Skip to main content

Edge Layer

The edge layer is the exclusive entry point for all client traffic. It provides two distinct paths: CDN-mediated media segment delivery and API Gateway-mediated application request routing.
AttributeDetail
ResponsibilitiesDDoS absorption, WAF policy enforcement, TLS termination, CDN caching of media segments and manifests, geographic routing, API rate limiting, JWT pre-validation
Core ServicesWAF (AWS Shield Advanced / Cloudflare Enterprise), CDN (CloudFront / Akamai with MTN PoP integration), API Gateway (Kong or AWS API Gateway with custom authoriser)
Scaling ModelCDN scales elastically per edge node. API Gateway horizontally scales behind a load balancer. WAF is managed/serverless.
Failure DomainsCDN edge node failure routes to the next nearest PoP. API Gateway node failure is handled by load balancer health checks. WAF failure open-circuits to allow traffic — availability is prioritised over WAF enforcement during an outage, with immediate alerting.

Application Layer

All business logic services run as independently deployable microservices. Services are stateless — all session state is held in Redis, not in process memory.
AttributeDetail
Core ServicesUser & Auth Service, Upload Service, Content Service, Engagement Service, Playback Service, Subscription Service, Creator Dashboard, Notification Service, Admin Control Plane
Scaling ModelKubernetes HPA on CPU (target 60%) and custom metrics (request queue depth). Each service scales independently. The Engagement Service uses Redis-buffered write batching to handle viral content write amplification.
Failure DomainsIndividual service failure is circuit-broken. Playback Service maintains a 3-replica minimum with a dedicated node pool. Auth Service failure degrades to cached session validation for up to 5 minutes. Engagement Service failures queue writes client-side for retry; view counts tolerate brief outages via eventual consistency.
Zero-Trust boundary. Every inter-service call on MCSP requires a valid mTLS client certificate issued per service identity. No internal endpoint is reachable without mutual authentication — the service mesh (Istio) enforces this independently of application code.

Media Processing Layer

The media processing layer is a fully asynchronous pipeline triggered by upload completion events. Video and audio jobs share the same Kafka-backed job queue and worker pool — content-type metadata in the job descriptor determines the processing branch applied at the transcoding stage.
AttributeDetail
ResponsibilitiesFormat validation, virus scanning, AI copyright fingerprinting, multi-resolution video transcoding, multi-bitrate audio transcoding, HLS/DASH packaging, DRM encryption, thumbnail and cover art generation, metadata indexing
Core ServicesUpload Ingestor, Copyright Scanner (perceptual hash + audio fingerprint), Transcoding Cluster (FFmpeg — GPU for 4K video, CPU-only for audio), DRM Packager (Shaka), Art/Thumbnail Generator, Metadata Indexer
Scaling ModelAudio and video job queues use separate autoscaling profiles. Spot/preemptible instances are used for transcoding (60–70% cost reduction). A minimum of 2 workers is always running to prevent cold-start latency.
Failure DomainsFailed jobs retry with exponential backoff (max 5 attempts) before moving to a dead-letter queue with creator notification. Partial failures (e.g., 4K transcode fails while 1080p succeeds) publish available variants immediately without blocking lower resolutions.

AI / ML Layer

The ML layer operates on two timescales: offline batch training (daily/weekly) and online real-time inference (sub-100 ms per request).
AttributeDetail
ResponsibilitiesBehavioural event collection, feature engineering, offline model training, online recommendation serving, AI content moderation
Core ServicesEvent Collector (Kafka consumer), Feature Store (Feast / Tecton), Offline Trainer (Spark on Kubernetes + Ray), Model Server (Triton / TorchServe), AI Moderation Pipeline
Scaling ModelModel server scales horizontally behind a load balancer. GPU nodes dedicated to inference; CPU nodes for feature serving. Training cluster scales on-demand for scheduled jobs.
Failure DomainsRecommendation inference failure falls back to the trending content feed. AI moderation failure queues content for human review — content is never auto-approved during an outage. Feature store unavailability degrades to cached features.

Data Layer

Each data class uses a purpose-fit store. No store is shared across unrelated data domains.
StorePurposeFailure Mode
PostgreSQL (multi-AZ, read replicas)Users, content metadata, subscriptions, transactionsPrimary failure triggers automated standby failover (RTO < 30 seconds)
Object Storage (S3-compatible)Hot, cold, and residency-isolated media file buckets11-nine durability; cross-AZ replication. Residency buckets never replicate cross-region.
ElasticsearchFull-text content search and discoveryFailure degrades search — not on the playback critical path
TimescaleDB / ClickHouseAnalytics time-series, creator metricsDegraded analytics; no impact on streaming
Redis ClusterSessions, idempotency keys, engagement counters, ML inference cacheFailure degrades performance but not correctness — sessions fall back to DB validation

Control Plane

The control plane is operationally isolated from the viewer-facing data plane with its own deployment, network boundaries, and scaling policies.
AttributeDetail
Core ServicesAdmin Control Plane API, Moderation Dashboard, Residency Policy Engine, Ad Operations Console, Audit Log Service
Scaling ModelScaled conservatively — handles significantly lower RPS than the data plane. Admin operations are rate-limited to prevent bulk operational errors.
Failure DomainsControl plane failure does not impact viewer streaming. Moderation pipeline failure routes all flagged content to a holding queue — content is not auto-approved during an outage.

Observability Layer

Observability is a deployment gate. Services that do not emit structured logs, RED metrics (Rate, Errors, Duration), and distributed traces will fail the CI/CD pipeline health check and cannot be deployed to production.
ComponentRole
Loki / ELK StackCentralised structured log aggregation and search
Prometheus + GrafanaSystem and application metrics; SLO dashboards
Jaeger / TempoDistributed request tracing (1–5% sampled on high-volume paths)
PagerDutyAlerting and on-call routing
Append-only Audit StoreDynamoDB (no-delete policy) or Kafka Compacted Topic — compliance-grade immutable record
Metrics are retained at high resolution for 7 days and downsampled for 1 year. Audit log records are partitioned and tiered to cold storage after 90 days but are never deleted.

Build docs developers (and LLMs) love