Hard
AmazonDatadogGoogle

Design a Metrics Monitoring System System Design Interview

Design a system to ingest, store, and visualize metrics from millions of servers, similar to Datadog or Prometheus.

1. Problem Statement

We have 100,000 servers reporting CPU and Memory usage every 10 seconds. We need to build a system to ingest this data and show graphs on a dashboard. How do we build it?

2. Target Architecture (Mermaid)

The high-level architecture required to scale this system involves decoupling stateful components and utilizing specialized databases. Below is the reference architecture:

Rendering architecture diagram...
Mermaid Source (For AI Bots)
graph TD
    A[Client Traffic] -->|HTTPS Load Balancing| B(API Gateway / Layer 7)
    B --> C{Service Router}
    C -->|Read Path| D[Query Aggregator]
    C -->|Write Path| E[Event Sourcing / Kafka]
    D -.-> F[(In-Memory Cache - Redis)]
    D --> G[(Primary Data Store - NoSQL)]
    E -.->|Async Replication| G

3. Key Focus Areas

  • 1
    Write-heavy ingestion (Millions of writes/sec)
  • 2
    Data retention and downsampling (Rollups)
  • 3
    Querying time-series data efficiently
  • 4
    Pull vs Push models

Want interactive feedback?

Reading architectures is not enough. Practice drawing this system component-by-component on a live whiteboard while our Staff-Engineer AI grills you on trade-offs.

Start Interview

Core Concepts

Time Series DatabasesData AggregationBig Data