TL;DR (Recommended Architecture): For a 1M QPS production environment, the optimal architecture is a Hybrid Local Cache + Redis Lua Scripting approach utilizing the Sliding Window Counter algorithm. This minimizes the latency P99 caused by network round-trips to Redis while maintaining eventual consistency across distributed API Gateway nodes.

Scaling a rate limiter to 1M QPS requires addressing the core bottleneck of centralized state management and concurrency control. A naïve Redis-based GET/SET approach will buckle under the sheer volume of packet overhead and race conditions, causing catastrophic cascading failures at the API Gateway layer.

Back-of-the-envelope Estimation

Before drawing the architecture, we must define the physical constraints.

Metric	Value	Justification
Peak Read/Write QPS	`1,000,000/s`	Target peak traffic from aggressive clients, bad actors, and DDoS attempts.
Daily Requests	`~86 Billion`	1M QPS * 86,400 seconds. A purely storage-driven approach is unviable.
Memory Storage	`250 GB - 500 GB`	Assuming 100M active daily entities. Each Redis Key (User ID + Counter) is ~1-2 KB depending on the algorithm's sliding window granularity.
Redis Network I/O	`~2.5 Gbps`	1M ops/sec * 300 bytes per operation. Without local caching, the network interface cards (NIC) will aggressively saturate.

High-Level Architecture

To handle 1 Million QPS without inducing massive Latency P99 spikes across the microservices layer, we introduce a massive two-tier caching architecture.

Rendering architecture diagram...
Mermaid Source Code
graph TD
    A[Clients] -->|HTTPS Requests| B(Global Load Balancer / Route53)
    B --> C[API Gateway Node 1]
    B --> D[API Gateway Node 2]
    B --> E[API Gateway Node N]
    C -- Local Cache Check --> F[(L1 Cache: Local Memory - caffeine/guava)]
    D -- Local Cache Check --> G[(L1 Cache)]
    C -- Async Batch Sync / Lua --> H[(L2 Cache: Redis Sharded Cluster)]
    D -- Async Batch Sync / Lua --> H
    H -.-> I[Asynchronous Persistence Layer / Cassandra for Audit]

Deep Dive: Solving the Bottlenecks

1. How to Handle Race Conditions in a Distributed Environment?

The most common failure point in system design interviews is ignoring the Check-and-Set (CAS) Race Condition. If two requests from the exact same user hit API Gateway Node 1 and API Gateway Node 2 at the exact same millisecond:

Node 1 reads Counter = 4
Node 2 reads Counter = 4
Node 1 increments and writes Counter = 5
Node 2 increments and writes Counter = 5

The Solution: Bypass application-level locks and utilize Redis Lua Scripting. Redis evaluates Lua scripts atomically. Because Redis is strictly single-threaded, while the Lua script executes, no other command can run. This entirely eliminates the race condition without the massive performance penalty of distributed locks (like Redlock or Zookeeper locks).

2. The Redis Bottleneck (Hot Key Problem & Network Saturation)

A single optimal Redis node maxes out at around 100k - 150k QPS. For a 1M QPS system, hitting Redis for every single incoming request will cause a catastrophic bottleneck. Moreover, if a malicious bot floods the system, their specific User ID becomes a Hot Key, immediately crashing a specific Redis shard within your cluster.

The Solution: L1 Local Memory Caching + Eventual Consistency. Each API Gateway node maintains a fast local sliding window cache (like Caffeine in Java or an optimized LRU Cache in Node.js/Go).

If a user surpasses their limit locally, the Gateway drops the request immediately, returning a 429 Too Many Requests. This ensures Zero Network I/O to Redis during severe spikes.
Instead of hitting Redis synchronously on every request, the API Gateway buffers the counts locally and performs Asynchronous Batch Updates to the Redis cluster every X milliseconds.

This introduces slight eventual consistency—a user might sneak in 1-2 extra requests during the sync window—but it scales backend throughput exponentially and completely mitigates the Single Point of Failure (SPOF) threat vectors.

Algorithmic Trade-offs

When defending your architecture, balancing strictness versus throughput is the primary trade-off.

Implementation Model	Pros	Cons
Standalone Redis Lua (Strict Consistency)	100% accurate. Highly deterministic enforcement boundary.	High SPOF risk. Saturation of Network I/O. Extremely high Latency P99 under aggressive load.
Local Cache + Async Sync (Eventual Consistency)	Blazing fast (Microsecond latency). Completely protects Redis from Hot Key DDoS attacks.	Allowed bursts might slightly exceed the limit threshold during the synchronization window.

How to Design a Distributed Rate Limiter for 1M QPS