Designing a Distributed Rate Limiter

Designing a rate limiter that holds up at high request rates: the algorithm trade-offs, where it lives in the stack, and what starts to break as you scale.

distributed-systemsrate-limitingredis

Jan 18, 2026 · 25 min

Introduction

A rate limiter caps how much traffic a client can send within a window. Anything over the cap gets rejected.

Examples you’ve probably seen:

An API that allows 200 requests per second per key.
A signup flow that won’t let one IP create more than 10 accounts an hour.
A payments endpoint that blocks rapid-fire transaction attempts.

Why bother

Three things break without one.

Resource starvation. A bad client (or a buggy one) can flood the system and crowd everyone else out. DoS is the dramatic case; a misconfigured retry loop is the boring one that happens more often.

Fairness. One heavy user shouldn’t be able to starve the rest. Even without malice, a single noisy tenant can ruin tail latencies for everyone on the same backend.

Cost. Rejecting excess traffic at the edge keeps the infrastructure bill bounded. A runaway script that’s free to call you 10 million times will happily do so.

Requirements

Splitting these into functional (what it does) and non-functional (how well it does it).

Functional

What gets rate-limited?

The “identity” the limit is keyed on. The usual candidates:

IP address. Limit by network origin.
User account. Per-user limits.
API key. Per-key limits for developer APIs.
Service. Service-to-service limits.
Combination (API key + endpoint). Complex rules pinned to a specific endpoint per key.
Application-level properties. Custom attributes specific to the application.

At a minimum: IP, user, and API key.

What happens when a request is rejected?

Reject fast, with a response the client can actually do something useful with:

HTTP 429.
A Retry-After or reset timestamp so the client knows when to come back.

REQReject over-limit requests with a clear 429 response that indicates when the limit resets.

Hard guarantee or approximate?

Approximate is fine. Enforcing the limit exactly requires coordination on every check, and coordination costs latency. Letting a handful of extra requests through during a race is a much better trade than adding a synchronous global counter on the hot path.

Configurable rules?

Yes. Limits change at runtime, never hard-coded.

Distributed?

Yes. Multiple gateways and services share the same view.

Non-functional

Latency

The check sits on every request. It can’t add latency anyone would notice, even at peak.

REQUnder 10ms per check, end to end.

Availability

The limiter has to stay up, and when it can’t, we’d rather let traffic through than block legitimate users.

REQHighly available; fail-open during outages.

Scalability

Designing for a million RPS at peak, ten million identities in the system, on the order of a million active at any moment. Both the limiter and its store have to scale horizontally.

REQ1M RPS, horizontally scalable.

Consistency

Global consistency isn’t worth paying for. A small amount of over-allowance during a partition or a failover is fine; rejecting legitimate users isn’t.

REQEventual consistency. Availability > strict accuracy.

Capacity

Some back-of-the-envelope numbers.

Traffic. 10M daily active clients × ~1,000 requests/day each ≈ 10B requests/day. That averages out to ~115K RPS, but average is the wrong number to plan around. Peak is ~10× average, so we’re sizing for 1M RPS.

State. 10M identities total, 5–10% active at once, so 500K–1M active at peak. Each one stores an identifier, a counter or timestamp, and a bit of rule metadata. Call it 100–200 bytes per identity; the whole active set fits in ~200 MB. That’s small. It comfortably lives in memory.

Access pattern. One read-modify-write per request. At peak that’s 1M reads and 1M writes per second, all atomic, all on the critical path.

Algorithm choice

Real traffic isn’t smooth. It bursts. Users open tabs, clients retry after a failure, mobile apps queue calls and flush them on reconnect. The algorithm has to absorb those bursts without letting the long-run rate creep up.

Two practical constraints. State per client stays small (a few hundred MB total across the active set), and the check has to be one atomic operation on the store. Anything that needs multi-step coordination races under concurrent load, and you find that out the painful way.

The shape that fits all three is token bucket.

Each client owns a bucket of CAPACITY tokens that refills at REFILL_RATE per second. Each request takes one token; an empty bucket means reject. State is just two numbers: tokens and lastRefill. On every check, lazily compute how many tokens have refilled since lastRefill, cap at CAPACITY, then try to consume one.

You get two independent knobs:

REFILL_RATE is the sustained rate. Say 10 req/sec.
CAPACITY is the burst headroom. 100 tokens is 10 seconds of refill banked up.

An idle client accumulates tokens up to capacity and can spend its full burst on reconnect. An active one stays pinned to the refill rate. Which is exactly how real traffic looks: bursts separated by idle periods. AWS API Gateway, Stripe, and most production gateways land on token bucket for the same reasons.

For the rest of the algorithm zoo (fixed window, sliding window log/counter, leaky bucket), the implementations, and the Lua scripts that make them safe under concurrency, see Rate Limiting Algorithms.

Entities and API

Three things the system reasons about: the request (an incoming API call), the client (the identity being limited, e.g. IP, user ID, API key), and the rule (the policy that says what’s allowed).

The public interface is one call:

allowRequest(clientId, ruleId) → RateLimitResponse

{
  "allowed":    true,
  "remaining":  45,
  "resetTime":  1699123500
}

remaining and resetTime go straight back to the client as response headers, so a well-behaved SDK can throttle itself without needing to fail first.

High level design

Four decisions: where the limiter sits, how it identifies clients, where state lives, and what the request path looks like.

Where does the limiter live?

Placement decides what the limiter can see and how it scales. Three common options.

In-process. Runs as a library inside each application server, each with its own local state.

Good: zero network overhead, dead simple.
Bad: every instance has its own view, so the limit is per-instance, not global. With N app servers, a client effectively gets N× the configured rate. Application code also has to carry the limiter logic, which means redeploys to change anything.

Dedicated service. A separate service does the check; app services call it on every request.

Good: shared state, centralised logic, you can update rules without touching app code.
Bad: one extra network hop on the hot path, one more service to run and scale, one more thing to page on.

At the API gateway. Limiter sits inside the gateway (or LB) layer, before any application server sees the request.

Good: one place to control, no extra network call beyond what the gateway is already doing, abusive traffic gets dropped before it touches app resources.
Bad: the gateway has to support it, and the gateway becomes load-bearing for both routing and limiting.

Going with the gateway. It’s the most common production setup, the gateway already sees every request, and the latency cost is essentially zero. Bad traffic gets stopped before it touches anything expensive.

Identifying clients

The gateway has the full HTTP request in front of it: URL, headers, query string, source IP. The question is what we hash to identify the “client”.

User ID (for authenticated APIs). Pulled from the JWT or session in the Authorization header. The right choice when each logged-in user should get their own bucket regardless of which device they’re on, and when account tiers map to different limits.

IP address. From X-Forwarded-For, or the connection itself if you control the LB. Useful for public APIs without auth, but be careful: users behind NATs share an IP, mobile IPs change, and shared addresses (offices, schools, ISPs) will look like one very busy client.

API key from X-API-Key. The natural fit for developer APIs where each consumer is metered separately.

Combined keys. Usually you want some of each: authenticated users limited by user ID, anonymous traffic by IP, partners by key. You can also key on tuples, like api_key + endpoint for per-route limits, or user_id + IP to catch a shared account being abused from many machines.

Where does the state live?

Per-client state (counters, timestamps, tokens) has to be shared across limiter instances. If each instance kept its own copy, a client could spray requests across instances and effectively multiply the limit by however many there are. Sharing it is the whole point.

The store has to be:

Fast. We’re on every request.
Atomic. “Read counter, decide, increment” can’t have a gap.
Self-expiring. Windows roll over constantly; we don’t want to write a cleanup job.
High-throughput. Millions of read-modify-write ops per second.

Redis ticks all four. In-memory, atomic commands (INCR, EXPIRE, Lua), TTLs built in, hundreds of thousands of ops per instance, and Cluster mode for sharding and replication out of the box.

A fixed-window counter in Redis looks like:

key: rate_limit:user_12345:window_202401011200
value: integer counter
TTL: set to expire when the window does

INCR + EXPIRE on the first request of a window covers the basic case. Once the algorithm needs more than one read or write (token bucket, sliding window), a small Lua script makes the whole sequence atomic. That’s the part that keeps you out of trouble under concurrent load.

Memcached, Hazelcast, Ignite, ElastiCache all work too. The requirement is “fast, atomic, with TTLs”. Redis is just the boring obvious answer.

The request path

What actually happens, step by step:

Request flow. The whole check should complete in under 10ms; Redis sits on the hot path so it must be fast and close.

Client sends an HTTP request. It lands at the gateway.
Gateway forwards it to one of the limiter instances behind the LB.
The limiter pulls a client identity out of the request: Authorization for user ID, X-API-Key for keys, falling back to X-Forwarded-For.
Look up the rules for this client + endpoint. Rules are cached in memory; the lookup is a map read.
Query Redis for current state: counter, timestamps, tokens, whatever the algorithm uses.
Decide. Under the limit: allow. Over: reject with 429.
If allowed, the same atomic op that read the state also updated it (this is where the Lua script earns its keep).
On allow, forward to the app with X-RateLimit-Remaining and X-RateLimit-Reset headers. On reject, return 429 with the same headers so the client knows when to retry.
App handles the request normally.

The whole thing has to finish in under 10ms. That budget is what drives most of the rest of the design: fast Redis, pooled connections, cached config, no surprises on the hot path.

Deep dives

The basic shape is settled. Now the parts that take work: getting to 1M RPS, holding the latency budget, and surviving failures.

Scaling to 1M RPS

A single Redis instance does roughly 100–200K ops/sec on commodity hardware, depending on the command mix. At 1M RPS with one read-modify-write per request, that’s 5–10 instances just for throughput, before any replicas.

So we shard. Partition data across N Redis instances by hashing the client identifier. Each shard owns a slice of the keyspace.

Consistent hashing. Each shard owns an arc; a client key hashes to a position and is served by the next shard clockwise.

Consistent hashing is the usual call here. Put each shard at a position on a ring; a client’s key hashes to a point and is owned by the next shard clockwise. The reason to use it over plain hash % N: when you add or remove a shard, only a small fraction of keys move instead of everything reshuffling. Failing over a shard’s range is the same operation as adding one.

Limiter instances are stateless. All state lives in Redis, so scaling the limiter layer is just running more pods behind the LB. No data migration, no warm-up.

So both layers scale horizontally. With 1M active identities split across 10 shards, each shard owns ~100K clients and ~100K ops/sec, comfortably inside what one Redis instance handles. Deploy regional clusters and the cross-region traffic disappears.

Staying up

Rate limiting sits on every request. When it breaks, the question is: do legitimate users get blocked, or does abusive traffic get through? Both are bad. The HA design exists so you don’t have to find out.

The stateless limiter is easy. A dying instance just drops out of the LB pool.

Redis is the part that needs care. Each shard runs with 1–2 replicas, and Redis Cluster handles failover automatically: when a master falls over, a replica gets promoted within seconds. The brief window between failure and promotion is where the fail-open vs fail-closed question matters.

Fail-open lets everything through during the outage. You lose protection, you keep availability.

Fail-closed rejects everything. You stay safe, but legitimate users see a wall of 429s during what’s already a bad day.

I’d default to fail-open. Rate limiting isn’t authentication; a few seconds of unlimited traffic during a Redis failover is rarely catastrophic, and breaking the whole API for users while you fail over feels worse. Sensitive endpoints (payments, signup, password reset) can flip to fail-closed individually.

The things actually worth watching: Redis cluster health and failover events, p99 limiter latency, sudden jumps in 429 rate (might be a config bug, not abuse), and shard memory balance.

Holding the latency budget

Budget is 10ms per check. Most of it gets spent on the network round-trip to Redis. Almost everything else is small.

Connection pooling is the biggest single win. Opening a fresh TCP connection per check costs you a handshake (20–50ms over the wire, more across regions). With a pooled connection, the round-trip is sub-millisecond. Every decent Redis client does this; the only thing to tune is pool size against concurrent load.

Deploy regionally. Latency is bounded by physics. Run a limiter and Redis cluster in each region your traffic lives in. The cost is more clusters to operate, and cross-region rate-limit state is eventually consistent, which is fine here since approximate is in the requirements.

Cache the rules. Rules don’t change often. Hold them in process memory on each limiter instance with a 30-second or 1-minute refresh. A network round-trip to fetch the policy on every check would blow the budget by itself.

Pipelining and Lua scripts combine multiple Redis ops into one round-trip. Use Lua any time the algorithm needs more than one read or write to stay correct under concurrency, which is every algorithm except trivial fixed-window.

What I’d avoid: caching the decision locally. Tempting, but a stale decision can let a client past a tight limit, or block one that shouldn’t be blocked. Pooling and regional deployment do almost all the work. The rest is diminishing returns.

Hot keys

Hot keys happen when one client generates enough traffic on its own to overwhelm the single shard that holds its state. Everyone else’s shard is idle; one shard is pinned at 100%. For this to happen, a single client has to be doing tens of thousands of RPS, which generally means one of three things: a DDoS, a buggy retry loop, or a legitimate high-volume integration (analytics pipelines, batch importers, internal services).

The three of them want different responses.

For legitimate high-volume clients, push work back to them. SDKs that honour Retry-After and X-RateLimit-Remaining smooth traffic out before it ever reaches you. Add batch endpoints so one request carries many operations. Offer a premium tier with higher limits if the use case justifies dedicated capacity.

For abuse, the answer is usually upstream of the limiter. Cloudflare or AWS Shield will absorb the bulk of a DDoS at the edge. Inside the limiter, an auto-blocklist for clients that consistently exceed by a huge margin keeps them from even reaching the check. And if you’re keying on IP, accept that shared addresses exist and lean on authenticated user limits where you can.

For workloads that genuinely need to be one logical client but spread the load: split the counter across N sub-keys with an extra hash, so a single “client” effectively has N buckets. The limit becomes “K req/s across these N counters”. Accounting gets fuzzier, but no one shard takes the whole load.

Most of the time the better fix is upstream: pick limits that match what real workloads actually do, so hot keys don’t show up in the first place.

Dynamic configuration

You’ll want to change rules at runtime, without a deploy: raise limits temporarily during a launch, drop them when traffic looks suspicious, tier them by customer, A/B test values. So the limiter has to pick up rule changes quickly, keep all instances on the same version, and tolerate the config store being briefly unavailable.

Two ways to wire it.

Polling. Each instance asks a config store (DB, etcd, a flat file in S3) every N seconds. Simple, works with anything, no extra push infrastructure to run. Costs: changes lag by up to the poll interval, instances briefly disagree during a transition, and the store gets hit by every instance.

Push. A config service pushes new rules when they change. etcd watches, Consul, or a custom pubsub do this well. Updates land everywhere at once and the steady-state load on the config store is near-zero. Costs: more moving parts, and the push channel itself becomes load-bearing.

In practice you end up with both: push for the hot path (urgent changes propagate in seconds), poll as a fallback (so an instance can’t drift forever if it missed a push). Attach a monotonic version number so an instance can tell whether it’s already seen the latest rules.

Starting point: just polling. Add push when you actually need it.

In summary

Token bucket at the gateway, atomic Lua over Redis. That covers most workloads. Everything else in this post is a layer you add when traffic forces it.

For the algorithms themselves and the Lua scripts that keep each one safe under concurrency, see Rate Limiting Algorithms.

References