Designing a Distributed Rate Limiter

25 min read
RedisSystem DesignDistributed Systems

Introduction

A rate limiter is a critical component in modern distributed systems that controls the rate of traffic sent by clients or services. It enforces limits based on the number of requests a client is allowed to make within a specified timeframe, blocking any requests that exceed these limits.

Real-world examples:

  • An API that restricts clients to 200 requests per second
  • A registration system that prevents more than 10 account creations per IP address per hour
  • A payment service that limits transaction attempts to prevent abuse

Why Rate Limiters Matter

Rate limiters serve three primary purposes:

1. Prevent Resource Starvation Protect systems from Denial of Service (DoS) attacks or unintentional overload caused by software errors, misconfigurations, or malicious actors flooding the system with excessive requests.

2. Ensure Fair Resource Usage Prevent individual users or services from monopolizing shared resources, ensuring equitable distribution across all clients.

3. Control Costs Reduce infrastructure costs by limiting excess requests, allowing better resource allocation to high-priority APIs and preventing runaway experiments from generating unexpected bills.


Requirements Analysis

Designing a distributed rate limiter requires careful consideration of both functional and non-functional requirements. Let’s break these down systematically.

Functional Requirements

What is being rate limited?

Based on what identity, the requests are going to be rate limited.

Common identities include:

  • IP address - Limit requests from specific network origins
  • User account - Enforce per-user limits
  • API key - Control access for different API consumers
  • Service - Apply limits at the service level
  • Combination (API key + endpoint) - Support complex rules like API key + endpoint combinations
  • Application level properties - Custom attributes specific to the application domain

The system should support different identities - IP address, User account, API key.

What should happen when limit is being exceeded?

When a request exceeds the rate limit, the system should reject it with clear error response indicating throttling.

  • Reject the request immediately
  • Return a clear error response indicating throttling
  • Provide information about when the limit will reset (if applicable)

Requirement: Reject requests with proper error responses that clearly indicate rate limiting.

Is limiting a hard guarantee or an approximate one?

Whether the system needs to be perfectly accurate on limiting requests or occasional over-allowances is acceptable?

Perfect accuracy implies stronger coordination or complications which might affect non-functional requirements that we may need to satisfy.

The system should enforce limits approximately and occasional over-allowances is acceptable.

Are we going to support configurable rules/policies to enforce?

Yes, limit requests based on configurable rules and policies at runtime.

Should the system work in distributed environment?

Yes. The rate limiter can be shared across multiple services.

Summary

  • Support different properties/identity to rate limit
  • Reject requests with proper response
  • Rate limiting with minor inaccuracies is allowed
  • Support configurable rules/policies
  • Support distributed rate limiting

Non-Functional Requirements

Latency

Rate limiting checks must be extremely fast to avoid impacting request response times, even under peak load.

Requirement: Rate limit checks must complete in under 10 milliseconds.

Availability

The rate limiter must be highly available. In failure scenarios, the system should prefer allowing traffic rather than blocking it (fail-open behavior).

Requirement: High availability with fail-open behavior during outages.

Scalability

The system must handle massive scale:

  • Support millions of requests per second
  • Scale horizontally with traffic growth
  • Handle millions of unique identities being rate limited simultaneously

Requirement: Scale to 1 million requests per second (RPS) with horizontal scalability.

Consistency

Global consistency is not required. Small amounts of excess traffic are acceptable if it means avoiding false rejections that degrade user experience.

Requirement: Eventual consistency is acceptable. Availability and performance take precedence over perfect consistency.

Non-Functional Requirements Summary

  • Low latency: < 10ms per rate limit check
  • High availability: Fail-open during outages
  • Scalability: Support 1 million RPS
  • Consistency: Eventual consistency acceptable

Capacity Estimation

Understanding the scale helps inform design decisions.

Traffic Estimates

  • Daily active clients: 10 million
  • Requests per client per day: 1,000 (average)
  • Total daily requests: 10 billion
  • Average requests per second: ~115,000 RPS
  • Peak requests per second: ~1 million RPS (10x average)

Storage Estimates

Not all identities are active simultaneously:

  • Total identities: 10 million (API keys, user IDs, IP addresses)
  • Active identities: 5-10% within any given time window
  • Concurrent active identities: 500K to 1 million

Each active identity requires state storage:

  • Unique identifier (API key, user ID, IP address)
  • Current counter/timestamp
  • Metadata (rule ID, window information)

Storage per identity: 100-200 bytes Total active state: 100-200 MB for 1 million active identities

Read/Write Characteristics

Every request requires a read-modify-write operation:

  • At peak (1M RPS):
    • 1 million reads per second
    • 1 million writes per second

Access pattern:

  • Write-heavy workload
  • Highly concurrent operations
  • Low latency requirements

Capacity Summary

  • Peak traffic: 1 million RPS
  • Active state: 100-200 MB
  • Workload: Write-heavy, highly concurrent

Rate Limiting Algorithms

Different algorithms offer different trade-offs between accuracy, memory efficiency, and implementation complexity. Let’s examine the most common approaches.

Fixed Window Counter

How it works:

  • Divide time into fixed-size windows (e.g., 1-minute windows aligned to wall-clock time)
  • Maintain a counter for each identity within the current window
  • On each request:
    • If counter < limit: increment counter and allow request
    • Else: reject request
  • Reset counter when the window expires

Advantages:

  • Memory efficient: minimal state (counter + window timestamp)
  • Simple to understand and implement
  • Low computational overhead

Disadvantages:

  • Allows traffic bursts at window boundaries
  • A client can make 2x the limit by sending requests at the end of one window and start of the next
  • Less accurate for strict rate limiting requirements

Use case: Suitable when some burst traffic is acceptable and simplicity is prioritized.

Sliding Window Log

How it works:

  • Store timestamps of all requests for each identity within the window
  • On each request:
    • Remove timestamps older than (current_time - window_size)
    • Count remaining timestamps
    • If count < limit: add current timestamp and allow request
    • Else: reject request

Advantages:

  • Accurate enforcement: prevents exceeding limits in any rolling window
  • No burst traffic at boundaries
  • Precise rate limiting

Disadvantages:

  • Memory inefficient: state grows linearly with traffic volume
  • Higher memory usage for high-traffic identities
  • More expensive cleanup operations

Use case: When accuracy is critical and memory is not a constraint.

Sliding Window Counter

How it works:

  • Maintain counters for two consecutive fixed windows (previous and current)
  • On each request:
    • Calculate overlap ratio based on position within current window
    • Compute weighted count: current_window_count + (previous_window_count × overlap_ratio)
    • If weighted count < limit: allow request
    • Else: reject request

Example calculation:

  • Current window: 30% elapsed
  • Overlap ratio: 1 - 0.30 = 0.70
  • Current window count: 50 requests
  • Previous window count: 60 requests
  • Weighted count: 50 + (60 × 0.70) = 92 requests

Advantages:

  • Smooths out traffic spikes better than fixed window
  • Memory efficient (only two counters per identity)
  • More accurate than fixed window while maintaining efficiency

Disadvantages:

  • Approximate: assumes even distribution of requests in previous window
  • Not perfectly accurate but good enough for most use cases

Use case: Balanced approach when you need better accuracy than fixed window but can’t afford the memory cost of sliding window log.

Token Bucket

How it works:

  • Maintain for each identity: token count and last refill timestamp
  • Tokens are added at a fixed refill rate up to a maximum capacity (bucket size)
  • On each request:
    • Calculate tokens to add based on elapsed time: (current_time - last_refill_time) × refill_rate
    • Refill tokens (capped at bucket size)
    • If tokens ≥ 1: consume one token and allow request
    • Else: reject request

Advantages:

  • Intuitive and easy to understand
  • Memory efficient (two values per identity)
  • Allows burst traffic: unused tokens accumulate
  • Flexible: can handle both rate limiting and burst tolerance

Disadvantages:

  • Requires careful parameter tuning (bucket size and refill rate)
  • May allow more traffic than intended if parameters are misconfigured
  • Less precise than sliding window approaches

Use case: When you need to allow bursts of traffic while maintaining an average rate limit.

Leaky Bucket

How it works:

  • Model requests as items in a queue (bucket) with fixed capacity
  • On each request:
    • If bucket has capacity: enqueue request
    • Else: reject request
  • Process requests from bucket at constant outflow rate

Advantages:

  • Memory efficient: state based on queue size
  • Provides strong protection for downstream systems
  • Stable outflow rate prevents downstream overload
  • Smooths traffic spikes

Disadvantages:

  • Recent requests may be rate limited if old requests are still queued
  • Requires careful tuning of bucket size and outflow rate
  • Less suitable for real-time rate limiting (requests are queued, not immediately processed)

Use case: When protecting downstream systems from traffic spikes is more important than immediate request processing.

Algorithm Comparison

AlgorithmAccuracyMemoryBurst HandlingComplexity
Fixed WindowLowLowPoorLow
Sliding Window LogHighHighExcellentMedium
Sliding Window CounterMediumLowGoodMedium
Token BucketMediumLowExcellentLow
Leaky BucketMediumLowGoodMedium

Core Entities

The system revolves around three main entities:

  • Request - An incoming API request that needs to be rate limited
  • Client - The identity being rate limited (IP address, user ID, API key, etc.)
  • Rule - A rate limiting policy that defines limits, windows, and conditions

API Interface

The rate limiter exposes a simple interface:

Function signature:

allowRequest(clientId, ruleId) → RateLimitResponse

Input parameters:

  • clientId: The identity being rate limited (IP address, user ID, API key, etc.)
  • ruleId: The identifier of the rate limiting rule to apply

Response structure:

  • allowed: Boolean indicating whether the request should be allowed
  • remaining: Number of requests remaining in the current window
  • resetTime: Unix timestamp indicating when the rate limit will reset

Example response:

{
  allowed: true,
  remaining: 45,
  resetTime: 1699123500
}

High Level Design

Now that we understand the requirements and algorithms, let’s design the overall system architecture. We need to decide where the rate limiter lives, how it identifies clients, how it stores state, and how requests flow through the system.

Where should the rate limiter live in the architecture?

Before we can limit requests, we need to decide where in our system architecture the rate limiter should be placed. This decision affects what information we have access to, how it integrates with other services, and how we can scale it.

We have three main placement options:

Option 1: In-Process Rate Limiting

The rate limiter runs as part of the application code on each server. Each application instance maintains its own rate limiting state.

Pros:

  • No network overhead - checks happen locally
  • Simple to implement initially
  • Low latency for rate limit checks

Cons:

  • Each server has its own view of limits - limits don’t work correctly across multiple servers
  • Can’t share state between instances
  • Difficult to scale consistently
  • Application code becomes responsible for rate limiting logic

Option 2: Dedicated Rate Limiter Service

A separate microservice handles all rate limiting. Application services make network calls to this service for every request.

Pros:

  • Centralized rate limiting logic
  • Can share state across all application instances
  • Easier to update rate limiting without deploying application code

Cons:

  • Adds network latency to every request
  • Extra service to maintain and scale
  • Network calls can fail, adding complexity to error handling
  • Can become a bottleneck if not scaled properly

Option 3: API Gateway / Load Balancer

The rate limiter is integrated into the API Gateway or Load Balancer layer, before requests reach application servers.

Pros:

  • Centralized control point for all incoming traffic
  • No extra network calls - rate limiting happens as part of request routing
  • Can block traffic before it reaches application servers, saving resources
  • Natural place to implement rate limiting since gateway already sees all requests
  • Can leverage gateway’s existing infrastructure for scaling

Cons:

  • Requires gateway support for rate limiting features
  • Gateway becomes a critical component - must be highly available

Our Choice: API Gateway Approach

For our design, we’ll use the API Gateway approach. It’s the most common pattern in production systems and gives us centralized control without adding extra network calls. The gateway already sees all incoming requests, so it’s the natural place to enforce rate limits. This also means we can block abusive traffic before it consumes application server resources.


How do we identify clients?

Since we’re using the API Gateway approach, our rate limiter has access to information in the HTTP request itself. This includes the request URL, HTTP headers, query parameters, and the client’s IP address.

We need to decide what makes a “client” unique - this determines how rate limits get applied. The key we use determines whether limits are per-user, per-IP, or per-API-key.

User ID

For authenticated APIs, we can identify clients by user ID. This is typically present in the Authorization header as a JWT token or session token. Each logged-in user gets their own rate limit allocation.

When to use:

  • APIs that require authentication
  • When you want per-user limits regardless of device or location
  • When users can have different limits based on their account type

IP Address

We can identify clients by their IP address, which is typically available in the X-Forwarded-For header or directly from the connection.

When to use:

  • Public APIs without authentication
  • When you want to limit by network origin
  • For basic abuse prevention

Limitations:

  • Users behind NATs or corporate firewalls share IP addresses
  • Mobile users may have changing IP addresses
  • Can block legitimate users if IP is shared

API Key

For developer APIs, clients can be identified by API keys, typically sent in the X-API-Key header.

When to use:

  • Developer-facing APIs
  • When you want different limits for different API consumers
  • When you need to track usage per API key holder

Combined Approaches

In practice, you’ll often want a combination. For example:

  • Authenticated users: rate limit by user ID
  • Unauthenticated requests: rate limit by IP address
  • Developer API requests: rate limit by API key

You might also combine identities for stricter limits, like rate limiting by both API key and endpoint, or user ID and IP address together.

The choice depends on your use case. Are all users authenticated? Is this a developer API? These questions help determine the right identification strategy.


How do we store rate limiting state?

Rate limiting requires storing state for each client - counters, timestamps, tokens, or whatever the algorithm needs. Since we have multiple rate limiter instances (for scalability and availability), we need a shared state store that all instances can access.

Why we need shared state:

If each rate limiter instance maintained its own state, a user could hit different instances and bypass limits. For example, if the limit is 100 requests per minute, a user could send 100 requests to instance A and 100 requests to instance B, effectively getting 200 requests instead of 100.

Requirements for the state store:

  • Low latency: Rate limit checks must be fast, so the store must respond quickly
  • Atomic operations: We need operations like “increment and check” to happen atomically to prevent race conditions
  • Expiration support: Rate limit windows expire, so we need automatic cleanup of old data
  • High throughput: Must handle millions of read-modify-write operations per second
  • High availability: If the store is down, rate limiting breaks

Why Redis?

Redis is the most common choice for rate limiting state storage because:

  • In-memory storage: Extremely fast reads and writes
  • Atomic operations: Commands like INCR, EXPIRE, and Lua scripts ensure atomicity
  • Built-in expiration: Automatic cleanup of expired keys
  • High performance: Can handle millions of operations per second
  • Data structures: Supports counters, sets, and other structures needed for different algorithms
  • Clustering: Redis Cluster provides sharding and high availability

How Redis stores rate limit state:

For a fixed window counter algorithm, we might store:

  • Key: rate_limit:user_12345:window_202401011200 (client ID + window identifier)
  • Value: Counter (number of requests in this window)
  • Expiration: Set to expire when window ends

When a request arrives:

  1. Check if key exists and get current count
  2. If count < limit: increment counter atomically and allow request
  3. If count >= limit: reject request
  4. Set expiration if this is the first request in the window

Redis provides atomic operations like INCR (increment) and EXPIRE (set expiration) that make this straightforward. We can also use Lua scripts to combine multiple operations atomically.

Alternative stores:

While Redis is most common, other options exist:

  • Memcached: Similar to Redis but with fewer features
  • In-memory databases: Like Hazelcast or Apache Ignite
  • Distributed caches: Like Amazon ElastiCache or Azure Cache

The key is choosing something fast, with atomic operations, and expiration support.


How does a request flow through the system?

Let’s trace what happens when a request arrives at our system:

Step 1: Request arrives at API Gateway

A client sends an HTTP request to our API. The request first hits the API Gateway, which acts as the entry point for all traffic.

Step 2: Gateway routes to Rate Limiter

The API Gateway forwards the request to one of the rate limiter instances. The gateway uses load balancing to distribute requests across multiple rate limiter instances for scalability.

Step 3: Rate Limiter extracts client identity

The rate limiter examines the request to identify the client:

  • Checks Authorization header for user ID (if authenticated)
  • Checks X-API-Key header for API key (if present)
  • Falls back to IP address from X-Forwarded-For header

Based on the request type and available information, it determines which identity to use for rate limiting.

Step 4: Rate Limiter retrieves configuration

The rate limiter looks up which rules apply to this client and endpoint. Configuration might be cached in memory or retrieved from a configuration store. Rules specify things like “authenticated users get 1000 requests per hour” or “search API allows 10 requests per minute per IP.”

Step 5: Rate Limiter checks current state

The rate limiter queries the shared state store (Redis) to check the current rate limit state for this client:

  • For fixed window: checks counter for current window
  • For sliding window: counts timestamps within window
  • For token bucket: checks current token count

This involves reading from Redis, which must be fast to meet our latency requirements.

Step 6: Rate Limiter makes decision

Based on the current state and the limit, the rate limiter decides:

  • If under limit: Update state (increment counter, consume token, etc.) and allow request
  • If over limit: Reject request with HTTP 429 status

Step 7: Update shared state

If allowing the request, the rate limiter updates the shared state store atomically. This might involve incrementing a counter, adding a timestamp, or consuming a token.

Step 8: Forward or reject

  • If allowed: Rate limiter forwards the request to the application server and includes rate limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) in the response
  • If rejected: Rate limiter returns HTTP 429 response with helpful headers indicating when the limit resets

Step 9: Application processes request

If the request was allowed, it reaches the application server, which processes it normally and returns a response to the client.

The entire rate limiting check should complete in under 10 milliseconds to avoid impacting request response times. This requires fast Redis operations, efficient network paths, and potentially cached configuration.


Deep Dives: Scaling and Operations

Now that we have the basic architecture, let’s address the challenges of scaling this system to handle 1 million requests per second while maintaining low latency and high availability.

How do we scale to handle 1M requests/second?

At 1 million requests per second, a single Redis instance won’t be enough. Each request requires a read-modify-write operation, which means we need to handle 1 million reads and 1 million writes per second. We need to distribute this load across multiple Redis instances.

The Problem: Single Redis Bottleneck

A single Redis instance can handle roughly 100,000-200,000 operations per second, depending on the operations and hardware. At 1M RPS, we’d need 5-10 Redis instances just to handle the load, not accounting for redundancy.

Solution: Redis Sharding

We need to partition (shard) our Redis data across multiple instances. Each Redis instance handles a subset of clients, distributing the load.

How sharding works:

When a rate limit check happens, we need to determine which Redis shard stores the state for a particular client. We use the client identifier (user ID, IP address, or API key) to compute which shard to use.

Consistent Hashing

Consistent hashing is the standard approach for distributing keys across shards. Here’s how it works:

  1. Each Redis shard is assigned a position on a hash ring (a conceptual circle)
  2. When we need to find which shard stores a client’s data, we hash the client identifier
  3. We find the shard whose position on the ring is closest to (but greater than) the hash value
  4. That shard stores and serves data for that client

Why consistent hashing?

  • Minimal redistribution: When a shard is added or removed, only a small fraction of keys need to move
  • Even distribution: Keys are distributed evenly across shards
  • Handles failures: If a shard fails, its keys can be redistributed to other shards

Horizontal Scaling of Rate Limiter Instances

The rate limiter instances themselves also need to scale horizontally. We run multiple stateless rate limiter instances behind the API Gateway’s load balancer. Each instance can handle rate limit checks independently, and they all share access to the Redis cluster.

Scaling Strategy:

  • Rate Limiter Instances: Add more instances as traffic grows. Since they’re stateless, scaling is straightforward - just add more instances and update the load balancer
  • Redis Shards: Add more Redis shards as the number of active clients grows or as throughput requirements increase
  • Geographic Distribution: Deploy rate limiter instances and Redis clusters in multiple regions to reduce latency for global users

Capacity Planning:

With 1 million active identities and 1M RPS:

  • Each Redis shard handles a subset of clients (e.g., 100K clients per shard with 10 shards)
  • Each shard handles roughly 100K operations per second
  • Rate limiter instances can be scaled independently based on CPU and network capacity

The key insight is that both rate limiter instances and Redis shards can scale horizontally, allowing the system to grow with traffic.


How do we ensure high availability?

Rate limiters are critical infrastructure - if they fail, either legitimate traffic gets blocked (fail-closed) or abusive traffic gets through (fail-open). We need to design for high availability.

Stateless Rate Limiter Instances

Rate limiter instances are stateless - they don’t store any local state. This means:

  • If an instance fails, the load balancer routes traffic to other instances
  • We can add or remove instances without data migration
  • Instances can be replaced without losing rate limiting state

Redis Replication and Failover

Redis state is critical - if Redis is down, rate limiting breaks. We need redundancy:

Redis Replication:

Each Redis shard should have replicas (typically 1-2 replicas per master). Replicas maintain copies of the master’s data and can take over if the master fails.

Automatic Failover:

When a Redis master fails, one of its replicas is automatically promoted to master. This failover should happen quickly (within seconds) to minimize impact.

Redis Cluster:

Redis Cluster provides built-in sharding, replication, and automatic failover. It’s the standard solution for high-availability Redis deployments at scale.

Fail-Open vs Fail-Closed Behavior

What happens when the rate limiter or Redis is unavailable? We have two options:

Fail-Closed (Strict):

If rate limiting can’t be checked, reject all requests. This is the safest approach for security but can block legitimate traffic during outages.

Fail-Open (Permissive):

If rate limiting can’t be checked, allow all requests through. This maintains availability but loses protection during outages.

Our Choice: Fail-Open

For most use cases, we prefer fail-open behavior:

  • Better user experience - legitimate users aren’t blocked during outages
  • Rate limiting is about preventing abuse, not security-critical authentication
  • Short outages are acceptable - we can detect and fix issues quickly
  • We can monitor for unusual traffic patterns and respond manually if needed

However, for sensitive endpoints (like payment processing), fail-closed might be appropriate.

Monitoring and Alerting

High availability requires good observability:

  • Monitor Redis cluster health and failover events
  • Track rate limiter instance health and response times
  • Alert on unusual rejection rates or latency spikes
  • Monitor for Redis memory usage and shard load distribution

Redundancy at Every Level:

  • Multiple rate limiter instances (load balanced)
  • Redis Cluster with replication (automatic failover)
  • Multiple API Gateway instances (if applicable)
  • Geographic redundancy (multiple regions)

The goal is to ensure that the failure of any single component doesn’t break rate limiting entirely.


How do we minimize latency?

Every request goes through rate limiting, so latency matters. Our target is under 10 milliseconds per rate limit check. Let’s look at where latency comes from and how to reduce it.

Sources of Latency:

  1. Network round-trip to Redis: The time to send a request to Redis and receive a response
  2. Redis processing time: The time Redis takes to execute operations
  3. Connection establishment: The time to establish TCP connections (if not pooled)
  4. Serialization overhead: Converting data to/from Redis protocol format

Connection Pooling

The biggest latency win comes from connection pooling. Instead of establishing a new TCP connection for each rate limit check, we maintain a pool of persistent connections to Redis.

How it works:

  • Rate limiter instances maintain a pool of open TCP connections to Redis
  • When a rate limit check is needed, we reuse an existing connection from the pool
  • After the check, the connection returns to the pool for reuse

Impact:

  • Without pooling: Each request requires TCP handshake (20-50ms depending on network distance)
  • With pooling: Connection reuse eliminates handshake overhead (sub-millisecond)

Most Redis clients handle connection pooling automatically, but we need to tune pool size based on concurrent requests and Redis response times.

Geographic Distribution

Network latency increases with distance. A user in Tokyo talking to Redis in Virginia will see much higher latency than talking to Redis in the same region.

Solution: Multi-Region Deployment

Deploy rate limiter instances and Redis clusters in multiple geographic regions:

  • Users in Asia connect to Asian region
  • Users in Europe connect to European region
  • Users in Americas connect to American region

Trade-offs:

  • Lower latency for users (data stays in their region)
  • More complex operations (managing multiple clusters)
  • Eventual consistency between regions (acceptable for rate limiting)

Caching Configuration

Rate limiting rules don’t change frequently. We can cache configuration in memory on rate limiter instances to avoid database lookups on every request.

What to cache:

  • Rate limiting rules (limits, windows, which clients they apply to)
  • Client-to-rule mappings
  • Rule metadata

Cache invalidation:

When rules change, we need to invalidate caches. This can be done via:

  • Time-based expiration (poll for updates periodically)
  • Push notifications (invalidate when rules change)
  • Version numbers (check version on each request, refresh if changed)

Other Optimizations:

  • Redis pipelining: Batch multiple operations in a single network round-trip
  • Lua scripts: Combine multiple Redis operations into a single atomic script
  • Local caching: Cache recent rate limit checks locally (risky - can lead to incorrect decisions)

For most systems, connection pooling and geographic distribution provide the biggest latency improvements. Other optimizations add complexity and are usually unnecessary unless you’re operating at extreme scale.


How do we handle hot keys?

Hot keys occur when a single client (identified by user ID, IP address, or API key) generates enough traffic to overwhelm a single Redis shard. This can happen with both legitimate high-volume clients and abusive traffic.

What creates hot keys?

A single client would need to generate tens of thousands of requests per second to overwhelm a Redis shard. This often indicates:

  • DDoS attacks or malicious bots
  • Misconfigured clients making excessive requests
  • Legitimate high-volume clients (analytics systems, data pipelines, mobile apps with aggressive refresh)

The Problem:

When all requests from a single client hit the same Redis shard, that shard becomes a bottleneck. Other shards might be underutilized while one shard is overwhelmed.

Strategies for Legitimate High-Volume Clients:

Client-Side Rate Limiting:

Encourage well-behaved clients to implement their own rate limiting. API SDKs often include built-in client-side rate limiting that respects server response headers. This:

  • Smooths traffic patterns before requests reach the server
  • Reduces server load
  • Prevents legitimate users from accidentally creating hot shards

Request Queuing and Batching:

Allow clients to batch multiple operations into single requests, reducing the total number of rate limit checks needed.

Premium Tiers:

Offer higher rate limits for power users who need them, potentially with dedicated infrastructure or different sharding strategies.

Strategies for Abusive Traffic:

Automatic Blocking:

When a client consistently exceeds rate limits (e.g., 10 times in a minute), temporarily block their IP address or API key entirely. Maintain a blocklist in Redis (or a separate store) and check it before rate limit checks.

DDoS Protection:

Use services like Cloudflare or AWS Shield that can detect and block malicious traffic before it reaches your rate limiter. These services operate at the network edge and can filter traffic before it consumes your infrastructure.

Rate Limit Tuning:

Set higher limits for IP-based rate limiting to account for legitimate users sharing IP addresses (corporate NATs, public WiFi). Rely more on authenticated user limits where possible, as user IDs are more unique than IP addresses.

Key Insight:

Client-side rate limiting serves as a valuable complement to server-side protection. While you can’t trust clients for security, encouraging good citizenship in your API ecosystem reduces infrastructure load and prevents legitimate users from accidentally triggering hot shard scenarios.

The best approach is to design rate limits upfront to account for legitimate use cases rather than trying to handle hot keys after they occur.


How do we handle dynamic rule configuration?

So far, we’ve assumed rate limiting rules are static. But production systems need flexibility to adjust limits without deploying new code. You might need to:

  • Temporarily increase limits during a product launch
  • Give premium customers higher limits
  • Quickly reduce limits if you’re seeing unusual traffic patterns
  • A/B test different limit values

Requirements:

  • Update rules without restarting rate limiter instances
  • Changes should take effect quickly (within seconds or minutes)
  • Support different rules for different clients, endpoints, or scenarios
  • Maintain consistency across all rate limiter instances

Poll-Based Configuration

Rate limiter instances periodically poll a configuration store (database, file system, or configuration service) to check for rule updates.

How it works:

  1. Configuration is stored in a central location (database, S3, etc.)
  2. Rate limiter instances poll every N seconds (e.g., 30 seconds)
  3. If rules have changed, instances reload configuration
  4. New rules take effect on the next poll cycle

Pros:

  • Simple to implement
  • Works with any configuration store
  • No need for push infrastructure

Cons:

  • Delay between rule change and taking effect (up to poll interval)
  • Polling adds load to configuration store
  • Instances might have slightly different rules during transition

Push-Based Configuration

A configuration service pushes rule updates to rate limiter instances immediately when rules change.

How it works:

  1. Configuration is stored in a central service (like etcd, Consul, or a custom service)
  2. When rules change, the service pushes updates to all rate limiter instances
  3. Instances update their in-memory configuration immediately
  4. Changes take effect within seconds

Pros:

  • Fast updates (changes take effect immediately)
  • No polling overhead
  • All instances get updates simultaneously

Cons:

  • More complex infrastructure (need push mechanism)
  • Need to handle instance failures (retry pushes)
  • Requires configuration service to be highly available

Hybrid Approach

Many systems use a hybrid:

  • Poll for periodic updates (fallback)
  • Push for urgent changes (primary)
  • Version numbers to detect changes efficiently

Configuration Store Options:

  • Database: Simple but adds latency for reads
  • Key-Value Store: etcd, Consul provide watch/notify capabilities
  • Object Storage: S3, GCS for file-based configuration
  • Configuration Service: Custom service with push capabilities

Our Recommendation:

For most systems, start with poll-based configuration for simplicity. As you scale and need faster updates, move to push-based configuration. The choice depends on how frequently rules change and how quickly updates need to take effect.

The key is ensuring all rate limiter instances eventually converge on the same configuration, even if there’s a brief period of inconsistency during updates.