Designing a Distributed Rate Limiter

January 18, 2026 • 25 min read

RedisSystem DesignDistributed Systems

Introduction

A rate limiter is a critical component in modern distributed systems that controls the rate of traffic sent by clients or services. It enforces limits based on the number of requests a client is allowed to make within a specified timeframe, blocking any requests that exceed these limits.

Real-world examples:

An API that restricts clients to 200 requests per second
A registration system that prevents more than 10 account creations per IP address per hour
A payment service that limits transaction attempts to prevent abuse

Why Rate Limiters Matter

Rate limiters serve three primary purposes:

1. Prevent Resource Starvation Protect systems from Denial of Service (DoS) attacks or unintentional overload caused by software errors, misconfigurations, or malicious actors flooding the system with excessive requests.

2. Ensure Fair Resource Usage Prevent individual users or services from monopolizing shared resources, ensuring equitable distribution across all clients.

3. Control Costs Reduce infrastructure costs by limiting excess requests, allowing better resource allocation to high-priority APIs and preventing runaway experiments from generating unexpected bills.

Requirements Analysis

Designing a distributed rate limiter requires careful consideration of both functional and non-functional requirements. Let’s break these down systematically.

Functional Requirements

What is being rate limited?

Based on what identity, the requests are going to be rate limited.

Common identities include:

IP address - Limit requests from specific network origins
User account - Enforce per-user limits
API key - Control access for different API consumers
Service - Apply limits at the service level
Combination (API key + endpoint) - Support complex rules like API key + endpoint combinations
Application level properties - Custom attributes specific to the application domain

The system should support different identities - IP address, User account, API key.

What should happen when limit is being exceeded?

When a request exceeds the rate limit, the system should reject it with clear error response indicating throttling.

Reject the request immediately
Return a clear error response indicating throttling
Provide information about when the limit will reset (if applicable)

Requirement: Reject requests with proper error responses that clearly indicate rate limiting.

Is limiting a hard guarantee or an approximate one?

Whether the system needs to be perfectly accurate on limiting requests or occasional over-allowances is acceptable?

Perfect accuracy implies stronger coordination or complications which might affect non-functional requirements that we may need to satisfy.

The system should enforce limits approximately and occasional over-allowances is acceptable.

Are we going to support configurable rules/policies to enforce?

Yes, limit requests based on configurable rules and policies at runtime.

Should the system work in distributed environment?

Yes. The rate limiter can be shared across multiple services.

Summary

Support different properties/identity to rate limit
Reject requests with proper response
Rate limiting with minor inaccuracies is allowed
Support configurable rules/policies
Support distributed rate limiting

Non-Functional Requirements

Latency

Rate limiting checks must be extremely fast to avoid impacting request response times, even under peak load.

Requirement: Rate limit checks must complete in under 10 milliseconds.

Availability

The rate limiter must be highly available. In failure scenarios, the system should prefer allowing traffic rather than blocking it (fail-open behavior).

Requirement: High availability with fail-open behavior during outages.

Scalability

The system must handle massive scale:

Support millions of requests per second
Scale horizontally with traffic growth
Handle millions of unique identities being rate limited simultaneously

Requirement: Scale to 1 million requests per second (RPS) with horizontal scalability.

Consistency

Global consistency is not required. Small amounts of excess traffic are acceptable if it means avoiding false rejections that degrade user experience.

Requirement: Eventual consistency is acceptable. Availability and performance take precedence over perfect consistency.

Non-Functional Requirements Summary

Low latency: < 10ms per rate limit check
High availability: Fail-open during outages
Scalability: Support 1 million RPS
Consistency: Eventual consistency acceptable

Capacity Estimation

Understanding the scale helps inform design decisions.

Traffic Estimates

Daily active clients: 10 million
Requests per client per day: 1,000 (average)
Total daily requests: 10 billion
Average requests per second: ~115,000 RPS
Peak requests per second: ~1 million RPS (10x average)

Storage Estimates

Not all identities are active simultaneously:

Total identities: 10 million (API keys, user IDs, IP addresses)
Active identities: 5-10% within any given time window
Concurrent active identities: 500K to 1 million

Each active identity requires state storage:

Unique identifier (API key, user ID, IP address)
Current counter/timestamp
Metadata (rule ID, window information)

Storage per identity: 100-200 bytes Total active state: 100-200 MB for 1 million active identities

Read/Write Characteristics

Every request requires a read-modify-write operation:

At peak (1M RPS):
- 1 million reads per second
- 1 million writes per second

Access pattern:

Write-heavy workload
Highly concurrent operations
Low latency requirements

Capacity Summary

Peak traffic: 1 million RPS
Active state: 100-200 MB
Workload: Write-heavy, highly concurrent

Rate Limiting Algorithms

Different algorithms offer different trade-offs between accuracy, memory efficiency, and implementation complexity. Let’s examine the most common approaches.

Fixed Window Counter

How it works:

Divide time into fixed-size windows (e.g., 1-minute windows aligned to wall-clock time)
Maintain a counter for each identity within the current window
On each request:
- If counter < limit: increment counter and allow request
- Else: reject request
Reset counter when the window expires

Advantages:

Memory efficient: minimal state (counter + window timestamp)
Simple to understand and implement
Low computational overhead

Disadvantages:

Allows traffic bursts at window boundaries
A client can make 2x the limit by sending requests at the end of one window and start of the next
Less accurate for strict rate limiting requirements

Use case: Suitable when some burst traffic is acceptable and simplicity is prioritized.

Sliding Window Log

How it works:

Store timestamps of all requests for each identity within the window
On each request:
- Remove timestamps older than (current_time - window_size)
- Count remaining timestamps
- If count < limit: add current timestamp and allow request
- Else: reject request

Advantages:

Accurate enforcement: prevents exceeding limits in any rolling window
No burst traffic at boundaries
Precise rate limiting

Disadvantages:

Memory inefficient: state grows linearly with traffic volume
Higher memory usage for high-traffic identities
More expensive cleanup operations

Use case: When accuracy is critical and memory is not a constraint.

Sliding Window Counter

How it works:

Maintain counters for two consecutive fixed windows (previous and current)
On each request:
- Calculate overlap ratio based on position within current window
- Compute weighted count: current_window_count + (previous_window_count × overlap_ratio)
- If weighted count < limit: allow request
- Else: reject request

Example calculation:

Current window: 30% elapsed
Overlap ratio: 1 - 0.30 = 0.70
Current window count: 50 requests
Previous window count: 60 requests
Weighted count: 50 + (60 × 0.70) = 92 requests

Advantages:

Smooths out traffic spikes better than fixed window
Memory efficient (only two counters per identity)
More accurate than fixed window while maintaining efficiency

Disadvantages:

Approximate: assumes even distribution of requests in previous window
Not perfectly accurate but good enough for most use cases

Use case: Balanced approach when you need better accuracy than fixed window but can’t afford the memory cost of sliding window log.

Token Bucket

How it works:

Maintain for each identity: token count and last refill timestamp
Tokens are added at a fixed refill rate up to a maximum capacity (bucket size)
On each request:
- Calculate tokens to add based on elapsed time: (current_time - last_refill_time) × refill_rate
- Refill tokens (capped at bucket size)
- If tokens ≥ 1: consume one token and allow request
- Else: reject request

Advantages:

Intuitive and easy to understand
Memory efficient (two values per identity)
Allows burst traffic: unused tokens accumulate
Flexible: can handle both rate limiting and burst tolerance

Disadvantages:

Requires careful parameter tuning (bucket size and refill rate)
May allow more traffic than intended if parameters are misconfigured
Less precise than sliding window approaches

Use case: When you need to allow bursts of traffic while maintaining an average rate limit.

Leaky Bucket

How it works:

Model requests as items in a queue (bucket) with fixed capacity
On each request:
- If bucket has capacity: enqueue request
- Else: reject request
Process requests from bucket at constant outflow rate

Advantages:

Memory efficient: state based on queue size
Provides strong protection for downstream systems
Stable outflow rate prevents downstream overload
Smooths traffic spikes

Disadvantages:

Recent requests may be rate limited if old requests are still queued
Requires careful tuning of bucket size and outflow rate
Less suitable for real-time rate limiting (requests are queued, not immediately processed)

Use case: When protecting downstream systems from traffic spikes is more important than immediate request processing.

Algorithm Comparison

Algorithm	Accuracy	Memory	Burst Handling	Complexity
Fixed Window	Low	Low	Poor	Low
Sliding Window Log	High	High	Excellent	Medium
Sliding Window Counter	Medium	Low	Good	Medium
Token Bucket	Medium	Low	Excellent	Low
Leaky Bucket	Medium	Low	Good	Medium

Core Entities

The system revolves around three main entities:

Request - An incoming API request that needs to be rate limited
Client - The identity being rate limited (IP address, user ID, API key, etc.)
Rule - A rate limiting policy that defines limits, windows, and conditions

API Interface

The rate limiter exposes a simple interface:

Function signature:

allowRequest(clientId, ruleId) → RateLimitResponse

Input parameters:

clientId: The identity being rate limited (IP address, user ID, API key, etc.)
ruleId: The identifier of the rate limiting rule to apply

Response structure:

allowed: Boolean indicating whether the request should be allowed
remaining: Number of requests remaining in the current window
resetTime: Unix timestamp indicating when the rate limit will reset

Example response:

{
  allowed: true,
  remaining: 45,
  resetTime: 1699123500
}

High Level Design

Now that we understand the requirements and algorithms, let’s design the overall system architecture. We need to decide where the rate limiter lives, how it identifies clients, how it stores state, and how requests flow through the system.

Where should the rate limiter live in the architecture?

Before we can limit requests, we need to decide where in our system architecture the rate limiter should be placed. This decision affects what information we have access to, how it integrates with other services, and how we can scale it.

We have three main placement options:

Option 1: In-Process Rate Limiting

The rate limiter runs as part of the application code on each server. Each application instance maintains its own rate limiting state.

Pros:

No network overhead - checks happen locally
Simple to implement initially
Low latency for rate limit checks

Cons:

Each server has its own view of limits - limits don’t work correctly across multiple servers
Can’t share state between instances
Difficult to scale consistently
Application code becomes responsible for rate limiting logic

Option 2: Dedicated Rate Limiter Service

A separate microservice handles all rate limiting. Application services make network calls to this service for every request.

Pros:

Centralized rate limiting logic
Can share state across all application instances
Easier to update rate limiting without deploying application code

Cons:

Adds network latency to every request
Extra service to maintain and scale
Network calls can fail, adding complexity to error handling
Can become a bottleneck if not scaled properly

Option 3: API Gateway / Load Balancer

The rate limiter is integrated into the API Gateway or Load Balancer layer, before requests reach application servers.

Pros:

Centralized control point for all incoming traffic
No extra network calls - rate limiting happens as part of request routing
Can block traffic before it reaches application servers, saving resources
Natural place to implement rate limiting since gateway already sees all requests
Can leverage gateway’s existing infrastructure for scaling

Cons:

Requires gateway support for rate limiting features
Gateway becomes a critical component - must be highly available

Our Choice: API Gateway Approach

For our design, we’ll use the API Gateway approach. It’s the most common pattern in production systems and gives us centralized control without adding extra network calls. The gateway already sees all incoming requests, so it’s the natural place to enforce rate limits. This also means we can block abusive traffic before it consumes application server resources.

How do we identify clients?

Since we’re using the API Gateway approach, our rate limiter has access to information in the HTTP request itself. This includes the request URL, HTTP headers, query parameters, and the client’s IP address.

We need to decide what makes a “client” unique - this determines how rate limits get applied. The key we use determines whether limits are per-user, per-IP, or per-API-key.

User ID

For authenticated APIs, we can identify clients by user ID. This is typically present in the Authorization header as a JWT token or session token. Each logged-in user gets their own rate limit allocation.

When to use:

APIs that require authentication
When you want per-user limits regardless of device or location
When users can have different limits based on their account type

IP Address

We can identify clients by their IP address, which is typically available in the X-Forwarded-For header or directly from the connection.

When to use:

Public APIs without authentication
When you want to limit by network origin
For basic abuse prevention

Limitations:

Users behind NATs or corporate firewalls share IP addresses
Mobile users may have changing IP addresses
Can block legitimate users if IP is shared

API Key

For developer APIs, clients can be identified by API keys, typically sent in the X-API-Key header.

When to use:

Developer-facing APIs
When you want different limits for different API consumers
When you need to track usage per API key holder

Combined Approaches

In practice, you’ll often want a combination. For example:

Authenticated users: rate limit by user ID
Unauthenticated requests: rate limit by IP address
Developer API requests: rate limit by API key

You might also combine identities for stricter limits, like rate limiting by both API key and endpoint, or user ID and IP address together.

The choice depends on your use case. Are all users authenticated? Is this a developer API? These questions help determine the right identification strategy.

How do we store rate limiting state?

Rate limiting requires storing state for each client - counters, timestamps, tokens, or whatever the algorithm needs. Since we have multiple rate limiter instances (for scalability and availability), we need a shared state store that all instances can access.

Why we need shared state:

If each rate limiter instance maintained its own state, a user could hit different instances and bypass limits. For example, if the limit is 100 requests per minute, a user could send 100 requests to instance A and 100 requests to instance B, effectively getting 200 requests instead of 100.

Requirements for the state store:

Low latency: Rate limit checks must be fast, so the store must respond quickly
Atomic operations: We need operations like “increment and check” to happen atomically to prevent race conditions
Expiration support: Rate limit windows expire, so we need automatic cleanup of old data
High throughput: Must handle millions of read-modify-write operations per second
High availability: If the store is down, rate limiting breaks

Why Redis?

Redis is the most common choice for rate limiting state storage because:

In-memory storage: Extremely fast reads and writes
Atomic operations: Commands like INCR, EXPIRE, and Lua scripts ensure atomicity
Built-in expiration: Automatic cleanup of expired keys
High performance: Can handle millions of operations per second
Data structures: Supports counters, sets, and other structures needed for different algorithms
Clustering: Redis Cluster provides sharding and high availability

How Redis stores rate limit state:

For a fixed window counter algorithm, we might store:

Key: rate_limit:user_12345:window_202401011200 (client ID + window identifier)
Value: Counter (number of requests in this window)
Expiration: Set to expire when window ends

When a request arrives:

Check if key exists and get current count
If count < limit: increment counter atomically and allow request
If count >= limit: reject request
Set expiration if this is the first request in the window

Redis provides atomic operations like INCR (increment) and EXPIRE (set expiration) that make this straightforward. We can also use Lua scripts to combine multiple operations atomically.

Alternative stores:

While Redis is most common, other options exist:

Memcached: Similar to Redis but with fewer features
In-memory databases: Like Hazelcast or Apache Ignite
Distributed caches: Like Amazon ElastiCache or Azure Cache

The key is choosing something fast, with atomic operations, and expiration support.

How does a request flow through the system?

Let’s trace what happens when a request arrives at our system:

Step 1: Request arrives at API Gateway

A client sends an HTTP request to our API. The request first hits the API Gateway, which acts as the entry point for all traffic.

Step 2: Gateway routes to Rate Limiter

The API Gateway forwards the request to one of the rate limiter instances. The gateway uses load balancing to distribute requests across multiple rate limiter instances for scalability.

Step 3: Rate Limiter extracts client identity

The rate limiter examines the request to identify the client:

Checks Authorization header for user ID (if authenticated)
Checks X-API-Key header for API key (if present)
Falls back to IP address from X-Forwarded-For header

Based on the request type and available information, it determines which identity to use for rate limiting.

Step 4: Rate Limiter retrieves configuration

The rate limiter looks up which rules apply to this client and endpoint. Configuration might be cached in memory or retrieved from a configuration store. Rules specify things like “authenticated users get 1000 requests per hour” or “search API allows 10 requests per minute per IP.”

Step 5: Rate Limiter checks current state

The rate limiter queries the shared state store (Redis) to check the current rate limit state for this client:

For fixed window: checks counter for current window
For sliding window: counts timestamps within window
For token bucket: checks current token count

This involves reading from Redis, which must be fast to meet our latency requirements.

Step 6: Rate Limiter makes decision

Based on the current state and the limit, the rate limiter decides:

If under limit: Update state (increment counter, consume token, etc.) and allow request
If over limit: Reject request with HTTP 429 status

Step 7: Update shared state

If allowing the request, the rate limiter updates the shared state store atomically. This might involve incrementing a counter, adding a timestamp, or consuming a token.

Step 8: Forward or reject

If allowed: Rate limiter forwards the request to the application server and includes rate limit headers (X-RateLimit-Remaining, X-RateLimit-Reset) in the response
If rejected: Rate limiter returns HTTP 429 response with helpful headers indicating when the limit resets

Step 9: Application processes request

If the request was allowed, it reaches the application server, which processes it normally and returns a response to the client.

The entire rate limiting check should complete in under 10 milliseconds to avoid impacting request response times. This requires fast Redis operations, efficient network paths, and potentially cached configuration.

Deep Dives: Scaling and Operations

Now that we have the basic architecture, let’s address the challenges of scaling this system to handle 1 million requests per second while maintaining low latency and high availability.

How do we scale to handle 1M requests/second?

At 1 million requests per second, a single Redis instance won’t be enough. Each request requires a read-modify-write operation, which means we need to handle 1 million reads and 1 million writes per second. We need to distribute this load across multiple Redis instances.

The Problem: Single Redis Bottleneck

A single Redis instance can handle roughly 100,000-200,000 operations per second, depending on the operations and hardware. At 1M RPS, we’d need 5-10 Redis instances just to handle the load, not accounting for redundancy.

Solution: Redis Sharding

We need to partition (shard) our Redis data across multiple instances. Each Redis instance handles a subset of clients, distributing the load.

How sharding works:

When a rate limit check happens, we need to determine which Redis shard stores the state for a particular client. We use the client identifier (user ID, IP address, or API key) to compute which shard to use.

Consistent Hashing

Consistent hashing is the standard approach for distributing keys across shards. Here’s how it works:

Each Redis shard is assigned a position on a hash ring (a conceptual circle)
When we need to find which shard stores a client’s data, we hash the client identifier
We find the shard whose position on the ring is closest to (but greater than) the hash value
That shard stores and serves data for that client

Why consistent hashing?

Minimal redistribution: When a shard is added or removed, only a small fraction of keys need to move
Even distribution: Keys are distributed evenly across shards
Handles failures: If a shard fails, its keys can be redistributed to other shards

Horizontal Scaling of Rate Limiter Instances

The rate limiter instances themselves also need to scale horizontally. We run multiple stateless rate limiter instances behind the API Gateway’s load balancer. Each instance can handle rate limit checks independently, and they all share access to the Redis cluster.

Scaling Strategy:

Rate Limiter Instances: Add more instances as traffic grows. Since they’re stateless, scaling is straightforward - just add more instances and update the load balancer
Redis Shards: Add more Redis shards as the number of active clients grows or as throughput requirements increase
Geographic Distribution: Deploy rate limiter instances and Redis clusters in multiple regions to reduce latency for global users

Capacity Planning:

With 1 million active identities and 1M RPS:

Each Redis shard handles a subset of clients (e.g., 100K clients per shard with 10 shards)
Each shard handles roughly 100K operations per second
Rate limiter instances can be scaled independently based on CPU and network capacity

The key insight is that both rate limiter instances and Redis shards can scale horizontally, allowing the system to grow with traffic.

How do we ensure high availability?

Rate limiters are critical infrastructure - if they fail, either legitimate traffic gets blocked (fail-closed) or abusive traffic gets through (fail-open). We need to design for high availability.

Stateless Rate Limiter Instances

Rate limiter instances are stateless - they don’t store any local state. This means:

If an instance fails, the load balancer routes traffic to other instances
We can add or remove instances without data migration
Instances can be replaced without losing rate limiting state

Redis Replication and Failover

Redis state is critical - if Redis is down, rate limiting breaks. We need redundancy:

Redis Replication:

Each Redis shard should have replicas (typically 1-2 replicas per master). Replicas maintain copies of the master’s data and can take over if the master fails.

Automatic Failover:

When a Redis master fails, one of its replicas is automatically promoted to master. This failover should happen quickly (within seconds) to minimize impact.

Redis Cluster:

Redis Cluster provides built-in sharding, replication, and automatic failover. It’s the standard solution for high-availability Redis deployments at scale.

Fail-Open vs Fail-Closed Behavior

What happens when the rate limiter or Redis is unavailable? We have two options:

Fail-Closed (Strict):

If rate limiting can’t be checked, reject all requests. This is the safest approach for security but can block legitimate traffic during outages.

Fail-Open (Permissive):

If rate limiting can’t be checked, allow all requests through. This maintains availability but loses protection during outages.

Our Choice: Fail-Open

For most use cases, we prefer fail-open behavior:

Better user experience - legitimate users aren’t blocked during outages
Rate limiting is about preventing abuse, not security-critical authentication
Short outages are acceptable - we can detect and fix issues quickly
We can monitor for unusual traffic patterns and respond manually if needed

However, for sensitive endpoints (like payment processing), fail-closed might be appropriate.

Monitoring and Alerting

High availability requires good observability:

Monitor Redis cluster health and failover events
Track rate limiter instance health and response times
Alert on unusual rejection rates or latency spikes
Monitor for Redis memory usage and shard load distribution

Redundancy at Every Level:

Multiple rate limiter instances (load balanced)
Redis Cluster with replication (automatic failover)
Multiple API Gateway instances (if applicable)
Geographic redundancy (multiple regions)

The goal is to ensure that the failure of any single component doesn’t break rate limiting entirely.

How do we minimize latency?

Every request goes through rate limiting, so latency matters. Our target is under 10 milliseconds per rate limit check. Let’s look at where latency comes from and how to reduce it.

Sources of Latency:

Network round-trip to Redis: The time to send a request to Redis and receive a response
Redis processing time: The time Redis takes to execute operations
Connection establishment: The time to establish TCP connections (if not pooled)
Serialization overhead: Converting data to/from Redis protocol format

Connection Pooling

The biggest latency win comes from connection pooling. Instead of establishing a new TCP connection for each rate limit check, we maintain a pool of persistent connections to Redis.

How it works:

Rate limiter instances maintain a pool of open TCP connections to Redis
When a rate limit check is needed, we reuse an existing connection from the pool
After the check, the connection returns to the pool for reuse

Impact:

Without pooling: Each request requires TCP handshake (20-50ms depending on network distance)
With pooling: Connection reuse eliminates handshake overhead (sub-millisecond)

Most Redis clients handle connection pooling automatically, but we need to tune pool size based on concurrent requests and Redis response times.

Geographic Distribution

Network latency increases with distance. A user in Tokyo talking to Redis in Virginia will see much higher latency than talking to Redis in the same region.

Solution: Multi-Region Deployment

Deploy rate limiter instances and Redis clusters in multiple geographic regions:

Users in Asia connect to Asian region
Users in Europe connect to European region
Users in Americas connect to American region

Trade-offs:

Lower latency for users (data stays in their region)
More complex operations (managing multiple clusters)
Eventual consistency between regions (acceptable for rate limiting)

Caching Configuration

Rate limiting rules don’t change frequently. We can cache configuration in memory on rate limiter instances to avoid database lookups on every request.

What to cache:

Rate limiting rules (limits, windows, which clients they apply to)
Client-to-rule mappings
Rule metadata

Cache invalidation:

When rules change, we need to invalidate caches. This can be done via:

Time-based expiration (poll for updates periodically)
Push notifications (invalidate when rules change)
Version numbers (check version on each request, refresh if changed)

Other Optimizations:

Redis pipelining: Batch multiple operations in a single network round-trip
Lua scripts: Combine multiple Redis operations into a single atomic script
Local caching: Cache recent rate limit checks locally (risky - can lead to incorrect decisions)

For most systems, connection pooling and geographic distribution provide the biggest latency improvements. Other optimizations add complexity and are usually unnecessary unless you’re operating at extreme scale.

How do we handle hot keys?

Hot keys occur when a single client (identified by user ID, IP address, or API key) generates enough traffic to overwhelm a single Redis shard. This can happen with both legitimate high-volume clients and abusive traffic.

What creates hot keys?

A single client would need to generate tens of thousands of requests per second to overwhelm a Redis shard. This often indicates:

DDoS attacks or malicious bots
Misconfigured clients making excessive requests
Legitimate high-volume clients (analytics systems, data pipelines, mobile apps with aggressive refresh)

The Problem:

When all requests from a single client hit the same Redis shard, that shard becomes a bottleneck. Other shards might be underutilized while one shard is overwhelmed.

Strategies for Legitimate High-Volume Clients:

Client-Side Rate Limiting:

Encourage well-behaved clients to implement their own rate limiting. API SDKs often include built-in client-side rate limiting that respects server response headers. This:

Smooths traffic patterns before requests reach the server
Reduces server load
Prevents legitimate users from accidentally creating hot shards

Request Queuing and Batching:

Allow clients to batch multiple operations into single requests, reducing the total number of rate limit checks needed.

Premium Tiers:

Offer higher rate limits for power users who need them, potentially with dedicated infrastructure or different sharding strategies.

Strategies for Abusive Traffic:

Automatic Blocking:

When a client consistently exceeds rate limits (e.g., 10 times in a minute), temporarily block their IP address or API key entirely. Maintain a blocklist in Redis (or a separate store) and check it before rate limit checks.

DDoS Protection:

Use services like Cloudflare or AWS Shield that can detect and block malicious traffic before it reaches your rate limiter. These services operate at the network edge and can filter traffic before it consumes your infrastructure.

Rate Limit Tuning:

Set higher limits for IP-based rate limiting to account for legitimate users sharing IP addresses (corporate NATs, public WiFi). Rely more on authenticated user limits where possible, as user IDs are more unique than IP addresses.

Key Insight:

Client-side rate limiting serves as a valuable complement to server-side protection. While you can’t trust clients for security, encouraging good citizenship in your API ecosystem reduces infrastructure load and prevents legitimate users from accidentally triggering hot shard scenarios.

The best approach is to design rate limits upfront to account for legitimate use cases rather than trying to handle hot keys after they occur.

How do we handle dynamic rule configuration?

So far, we’ve assumed rate limiting rules are static. But production systems need flexibility to adjust limits without deploying new code. You might need to:

Temporarily increase limits during a product launch
Give premium customers higher limits
Quickly reduce limits if you’re seeing unusual traffic patterns
A/B test different limit values

Requirements:

Update rules without restarting rate limiter instances
Changes should take effect quickly (within seconds or minutes)
Support different rules for different clients, endpoints, or scenarios
Maintain consistency across all rate limiter instances

Poll-Based Configuration

Rate limiter instances periodically poll a configuration store (database, file system, or configuration service) to check for rule updates.

How it works:

Configuration is stored in a central location (database, S3, etc.)
Rate limiter instances poll every N seconds (e.g., 30 seconds)
If rules have changed, instances reload configuration
New rules take effect on the next poll cycle

Pros:

Simple to implement
Works with any configuration store
No need for push infrastructure

Cons:

Delay between rule change and taking effect (up to poll interval)
Polling adds load to configuration store
Instances might have slightly different rules during transition

Push-Based Configuration

A configuration service pushes rule updates to rate limiter instances immediately when rules change.

How it works:

Configuration is stored in a central service (like etcd, Consul, or a custom service)
When rules change, the service pushes updates to all rate limiter instances
Instances update their in-memory configuration immediately
Changes take effect within seconds

Pros:

Fast updates (changes take effect immediately)
No polling overhead
All instances get updates simultaneously

Cons:

More complex infrastructure (need push mechanism)
Need to handle instance failures (retry pushes)
Requires configuration service to be highly available

Hybrid Approach

Many systems use a hybrid:

Poll for periodic updates (fallback)
Push for urgent changes (primary)
Version numbers to detect changes efficiently

Configuration Store Options:

Database: Simple but adds latency for reads
Key-Value Store: etcd, Consul provide watch/notify capabilities
Object Storage: S3, GCS for file-based configuration
Configuration Service: Custom service with push capabilities

Our Recommendation:

For most systems, start with poll-based configuration for simplicity. As you scale and need faster updates, move to push-based configuration. The choice depends on how frequently rules change and how quickly updates need to take effect.

The key is ensuring all rate limiter instances eventually converge on the same configuration, even if there’s a brief period of inconsistency during updates.