Sharding for Scale

Why Sharding?

When many requests happen simultaneously, they all try to modify the same rate limit values in the database. Because Convex provides strong transactions, requests will never overwrite each other incorrectly, so your rate limiter will never allow more requests than configured. However, high contention for these values causes optimistic concurrency control (OCC) conflicts. Convex automatically retries these conflicts with backoff, but it’s better to avoid them in the first place.

How Sharding Works

Sharding breaks up the total capacity into individual buckets, or “shards”. When consuming capacity, the rate limiter checks a random shard. While you might occasionally get rate limited when capacity exists elsewhere, you’ll never violate the rate limit’s upper bound.

const rateLimiter = new RateLimiter(components.rateLimiter, {
  // Use sharding to increase throughput without compromising on correctness
  llmTokens: { kind: "token bucket", rate: 40000, period: MINUTE, shards: 10 },
  llmRequests: { kind: "fixed window", rate: 1000, period: MINUTE, shards: 10 },
});

Power of Two Technique

The implementation uses an advanced load balancing technique: it checks two random shards and uses the one with more capacity. This keeps shards relatively balanced. The rate limiter will also combine the capacity of both shards if neither has enough on their own, based on the power of two choices technique.

Calculating Optimal Shard Count

Use this formula to estimate the number of shards needed:

shards ≈ max queries per second / 2

For example, handling 1,000 queries per minute (~17 QPS) with 10 shards:

llmRequests: { kind: "fixed window", rate: 1000, period: MINUTE, shards: 10 }

Best Practices

Each shard should have at least 5-10 capacity (ideally 10 or more)
For the example above: 1000 rate / 10 shards = 100 capacity per shard ✓
Don’t expect normal traffic to exceed ~20 QPS with this configuration

Scaling a Rate Limit

If you want a rate like { rate: 100, period: SECOND } and you’re flexible on the overall period, you can shard it by increasing both the rate and period proportionally:

// Original: too fast, needs sharding
{ rate: 100, period: SECOND }

// Better: same rate, more capacity per shard
{ shards: 50, rate: 250, period: 2.5 * SECOND }

// Best: much better capacity per shard
{ shards: 50, rate: 1000, period: 10 * SECOND }

This gives each shard more capacity while maintaining the same overall rate limit.

Trade-offs

Occasional false negatives: With sharding, you might occasionally be rate limited even when capacity exists in other shards. This is the trade-off for avoiding OCC conflicts.

When to Use Sharding

Use sharding when:

You expect high concurrent load (>10 QPS per rate limit)
You’re experiencing OCC conflicts
You need to scale beyond single-shard throughput

Don’t use sharding when:

Traffic is low (< 5 QPS)
You need exact capacity guarantees at all times
Each shard would have < 5 capacity

Complete Example

import { RateLimiter, MINUTE, HOUR } from "@convex-dev/rate-limiter";
import { components } from "./_generated/api";

const rateLimiter = new RateLimiter(components.rateLimiter, {
  // Low traffic: no sharding needed
  sendMessage: { kind: "token bucket", rate: 10, period: MINUTE },
  
  // High traffic: use sharding
  llmTokens: {
    kind: "token bucket",
    rate: 40000,
    period: MINUTE,
    shards: 10,
    // 40000 / 10 = 4000 capacity per shard ✓
  },
  
  // Very high traffic with scaled period
  apiRequests: {
    kind: "fixed window",
    rate: 10000,
    period: 10 * MINUTE,
    shards: 50,
    // 10000 / 50 = 200 capacity per shard ✓
  },
});

// Usage is identical whether sharded or not
export const consumeTokens = mutation({
  args: { tokens: v.number() },
  handler: async (ctx, args) => {
    const status = await rateLimiter.limit(ctx, "llmTokens", {
      count: args.tokens,
    });
    
    if (!status.ok) {
      throw new Error(`Rate limited. Retry after ${status.retryAfter}ms`);
    }
    
    // Proceed with operation
  },
});

Get Started

Core Concepts

Usage Guide

Advanced

Examples

Why Sharding?

How Sharding Works

Power of Two Technique

Calculating Optimal Shard Count

Best Practices

Scaling a Rate Limit

Trade-offs

When to Use Sharding

Complete Example

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage Guide

Advanced

Examples

Documentation Index

​Why Sharding?

​How Sharding Works

​Power of Two Technique

​Calculating Optimal Shard Count

​Best Practices

​Scaling a Rate Limit

​Trade-offs

​When to Use Sharding

​Complete Example

Build docs developers (and LLMs) love

Why Sharding?

How Sharding Works

Power of Two Technique

Calculating Optimal Shard Count

Best Practices

Scaling a Rate Limit

Trade-offs

When to Use Sharding

Complete Example