Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pingcap/tidb/llms.txt

Use this file to discover all available pages before exploring further.

TiDB is designed so that no single node failure interrupts service. Data is replicated across multiple nodes using the Raft consensus protocol, leaders are elected automatically when a node goes down, and the TiDB compute layer is stateless so you can add or remove servers without downtime.

Raft consensus protocol

TiDB stores data in TiKV, which uses Raft to maintain consistency across replicas. Raft is a consensus algorithm that guarantees:
  • A write is committed only after a majority of replicas acknowledge it.
  • All replicas apply writes in the same order.
  • If the current leader fails, the remaining replicas elect a new leader automatically without external coordination.
With the default configuration of three replicas, TiKV tolerates one node failure with no data loss and no manual intervention.
Region A (3 replicas)
  ├── TiKV Node 1 — Leader   ← handles reads and writes
  ├── TiKV Node 2 — Follower ← receives replicated log entries
  └── TiKV Node 3 — Follower ← receives replicated log entries

If Node 1 fails:
  ├── TiKV Node 2 — elected Leader (automatic, ~seconds)
  └── TiKV Node 3 — Follower

Region-based data distribution

TiKV splits data into contiguous key ranges called Regions. Each Region is approximately 96 MB by default. Regions are the unit of replication and scheduling.
  • Every Region has its own Raft group with its own leader and followers.
  • Regions split automatically as data grows and merge when data shrinks.
  • PD (Placement Driver) balances Regions across TiKV nodes to distribute load evenly.
Because Regions are independent Raft groups, a failure on one node affects only the Regions it leads. Other Regions continue operating normally.
The default replication factor is 3 replicas per Region. You can increase this to 5 for higher fault tolerance, at the cost of additional storage and write amplification.

Automatic leader election and failover

When a TiKV leader node becomes unavailable, Raft triggers an election among the followers for that Region. The election completes in seconds (typically under 10 seconds by default, configurable with raft-election-timeout). During the election window, writes to the affected Regions are briefly unavailable. Once the new leader is elected, writes resume automatically. Reads can often continue from followers using Follower Read or Stale Read during this period.
To minimize the impact of a leader election on your application, enable Follower Read (SET tidb_replica_read = 'follower') for read-heavy workloads. Follower Read serves reads from any healthy replica even during a leader transition.

PD: Placement Driver

PD is the cluster’s control plane. It:
  • Allocates timestamps — PD is the centralized timestamp oracle. Every transaction gets a globally monotonic start timestamp from PD, enabling consistent snapshot reads across the entire cluster.
  • Schedules Regions — PD balances Regions across TiKV nodes, moves leaders away from hot nodes, and triggers splits and merges.
  • Stores cluster metadata — PD holds the mapping from key ranges to TiKV nodes, which TiDB servers query to route requests.
PD itself runs as a cluster (typically 3 or 5 nodes) using etcd for its own consensus, so it does not introduce a single point of failure.

Stateless TiDB servers

The TiDB server layer (which handles SQL parsing, planning, and execution) is stateless. It holds no data and no session-critical persistent state. This means you can:
  • Add TiDB server nodes at any time without downtime.
  • Remove TiDB server nodes at any time without migrating data.
  • Place a load balancer in front of multiple TiDB servers for horizontal scale-out and request-level failover.
Load Balancer
  ├── TiDB Server 1 (stateless)
  ├── TiDB Server 2 (stateless)
  └── TiDB Server 3 (stateless)


     TiKV Cluster (data + replicas)
If a TiDB server fails, the load balancer routes new connections to healthy servers. In-flight transactions on the failed server are lost and must be retried by the client, but no data is lost.

Geographic placement and disaster recovery

For disaster recovery across data centers or regions, TiDB supports Placement Policies that control where replicas are located.

Multi-zone deployment

Place one replica per availability zone. A zone failure leaves two replicas available, allowing the Raft majority to continue committing writes.
-- Create a placement policy pinning replicas to three zones
CREATE PLACEMENT POLICY three_zone_policy
  CONSTRAINTS = '[+zone=us-east-1a, +zone=us-east-1b, +zone=us-east-1c]'
  FOLLOWERS = 2;

-- Apply the policy to a table
ALTER TABLE orders PLACEMENT POLICY = three_zone_policy;

-- Apply the policy to an entire database
ALTER DATABASE mydb PLACEMENT POLICY = three_zone_policy;

Multi-region deployment

For disaster tolerance across geographic regions, you can place replicas in separate regions. Because the Raft leader must wait for a majority of replicas to acknowledge each write, cross-region deployments increase write latency proportional to the round-trip time between regions.
-- Place the Raft leader in us-east, with followers in us-west and eu-west
CREATE PLACEMENT POLICY geo_policy
  LEADER_CONSTRAINTS = '[+region=us-east]'
  FOLLOWER_CONSTRAINTS = '{"+region=us-west": 1, "+region=eu-west": 1}';

ALTER TABLE critical_data PLACEMENT POLICY = geo_policy;
In a 3-region deployment where each region holds one replica, a single region failure leaves only two of three replicas. Raft still has a majority and can elect a new leader, so writes continue. However, if two regions fail simultaneously, the cluster loses quorum and becomes read-only until quorum is restored.

Five-replica deployment for higher fault tolerance

Increasing the replica count to 5 allows the cluster to survive two simultaneous node failures:
CREATE PLACEMENT POLICY high_ft_policy FOLLOWERS = 4; -- 1 leader + 4 followers = 5 total
ALTER TABLE critical_table PLACEMENT POLICY = high_ft_policy;

Recovery time objectives

Failure scenarioTypical recovery time
Single TiDB server failureImmediate (load balancer reroutes)
Single TiKV node failureSeconds (Raft leader election per Region)
Single TiKV disk failureSeconds (Raft election) + minutes (replica rebalancing)
Complete availability zone failureSeconds (Raft election on surviving nodes)
PD node failure (3-node PD cluster)Seconds (etcd leader election)
These objectives assume the default 3-replica configuration and healthy network connectivity among surviving nodes.

Deployment patterns

The minimum production configuration. Run 3 TiKV nodes, 3 PD nodes, and at least 2 TiDB servers behind a load balancer. This tolerates one TiKV node failure and one PD node failure simultaneously.
TiDB Server ×2+   (stateless, behind load balancer)
PD           ×3   (etcd-backed control plane)
TiKV         ×3   (data, 3 replicas per Region)
Spread TiKV and PD nodes across three availability zones. A zone failure does not interrupt writes because two zones still form a Raft majority.
Zone A: TiKV ×1, PD ×1, TiDB ×1
Zone B: TiKV ×1, PD ×1, TiDB ×1
Zone C: TiKV ×1, PD ×1, TiDB ×1
Place two replicas in the primary region and one in a secondary region for disaster recovery with acceptable write latency. Use Placement Policies to pin the Raft leader to the primary region.
Primary region:   TiKV ×2 (leader + 1 follower), PD ×2, TiDB ×2+
Secondary region: TiKV ×1 (follower), PD ×1
For zero-RPO across regions, use a 3-region active-active setup with one replica per region.

Build docs developers (and LLMs) love