Designing WhatsApp: A System Design Deep Dive
A practical walkthrough of designing a WhatsApp-scale messaging system: persistent connections, message routing, ordering, receipts, group fan-out, and E2E encryption.
- system-design
- messaging
- websockets
- scalability
- distributed-systems

WhatsApp is one of those deceptively simple products: a text box and a send button. Underneath, it is one of the most demanding distributed systems in production — billions of users, tens of billions of messages a day, all expected to arrive in order, exactly once, within a few hundred milliseconds, with end-to-end encryption that even the operator cannot break.
I have designed and reviewed enough real-time systems to know where the bodies are buried. In this post I will walk through how I would design a WhatsApp-style messenger from scratch: requirements, rough capacity math, the architecture, and then deep-dives on the genuinely hard parts — routing between online and offline users, ordering, delivery receipts, group fan-out, presence, and E2E encryption. I will be explicit about trade-offs throughout.
Requirements
Before any boxes-and-arrows, pin down what the system actually has to do.
Functional requirements
- 1:1 messaging — send/receive text and media between two users.
- Group messaging — small to medium groups (WhatsApp caps groups in the low thousands).
- Delivery and read receipts — the familiar single tick (sent to server), double tick (delivered to device), blue tick (read).
- Presence — online status and "last seen".
- Media — images, video, audio, documents.
- End-to-end encryption — the server stores and forwards ciphertext only.
- Push notifications — wake the app when it is backgrounded or offline.
Non-functional requirements
- Low latency — sub-second message delivery for online peers.
- High availability — messaging is the product; downtime is catastrophic.
- Durability — an accepted message must never be silently lost.
- Ordering — messages within a conversation arrive in the order they were sent.
- Horizontal scalability — the design must scale by adding machines, not bigger ones.
The key tension is between durability (write everything to durable storage) and latency (don't let storage be on the hot path of delivery). Most of the design is about resolving that tension.
Capacity estimates
These are back-of-envelope numbers to size the system, not precise figures.
- Assume 2 billion users, 500 million daily active.
- Assume 40 messages/user/day → 2B × 40 = 80 billion messages/day.
- 80B / 86,400s ≈ ~1 million messages/second average. Peak is spikier — call it 3–5x, so 3–5M msg/s.
- Average message payload (ciphertext + metadata): ~200 bytes. Media is referenced by URL, not inlined.
- Write throughput: ~1M msg/s × 200B ≈ 200 MB/s of message writes, sustained.
- Storage: if we retain messages until delivered (then optionally purge), the steady-state store is modest. If we kept 80B msgs/day × 200B for 30 days ≈ ~480 TB/month — but WhatsApp historically deletes messages from the server once delivered, which dramatically cuts storage.
- Connections: 500M concurrent persistent connections. If a single connection server holds ~1M sockets (aggressive but achievable with tuned epoll), that is ~500 connection servers minimum, realistically several thousand for headroom.
The headline numbers: millions of messages per second and hundreds of millions of concurrent long-lived connections. Those two facts drive every decision below.
High-level architecture
+-------------------+
Client <---->| Load Balancer |
(WebSocket) | (TCP/TLS, LB) |
+---------+---------+
|
+---------v-----------+
| Connection Servers | holds live sockets,
| (Gateway / "chat") | tracks userId -> serverId
+---------+-----------+
|
+-------------+--------------+
| | |
+--------v---+ +------v------+ +----v---------+
| Session | | Message | | Presence |
| Registry | | Service | | Service |
| (who is on | | (route + | | (online, |
| which box)| | persist) | | last seen) |
+------------+ +------+------+ +--------------+
|
+-----------+-----------+
| |
+-------v------+ +-------v-------+
| Message Store| | Outbox/Queue |
| (per-user | | (Kafka for |
| inbox, undel)| | fan-out/push)|
+--------------+ +-------+-------+
|
+-------v-------+
| Push Service |
| (APNs / FCM) |
+---------------+
Media path (separate):
Client --PUT--> Blob Store (S3-like) --> CDN --GET--> Client
The core idea: clients hold a persistent connection (WebSocket or a custom protocol over TCP) to a connection server. The connection server is dumb — it terminates TLS, keeps the socket alive with heartbeats, and forwards frames to the message service. The message service does the real work: persist, route, ack.
Why persistent connections
Polling cannot deliver sub-second latency at this scale, and reconnecting per message is wasteful. A long-lived connection lets the server push to the client the instant a message arrives. WhatsApp famously used a tuned XMPP variant over a persistent TCP connection; a modern greenfield design would use WebSocket or a custom binary framing over TCP/QUIC.
The connection servers maintain a mapping in a session registry: userId -> {connectionServerId, deviceId, lastSeen}. This registry (Redis or a similar fast store) is how any other service finds which box currently holds a given user's socket.
Message routing
This is the heart of the system. When user A sends a message to user B:
- A's client sends the message frame over its WebSocket to its connection server.
- The connection server hands it to the message service.
- The message service assigns a server-side
messageId(more on IDs below), persists it durably to B's inbox, and immediately returns a "sent" ack (single tick) to A. - The message service looks up B in the session registry.
- B online: push the message to B's connection server, which writes it to B's socket. When B's device acks receipt, mark delivered (double tick) and notify A.
- B offline: the message stays in B's persistent inbox/queue. A push notification is fired via APNs/FCM. When B reconnects, it drains its inbox.
A.send ──► ConnSrv ──► MsgService ──► persist(B.inbox) ──► ack "sent" to A
│
B online? ──────┤── yes ─► push to B's ConnSrv ─► B device ─► "delivered" ─► notify A
└── no ─► leave in inbox + push notification (APNs/FCM)
The persist-before-ack rule is non-negotiable. We only tell A "sent" after the message is in durable storage. This guarantees that even if every connection server crashes a millisecond later, the message survives.
Message schema
A message frame on the wire might look like:
{
"type": "MESSAGE",
"messageId": "01H...ULID",
"clientMessageId": "uuid-from-sender",
"conversationId": "1to1:userA:userB",
"senderId": "userA",
"recipientId": "userB",
"timestamp": 1745312400123,
"seq": 48213,
"contentType": "text/encrypted",
"ciphertext": "base64...",
"mediaRef": null
}
The clientMessageId is generated by the sender and is the key to idempotency: if A retries after a flaky connection, the server dedupes on (senderId, clientMessageId) and does not create a duplicate.
Message store
I would model storage as a per-user inbox: each user has a queue of messages destined for them that have not yet been confirmed delivered to all their devices.
-- Conceptual schema (a wide-column store like Cassandra fits better than SQL,
-- but SQL makes the model clear)
CREATE TABLE user_inbox (
user_id TEXT,
message_id TEXT, -- ULID, time-sortable
conversation_id TEXT,
sender_id TEXT,
seq BIGINT,
ciphertext BLOB,
status TEXT, -- SENT | DELIVERED | READ
created_at TIMESTAMP,
PRIMARY KEY ((user_id), message_id) -- partition by user, cluster by id
);
Partitioning by user_id means draining a user's inbox is a single-partition scan — fast and cheap. A wide-column store (Cassandra/Scylla) is a natural fit: high write throughput, tunable consistency, and partition-local reads. Once a message is delivered to all of a user's devices (and read, depending on retention policy), it can be deleted from the inbox to keep storage bounded.
Sequencing and ordering
"Messages arrive in order" sounds trivial and is one of the trickiest parts.
Wall-clock timestamps are unreliable — client clocks drift, and two messages can land in the same millisecond. Instead, I assign a monotonic per-conversation sequence number (seq). For a 1:1 chat, a single logical writer (the message service partition that owns that conversationId) increments the counter. Clients render messages sorted by seq, not by timestamp.
For messageId I use a ULID (or Snowflake-style ID): time-ordered and globally unique, so IDs sort roughly chronologically across the system. The combination gives:
- Global uniqueness and dedup via
messageId/clientMessageId. - Strict per-conversation ordering via
seq.
Trade-off: a single writer per conversation gives clean ordering but makes that conversation a potential hot spot. For 1:1 chats this is fine — traffic is naturally low per conversation. For huge groups it is not, which is why group ordering is usually relaxed to "best-effort causal" rather than strict total order.
Delivery semantics: the three ticks
The familiar receipts map directly onto the routing flow:
- Single tick (sent): the server has durably accepted the message. Returned after persist.
- Double tick (delivered): the recipient's device has acked receipt over its connection. The recipient's app sends a
DELIVERY_RECEIPTwhich the server relays back to the sender. - Blue tick (read): the recipient opened the chat; the app sends a
READ_RECEIPT.
{ "type": "RECEIPT", "messageId": "01H...", "status": "DELIVERED", "by": "userB", "at": 1745312401000 }
Each receipt is itself a small message that flows back through the same routing machinery. Read receipts are a privacy toggle, so the client may suppress sending them.
We aim for at-least-once delivery to the device, with the client deduplicating on messageId. True exactly-once across an unreliable network is impossible, so the honest design is at-least-once + idempotent client. The user perceives exactly-once because duplicates are silently dropped.
Group messaging and fan-out
A group message must reach every member. The two classic approaches:
- Fan-out on write: when A posts to a group of N members, the message service writes a copy into each of the N members' inboxes immediately. Reads are cheap (drain your own inbox); writes amplify by N.
- Fan-out on read: store one copy in a group timeline; each member reads from it. Writes are cheap; reads must merge the group timeline.
For messaging, fan-out on write is the standard choice because delivery and per-user receipt tracking need a per-recipient record anyway, and groups are bounded (thousands, not millions). The write amplification is acceptable. For a 1,000-member group, one send becomes 1,000 inbox writes plus up to 1,000 pushes — done asynchronously via a queue (Kafka) so the sender's ack is not blocked on fan-out completing.
A posts to group G (members b,c,d,...,n)
──► persist group message
──► enqueue fan-out job
worker ─► for each member: write to member.inbox + route/push
Trade-off: very large broadcast groups break fan-out-on-write (a message to 1M members = 1M writes). That is why broadcast-style channels use a different, read-oriented model. For chat-sized groups, write fan-out wins.
Presence and last-seen
Presence is high-churn, low-value-per-event data, so I keep it out of the durable message path entirely.
- When a client connects, its connection server writes
online+ a heartbeat timestamp to the presence service (Redis with TTL). - Heartbeats refresh the TTL every ~30s. Miss two heartbeats → considered offline;
last_seenfreezes at the last heartbeat. - Presence is not broadcast to everyone. It is pushed only to users who currently have that user's chat open (a subscription model), to avoid an O(contacts) fan-out storm on every status flip.
This keeps presence cheap and eventually consistent, which is exactly the right consistency level for "online 2 minutes ago".
Push notifications
When the recipient is offline (no live socket), we cannot push over WebSocket. Instead, the message service hands a notification to a push service that talks to APNs (iOS) and FCM (Android). Because the payload is end-to-end encrypted, the push typically carries only a wake signal; the app then connects and pulls the actual (still-encrypted) message from its inbox.
Push providers have their own rate limits and are best-effort, so they are a supplement to the durable inbox, never a replacement. The inbox is the source of truth; push is just a doorbell.
Media handling
Media never flows through the message pipeline — that would crush it. The flow is:
- Client requests an upload URL; server returns a pre-signed URL to a blob store (S3-like).
- Client
PUTs the (client-side encrypted) blob directly to the blob store. - Client sends a normal message whose body contains a media reference (object key + encryption key, itself E2E encrypted) rather than the bytes.
- Recipient fetches the blob via a CDN edge for low latency, then decrypts locally.
This keeps the chat path tiny (a 200-byte reference) and offloads the heavy bytes to infrastructure built for it.
End-to-end encryption
WhatsApp uses the Signal Protocol. The essential concept: keys live on devices, never on the server.
- Each device publishes a long-term identity key and a batch of one-time prekeys to the server.
- To start a conversation, the sender fetches the recipient's prekey bundle and runs X3DH (Extended Triple Diffie-Hellman) to derive a shared secret without both parties being online.
- Ongoing messages use the Double Ratchet: every message advances the key, giving forward secrecy (a compromised key cannot decrypt past messages) and post-compromise security.
The server only ever sees ciphertext plus routing metadata (who, when, message size). For groups, a sender-key scheme encrypts the message once per group and distributes the group key pairwise, avoiding per-recipient re-encryption of the whole payload.
Critical trade-off: E2E encryption means the server cannot do server-side search, content moderation, or multi-device sync by replaying plaintext. Multi-device support requires syncing keys/state across a user's own devices — a hard problem WhatsApp solved later with a per-device session model. Encryption shapes the entire product, not just one module.
Bottlenecks and trade-offs
- Connection servers are stateful. Holding millions of sockets, a single box failing drops all its users; they reconnect (likely to a different box) and re-register in the session registry. Connection draining and sticky-but-rebalanceable assignment matter.
- The session registry is on the hot path of every routing decision. It must be fast (Redis) and highly available; a stale entry sends a message to the wrong box, so it needs short TTLs and reconciliation on reconnect.
- Hot conversations / large groups concentrate writes. Fan-out on write amplifies; mitigate with async queues and by capping group size.
- Durability vs latency: persist-before-ack adds a storage round-trip to the send path. We accept that latency to never lose an accepted message — the right call for a messenger.
- Ordering vs throughput: strict per-conversation ordering needs a single logical writer per conversation, which limits parallelism for that conversation. Fine for 1:1, relaxed for big groups.
- E2E encryption vs features: strong privacy forecloses entire categories of server-side functionality. This is a product decision as much as a technical one.
Final thoughts
A WhatsApp-scale messenger is not one hard problem; it is a stack of them that interact. Persistent connections give you latency; a durable per-user inbox gives you reliability; sequence numbers give you ordering; async fan-out gives you groups; the Signal Protocol gives you privacy — and each of those choices closes off other options.
If I had to compress the design to a few load-bearing principles: persist before you ack, route through a fast session registry, keep presence and media off the durable path, and make the client idempotent so at-least-once feels like exactly-once. Get those right and the rest is engineering. Get them wrong and no amount of horizontal scaling will save you.
