Rohan Shakya
System Design11 min read

Designing a Scalable E-commerce Platform

How I would architect a scalable e-commerce platform: domain decomposition, inventory consistency, idempotent checkout, search, caching, order sagas, and flash-sale survival.

  • system-design
  • ecommerce
  • microservices
  • distributed-systems
  • scalability
Designing a Scalable E-commerce Platform

E-commerce looks like a solved problem from the outside — you list products, people buy them, you ship. But the moment you care about correctness under load, it becomes one of the richest system design problems there is. You cannot sell the same unit twice. You cannot charge a customer twice because their network blipped. You cannot let a flash sale take down the entire site. And you have to do all of this while keeping pages fast enough that people don't bounce.

In this post I will walk through how I would design a scalable e-commerce platform: the requirements, how I decompose the domain, the data models, and then the parts that actually keep me up at night — inventory reservation, idempotent checkout, exactly-once payment, search, caching, and surviving traffic spikes. I will close with the monolith-vs-microservices trade-off, because the right answer is almost never the trendy one.

Requirements

Functional requirements

  • Browse and search a product catalog.
  • Add items to a cart.
  • Checkout: address, shipping, payment.
  • Place orders and view order history/status.
  • Process payments reliably.
  • Track inventory so we never oversell.
  • Manage fulfillment (pick, pack, ship).

Non-functional requirements

  • High availability — browsing must stay up even if checkout degrades.
  • Correctness under concurrency — no overselling, no double-charging.
  • Low read latency — product pages and search should feel instant.
  • Elasticity — absorb 10–50x spikes during sales without falling over.
  • Observability — when money is involved, you need to see everything.

The defining property of commerce systems: reads vastly outnumber writes (people browse far more than they buy), but the rare writes (checkout/payment) demand strong guarantees. That asymmetry drives the whole architecture — optimize reads aggressively with caching and replicas, and spend your consistency budget on the write path.

Domain decomposition

I decompose by business capability, with each bounded context owning its data:

  • Catalog — products, variants, descriptions, pricing, media.
  • Search — an inverted index over the catalog for full-text and faceted search.
  • Cart — ephemeral, per-user item lists.
  • Checkout — orchestrates address, shipping, payment, and order creation.
  • Orders — the durable record of what was purchased.
  • Payments — talks to payment gateways; the most safety-critical service.
  • Inventory — authoritative stock counts and reservations.
  • Fulfillment — warehouse operations after an order is placed.
sql
   Browser / App
        │
   ┌────▼─────┐   reads
   │  API GW  ├──────────────┐
   └────┬─────┘              │
        │ writes             │
  ┌─────┼───────────┬────────┼─────────┐
  │     │           │        │         │
┌─▼──┐ ┌▼─────┐ ┌──▼───┐ ┌──▼────┐ ┌──▼──────┐
│Cart│ │Catalog│ │Search│ │Checkout│ │Inventory│
└────┘ └──┬────┘ └──▲───┘ └──┬─────┘ └────┬────┘
          │ events  │ index  │ saga       │
          └─────────┘  ┌─────▼────┐  ┌────▼────┐
                       │ Orders   │  │Payments │
                       └────┬─────┘  └─────────┘
                            │ events
                       ┌────▼──────┐
                       │Fulfillment│
                       └───────────┘

These services communicate synchronously for reads (API gateway → service) and asynchronously via an event bus (Kafka) for state changes. Catalog changes publish events that the search service consumes to update its index. Order events drive fulfillment. This keeps services decoupled and lets the read-heavy parts scale independently.

Data models

Catalog

sql
CREATE TABLE products (
  product_id   UUID PRIMARY KEY,
  title        TEXT NOT NULL,
  description  TEXT,
  brand        TEXT,
  category_id  UUID,
  status       TEXT DEFAULT 'active',
  created_at   TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE variants (
  variant_id   UUID PRIMARY KEY,
  product_id   UUID REFERENCES products(product_id),
  sku          TEXT UNIQUE NOT NULL,
  attributes   JSONB,          -- { "size": "M", "color": "black" }
  price_cents  BIGINT NOT NULL,
  currency     TEXT NOT NULL
);

The variant (the actual buyable SKU) is the unit of inventory and pricing — not the product. This distinction matters: stock and price live on the variant.

Orders

sql
CREATE TABLE orders (
  order_id      UUID PRIMARY KEY,
  user_id       UUID NOT NULL,
  status        TEXT NOT NULL,          -- PENDING, PAID, FULFILLING, SHIPPED, CANCELLED
  total_cents   BIGINT NOT NULL,
  currency      TEXT NOT NULL,
  idempotency_key TEXT UNIQUE,          -- guards against duplicate placement
  created_at    TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE order_items (
  order_id      UUID REFERENCES orders(order_id),
  variant_id    UUID NOT NULL,
  qty           INT NOT NULL,
  unit_price_cents BIGINT NOT NULL,     -- price captured at order time
  PRIMARY KEY (order_id, variant_id)
);

Note unit_price_cents is copied into the order. Prices change; an order must remember what the customer actually agreed to pay.

Inventory

sql
CREATE TABLE inventory (
  variant_id   UUID PRIMARY KEY,
  available    INT NOT NULL,    -- on hand and not reserved
  reserved     INT NOT NULL DEFAULT 0,
  version      BIGINT NOT NULL DEFAULT 0   -- optimistic concurrency
);

Inventory reservation: not overselling

This is the first hard consistency problem. Two customers both grab the last unit. Who wins?

The naive read-then-write is a race. I use a conditional, atomic decrement so the database itself enforces the invariant:

sql
-- Reserve N units atomically; succeeds only if enough are available
UPDATE inventory
SET available = available - :qty,
    reserved  = reserved  + :qty,
    version   = version + 1
WHERE variant_id = :variant_id
  AND available >= :qty;
-- 0 rows affected => out of stock; reject the reservation

If zero rows are affected, there wasn't enough stock and we reject. This is a single-row atomic operation — no race, no double-sell. Reservations have a TTL: if checkout isn't completed in, say, 15 minutes, a background job releases the reserved units back to available. This prevents abandoned carts from locking up stock forever.

Trade-off: reserving at "add to cart" gives the best UX (you're guaranteed the item) but locks inventory aggressively. Reserving at "begin checkout" is the usual compromise — stock is held only once someone is genuinely buying.

Idempotent checkout

Checkout is the riskiest write in the system, and the network is hostile. A client submits checkout, the request succeeds server-side, but the response is lost. The client retries. Without protection, you create two orders.

The fix is an idempotency key: the client generates a unique key per checkout attempt and sends it with every retry.

http
POST /v1/checkout
Idempotency-Key: 7f3c9a1e-...-client-generated
Content-Type: application/json

{ "cart_id": "...", "address_id": "...", "payment_method_id": "..." }

Server-side logic:

text
on checkout(idempotency_key, payload):
    existing = orders.find_by_idempotency_key(idempotency_key)
    if existing:
        return existing            # same result, no new side effects
    order = create_order(...)      # UNIQUE constraint on key as a backstop
    return order

The UNIQUE constraint on idempotency_key is the safety net: even under a race, the second insert fails and we return the already-created order. Idempotency is the single most important pattern on the write path — it turns "at-least-once" delivery (which is all the network gives you) into effectively "exactly-once" outcomes.

Exactly-once payment

Payments compound the idempotency problem because there's an external system — the payment gateway — and a charge is irreversible-ish (refunds are messy and slow). I never want to charge twice.

Defenses, layered:

  1. Idempotency keys on the gateway call. Stripe and most modern gateways accept an idempotency key; passing the same key for retries guarantees a single charge. I derive it deterministically from the order, e.g. pay_{order_id}.
  2. Local payment state machine. A payment_intent row tracks INITIATED → AUTHORIZED → CAPTURED → FAILED. Transitions are guarded; I never re-attempt a charge for an order already in CAPTURED.
  3. Webhook reconciliation. Gateways deliver outcome via webhook. I treat the webhook as the source of truth, dedupe it on the gateway's event id, and reconcile against my local state. If the synchronous call timed out but the webhook says CAPTURED, the order proceeds.
text
checkout ─► reserve inventory ─► create payment_intent(INITIATED)
        ─► charge gateway (idempotency_key = pay_{order_id})
        ─► on success: payment_intent=CAPTURED, order=PAID
        ─► on timeout: leave INITIATED, let webhook reconcile

True exactly-once across a network boundary is theoretically impossible; what we build is at-most-once charge via idempotency + eventual reconciliation, which is exactly-once in practice.

The order saga

A successful checkout touches several services: reserve inventory, charge payment, create the order, kick off fulfillment. There is no distributed transaction across them (2PC at this scale is a liveness nightmare). Instead I use a saga — a sequence of local transactions with compensating actions for rollback.

text
Saga: PlaceOrder
  1. Reserve inventory      ── compensate: release reservation
  2. Authorize payment      ── compensate: void authorization
  3. Capture payment        ── compensate: refund
  4. Confirm order (PAID)
  5. Emit OrderPlaced event ──► fulfillment, notifications

If step 2 fails  ─► compensate step 1 (release stock), mark order FAILED
If step 3 fails  ─► compensate steps 2 & 1

The saga is driven by events on Kafka. Each step is idempotent and persists its progress, so a crashed orchestrator resumes where it left off. Compensation is the key insight: instead of locking everything until the whole flow commits, each step commits locally and we undo if a later step fails.

json
// Event emitted on success
{
  "type": "OrderPlaced",
  "order_id": "...",
  "user_id": "...",
  "items": [{ "variant_id": "...", "qty": 2 }],
  "total_cents": 5998,
  "occurred_at": "2025-12-08T09:00:00Z"
}

Browsing and search are read-dominant and need an inverted index, so I keep search in a dedicated Elasticsearch / OpenSearch cluster — not the transactional database. The catalog is the system of record; search is a derived, eventually-consistent projection.

The flow:

  1. Catalog service writes a product change to its DB.
  2. It publishes a ProductUpdated event (ideally via the transactional outbox pattern, so the DB write and the event are atomic).
  3. A search indexer consumes the event and updates the index.
json
PUT /products/_doc/{variant_id}
{
  "title": "Merino Wool Sweater",
  "brand": "Acme",
  "category": "apparel",
  "price_cents": 8900,
  "attributes": { "size": "M", "color": "navy" },
  "in_stock": true
}

This gives fast full-text and faceted search (filter by brand, price range, attributes) without burdening the transactional DB. Eventual consistency here is fine — a product appearing in search a few seconds late is invisible to users.

Caching strategy

Caching is where the read-heavy nature of commerce pays off enormously.

  • CDN for static assets and cacheable product images/pages.
  • Read replicas for the catalog DB to scale reads horizontally.
  • Redis for hot product data, with cache-aside: read from cache, on miss read the DB and populate. Invalidate (or short-TTL) on product update events.
  • Edge/page caching for category and product pages, varying by locale/currency.

Watch out for two classic failures: cache stampede (many requests miss the same key at once → use request coalescing / locks) and the thundering herd when a hot key expires during a sale (→ jittered TTLs, background refresh). Prices and stock displays should be cached briefly but never trusted at checkout — re-validate inventory and price authoritatively at the moment of purchase.

Flash sales and traffic spikes

A flash sale is an adversarial load test you scheduled yourself. Everyone hits the same few SKUs at the same instant. Strategies:

  • Decouple browse from buy. Browsing is served entirely from caches/CDN and scales freely. Only the checkout path touches the contended write resources.
  • Queue the write path. Put checkout requests through a queue / virtual waiting room so the inventory service sees a controlled, smoothed rate rather than a spike. Users get a "you're in line" experience instead of a 500.
  • Pre-warm caches and pre-scale the relevant services before the sale starts.
  • Protect hot inventory rows. The single contended row (the sale item's stock) is the bottleneck. Mitigations: keep the atomic-decrement path lean, or shard a single SKU's stock into N sub-counters and decrement a random shard (then a unit is "available" if any shard has stock).
  • Rate-limit and shed load gracefully. A clear "sold out / try again" beats a hung page.
  • Idempotency everywhere — under a spike, clients retry aggressively, and idempotency keys are what stop those retries from creating duplicate orders and charges.

Observability

When money flows, blindness is unacceptable. I instrument:

  • Metrics (Prometheus/Grafana): checkout success rate, payment failure rate, inventory reservation rejects, p50/p95/p99 latency per service, queue depth.
  • Distributed tracing (OpenTelemetry): follow a single checkout across cart → checkout → inventory → payment → orders to find where latency or errors originate.
  • Structured logs correlated by order_id and trace_id.
  • Business alerts, not just technical ones: "checkout success rate dropped below 98%" catches problems no CPU graph will.

Monolith vs microservices

The trade-off everyone gets wrong by reaching for microservices too early.

  • A modular monolith — one deployable, clean internal module boundaries, one database — is the right starting point for almost everyone. You get transactional simplicity (real ACID across cart/order/inventory in one DB), trivial local testing, and no distributed-systems tax. Most stores never outgrow this.
  • Microservices earn their keep when teams and traffic grow past what one codebase/DB can serve: independent scaling (search and catalog reads scale very differently from payments), independent deploys, and team autonomy. The cost is real — network failures, eventual consistency, sagas instead of transactions, and a lot more operational surface.

My rule: start as a modular monolith with clean domain boundaries. When a specific module's scaling, deploy cadence, or team ownership becomes a genuine pain, extract that service. Let pain, not fashion, drive decomposition. The clean module boundaries make extraction cheap when the day comes.

Final thoughts

A scalable e-commerce platform is really two systems wearing one coat: a massively read-heavy browsing system that you scale with caches, replicas, and a derived search index; and a small but unforgiving write system where correctness is everything. The browsing system is about throughput. The write system is about not being wrong — not overselling, not double-charging, not losing orders.

The patterns that make the write side correct are remarkably few and worth internalizing: atomic conditional updates for inventory, idempotency keys for checkout and payment, and sagas with compensation for multi-service flows. Get those three right, lean on caching and events for everything read-heavy, and resist premature microservices. That is a platform that stays correct on a normal Tuesday and survives Black Friday.