uber.case — Uber
Uber

Uber: From 2 APIs to 2,200 Microservices—Managing Hypergrowth

$ published September 18, 2024 14 min read
#Architecture #Scale #Microservices #Distributed Systems

Uber: From 2 APIs to 2,200 Microservices—Managing Hypergrowth

🚗 Case Overview

Context: 2010–2020 transformation from SF-only ride-hailing to global mobility platform

Challenge: Monolithic architecture broke under hypergrowth (10× user growth per year, launches in new cities weekly)

Stakes: If Uber couldn’t reliably match riders with drivers in <30 seconds, users would switch to Lyft/competitors

Result: Evolved to 2,200+ microservices, 8,000+ engineers, handling 20M+ rides/day across 10,000+ cities


1. Background

Company Context

Uber in 2010 (Launch Year):

  • Product: Luxury black car service in San Francisco
  • Scale: ~50 drivers, ~1000 users, ~100 rides/day
  • Tech Team: 2 backend engineers (Ryan Graves, Conrad Whelan)
  • Infrastructure: Monolithic PHP application on single MySQL database

Market Opportunity (2010–2013):

  • Smartphones (iPhone, Android) had GPS + mobile internet → enabled real-time location tracking
  • Traditional taxis were inefficient (hail on street, no price visibility, cash-only)
  • Regulatory gaps: Most cities hadn’t banned ride-hailing yet (window of opportunity)

Strategic Imperative: Launch in every major US city before competitors (Lyft, Sidecar) do, then go international.

Technical Baseline

2012 Architecture (“Monolith Era”):

┌──────────────────────────────────────────────┐
│         Monolithic Application               │
│  ┌─────────────────────────────────────────┐ │
│  │  Rider App (Request Ride)               │ │
│  │  Driver App (Accept Ride)               │ │
│  │  Dispatch Logic (Match Rider/Driver)    │ │
│  │  Pricing (Surge, Promos)                │ │
│  │  Payments (Stripe Integration)          │ │
│  │  Maps (Google Maps API)                 │ │
│  └─────────────────────────────────────────┘ │
│                                              │
│         Single MySQL Database                │
└──────────────────────────────────────────────┘

Why This Worked Initially:

  • Simple to build (2 engineers could manage entire stack)
  • Fast iteration (deploy 10× per day)
  • SF-only operations (single timezone, English-only)

Growth Trajectory:

  • 2011: Launch in NYC (2 cities)
  • 2012: Launch in 10 cities (exponential growth begins)
  • 2013: Launch internationally (Paris, London) → 50 cities
  • 2014: 200+ cities, UberX (non-luxury) launches

2. Problem

The “Hypergrowth Bottleneck” (2013–2014)

Incident Timeline:

Jan 2013: NYC launch day → site crashes. Dispatch system can’t handle 10× spike in ride requests.

  • Root cause: MySQL query for “find nearest available driver” was O(n²) (checked every driver against every rider)
  • Fix: Added spatial index, reduced query time from 8s to 200ms

Mar 2013: Added payment processing → all deployments delayed by 3 days. Payment team waiting for dispatch team to finish release.

  • Root cause: Single codebase = coordination overhead (20 engineers)
  • Temporary fix: Deploy at 3am to avoid peak hours

Aug 2013: International launches (Paris, London) → wrong pricing displayed. Surge pricing showed “$12” instead of “€10”.

  • Root cause: Pricing logic hardcoded US dollars, no localization
  • Fix: Rewrite pricing module for multi-currency (took 6 weeks)

Dec 2013: Holiday season → database becomes read-only. MySQL master hit write capacity (10K writes/sec).

  • Root cause: Every ride request, driver location update, payment transaction hit same DB
  • Fix: Add read replicas, but master still bottlenecked on writes

Strategic Constraints

Leadership (Travis Kalanick, CEO; Thuan Pham, CTO) faced decision criteria:

  1. Speed: Must launch in 100 new cities per year (competitive moat)
  2. Reliability: 99.9% uptime (riders won’t tolerate “no cars available” errors)
  3. Latency: <30s to match rider with driver (industry standard set by taxis)
  4. Flexibility: Support multiple products (UberX, UberPool, UberEats) without rewriting core systems
  5. Scalability: Handle 10× growth per year (both users and cities)

Key Question: Can we scale the monolith, or do we need a fundamental architectural shift?


3. Decision Criteria

Uber engineering evaluated options based on:

Technical Criteria

  1. Independent Deployment: Teams must deploy without blocking others (payment updates shouldn’t delay dispatch)
  2. Geographic Scalability: New city launches shouldn’t require code changes (configuration-driven)
  3. Product Diversity: Add new products (UberPool, UberEats) without modifying core dispatch logic
  4. Fault Isolation: Bug in pricing shouldn’t crash entire app (rider can still request rides)
  5. Performance: Sub-second dispatch (match rider to driver in <1s)

Business Criteria

  1. Engineering Velocity: 100+ engineers must work in parallel (not waiting for releases)
  2. Market Speed: Launch in new city in <2 weeks (vs. 3 months with monolith)
  3. Innovation: Enable A/B testing on subsets of cities (test new pricing algorithms)
  4. Cost Efficiency: Infrastructure costs grow slower than revenue (economies of scale)

Risk Factors

  1. Migration Complexity: Can’t pause business to rewrite architecture (must migrate while growing)
  2. Data Consistency: Distributed systems = eventual consistency (rider might see stale driver locations)
  3. Operational Overhead: Monitoring 100+ services vs. 1 monolith
  4. Skill Gap: Most engineers hadn’t built distributed systems (training required)

4. Alternatives Considered

Option A: Scale the Monolith (Optimize Current System)

Approach:

  • Shard MySQL by city (SF database, NYC database, etc.)
  • Add caching (Redis) for hot data (driver locations)
  • Optimize queries (add indexes, rewrite O(n²) algorithms)

Pros:

  • Lowest risk (no architectural change)
  • Engineers already familiar with codebase
  • Faster short-term (no migration overhead)

Cons:

  • Coordination overhead still exists (teams block each other)
  • City sharding doesn’t help inter-city features (UberPool across cities)
  • Technical debt accumulates (queries becoming unmanageable)

Why Rejected: Wouldn’t enable independent deployment (main bottleneck for velocity). City-based sharding breaks down when products span cities (e.g., airport pickups on city borders).


Option B: Service-Oriented Architecture (SOA with ESB)

Approach:

  • Split monolith into ~20 large services (Dispatch Service, Payment Service, etc.)
  • Use enterprise service bus (ESB) for communication
  • Shared database across services (single source of truth)

Pros:

  • Industry-standard (used by eBay, Salesforce)
  • Enables some team autonomy (payment team owns Payment Service)
  • Shared database simplifies data consistency

Cons:

  • ESB becomes bottleneck (all messages flow through one system)
  • Shared database = coordination on schema changes (teams still coupled)
  • Doesn’t solve geographic scalability (city-specific logic still in monolith)

Why Rejected: ESB is single point of failure. Shared database defeats purpose (teams still need to coordinate schema migrations).


Option C: Domain-Driven Microservices (Chosen)

Approach:

  • Decompose by business domain (Dispatch, Pricing, Payments, Maps)
  • Each service owns its database (no shared DB)
  • Services communicate via REST APIs + message queues (Kafka)
  • Geographic services (city-specific logic) deployed per region

Pros:

  • Independent deployment: Payment updates don’t affect Dispatch
  • Fault isolation: Pricing bug doesn’t crash ride requests
  • Product flexibility: Add UberEats without modifying core ride services
  • Geographic scalability: Deploy city-specific services to local AWS regions (lower latency)

Cons:

  • Eventual consistency: Rider might see outdated driver location (5s lag)
  • Operational complexity: Monitor 100+ services (need observability tooling)
  • Migration risk: 18-month project, must keep business running

Why Chosen: Only option that enabled hypergrowth (launch 100 cities/year while scaling engineering team 10×).


5. Solution / Implementation

Phase 1: Service Extraction (2014–2015)

Strategy: Extract services one at a time, starting with least critical.

First Service Extracted: Payment Processing

  • Highest isolation (only touches billing database, not dispatch)
  • Stripe API integration → moved to standalone service
  • Result: Payment team deployed 3× per week (vs. weekly monolith releases)

Second Service: Dispatch (Core Matching Logic)

  • Moved “find nearest driver” algorithm to Dispatch Service
  • Implemented spatial indexing (geohashing) for driver locations
  • Result: Reduced matching latency from 5s to 800ms

Third Service: Pricing Engine

  • Surge pricing, promo codes, currency conversion → standalone service
  • Enabled A/B testing (test new pricing algorithms on 10% of users)
  • Result: Pricing experiments went from 4 weeks to 2 days

Phase 2: Domain-Driven Design (2015–2017)

Service Taxonomy: Organized microservices by business capability.

Core Services:

  • Trip Service: Manages ride lifecycle (requested → matched → in-progress → completed)
  • Dispatch Service: Matches riders with drivers (geospatial optimization)
  • Pricing Service: Calculates fare (surge, promos, tolls)
  • Payment Service: Processes transactions (credit cards, wallets, invoices)
  • Maps Service: Routing, ETAs, driver navigation

Supporting Services:

  • User Service: Authentication, profiles, ratings
  • Notification Service: Push notifications, SMS, emails
  • Analytics Service: Real-time dashboards (active rides, revenue)

Data Ownership: Each service owns its database.

  • Trip Service → PostgreSQL (relational data: trip_id, rider_id, driver_id)
  • Dispatch Service → Redis (in-memory: live driver locations)
  • Payment Service → Stripe (external API)

Phase 3: Geographic Distribution (2016–2018)

Problem: US-based AWS servers caused 300ms+ latency for Asia/Europe riders.

Solution: Deploy services to AWS regions closest to users.

Regional Architecture:

┌─────────────────┐       ┌─────────────────┐       ┌─────────────────┐
│  US Region      │       │  EU Region      │       │  Asia Region    │
│  (us-east-1)    │       │  (eu-west-1)    │       │  (ap-south-1)   │
├─────────────────┤       ├─────────────────┤       ├─────────────────┤
│  Dispatch       │       │  Dispatch       │       │  Dispatch       │
│  Pricing        │       │  Pricing        │       │  Pricing        │
│  Trip Service   │       │  Trip Service   │       │  Trip Service   │
└─────────────────┘       └─────────────────┘       └─────────────────┘
        │                         │                         │
        └────────────── Global Services ───────────────────┘
                        (User Service, Payment Service)

Trade-offs:

  • Latency improved: Asia dispatch latency dropped from 300ms to 40ms
  • Data consistency: User profile changes took 5s to propagate globally (eventual consistency)

Phase 4: Reliability Engineering (2017–2020)

Circuit Breakers: If Pricing Service is down, Trip Service uses cached prices (vs. failing entire ride request).

Rate Limiting: Dispatch Service limits requests to 10K/sec per city (prevents accidental DDoS from mobile apps).

Chaos Engineering: Randomly kill services in production to test fault tolerance (inspired by Netflix).

Regional Failover: If us-east-1 goes down, traffic routes to us-west-2 (automatic DNS failover).


6. Outcome / Lessons

Quantitative Results

Metric2012 (Monolith)2016 (Microservices)2020 (Current)
Cities240010,000+
Rides/Day1005M20M+
Engineers25008,000+
Services1 monolith300+2,200+
Deployment FrequencyWeeklyMultiple per day1,000+ deploys/day
Dispatch Latency5s800ms200ms
Uptime99.5%99.95%99.99%

Technical Wins

  1. Independent Deployment: UberEats launched in 2014 by reusing Dispatch Service (no changes to core ride logic).

  2. Geographic Scalability: New city launches reduced from 3 months to 2 weeks (configuration change, not code).

  3. A/B Testing: Pricing experiments run on subsets of users (test surge algorithm in NYC, not SF).

  4. Cost Optimization: EC2 spot instances for non-critical workloads (analytics) → 60% cost savings.

Organizational Impact

Team Structure (Post-Microservices):

  • Product Teams: Own end-to-end features (e.g., UberPool team owns dispatch logic, pricing, UI)
  • Platform Teams: Build shared infrastructure (API gateway, observability, CI/CD)
  • Regional Teams: Manage city-specific operations (driver onboarding, regulatory compliance)

Engineering Velocity:

  • 2014: 100 engineers, 1 deploy/week
  • 2020: 8,000 engineers, 1,000+ deploys/day

Challenges / Lessons Learned

  1. Over-Decomposition: Some services were too small (Driver Avatar Service had 1 API endpoint). Merged back into User Service.

  2. Data Consistency: Rider sees “driver 2 min away” but driver canceled 5s ago (eventual consistency lag). Fixed with WebSocket updates (real-time sync).

  3. Monitoring Complexity: Debugging requests across 50+ services required distributed tracing (Uber built Jaeger, open-sourced in 2017).

  4. API Versioning: Breaking changes in Dispatch Service broke 20 downstream services. Implemented API contracts + backward compatibility rules.

  5. Cultural Shift: Engineers had to learn distributed systems (CAP theorem, idempotency, retries). Took 6 months of training + documentation.


What I Learned

Technical Insights

  1. Monoliths Aren’t Evil: Uber started with a monolith (right choice for 2 engineers). Microservices made sense at 100+ engineers.

  2. Data Ownership: Each service owning its database is key to independence. Shared databases = hidden coupling.

  3. Geographic Distribution: Latency matters. Uber’s Asia users saw 7× latency improvement by deploying services locally.

Business Insights

  1. Architecture Enables Strategy: Microservices allowed Uber to scale engineering team 40× (2 → 8,000) without collapsing.

  2. Migration is a Product Feature: Uber couldn’t pause growth to rewrite systems. Had to migrate incrementally (service by service).

  3. Operational Complexity is Real: 2,200 services = 2,200 on-call rotations. Uber built internal tooling (Mezzos, Peloton) to manage this.

Strategic Thinking

When to Adopt Microservices:

  • Multiple teams (50+ engineers) working on same product
  • Geographic expansion requires local deployments
  • Need independent deployment (team velocity is bottleneck)

When NOT to Adopt Microservices:

  • Early-stage startup (<10 engineers)
  • Single geographic market
  • Monolith isn’t the bottleneck (consider optimizing first)

Key Insight: Architecture should match organizational structure. Uber’s microservices mirrored team boundaries (Dispatch team → Dispatch Service).


Further Reading


Discussion: If you were Uber in 2013, would you bet on microservices? What would you do differently knowing the 2,200-service complexity today?