Getlago

Mar 17

/

7 min read

Billing System Scalability: Horizontal Scaling, Sharding, Multi-Tenancy

Anh-Tho Chuong

Anh-Tho Chuong

Share on

LinkedInX

Billing system scalability is one of the most underestimated engineering challenges in SaaS infrastructure: a billing system that handles 10,000 customers smoothly can grind to a halt at 1 million customers if it was designed without horizontal scaling in mind. According to McKinsey's 2024 Technology Report, scaling failures in billing infrastructure account for 23% of revenue-impacting production incidents at high-growth SaaS companies [1]. This guide covers the core architectural patterns — horizontal scaling, database sharding, and multi-tenancy — that allow billing systems to grow without requiring complete rewrites.

What Is Horizontal Scaling in Billing Systems?

Horizontal scaling in billing systems means adding more machines to distribute load rather than upgrading existing machines with more CPU or RAM (vertical scaling). Billing workloads — event ingestion, invoice generation, payment processing — are naturally parallelizable: invoices for different customers have no dependencies on each other and can be generated concurrently across multiple workers. A horizontally scaled billing system routes work through a queue, with worker nodes picking up jobs independently. When load spikes — at end-of-month billing cycles, for example — additional workers spin up automatically to absorb the volume, then scale back down to reduce cost [2].

Stateless application tier design is a prerequisite for horizontal scaling. Billing application servers must not store session state or in-memory data that other servers need — all shared state must live in the database or a shared cache. If a billing worker holds an invoice-generation job's intermediate state in local memory and then crashes, the job must be recoverable by another worker from a durable checkpoint. Job queues like Sidekiq, Celery, or cloud-native services like AWS SQS provide durable job storage, at-least-once delivery, and visibility timeouts that allow stuck jobs to be claimed by other workers after a configurable delay.

How Does Database Sharding Work for Billing Data?

Database sharding for billing data partitions rows across multiple database instances based on a shard key, so no single database handles the full write load. The most common shard key for billing systems is customer_id or organization_id — all events, subscriptions, invoices, and payment records for a given customer live on the same shard, enabling efficient single-shard queries for the common case (fetching a customer's invoice history) while distributing write load across shards. A billing system with 10 shards can handle roughly 10x the write throughput of an unsharded system, assuming even key distribution [3].

Shard key selection requires careful analysis of access patterns. Sharding by customer_id works well when queries are predominantly per-customer, but it creates hot shards if a small number of high-volume customers generate the majority of events. Consistent hashing distributes keys more evenly and makes adding new shards less disruptive — adding a new shard only requires remapping a fraction of existing keys rather than rehashing all of them. Billing systems that anticipate significant customer growth should design for re-sharding from the start, as migrating a sharded billing system to a new key space in production is one of the most technically complex operations in infrastructure engineering.

Cross-shard queries — such as generating a revenue report across all customers — are expensive in sharded systems because they require querying all shards and aggregating results. Billing systems handle this by maintaining a separate analytics replica or data warehouse that receives event streams from all shards and provides a unified query surface. This separation-of-concerns pattern keeps the transactional billing database optimized for per-customer writes while the analytics layer handles cross-customer aggregations. For more on building the analytics foundation, see the guide on subscription analytics, cohort analysis, and retention metrics.

Multi-Tenancy Architecture in Billing Systems

Multi-tenancy in billing systems means a single deployment serves multiple independent organizations, with strict data isolation between tenants. There are three primary multi-tenancy models: database-per-tenant (maximum isolation, highest cost), schema-per-tenant within a shared database (moderate isolation, moderate cost), and shared tables with tenant_id columns (minimum isolation, lowest cost). Billing platforms generally use shared tables with tenant_id because the economics of database-per-tenant don't scale beyond a few hundred tenants, and schema-per-tenant adds schema migration complexity that multiplies with tenant count [4].

Row-level security (RLS) in PostgreSQL is a powerful tool for enforcing tenant isolation in shared-table multi-tenancy. RLS policies evaluate a WHERE tenant_id = current_tenant_id() predicate automatically on every query, preventing cross-tenant data access even if application code has a bug that omits the tenant filter. RLS policies are enforced at the database layer — they cannot be bypassed by application-layer mistakes — making them a defense-in-depth mechanism that complements application-level authorization. Billing platforms handling sensitive financial data should use RLS as a required control, not an optional optimization.

Connection pooling is a critical scalability concern in multi-tenant billing systems. Each PostgreSQL connection consumes approximately 5–10MB of memory, so a billing system serving 10,000 tenants cannot open a dedicated connection per tenant. PgBouncer and similar connection poolers multiplex many application connections onto a small pool of database connections, allowing thousands of concurrent tenant requests to share a manageable number of database connections. Transaction-mode pooling (returning the connection to the pool after each transaction) is preferred for billing systems because it maximizes pool utilization, but it prevents the use of session-level features like advisory locks or temporary tables within a billing transaction.

Caching Strategies for High-Scale Billing Systems

Caching reduces database load for frequently read billing data that changes infrequently. Billing plan configurations, pricing rules, and customer entitlement states are good cache candidates — they're read on every event ingestion and invoice calculation but change only when customers upgrade or when plans are modified. Redis is the standard caching layer for billing systems, providing sub-millisecond read latency, atomic operations for cache invalidation, and built-in TTL support. Cache invalidation for billing data must be precise: stale pricing rules can cause invoice calculation errors, so cache entries should be invalidated immediately on plan changes rather than relying on TTL expiry [5].

Read replicas offload read traffic from the primary billing database without requiring the complexity of full sharding. Billing dashboards, invoice history endpoints, and reporting queries can route to read replicas, preserving primary database capacity for writes. Replication lag — typically 10–100ms in well-configured deployments — means read replicas may return slightly stale data. For billing systems, this is generally acceptable for reporting queries but not for real-time balance checks or payment authorization flows where stale data could result in overdrafts or erroneous charges.

Queue Architecture for Billing Workload Distribution

Billing workloads vary significantly in latency requirements and can benefit from priority queues. Real-time event ingestion must be processed within milliseconds to update usage meters for customers on hard-limit plans. Invoice generation at end-of-month can tolerate minutes of latency. Payment retry jobs can run overnight. A queue architecture with separate priority lanes allows each workload type to receive the resources it needs without competing with other workloads. High-priority lanes process real-time events, medium-priority lanes handle invoice generation, and low-priority lanes process bulk exports and reports.

Dead letter queues (DLQs) are essential for billing job resilience. Jobs that fail repeatedly — due to a billing rule misconfiguration, a payment provider outage, or a data integrity issue — should be moved to a DLQ rather than retried indefinitely. DLQs allow operations teams to inspect failed jobs, diagnose root causes, and replay them once the underlying issue is resolved without losing billing events. Every billing queue should have a configured DLQ with alerting on DLQ depth — a growing DLQ is an early warning of a billing system issue that may not yet be visible in customer-facing metrics.

Scaling Event Ingestion for Usage-Based Billing

Event ingestion is the highest-throughput component of usage-based billing systems and requires dedicated scalability design. At scale, usage events arrive as high-frequency streams — API calls, compute seconds, messages sent — that must be ingested, deduplicated, aggregated, and stored before invoice generation. A naive approach of writing each event directly to the billing database creates write contention at scale. The production pattern uses a message queue (Kafka, Kinesis, or Pub/Sub) as an ingestion buffer, with consumers reading from the queue and writing to the billing database in micro-batches. This decouples ingestion latency from database write throughput and provides a durable buffer during database maintenance windows.

Open-source billing platforms like Lago are built for this scale: Lago's event ingestion pipeline processes over 1 million events per second, with multi-entity billing support that handles complex organizational structures within a single deployment. For teams designing event ingestion infrastructure, the guide on event ingestion architecture and metering pipelines covers the design choices in depth.

Multi-Region Deployment and Data Residency

Multi-region billing deployments address both scalability and data residency requirements. Distributing billing infrastructure across multiple geographic regions reduces latency for customers in each region and protects against single-region outages. The primary challenge in multi-region billing is data consistency: financial records must be authoritative — two regions cannot have different values for the same invoice total. Most billing systems solve this with a primary-region write architecture where all write operations route to a designated primary region and are replicated to secondary regions, while read operations can be served from the nearest region [6].

Active-active multi-region billing — where any region can accept writes — is achievable but requires careful conflict resolution. Distributed databases like CockroachDB or YugabyteDB provide ACID transactions across regions with automatic conflict resolution. For billing-specific conflicts (two regions both attempting to apply a discount to the same invoice), application-level idempotency and optimistic concurrency control (checking a version counter before applying changes) are required even when the database provides distributed ACID guarantees.

Scaling Considerations for Specific Billing Workloads

Invoice generation at end-of-month is a predictable peak load that billing systems must design for explicitly. A SaaS company with 100,000 customers that generates all invoices on the 1st of the month needs to process orders of magnitude more work in a 2–4 hour window than during normal operating hours. Strategies for handling this peak include spreading invoice generation across the month using anniversary billing (each customer invoiced on their signup anniversary rather than a fixed date), using invoice scheduling queues that spread jobs evenly throughout the billing period, and pre-computing usage aggregations continuously rather than at invoice generation time.

Tax calculation is another billing workload that doesn't scale linearly. Each invoice line item may require a separate tax API call to providers like Avalara to determine the correct tax rate based on customer location, product type, and local regulations. At 100,000 invoices with 5 line items each, that's 500,000 tax API calls per billing cycle. Caching tax rates by jurisdiction-product combination, batching tax API calls, and maintaining a local tax rate cache for frequently billed jurisdictions all reduce the external API dependency and improve invoice generation throughput at scale.

Citations

  1. McKinsey & Company. Technology Operations Report: SaaS Infrastructure Failures. McKinsey Digital, 2024.
  2. AWS Architecture Blog. Building Scalable Job Processing Systems on AWS. Amazon Web Services, 2024.
  3. Percona. MySQL Sharding: A Practical Guide to Database Partitioning. Percona, 2024.
  4. PostgreSQL Documentation. Row Security Policies. The PostgreSQL Global Development Group, 2025.
  5. Redis Labs. Caching Patterns for Financial Applications. Redis Inc., 2024.
  6. Cockroach Labs. Multi-Region Database Patterns. CockroachDB Documentation, 2025.

Share on

LinkedInX

More from the blog

Lago solves complex billing.