1. Reliable, Scalable, and Maintainable Applications

Many applications today are data-intensive rather than compute-intensive. The bottleneck is usually the amount of data, the complexity of data, and how fast it changes—not raw CPU power. This chapter establishes three fundamental concerns for building data systems.

Thinking About Data Systems

Databases, caches, and queues are traditionally thought of as different categories of tools with different access patterns, performance characteristics, and implementations. But the boundaries are blurring:

Redis: A datastore that can also be used as a message queue
Kafka: A message queue with database-like durability guarantees

Modern applications are usually too complex to solve with a single tool, requiring composite systems. In this view, the application developer becomes a data system designer, responsible for a system that provides specific guarantees to the client.

The book focuses on three main concerns:

Reliability: Continuing to work correctly when things go wrong
Scalability: Having strategies to handle growth in load
Maintainability: Making it easy for people to work on the system over time

Reliability

First, distinguish between faults and failures:

Fault: When one component of the system deviates from its specification.

Failure: When the system as a whole stops providing the required service to the user.

It’s impossible to reduce the probability of a fault to zero. The goal is to design fault-tolerant mechanisms that prevent faults from causing failures.

Types of faults:

Hardware faults: Hard disks crashing, RAM failing, power outages. Usually handled with redundancy (RAID, backup generators).
Software errors: Systematic bugs in the system. Hard to anticipate and tend to cause more failures since they are correlated across nodes.
Human error: Configuration mistakes by operators—the leading cause of outages.
- Design systems that minimize opportunities for error
- Provide sandbox environments to explore and experiment safely

Example: Netflix Chaos Monkey

Netflix’s Chaos Monkey deliberately kills random server processes in production. The goal is to ensure that fault-tolerance mechanisms are actually working—not just theoretically sound.

Scalability

Scalability isn’t a simple “yes” or “no” label. It’s about how the system handles growth in specific circumstances.

Describing Load

Choose load parameters—the dimensions of growth you want your system to handle:

Requests per second
Ratio of reads to writes
Number of simultaneous users

Twitter case study:

Post tweet: avg 4.6k rps, peak 12k rps
View timeline: 300k rps

Handling 12k writes/sec is easy. But 300k reads/sec is harder—and Twitter guarantees tweet delivery to followers within 5 seconds.

Approach 1: Read-time joins Store tweets in a global table. On timeline request, look up everyone you follow, grab their tweets, and merge them. This is slow at 300k reads/sec.

Approach 2: Write-time fan-out Maintain a timeline cache per user. When someone tweets, insert it into every follower’s cache. Reading is fast since the work is pre-computed. But posting is expensive: with 75 followers on average, 4.6k tweets/sec becomes 345k cache writes/sec. This doesn’t scale for celebrities with millions of followers.

Twitter’s hybrid solution: Most users use write fan-out, but celebrity tweets are fetched separately and merged at read time.

Measuring Performance

Different systems have different metrics:

Batch systems (Hadoop): Throughput (records/sec) or total processing time
Online systems: Response time—measured as a distribution, not a single number

Response time = time between client sending request and receiving response

Latency = time a request spends waiting to be handled

Percentiles matter more than averages. Amazon optimizes for p99.9 because customers with the most data (from many purchases) tend to have the slowest requests—and they’re the most valuable.

Tail latency amplification: In systems with multiple backend calls, a single slow sub-request can slow down the entire user response.

Approaches to Handle Load

Vertical scaling: Upgrade to a more powerful machine
Horizontal scaling: Distribute load across multiple machines
- Elastic: Automatically add resources when load increases
- Complex for stateful systems

Rule of thumb: Scale databases vertically until cost or availability requirements force distribution

Maintainability

The majority of software cost is in ongoing maintenance, not initial development. Design for these three principles:

Operability: Make it easy for operations teams to keep the system running smoothly
Simplicity: Make it easy for new engineers to understand the system (reduce accidental complexity)
Evolvability: Make it easy to adapt the system for unanticipated use cases in the future