1. Reliable, Scalable, and Maintainable Applications
Many applications today are data-intensive rather than compute-intensive. The bottleneck is usually the amount of data, the complexity of data, and how fast it changes—not raw CPU power. This chapter establishes three fundamental concerns for building data systems.
Thinking About Data Systems
Databases, caches, and queues are traditionally thought of as different categories of tools with different access patterns, performance characteristics, and implementations. But the boundaries are blurring:
- Redis: A datastore that can also be used as a message queue
- Kafka: A message queue with database-like durability guarantees
Modern applications are usually too complex to solve with a single tool, requiring composite systems. In this view, the application developer becomes a data system designer, responsible for a system that provides specific guarantees to the client.
The book focuses on three main concerns:
- Reliability: Continuing to work correctly when things go wrong
- Scalability: Having strategies to handle growth in load
- Maintainability: Making it easy for people to work on the system over time
Reliability
First, distinguish between faults and failures:
Fault: When one component of the system deviates from its specification.
Failure: When the system as a whole stops providing the required service to the user.
It’s impossible to reduce the probability of a fault to zero. The goal is to design fault-tolerant mechanisms that prevent faults from causing failures.
Types of faults:
- Hardware faults: Hard disks crashing, RAM failing, power outages. Usually handled with redundancy (RAID, backup generators).
- Software errors: Systematic bugs in the system. Hard to anticipate and tend to cause more failures since they are correlated across nodes.
- Human error: Configuration mistakes by operators—the leading cause of outages.
- Design systems that minimize opportunities for error
- Provide sandbox environments to explore and experiment safely
Example: Netflix Chaos Monkey
Netflix’s Chaos Monkey deliberately kills random server processes in production. The goal is to ensure that fault-tolerance mechanisms are actually working—not just theoretically sound.
Scalability
Scalability isn’t a simple “yes” or “no” label. It’s about how the system handles growth in specific circumstances.
Describing Load
Choose load parameters—the dimensions of growth you want your system to handle:
- Requests per second
- Ratio of reads to writes
- Number of simultaneous users
Twitter case study:
- Post tweet: avg 4.6k rps, peak 12k rps
- View timeline: 300k rps
Handling 12k writes/sec is easy. But 300k reads/sec is harder—and Twitter guarantees tweet delivery to followers within 5 seconds.
Approach 1: Read-time joins Store tweets in a global table. On timeline request, look up everyone you follow, grab their tweets, and merge them. This is slow at 300k reads/sec.
Approach 2: Write-time fan-out Maintain a timeline cache per user. When someone tweets, insert it into every follower’s cache. Reading is fast since the work is pre-computed. But posting is expensive: with 75 followers on average, 4.6k tweets/sec becomes 345k cache writes/sec. This doesn’t scale for celebrities with millions of followers.
Twitter’s hybrid solution: Most users use write fan-out, but celebrity tweets are fetched separately and merged at read time.
Measuring Performance
Different systems have different metrics:
- Batch systems (Hadoop): Throughput (records/sec) or total processing time
- Online systems: Response time—measured as a distribution, not a single number
Response time = time between client sending request and receiving response
Latency = time a request spends waiting to be handled
Percentiles matter more than averages. Amazon optimizes for p99.9 because customers with the most data (from many purchases) tend to have the slowest requests—and they’re the most valuable.
Tail latency amplification: In systems with multiple backend calls, a single slow sub-request can slow down the entire user response.
Approaches to Handle Load
- Vertical scaling: Upgrade to a more powerful machine
- Horizontal scaling: Distribute load across multiple machines
- Elastic: Automatically add resources when load increases
- Complex for stateful systems
Rule of thumb: Scale databases vertically until cost or availability requirements force distribution
Maintainability
The majority of software cost is in ongoing maintenance, not initial development. Design for these three principles:
- Operability: Make it easy for operations teams to keep the system running smoothly
- Simplicity: Make it easy for new engineers to understand the system (reduce accidental complexity)
- Evolvability: Make it easy to adapt the system for unanticipated use cases in the future