Designing Data-Intensive Applications

Written — Updated
  • Author: Martin Kleppmann
  • Reliability
    • Two ways things can go wrong
      • A fault is when a component inside the system fails to function properly.
      • A failure is when the system as a whole breaks, as a result of one or more faults.
    • Prevent faults when you can, but it's not possible to prevent 100% of faults.
    • The big idea of reliability is to keep faults from turning into failures.
    • The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment — and while that assumption is usually true, it eventually stops being true for some reason
    • Chaos Monkey style testing helps here. As with backup, you only know if your error handling really works if you are testing it.
    • Monitoring can help a lot to catch small problems before they balloon into big problems.
  • Scalability
    • To talk about scalability we have to think about what we are scaling and what the load on the system actually comprises.
    • Describing Load
      • Is it read-heavy? Write-heavy? Are there a lot of active simultaneous users?
      • What kind of SLAs do we have?
        • Both internal or external/contractual
      • Do we care more about the average case or the tails?
      • How do changes that we make affect the different types of load in the system?
      • The nature of the load and what we want to optimize for makes a big difference in the design of the system.
      • In extreme caes, a hybrid design may even work best.
        • Twitter uses an approach in which each user has a timeline cache, and when someone tweets, it is replicated out to the timeline cache of all their followers at the time that the tweet is made, even if the follower isn't currently using Twitter.
        • But this falls apart at the top where some users may have tens of millions of followers, so for those specific users they use a read-based system, where their tweets are fetched when the follower actually opens Twitter and then merged into the rest of the timeline cache.
    • Describing Performance
      • When you increase load, how is the performance of the system affected?
      • When you increase load, how much do you need to increase the underlying resources (CPU, RAM, etc.) to keep performance the same?
      • Which performance metric is relevant?
        • Response time
        • Latency to start serving a response
        • Throughput of a stream
      • Of course all these are distributions as well and our thinking needs to encompass that.
        • Your best users may have the most data or the most complex queries, and thus experience the longest response times. This is an especially good reason to pay attention to the tail of the distribution.
        • At some point the tail of the distribution is dominated by random events that you can't really control. AWS decided that optimizing for the 99.9 percentile to meet their response time SLA was useful, but optimizing the 99.99th percentile (1 in 10,000 requests) was not.
      • Head-of-Line Blocking
        • When multiple slow requests are running and blocking faster requests from running, so those faster requests end up taking a long time as well.
    • Dealing with Load
      • Each factor of 10 increase in load tends to require a change in architecture.
      • The architecture depends a lot on how the load is distributed, read-heavy vs. write-heavy, OLAP vs OLTP, and so on.
  • Maintainability
    • Operability — Keeping things running smoothly shouldn’t be a chore.
      • Monitoring
      • Tracking down problems
      • Updating systems software, OS, etc.
      • How do systems afffect and depend on each other?
      • Anticipating future problems, and having the visibility to do so. (Scaling up servers when load is getting high, but before it becomes a problem.)
      • Good practices for deployment, configuration management, etc.
      • Rearranging how applications are deployed in the system and other complex tasks
      • Avoiding dependency on individual machines
      • Maintaining security as things change
      • Defining processes for all the above
      • Writing knowledge down
    • Simplicity — Easy for new engineers to understand the system and its components.
      • Sources of complexity
        • Explosion of the state space
        • Tightly-coupled modules
        • Tangled dependencies
        • Inconsistent naming
        • Performance problems solved via hacks that never go away
        • Special cases to work around issues in other modules
      • Simplicity makes software easier to write and deploy performant, correct software.
      • Good abstractions enable simplicity of the layers above. Bad, leaky abstractions can hurt more than they help.
    • Evolvability — Easy to make changes in the future and adapt to new requirements.
      • This is closely linked to the simplicity of the system. Simpler systems are easier to change.
  • Replication
    • This chapter discusses replication in the case where the entire dataset can be stored on each machine.
    • Replicating immutable data is pretty easy: just copy the data and then leave it there until it's not needed anymore. Replicating changes is the hard part.
    • Algorithms
      • Single-Leader Replication
        • All writes go to a single leader, which applies the writes locally and also sends the writes to all the followers.
        • This simplifies things because you don't have to deal with different replicas receiving simultaneous conflicting writes.
      • Multi-Leader Replication
        • Since all writes go to the same place in a single-leader system, it can become a performance bottleneck and a reliability bottleneck; if the leader goes down, no writes take place until the system can detect it and elect a new leader.
        • If your system is contained in just a single datacenter, then there is usually no need to go to a multi-leader system.
        • When the system spans multiple datacenters, you’ll often want to have a leader in each datacenter, so that each write request can go to a server within the same datacenter.
          • In this model, each follower follows a particular leader — generally the one in its datacenter. Then the leaders replicate with each other, and when a leader receives a write from another leader, it processes it and replicates it to its followers just like any other write.
        • Multi-leader synchronization is almost always asynchronous, and often performed with the help of external tools to handle the inter-datacenter communication. So each datacenter is operating independently, in a sense.
        • Offline Operation
          • Allowing client devices to operate offline is very similar to a multi-leader replication system. Each client device is, in effect, a leader on its own, albeit for a subset of the entire data. And it makes writes which then need to be reconciled with the rest of the system when it comes back online.
        • Write Conflicts
          • The main issue with multi-leader replication is when conflicting writes arrive at different leaders at roughly the same time.
          • They need to be resolved, either by merging them (as with a CRDT) or by determining which write takes effect, and ensuring that all the leaders end up making the same decision on which one takes effect.
          • The easiest way to handle this is avoiding conflicts completely, such as by ensuring that all writes for a particular key go to the same leader. This isn't always possible though.
          • If you can't avoid conflicts or use a data type that can natively resolve conflicts, then you have to figure out how to order concurrent writes. A lot of systems use a timestamp, a random number, or other identifier, and just order the writes by that identifier. A logical or hybrid clock is usually better.
          • Some database systems allow you to specify custom conflict resolution logic. Sometimes this can be run by the database itself. This type of logic often ends up handling a ton of edge cases and is bug-prone. If you find yourself doing this, it's better to look into a CRDT or similar solution, if that fits your needs.
        • Things like integrity constraints, autoincrementing types, and so on can cause trouble in multi-leader setups, so they need to be used with care, if at all.
        • Multi-Leader Topologies
          • Usually the nodes all replicate directly to each other, which is called an all-to-all topology.
          • In a circular topology, the nodes form a ring. Each node sends replication data to the next node in the ring and receives it from the previous node in the ring.
          • A star topology designates certain nodes as replication hubs. All other nodes replicate writes to the hubs, which them send them our to the rest of the network. This forms a replication tree at larger scales.
          • All-to-all topologies are the most resilient, but can run into problems with writes arriving at the wrong order at some nodes if the network connection speed isn't the same throughout the system.
      • Leaderless Replication
        • These systems tend to be much more oriented around eventual consistency, and nodes may not even replicate between each other, depending on the design.
        • Without leaders to be the source of truth, it's possible for some nodes to miss writes while unavailable and then just never receive certain updates. Some systems have anti-entropy processes that scan the database for stale values and update them.
        • Leaderless systems often move more intelligence to the client, which will write to and read from multiple replicas at a time itself of relying on a leader to do that.
          • Some clients will also perform read repair. If they read a value from three nodes and detect that one has a stale value, they will write the current value to the stale node to fix it.
        • Quorums
          • Defining the number of nodes we need to acknowledge a write or return a value for a read gives us certain properties of the system. These numbers are called the quorum numbers. Generally we want w+r>nw + r > n, where ww is the write quorum, rr is the read quorum, and nn is the total number of nodes that should hold a particular key. This lets us ensure that at least one node will have a current value when we read.
          • Commonly ww and rr are set to (n+1)/2(n + 1)/2.
          • When this total quorum condition holds, we can still process writes if w<nw < n, and can still process reads, if r<nr < n, so long as w<=navailablew <= n_{available} or r<=navailabler <= n_{available}.
          • A sloppy quorum can be used to allow writes even when you can't reach the nodes that normally would accept the writes. It trades better availability for less consistency during network partitions. Some sort of handoff process moves the writing back to the proper nodes once they come back.
        • Stalesness should be monitored to make sure things aren't falling too far behind. This can be a sign of a larger problem.
        • With no leaders, concurrent writes can be a problem once again.
    • Synchronous or Asynchronous Replication
      • When accepting a write, do you tell the client right away after writing it locally, and then send the writes to the replicas, or do you wait until all replicas have acknowledged the write to respond to the client?
      • Synchronous is slower (sometimes much slower) but eliminates any possibility that a client will not read its own writes when if it writes to the leader and then reads the same data from a replica.
        • This also leads to big problems during network partitions or node failures. The entire system can come to a halt. (i.e. CAP theorem).
        • Often this is implemented by doing synchronous replication with a single follower per write and replicating asynchronously with the others.
          • This way the client can be assured that the write has been saved on more than replica, while maintaining some speed.
          • It does bring back the read-your-writes problem though.
      • Asynchronous allows higher write throughput but the client has to be prepared that its reads may not reflect what it just wrote.
        • In a large-scale leaderless system, some systems are set up so that each client only communicates with a single node so that it will always read its own writes.
        • Also in the case of a leader failure, some writes that were not yet replicated may be lost.
      • "chain replication" is one variant of replication that is mostly synchronous but still provides good performance.
    • Consistency Models
      • Read-your-writes
        • A common way of ensuring this takes place is to figure out if a particular read might be of data that the user has written, and always read from the leader in that case.
        • For example, always read a user’s own profile data from the leader, but other profiles can be read from any replica.
        • This doesn’t work as well if the set of potentially written data is the majority of the data.
          • You can work around this by keeping track of what you’ve actually changed and requesting reads for those items from the leader.
          • Sending a timestamp (preferably a logical or hybrid clock) and ensuring that the read is served by a replica that is up to date with that timestamp works too, when the database makes it possible.
        • Cross-device read consistency can be an issue too, like when the same user is using a laptop and phone to access the application at the same time.
      • Monotonic reads
        • When reads are spread across multiple different replicas, it’s possible for the reads to actually go backwards in time if the later reads go to a replica that is farther behind in syncing to the leader.
        • Monotonic read guarantees avoid this, so a user will never see a state older than what a previous query returned. This is a stronger guarantee than eventual consistency, but weaker than strong consistency.
        • Making sure that all reads for a user go to the same replica can prevent this.
      • Causality
        • Causality violations can happen when replicating from multiple sources, and the replication lag is different between them.
        • In this case you may see writes that were made in response to a write that you don’t see yet. For example, a comment with a missing parent.
        • Ensuring causality is known as “consistent prefix reads”. Writes are seen in the same order that they actually took place.
      • Of course, how much you care about all these things determines how much you care about replication lag.
      • Distributed transactions are one way to get around this. They are difficult to implement correctly but it can be done.
    • Adding New Followers
      • Since the new follower probably starts with no data, you need a way to bootstrap it while still keeping things consistent.
      • Usually this is done by the follower requesting a snapshot of the state. The leader sends this snapshot along with the position in the log that the snapshot represents.
        • Many databases have a way to take a snapshot like this that is consistent in time, but without needing to lock the entire database.
      • Once the follower has received the snapshot, it then asks the leader to send it all writes that have taken place since that position in the log.
      • Finally it catches up with the leader and can replicate normally.
      • Practically, the steps involved in this vary quite a lot between databases. Some make it really easy while others involve a lot of complex steps.
    • Outages
      • The follower keeps track of its position in the replication log, and so when a node reboots or the follower restarts for some reason, it can just request a catch-up from that location.
      • When a leader fails, there needs to be a way to "failover" to a new leader.
        • This can happen manually, where a person chooses a new node to be leader, or it can happen automatically.
      • Choosing a new leader
        • Normally there's some sort of heartbeat between the replicas, and so when a heartbeat is missing for too long it is assumed to be dead. It's also usually possible to tell the leader to give up its leader status so that it can go down for maintenance.
          • Choice of heartbeat timeout is a hard problem. Too long and the system remains unwritable in a failure situation. Too short and you may bounce around leaders too often.
        • Choosing a new leader can involve either a decentralized election process or a controller node previously chosen to be the one that elects leaders. Generally the replica with the latest position in the log is the new leader.
        • Clients then need to reroute to the new leader, and if the old leader node comes back up and is still in leader more, it needs a way to realize that it is not the leader anymore.
      • In a network partition case, it can be tricky to synchronize the writes received by the old leader and the new leader during the partition. Often the old leader just loses its writes but this isn't a great solution.
    • Implementing Replication
      • Statement-based replication involves the leader sending each write command in (roughly) its original form to the followers. This can cause issues with any nondeterministic part of the statement.
        • Does it call a function that generates a timestamp, a UUID, or a random number?
        • Anything with autoincrementing column needs to ensure that operations all happen in the same order too even if they wouldn't otherwise interact.
        • These problems end up being tricky to solve completely and spawn a ton of edge cases, so this isn't used so much anymore.
      • Write-ahead log shipping
        • Since many databases are using a WAL anyway, one way is to just ship the WAL entries to the followers.
        • The main downside of this method is that the WAL often describes the data on a very low level, and so it's difficult or impossible to ship updates between replicas using different storage engines or software versions.
      • Logical Replication
        • This uses a special data format for replication that describes writes at (usually) the row level. Since this is not linked to a particular version or data format, it's easier to deal with version mismatches, and makes upgrades easier.
        • Logical replication is also nice because it allows external programs to act as followers and do things like database write metrics, change capture, auditing, and so on.
      • Trigger-Based Replication
        • This involves triggers set up by the database administrator, for when you need special logic to handle the replication. Usually this is not what you want, since it's slower and more bug-prone than other methods, but it can be handy in special cases.
  • Consistency and Consensus
    • It’s useful to define certain guarantees, build a system that implements those guarantees, and then allow applications that rely on them to be built on top of that system.
      • e.g. in ACID databases, the transaction abstracts over a bunch of different desired properties: atomicity, fault tolerance, etc.
      • The same technique applies in the distributed world.
    • In distributed applications, the big questions are around consensus and consistency, so we can build an abstraction around that into our database.
    • Part of building these abstractions is deciding what limitations and levels of guarantees they will have.
    • Eventual consistency
      • Guaranteed that if you stop writing to a key, eventually, all nodes will have the same value.
      • Technically there’s no bound on how long that will take though, and you won’t necessarily even read your own writes.
      • This is also a model that requires careful use by the application. It’s easy to make certain assumptions that hold in testing but fall apart under heavy load.
    • Linearizability
      • Often known as strong consistency. This basically makes the system appear as if it’s not distributed. Any read from any node will only ever get the latest write to any node, just as if there was only a single copy of the data.
      • Writes are not considered complete until they’ve replicated to all other nodes.
        • Specifically, before the write completes, all read operations will return the old value, and after the write completes, all reads will return the new value.
      • Note that this doesn’t make any guarantees about the ordering of concurrent writes and reads. Only that the view of time always moves forward. Once a value is read, no other reads will see the old value again.
      • Some models explicitly drop this guarantee in certain cases. Snapshot isolation, for example, is not strictly linearizable because a client in a transaction will not see writes that come in after the transaction begins.
      • Compare-and-set operations are useful primitives here too. Set a value only if the existing value is equal to some other expected value; otherwise return an error.
      • Uses of linearizability
        • Locking and leader election — once a node grabs the lock or leader status, no other node should be able to see it not grabbed.
        • Uniqueness constraints — Without linearizability it becomes possible for two concurrent writes to break uniqueness constraints.
        • Anything that requires all nodes to agree on a single value at some point in time for correctness.
        • Any time when there might be more than one way for information to pass between nodes.
          • e.g. Uploading a file to a storage service and enqueueing a task to work on that file.
      • Implementing Linearizability
        • Single-leader systems can be linearizable when you read from the leader (and assuming that your idea of which node is the leader is actually correct).
        • Consensus algorithms are always linearizable because they take extra effort to do so.
        • Multi-leader replication is not linearizable since multiple leaders can process a write to the same key.
        • Leaderless systems are usually not linearizable. Strict quorum systems where read repair is performed before the read returns and writes also achieve quorum consensus mostly achieve it; this has significant performance implications though and still doesn't permit compare-and-set.
      • CAP Theorem
        • While this is usually phrased as "any two of consistency, availability, or partition tolerance," this isn't a helpful definition.
        • A CP or AP system can achieve both consistency and availability when there is no partition if it's designed for it, so the real question is: in a network partition, does the system retain consistency or does it remain available?
        • CAP also is a narrow set of concerns: linearizability during network partitions. It's a good starting point, but in real systems you also have network delays, disappearing nodes (from crashes, etc.), and other concerns.
      • Linearizability is also rarer than one might think. Even per-core caches and other performance optimizations can make memory access in a single computer non-linearizable without explicit synchronization.
      • Likewise, in a network-distributed system linearizability can be slow, and so it's often dropped or limited to retain good system performance.
    • Ordering Guarantees
      • The primary purpose of ordering guarantees is preserving causality. To some approximation we don't really care about ordering of two events that aren't causally linked.
        • Systems that obey causality are called causally consistent.
        • Causality defines a partial order, not a total order. Concurrent operations are not comparable.
      • Because linearizability defines a total ordering over events, it also preserves causality.
      • But in many cases, causal consistency is really what you want and linearizability is overkill. This is good news, since causal consistency can be implemented with much higher performance.
      • Version vectors are one way of tracking causality across the system. For each write, the system also sends the last version of the key that it knew about, and this lets the system determine which write to apply in the case of concurrent writes.
      • Frequently this instead becomes a single version number using Lamport clocks or Hybrid Logical Clock, which loses the direct tracking of causality but greatly decreases the amount of metadata to track.
      • This still doesn't work for enforcing things like uniqueness constraints though, since it only really tells you after the fact which write should win, but by that time you may have allowed two users to grab the same username, or whatever is the constraint.
      • For this we need total order broadcast, which requires two properties for messages traveling between nodes:
        • Reliable delivery — every message is seen by every node
        • Totally ordered delivery — Messages arrive in the same order
      • This basically puts you in the CP style of system, where you handle network partitions by not allowing writes until they heal.
      • Linearizable compare-and-set operations can be built upon a total order broadcast, for example by using it as an append-only log.
        • Writing a value becomes appending a message to the log.
        • Then you read the incoming log messages until you see your own message again.
        • If you saw another message writing to that key before your own message, then you lost and consider your write failed. Otherwise your write succeeded.
      • This works because all node see messages in the same order.
      • You can perform linearizable reads in a somewhat similar fashion. Since the log is asynchronous, you don't always know if you're up to date, but if you send a message out to the log, when you see that message come back then you know that you're up to date as of the time that you sent the message. Some logs also let you do a synchronous sideband query of what the latest position is, so then you just have to wait until you see the message with that position arrive.
      • Conversely, total order broadcast can also be implemented on top of a linearizable increment-and-set register. For each message that you want to send, increment the register, and then attach the returned number to your message.
      • Both this type of register and total order broadcast end up being different forms of the same consensus algorithm.
  • Highlights
    • Note that a fault is not the same as a failure [2]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. (Location 253)
    • Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally. (Location 260)
    • there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine (to apply operating system security patches, for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system (Location 294)
    • The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment — and while that assumption is usually true, it eventually stops being true for some reason (Location 316)
    • Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right. (Location 331)
    • Monitoring can show us early warning signals and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue. (Location 345)
    • You still need an in-memory index to tell you the offsets for some of the keys, but it can be sparse: one key for every few kilobytes of segment file is sufficient, because a few kilobytes can be scanned very quickly. (Location 2063)
    • In size-tiered compaction, newer and smaller SSTables are successively merged into older and larger SSTables. In leveled compaction, the key range is split up into smaller SSTables and older data is moved into separate “levels,” which allows the compaction to proceed more incrementally and use less disk space. (Location 2129)
    • is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log). This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. (Location 2190)
    • Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) use a copy-on-write scheme [21]. A modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location. This approach is also useful for concurrency control, as we shall see in “Snapshot Isolation and Repeatable Read” (Location 2203)
    • LSM-trees are typically able to sustain higher write throughput than B-trees, partly because they sometimes have lower write amplification (although this depends on the storage engine configuration and workload), and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree (Location 2245)
    • A downside of log-structured storage is that the compaction process can sometimes interfere with the performance of ongoing reads and writes. (Location 2258)
    • The impact on throughput and average response time is usually small, but at higher percentiles (see “Describing Performance”) the response time of queries to log-structured storage engines can sometimes be quite high, and B-trees can be more predictable (Location 2261)
    • Typically, SSTable-based storage engines do not throttle the rate of incoming writes, even if compaction cannot keep up, so you need explicit monitoring to detect this situation (Location 2270)
    • An advantage of B-trees is that each key exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. This aspect makes B-trees attractive in databases that want to offer strong transactional semantics: in many relational databases, transaction isolation is implemented using locks on ranges of keys, and in a B-tree index, those locks can be directly attached to the tree (Location 2273)
    • A compromise between a clustered index (storing all row data within the index) and a nonclustered index (storing only references to the data within the index) is known as a covering index or index with included columns, which stores some of a table’s columns within the index (Location 2316)

Thanks for reading! If you have any questions or comments, please send me a note on Twitter.