Book - Designing Data-Intensive Applications

Written
  • Author: Martin Kleppmann
  • Reliability
    • Two ways things can go wrong
      • A fault is when a component inside the system fails to function properly.
      • A failure is when the system as a whole breaks, as a result of one or more faults.
    • Prevent faults when you can, but it's not possible to prevent 100% of faults.
    • The big idea of reliability is to keep faults from turning into failures.
    • The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment — and while that assumption is usually true, it eventually stops being true for some reason
    • Chaos Monkey style testing helps here. As with backup, you only know if your error handling really works if you are testing it.
    • Monitoring can help a lot to catch small problems before they balloon into big problems.
  • Scalability
    • To talk about scalability we have to think about what we are scaling and what the load on the system actually comprises.
    • Describing Load
      • Is it read-heavy? Write-heavy? Are there a lot of active simultaneous users?
      • What kind of SLAs do we have?
        • Both internal or external/contractual
      • Do we care more about the average case or the tails?
      • How do changes that we make affect the different types of load in the system?
      • The nature of the load and what we want to optimize for makes a big difference in the design of the system.
      • In extreme caes, a hybrid design may even work best.
        • Twitter uses an approach in which each user has a timeline cache, and when someone tweets, it is replicated out to the timeline cache of all their followers at the time that the tweet is made, even if the follower isn't currently using Twitter.
        • But this falls apart at the top where some users may have tens of millions of followers, so for those specific users they use a read-based system, where their tweets are fetched when the follower actually opens Twitter and then merged into the rest of the timeline cache.
    • Describing Performance
      • When you increase load, how is the performance of the system affected?
      • When you increase load, how much do you need to increase the underlying resources (CPU, RAM, etc.) to keep performance the same?
      • Which performance metric is relevant?
        • Response time
        • Latency to start serving a response
        • Throughput of a stream
      • Of course all these are distributions as well and our thinking needs to encompass that.
        • Your best users may have the most data or the most complex queries, and thus experience the longest response times. This is an especially good reason to pay attention to the tail of the distribution.
        • At some point the tail of the distribution is dominated by random events that you can't really control. AWS decided that optimizing for the 99.9 percentile to meet their response time SLA was useful, but optimizing the 99.99th percentile (1 in 10,000 requests) was not.
      • Head-of-Line Blocking
        • When multiple slow requests are running and blocking faster requests from running, so those faster requests end up taking a long time as well.
    • Dealing with Load
      • Each factor of 10 increase in load tends to require a change in architecture.
      • The architecture depends a lot on how the load is distributed, read-heavy vs. write-heavy, OLAP vs OLTP, and so on.
  • Maintainability
    • Operability — Keeping things running smoothly shouldn’t be a chore.
      • Monitoring
      • Tracking down problems
      • Updating systems software, OS, etc.
      • How do systems afffect and depend on each other?
      • Anticipating future problems, and having the visibility to do so. (Scaling up servers when load is getting high, but before it becomes a problem.)
      • Good practices for deployment, configuration management, etc.
      • Rearranging how applications are deployed in the system and other complex tasks
      • Avoiding dependency on individual machines
      • Maintaining security as things change
      • Defining processes for all the above
      • Writing knowledge down
    • Simplicity — Easy for new engineers to understand the system and its components.
      • Sources of complexity
        • Explosion of the state space
        • Tightly-coupled modules
        • Tangled dependencies
        • Inconsistent naming
        • Performance problems solved via hacks that never go away
        • Special cases to work around issues in other modules
      • Simplicity makes software easier to write and deploy performant, correct software.
      • Good abstractions enable simplicity of the layers above. Bad, leaky abstractions can hurt more than they help.
    • Evolvability — Easy to make changes in the future and adapt to new requirements.
      • This is closely linked to the simplicity of the system. Simpler systems are easier to change.
  • Highlights
    • Note that a fault is not the same as a failure [2]. A fault is usually defined as one component of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. (Location 253)
    • Many critical bugs are actually due to poor error handling [3]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally. (Location 260)
    • there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Such systems also have operational advantages: a single-server system requires planned downtime if you need to reboot the machine (to apply operating system security patches, for example), whereas a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system (Location 294)
    • The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment — and while that assumption is usually true, it eventually stops being true for some reason (Location 316)
    • Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right. (Location 331)
    • Monitoring can show us early warning signals and allow us to check whether any assumptions or constraints are being violated. When a problem occurs, metrics can be invaluable in diagnosing the issue. (Location 345)

Thanks for reading! If you have any questions or comments, please send me a note on Twitter. And if you enjoyed this, I also have a newsletter where I send out interesting things I read and the occasional nature photo.

You can check out a recent issue, or enter your email below to subscribe.