TIL 251004

2025-10-04til reliability ddia

I've been reading “Designing Data-Intensive Applications”.

Some interesting things I’ve learned so far:

Human error accounts for the vast majority of outages. To quote the book:

...one study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages.

Hardware failures happen more often than I'd expected. Each piece of hardware is eventually going to fail. Two useful metrics are “mean time to failure” (if you throw it away when it fails) and “mean time between failures” (if you repair it when it fails). The values of these metrics aren’t infinite. With so many CPUs, RAM modules, GPUs, and hard drives, something will be failing all the time.

One example given in the book goes like this:

Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

In addition to the above, since modern cloud services prioritize flexibility and elasticity over the stability of any single machine, you need to anticipate these factors when designing your software.

One of the techniques mentioned in the book for handling faults is process isolation.

In ancient times, when software ran closer to the bare metal, this concept meant one CPU process should not touch memory addresses or other resources used by another process. In our modern-day context, this concept extends to technologies like containerization.