"Designing data-intensive applications" By Martin Kleppmann: A summary
[Vikram Mandyam] / 2018-09-21
I recently finished reading the book “Designing data-intensive applications” , written by Martin Kleppmann . It took me a little over a year to finish reading this end-to-end, but it has been well worth the time!
This is a very good book to get a clear picture of the landscape of data systems. It will provide you with enough details to figure out the right data approach for your project no matter what the scale of the project is. I highly recommend this book to anybody who is building any form of networked system dealing with data.
My takeaway
This is a must-read for every programmer - It works as an in-depth introduction to data storage technologies and distributed systems concepts — two elements for building almost any piece of software today.
What stands out for me is the amazing job Martin does, of distilling seemingly unbounded masses of research into a book that is easy to understand.
TLDR; Key topics covered in the book.
The book starts off with the basics covering reliability, scalability and maintainability of data intensive systems. It then goes down memory lane, with the history of how data was stored and accessed. The second part of the book deals with the topic that is currently very hot in the field, i.e, Distributed systems. When data is distributed, it brings its own challenges. Dealing with issues such as unreliable clocks across machines, and network partitions, Martin explains different techniques used in distributed data systems - like replication, partitioning, distributed transaction. One particularly interesting topic is the consistency and consensus in a cluster of machines. This is also the subject of a tech talk by Martin on infoQ .
A detailed review and summary
The book starts off with fundamental ways of thinking about data-intensive applications. Basic concepts are demystified, to be used as a foundation for the remainder of the book. Topics covered include a good brush up on availability, scalability etc.
After the foundational chapter, we come to an overview of different data models like relational, document, and graph based models. This chapter also demonstrates their applicability to different use cases, using modern real-life challenges that companies such as Twitter have faced.
The next chapter delves into the way data is stored and retrieved by databases. This deals not only with RDBMS, but also with NOSQL databases, including columnar and graph. If one is particularly interested in the various data structures like LSM trees, B-Trees etc, this chapter is an absolute delight.
Martin goes on to describe the overlooked field of converting data structures into bits and bytes on a disk, and how to optimally store them in secondary storage. In fact, the concepts of encoding explained here can also be used for service-to-service communication, because backward compatibility of encoding is explained very well.
The next chapter deals with techniques of replication such as single leader, multi leader, and leaderless replication - and the problems each one aims to solve. Who knew just maintaining copies of data in multiple machines could turn out to be so tricky!
Next, we learn how to partition large datasets into smaller datasets. Here as well, similar to replication, Martin delves into details of problems such as avoiding query hotspots, multi partition writes etc.
The seventh chapter deals with the problems that transactions solve and why they are important to certain systems. Further, we get to think and reason about different isolation levels which can affect the way the application behaves for e.g. Dirty reads, Dirty writes, Read Skew, Lost updates, Write Skew and Phantom reads. There is also an explanation of how databases implement transactions.
The eighth chapter is an eye opener for anyone who thinks distributed systems are easy. In this chapter, myriad ways in which systems can fail are discussed. In fact, at the end of it, one would be sure that a defining characteristic of distributed system is that partial failures will occur.
He then tackles the topic of consensus, and explains how linearizability of events that occur in a distributed system slows the system. Causality provides a weaker consistency guarantee and a version history which looks like branching and merging, because some events are allowed to be concurrent.
Chapters eight and nine are probably the most important bits of this book, where we get down and dirty with all sorts of problems with distributed systems, and their mitigation.
In the next, and the penultimate chapter, there is a very interesting bit about how unix tools such as awk, sed etc.. can perform much better than Hadoop and other tools. This should also tell us that it is not always right to chase shiny techniques. The chapter provides a good overview of map-reduce and other algorithms.
Martin brings the book to a close by discussing stream processing and explaining the difference between batch and stream processing techniques. Message brokers and event hubs are introduced and their role in stream vis-a-vis the role of file system in batch processing is explained. All through the book, there are interesting tie ups and links made to different technologies already explained. In this chapter as well, the correlation between the log based event brokers and the log based replication technologies of DBs is brought out.
Overall - A worthy read!