The shared log paradigm is at the heart of modern cloud-based distributed applications. Their appeal lies in the simplicity of the abstraction they offer: by adding new items to the log’s totally ordered sequence of records, clients contribute to building a shared ground truth for their system, which they can then leverage both immediately (e.g., to achieve Paxos-like fault tolerance or atomic transactions) and in the background (e.g., to support debugging, deterministic replay, analytics, intrusion detection, and failure recovery). A shared log implementation would ideally offer (i) Total order ; (ii) High throughput; (ii) High availability during reconfigurations and (iv) Low and predictable latency. No shared log today achieves all these properties — but not for lack of trying. For example, Scalog, a cool shared log implementation we developed a couple of years ago, offered what was then an unprecedented combination of features for continuous smooth delivery of service. It allowed applications to customize data placement, support reconfiguration with no loss in availability, and recover quickly from failures. At the same time, Scalog achieved high throughput and total order. Achieving low latency without giving up on any of these properties, however, has proved elusive — and, as we will see, the issue goes deeper than simple engineering. I will give you a sneak peak to the new ideas we are exploring to finally crack the problem.
13/07/2023