27 Mart 2023 Pazartesi

Distributed Snapshot

Problem Tanımı
Açıklaması şöyle
Point in Time Snapshots is critical for capturing the “consistent” state of systems, which can be restored in case of any loss of system state, making your system fault tolerant.

Taking a snapshot of one particular server is easy. You define a cut-off time and at that time, the state of the server(local state) at that exact time can be captured for the snapshot.

However, Snapshot in distributed systems, i.e on all the nodes in a cluster, is a challenging problem because nodes in a cluster don’t have a common/global clock. Hence, it’s cannot be guaranteed that all the nodes in the cluster will capture their local state at the same “instant”.

In addition to the “local” state, there could be additional states associated with the distributed system, which are in transit i.e messages send from node 1 to node 2, but hasn’t arrived at node 2 yet.

The other constraint during snapshots is that it should not be a “stop the world” process and it should not alter the actual computations!!


In short, we need the distributed snapshot to create a “Consistent” snapshot of the global state of the distributed system, without impacting actual computations on that system.
Algoritma
Açıklaması şöyle
The algorithm used for capturing distributed snapshots is the Chandy-Lamport algorithm(Yes! Leslie Lamport is also behind Lamport Clocks).

Hiç yorum yok:

Yorum Gönder