Yazılım Çorbası: Distributed Snapshot

27 Mart 2023 Pazartesi

Distributed Snapshot

Problem Tanımı

Açıklaması şöyle

Point in Time Snapshots is critical for capturing the “consistent” state of systems, which can be restored in case of any loss of system state, making your system fault tolerant.

Taking a snapshot of one particular server is easy. You define a cut-off time and at that time, the state of the server(local state) at that exact time can be captured for the snapshot.

However, Snapshot in distributed systems, i.e on all the nodes in a cluster, is a challenging problem because nodes in a cluster don’t have a common/global clock. Hence, it’s cannot be guaranteed that all the nodes in the cluster will capture their local state at the same “instant”.

In addition to the “local” state, there could be additional states associated with the distributed system, which are in transit i.e messages send from node 1 to node 2, but hasn’t arrived at node 2 yet.

The other constraint during snapshots is that it should not be a “stop the world” process and it should not alter the actual computations!!

In short, we need the distributed snapshot to create a “Consistent” snapshot of the global state of the distributed system, without impacting actual computations on that system.

Algoritma

Açıklaması şöyle

The algorithm used for capturing distributed snapshots is the Chandy-Lamport algorithm(Yes! Leslie Lamport is also behind Lamport Clocks).

Yazılım Çorbası

27 Mart 2023 Pazartesi

Distributed Snapshot

Hiç yorum yok:

Yorum Gönder

Blog Arşivi