Yazılım Çorbası: Raft Consensus Algorithm

Giriş

Burada Türkçe bir yazı var

Distributed Consensus Algorithm Ne Demek

Açıklaması şöyle

A consensus algorithm gets multiple nodes to agree on a particular piece of data. In the case of Raft, we try to get all nodes to agree on a sequence of events which is known as a log. You can think an entry in the log as an operation which changes the state of the cluster.

Bazı Algoritmalar

Açıklaması şöyle

Hirschberg–Sinclair Algorithm
In a ring topology (each node has two neighbors), each compares their IDs, and the highest one wins after the O(n log(n)) messages.

Bully Algorithm
This synchronous algorithm assumes that each node has a unique ID and knows all the participant IDs.

The highest ID node declares itself the winner of the “election” by broadcasting a message to all the other nodes or lower ID’s nodes. It then waits for a response before declaring itself the winner if they fail to respond.

An election is called when a high ID node is launched, the leader fails, or the heartbeats message fails.

Paxos

Paxos bir başka algoritma

Raft vs Paxos

Açıklaması şöyle. Yani Paxos zor bir algoritma. Raft daha kolay.

Raft solves a particular problem: how can multiple independent processes decide on a single value for something?

In computer science, this problem is known as distributed consensus; it was famously solved by the Leslie Lamport's Paxos algorithm, which is effective but notoriously tricky to understand and implement in practice.

Raft was designed to solve similar problems to Paxos but in a much more understandable way.

Açıklaması şöyle

Paxos requires multiple round trips to gain a consensus, which creates a lot of extra latency and fine print about when to use lightweight transactions in your application.

The Raft protocol was developed as the next generation to replace Paxos and several systems such as Etcd, CockroachDB and DynamoDB adopted it. It reduced round trips by creating an elected leader.

Şeklen şöyle

Raft Nedir?

Kısaca amacı leader based replication yapmak. Bunun için de bir lider seçilmesi lazım

Leader-based Replication

Açıklaması şöyle. Yani yazma isteği lidere gönderilir.

On a high level, the machines elect a leader among themselves to handle all client write requests. If a follower receives a client write request, it’ll forward it to the leader. Values only float from the leader to the followers. Multiple write requests may arrive at the leader simultaneously. The leader will serialize them in the log and replicate them in the same order to the followers.

Bir başka açıklama şöyle

Now we have multiple nodes, a good question is with which node should we interact? A raft cluster has a leader for this cause, a specific node that the client send its write requests and its the leader’s duty to get this node committed to the holy log (we will get to how that happens exactly). So, raft has three kinds of nodes, “The Leader”, “The Candidates” and the “Followers”.

Lider Nasıl Seçilir

Açıklaması şöyle.

So, the question is, How does raft decide which node is the leader? Democracy, of course! All the nodes starts as “Followers” and between them they elect the leader. Who can request for votes? The “Candidate” nodes. So, when there is no leader in the cluster, any node can become a candidate and anyone can win the leadership. Also, when there is only one node in the cluster it can bootstrap itself and become the leader

Liderlik Kaybedilebilir mi?

Açılaması şöyle. Lider ile bağlantısı kesilen node kendisini candidate ilan eder ve oylama başlatır. Eğer kazanırsa bir sonraki döneme geçilir.

The time in office for a particular leader node is called a term. But how does a term end? Obviously, when the leader fails (Believe me, this post is not political).

The leader keeps sending heartbeat requests to the nodes, this is what tells the other nodes that the leader is still alive. When a node does not receive a heartbeat from the leader in a stipulated amount of time, it considers the leader dead and becomes the candidate node. Then it requests votes from all other nodes, and if it wins the election it gets to be the leader. The candidate also increments the current term, as the reign of the previous leader has ended.

Şeklen şöyle. Burada Master olan S2 bir müddet sonra çöküyor ve yeni leader seçiliyor

Aday (Candidate) İlan Etme Detayı

Açılaması şöyle. Yani lider ile bağlantı kesilince herkes kendisini aday ilan etmez. Rastgele bir süre bekledikten sonra ilan eder.

Now let’s consider a scenario, where the heartbeat timeout period for all the nodes is 100ms, and there is some error in the leader node due to which it shuts down. All the nodes waits for 100ms for the leader to send a heartbeat, and all of them becomes the candidate. This can be a problematic scenario as we have no followers left to vote, and the all the candidates will vote themselves.

To solve this scenario, every node has a random timeout time. So, at a single time all of the followers won’t turn candidate. If somehow there are multiple candidates and both of them receive the same number of votes, then the election would start all over again.

Oylama Detayı

Açıklaması şöyle. Yani follower candidate için oy atarken candidate düğümün log indeksine de bakıyor.

Lets get back to voting and elections for a moment. Will the followers vote blindly to any candidate? No. (Here’s where it differs from real world politics) The followers won’t vote for any candidate with shorter logs than them, as it means that the followers has more updated information.

Quorum Nedir

Quorum Nedir yazısına taşıdım

Log Detayı

Log üzerinde bir Finite State Machine (FSM) çalıştırılır. Açıklaması şöyle

Raft says that the log must be applied on the same Finite State Machine (FSM) on each of the nodes … the FSMs must reach a particular state on being applied a log entry.

Log'ın detayları şöyle

each log entry has the following:
- An index (starts at 1)
- The term index. It identifies the term which the log entry was created in.

Log Diskte Saklanır

Açıklaması şöyle

Also, we must keep in mind the log is always kept in persistent storage, not in memory. The reason being, if the whole cluster goes off for some reason the log can be used to build back up the data. Consistency is the key, at least when it comes to raft.

Log'e Ekleme Nasıldır

Log'ın detayları şöyle. Bir anlamda biraz Two Phase Commit'e benziyor diyebiliriz.

So, let’s say the leader gets a request :
1. It appends the log entry to its own log, with proper term and index.
2. It sends out the news to all the followers, the followers append the entry in their logs with the given term and index.
3. The followers signals the leader that they have appended the log.
4. If at least N/2+1 nodes sends back the response (N being the number of nodes) … we have reached consensus! The leader commits the log and tells the followers it has done so.
5. The followers then commits the log. They keep a track of the last committed index. Committing a log basically means applying the log to its FSM.

Log'a Yetişme (Catchup) Nasıldır

Açıklaması şöyle.

What if some node was unconnected from the cluster for a long time and then it reconnects again. Lets say when it disconnected it had the log up to (index 3, term 1). Now when it comes back, the leader wants to send it a log entry with (index 15, term 2), if the follower appends this entry in its log, we have a serious gap in operations to be performed on our data and thus it becomes inconsistent. To solve this:

1. Along with the current log entry the leader sends the log entries of the immediately previous entry. (lets say p and p-1)
2. The follower checks if it has the previous entry(p-1) send by the leader in its log. If it has it, it accepts the entry and appends it in its log.
3. If the follower does not have the entry, it returns failure.
4. The leader receives the failure notification and now it sends the entries p-1 and p-2 to the follower, and it continues to do so till the follower sends back a success. Thus, the follower is brought up to speed and the logs are reconciled.

Hazelcast

Açıklaması şöyle

Communication failures within network may lead to parts of the network being unreachable to other parts. Let’s say our cluster is comprised of eight nodes with a backup count of two. Somehow, connectivity between two groups of nodes is lost, and two groups of five nodes are formed.

Individually, each of those two groups will act as if some part of the cluster has been lost and will immediately start to rebalance data within the cluster. As you can see, it will result in lost data for those clusters and clients who are connecting to either group would see compromised data. The problem mentioned here is referred to as the Split Brain Problem and Hazelcast provides a feature called Quorum configuration to make sure you have the minimum number of machines available in network for the data structure to respond back. If cluster size drops below the configured number, it would result in ‘QuorumException’.

etcd

etcd açıklaması şöyle

etcd is built on the Raft consensus algorithm to ensure data store consistency across all nodes in a cluster—table stakes for a fault-tolerant distributed system.

Raft achieves this consistency via an elected leader node that manages replication for the other nodes in the cluster, called followers. The leader accepts requests from the clients, which it then forwards to follower nodes. Once the leader has ascertained that a majority of follower nodes have stored each new request as a log entry, it applies the entry to its local state machine and returns the result of that execution—a ‘write’—to the client. If followers crash or network packets are lost, the leader retries until all followers have stored all log entries consistently.

If a follower node fails to receive a message from the leader within a specified time interval, an election is held to choose a new leader. The follower declares itself a candidate, and the other followers vote for it or any other node based on its availability. Once the new leader is elected, it begins managing replication, and the process repeats itself. This process enables all etcd nodes to maintain highly available, consistently replicated copes of the data store. türkçe bir y

Yazılım Çorbası

9 Şubat 2021 Salı

Raft Consensus Algorithm - Lider Gerektirir

Hiç yorum yok:

Yorum Gönder

Blog Arşivi