Yazılım Çorbası: Apache Kafka Rebalancing Protocol

Rebalance Nedir?

Açıklaması şöyle

But you can ask, what will happen when we add a new consumer to the existing and running group? The process of switching ownership from one consumer to another is called rebalance. It is a small break from receiving messages for the whole group. The idea of choosing which partition goes to which consumer is based on the coordinator election problem.

Rebalance işleminde "stop the world" yaklaşımı benimseniyor. Açıklaması şöyle

There is, however, a drawback of the default rebalancing approach. Each consumer gives up its entire assignment of topic partitions, and no processing takes place until the topic partitions are reassigned — sometimes referred to as a “stop-the-world” rebalance. To compound the issue, depending on the instance of the ConsumerPartitionAssignor used, consumers are simply reassigned the same topic partitions that they owned prior to the rebalance, the net effect being that there is no need to pause work on those partitions. This implementation of the rebalance protocol is called eager rebalancing because it prioritizes the importance of ensuring that no consumers in the same group claim ownership over the same topic partitions.

Rebalance Sıkıntıları

Açıklaması şöyle

Kafka tries to rebalance partitions every time rolling new code on each machine. Unsurprisingly, this problem is notorious in Kafka / Kafka’s stream applications world. Here is how the protocol works when rolling updates:

- When we roll out new code, let us say on an instance A, which hosts a stream application, it will close the current running stream application. As the consequence, which will send a “leave group” request to the group coordinator.

- After instance A finishes rolling out a new code, instance A joins again the old consumer group by sending a “join group” request.

- Kafka membership protocol will resolve again and divide partitions again for each node. Instance A might receive a different partition, which needs to rebuild again the state store from the changelog topic respected to the assigned partition.

- Even worse, when rebalancing is happening, the whole consumer group will stop-the-world, all operations, including the cluster-internal one, cannot work normally and wait for rebalancing to finish. After that, some consumers will stop and restore the original state from the changelog topics. If changelog topics are huge, the service could be slowed down/unresponsive for a long time before coming back to the normal state.

Çözümler şöyle

1. Optimize consumer configurations
2. Optimize state store configurations
3. Change to Static Membership
4. Maintain active and inactive clusters
5. Incremental Cooperative Rebalancing
6. Moving from stateful consumers to stateless consumers
7. Upgrading to Kafka’s stream 2.6
8. Accept downtime

1. Optimize consumer configurations

Burada bir yazı var

Burada bir yazı daha var

session.timeout.ms alanı kullanılarak consumer için timeout atanabiliyor. Açıklaması şöyle. Yani yeni sürüm çalışmaya başlarken heartbeat süresini artırdığımız için Kafka farkına varmıyor

Default value: 3000 (3 seconds)

Timeout for the heartbeat thread. Prevent unnecessary rebalancing due to network issues or the application doing a GC that causes heartbeats not to be sent to the broker coordinator.

It’s also good to understand the differences between session timeout and polling timeout. See https://stackoverflow.com/questions/39730126/difference-between-session-timeout-ms-and-max-poll-interval-ms-for-kafka-0-10

5. Incremental Cooperative Rebalancing Nedir?

Açıklaması şöyle. "stop the world" yaklaşımı yerine group leader'in değişen consumer'ları yayınlaması bekleniyor. Böylece consumer durmak zorunda kalmıyor.

First introduced to Kafka Connect in Apache Kafka 2.3, incremental cooperative rebalancing has now been implemented for the consumer group protocol too. With the cooperative approach, consumers don’t automatically give up ownership of all topic partitions at the start of the rebalance. Instead, all members encode their current assignment and forward the information to the group leader. The group leader then determines which partitions need to change ownership — instead of producing an entirely new assignment from scratch. Now a second rebalance is issued, but this time, only the topic partitions that need to change ownership are involved. It could be revoking topic partitions that are no longer assigned or adding new topic partitions. For the topic partitions that are in both the new and old assignment, nothing has to change, which means continued processing for topic partitions that aren’t moving. The bottom line is that eliminating the "stop-the-world" approach to rebalancing and only stopping the topic partitions involved means less costly rebalances, thus reducing the total time to complete the rebalance. Even long rebalances are less painful now that processing can continue throughout them.

This positive change in rebalancing is made possible by using the CooperativeStickyAssignor. The CooperativeStickyAssignor makes the trade-off of having a second rebalance but with the benefit of a faster return to normal operations. To enable this new rebalance protocol, you need to set the partition.assignment.strategy to use the new CooperativeStickyAssignor. Also, note that this change is entirely on the client-side. To take advantage of the new rebalance protocol, you only need to update your client version. If you’re a Kafka Streams user, there is even better news. Kafka Streams enables the cooperative rebalance protocol by default, so there is nothing else to do.

Yazılım Çorbası

21 Ekim 2022 Cuma

Apache Kafka Rebalancing Protocol

Hiç yorum yok:

Yorum Gönder

Blog Arşivi