26 Aralık 2021 Pazar

Kafka Producer API

Giriş
Açıklaması şöyle.
The Producer API allows an application to publish a stream of records to one or more Kafka topics.
Java örneklerini Kafka Producer API yazısına taşıdım

Idempotent Producer Configuration
Açıklaması şöyle
By configuring the Producer to be idempotent, each Producer is assigned a unique Id (PID) and each message is given a monotonically increasing sequence number. The broker tracks the PID + sequence number combination for each partition, rejecting any duplicate write requests it receives.
enable.idempotence Alanı
Açıklaması şöyle. Yani bu alana ilave olarak acks = all olmalı 
enable.idempotence determines whether the producer may write duplicates of a retried message to the topic partition when a retryable error is thrown. Examples of such transient errors include leader not available and not enough replicas exceptions. It only applies if the retries configuration is greater than 0 (which it is by default).

To ensure the idempotent behaviour then the Producer configuration acks must also be set to all. The leader will wait until at least the minimum required number of in-sync replica partitions have acknowledged receipt of the message before itself acknowledging the message. The minimum number is based on the configuration parameter min.insync.replicas.

By configuring acks equal to all it favours durability and deduplicating messages over performance. The performance hit is usually considered insignificant.
enable.idempotence İçin Timeout
Açıklaması şöyle
The recommendation is to leave the retries as the default (the maximum integer value) and limit retries by time, using the Producer configuration delivery.timeout.ms (defaulted to 2 minutes).
Yani önerilen ayarlar şekle şöyle



KafkaProducer Sınıfı Açısından acks Nedir?
Açıklaması şöyle
For data durability, the KafkaProducer has the configuration setting acks. The acks configuration specifies how many acknowledgments the producer receives to consider a record delivered to the broker. The options to choose from are:

none: The producer considers the records successfully delivered once it sends the records to the broker. This is basically “fire and forget.”
one: The producer waits for the lead broker to acknowledge that it has written the record to its log.
all: The producer waits for an acknowledgment from the lead broker and from the following brokers that they have successfully written the record to their logs.

As you can see, there is a trade-off to make here — and that’s by design because different applications have different requirements. You can opt for higher throughput with a chance for data loss, or you may prefer a very high data durability guarantee at the expense of lower throughput. 
acks=all Durumu
Açıklaması şöyle. min.insync.replicas değerini de atamak gerekir. Bundan sonra NotEnoughReplicasExceptionNotEnoughReplicasAfterAppendException gibi exception'lar alınabilir.
If you produce records with acks set to all to a cluster of three Kafka brokers, it means that, under ideal conditions, Kafka contains three replicas of your data; one for the lead broker and one each for two followers. When the logs of each of these replicas all have the same record offsets, they are considered to be in sync. In other words, these in-sync replicas have the same content for a given topic partition. 

But there’s some subtlety to using the acks=all configuration. What it doesn’t specify is how many replicas need to be in sync. The lead broker will always be in sync with itself. But, you could have a situation where the two following brokers can’t keep up due to network partitions, record load, etc. So, when a producer has a successful send, the actual number of acknowledgments could have come from only one broker! If the two followers are not in sync, the producer still receives the required number of acks, but it’s only the leader in this case.

By setting acks=all, you are placing a premium on the durability of your data. So, if the replicas aren’t keeping up, it stands to reason that you want to raise an exception for new records until the replicas are caught up. In a nutshell, having only one in sync replica follows the "letter of the law" but not the "spirit of the law." What we need is a guarantee when using the acks=all setting. A successful send involves at least a majority of the available in sync brokers. There just so happens to be one such configuration: min.insync.replicas. The min.insync.replicas configuration enforces the number of replicas that must be in sync for the write to proceed. Note that the min.insync.replicas configuration is set at the broker or topic level and is not a producer configuration. The default value for min.insync.replicas is one. So, to avoid the scenario described above, in a three-broker cluster, you’d want to increase the value to two. 

ProducerRecord Sınıfı
Açıklaması şöyle. Sticky Partitioner kullanma imkanı var. Örnek ver
Kafka uses partitions to increase throughput and spread a load of messages to all brokers in a cluster. Kafka records are in a key/value format, where the keys can be null. Kafka producers don’t immediately send records, instead, they place them into partition-specific batches to be sent later. Batches are an effective means of increasing network utilization. There are three ways the partitioner determines into which partition the records should be written. The partition can be explicitly provided in the ProducerRecord object via the overloaded ProducerRecord constructor. In this case, the producer always uses this partition. If no partition is provided, and the ProducerRecord has a key, the producer takes the hash of the key modulo as the number of partitions. The resulting number from that calculation is the partition that the producer will use. Previously, if there was no key and no partition present in the ProducerRecord, Kafka would use a round-robin approach to assign messages across partitions. The producer would assign the first record in the batch to partition zero, the second to partition one, and so on, until the end of the partitions. The producer would then start over with partition zero and repeat the entire process for all remaining records. 

The round-robin approach works well for even distribution of records across partitions. But there’s one drawback. Due to this "fair" round-robin approach, you can end up sending multiple sparsely populated batches. It’s more efficient to send fewer batches with more records in each batch. Fewer batches mean less queuing of produce requests, hence less load on the brokers.

Hiç yorum yok:

Yorum Gönder