20 Ağustos 2020 Perşembe

Apache Kafka

Giriş
Distributed Architecture için kullanılan 4 tane yöntem var
- Streaming
- Sharding
- Lambda
- 3 tier architecture (distributed storage)

Kafka'yı Kim Geliştirdi? 
Kafka LinkedIn tarafından Java + Scala kullanılarak geliştirildi. Açıklaması şöyle.
Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to Apache Software Foundation. It is written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency streaming platform for handling and processing real-time data feeds.
Kafka 2010 yılında geliştirilmeye başladı. Açıklaması şöyle
In 2010, LinkedIn engineers faced the problem of integrating huge amounts of data from their infrastructure into a Lambda architecture. It also included Hadoop and real-time event processing systems. 

As for traditional message brokers, they didn't satisfy LinkedIn's needs. These solutions were too heavy and slow. So, the engineering team developed a scalable and fault-tolerant messaging system without lots of bells and whistles. The new queue manager has quickly transformed into a full-fledged event streaming platform.
Pulsar - Rakip
Pulsar bir başka rakip

Kafka Kelimesi Nereden Geliyor ?
Açıklaması şöyle. Tabii ki Franz Kafka ve onun romanlarındaki korkunç şeylerin, modern şirketlerdeki iletişim korkunçluğuna benzemesi ile alakalı :)
Kafka's name has its origins in the word Kafkaesque which means, according to the Cambridge dictionary, something extremely unpleasant, frightening, and confusing, and similar to situations described in the novels of Franz Kafka.

The communication mess in the modern enterprise was a factor to invent such a tool.
Message Broker
Kafka publish subscribe modeli ile çalışır ve BigData projelerinde sıkça kullanılır. Kafka aslında temel olarak bir Message Broker. Başka Message Broker teknolojileri de var. Bunlardan bir bazıları RabbitMQ ve ActiveMQ

Zookeeper
Kafka fault tolerant olmak için zookeeper ile birlikte çalışmak zorunda. Zookeeper aslında bir "configuration store". Açıklaması şöyle
First we need to start zookeeper which is a distributed configuration store. Kafka uses this to keep information about which Kafka node is the controller, it also stores the configuration for topics. This is where the status of what data has been read is stored so that if we stop and start we don’t lose any data.

Rakipler
Kafka dışında stream'ing için Kinesis, Storm kullanılıyor. Ayrıca Kafka Cloud mimariye çok iyi uyum sağlamadığı için KubeMQ da rakibi

Kafka vs JMS
Fast JMS for Apache Pulsar ile Kafka performansını yakalamak mümkün. Bu iki teknolojiyi karşılaştıran bir yazı burada

StateStore
Kafka topic dışında key-value store imkanı da sunar. Buna StateStore denilir.

Big Data Processing
Çeşitli seçenekler var. Gruplaması şöyle.
Distributed Computing Framework
Apache Spark,Apache Flink,Apache Storm,

Distributed Engine Based On a Distributed Cache
Apache Ignite,Hazelcast Jet

Third Part Library Based On a Distributed Processing System
Kafka Streams,Apache Pulsar Functions
Message Drive vs Event Driven Farkı Nedir?
Aslında  iki terminoloji de aynı şeymiş gibi düşünülüyor. Arada küçük bir fark var. Açıklaması şöyle. Mesajlar işlendikten sonra silinir. Event'ler ise saklanır ve tekrar oynatılabilir.
... there are different characteristics that are worth considering:

Messaging: Messages transport a payload and messages are persisted until consumed. Message consumers are typically directly targeted and related to the producer who cares that the message has been delivered and processed.

Events: Events are persisted as a replayable stream history. ... An event is a record of something that has happened and so can’t be changed. (You can’t change history.)
Event Nedir?
Açıklaması şöyle
It has a key, value, timestamp, and optional metadata headers. A key is used not only for identification but also for routing and aggregation operations for events with the same key.
Açıklaması şöyle
if the message has no key attached, then data is sent using a round-robin algorithm. The situation is different when the event has a key attached. Then the events always go to the partition which holds this key. It makes sense from the performance perspective. We usually use ids to get information about objects, and in that case, it is faster to get it from the same broker than to look for it on many brokers.
Topic Nedir?
Açıklaması şöyle. Event'lerin saklandığı yer
storage for events. The analogy to a folder in a filesystem, where the topic is like a folder that organizes what is inside. An example name of the topic, which keeps all orders events in the e-commerce system, can be orders
Partition Nedir?

Replication Nedir?

Mesaj Formatı
Kafka sadece byte array bilir. Açıklaması şöyle. Yani kendi serialization formatımızı kullanırız.
Kafka can store and process anything, including XML. The Kafka brokers are dumb. They don't care about data formats. The implementation of Kafka under the hood stores and processes only byte arrays. This approach follows the design principle of dumb pipes and smart endpoints (coined by Martin Fowler for microservice architectures). Dumb brokers are one of the architectural reasons why Kafka scales and performs so well.
Avro Serialization
Apache Avro Serialization kullanmak için şu satırı dahil ederiz
<dependency>
  <groupId>io.confluent</groupId>
  <artifactId>kafka-avro-serializer</artifactId>
  <version>5.5.1</version>
</dependency>
application.properties şöyle olsun
# Apache Kafka Configurations
spring.kafka.bootstrap-servers=localhost:9092
spring.kafka.consumer.group-id=person_consumer_app
spring.kafka.topic-name=person_topic

# Schema Registry URL
schema.registry.url=http://localhost:8081

# No. of Concurrent Consumers in the Consumer-Group
spring.kafka.listener.concurrency=3
Şöyle yaparız. Burada 
Key serialization için org.apache.kafka.common.serialization.StringSerializer kullanılıyor. 
Value serialization için io.confluent.kafka.serializers.KafkaAvroSerializer kullanılıyor

Ayrıca props değerleri için org.apache.kafka.clients.producer.ProducerConfig ve 
io.confluent.kafka.serializers.KafkaAvroSerializerConfig kullanılıyor
@Configuration
public class KafkaProducerConfig {

  @Value("${spring.kafka.bootstrap-servers}")
  private String bootstrapServers;
  @Value("${schema.registry.url}"
  private String schemaRegistryUrl;

  @Bean
  public Map<String, Object> producerConfigs() {
    Map<String, Object> props = new HashMap<>();
    props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
    props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
    props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
    props.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
    return props;
  }

  @Bean
  public ProducerFactory<String, PersonDto> producerFactory() {
    return new DefaultKafkaProducerFactory<>(producerConfigs());
  }

  @Bean
  public KafkaTemplate<String, PersonDto> kafkaTemplate() {
    return new KafkaTemplate<>(producerFactory());
  }
}
Topic İçin Dosya Sistemi
Açıklaması şöyle.
In Kafka, data for a topic is stored in dedicated files and directories, but as a result, Kafka has trouble scaling because I/O will be scattered across the disk as these files are flushed from the page cache to disk periodically.
Offset
Açıklaması şöyle.
- the messages don't have separate IDs (they are addressed by their offset in the log).
- the system doesn't check the consumers of each topic or message.
- Kafka doesn't maintain any indexes and doesn't allow random access (it just delivers the messages in order, starting with the offset).
- the system doesn't have deletes and doesn't buffer the messages in userspace (but there are various configurable storage strategies).

Key Partition
Key partition için composite key kullanılabilir.

Cluster İçinde Retention Period - Message Replay
Yani eski ve işlenmiş mesajlara belli bir süre boyunca, cursor kullanarak halen erişmek mümkün. Açıklaması şöyle.
The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.
Bir başka açıklama şöyle
... unlike most messaging systems, the message queue in Kafka is persistent. The data sent is stored until a specified retention period has passed, either a period of time or a size limit. The message stays in the queue until the retention period/size limit is exceeded, meaning the message is not removed once it’s consumed. Instead, it can be replayed or consumed multiple times, which is a setting that can be adjusted.
Replay özelliğini dikkatli kullanmak gerekir. Yoksa hatalara sebep olabilir. Açıklaması şöyle
If you are using replay in Kafka, ensure that you are using it in the correct way and for the correct reason. Replaying an event multiple times that should just happen a single time; e.g. if you happen to save a customer order multiple times, is not ideal in most usage scenarios. Where a replay does come in handy is when you have a bug in the consumer that requires deploying a newer version, and you need to re-processing some or all of the messages.
Log compaction
Açıklaması şöyle
Log compaction ensures that Kafka always retains the last known value for each message key within the queue for a single topic partition. Kafka simply keeps the latest version of a message and delete the older versions with the same key.

Log compaction can be seen as a way of using Kafka as a database. You set the retention period to “forever” or enable log compaction on a topic, and voila, the data is stored forever.

An example of where we use log compaction is when we are showing the latest status of one cluster among thousands of clusters running. Instead of storing whether a cluster is responding or not all the time, we store the final status. The latest information is available immediately, such as how many messages are currently in the queue.
Yani aynı key değerine sahip kayıtlar varsa sadece en son değeri muhafaza eder, diğerlerini siler. Konuyu daha iyi gösteren bir şekil burada.

Storage Tiers in Kafka
Açıklaması şöyle.  local storage tier ve remote storage tier diye iki kademe var
Kafka mainly uses disks for log retention. The size and speed of the disks required is dependent on the retention period configured on the data. Because of Kafka underlying ability to replay messages from the start, different applications use the ability of configuring longer retention periods.

However, having longer retention periods also has its own strain on the disk in terms of backups, migration, etc. This increases the risk on consumers on the same cluster which depend on the fairly recent messages and don’t need a longer retention period for more than 2–3 days.

The Kafka team has come up with a concept of local storage tier and remote storage tier to alleviate the problem.

The retention period and the disk type can be configured separately for the different tiers. For example, for the local storage tier, we can use a smaller 100 GB SSD disk with a retention of 2 days. For the remote tier, we can use HBase or S3 with a retention period of 6 months.

For applications which need older data then the data in the local storage, the Kafka brokers will itself talk to the remote tier and fetch the data accordingly. This allows the same Kafka cluster and topics to serve different kinds of applications.
Kafka mainly uses disks for log retention. The size and speed of the disks required is dependent on the retention period configured on the data. Because of Kafka underlying ability to replay messages from the start, different applications use the ability of configuring longer retention periods.

However, having longer retention periods also has its own strain on the disk in terms of backups, migration, etc. This increases the risk on consumers on the same cluster which depend on the fairly recent messages and don’t need a longer retention period for more than 2–3 days.

The Kafka team has come up with a concept of local storage tier and remote storage tier to alleviate the problem.

The retention period and the disk type can be configured separately for the different tiers. For example, for the local storage tier, we can use a smaller 100 GB SSD disk with a retention of 2 days. For the remote tier, we can use HBase or S3 with a retention period of 6 months.

For applications which need older data then the data in the local storage, the Kafka brokers will itself talk to the remote tier and fetch the data accordingly. This allows the same Kafka cluster and topics to serve different kinds of applications.

Aynı Topic İçin Farklı Subscriber'lar Olabilir
Açıklaması şöyle.
...Kafka terminology, such as Topic, publisher, and subscriber, as well as Kafka’s ability to have different subscribers for the same topic.
Mesajları okuma subscriber 'ın görevi. Kafka mesajları herkes okuyuncaya kadar sakla tarzında bir işleve sahip değildir. Açıklaması şöyle.
...it is the consumer's responsibility to consume messages when they are available in a topic.

It is analogous to subscribing and watching content from satellite channel. The subscriber should tune to the channel while the content is being played to watch.
Consumer Group 
Consumer Group yazısına taşıdım.

API
Kafka API yazısına taşıdım.

Kafka Connect
Kafka Connect yazısına taşıdım

Komutlar
Apache Kafka Komutları yazısına taşıdım

Hiç yorum yok:

Yorum Gönder