Yazılım Çorbası: Apache Flink - Stream Processing Engine

Giriş

Data streaming için kullanılır

Flink Kelimesi Ne Demek?

Açıklaması şöyle

The name Flink derives from the German word flink which means fast or agile (hence the logo, which is a red squirrel — a common sight in Berlin, where Apache Flink was partially created). Flink sprung from Stratosphere, a research project conducted by several European universities between 2010 and 2014.

Açıklaması şöyle

Apache Flink, known initially as Stratosphere, is a distributed stream processing engine initiated by a group of researchers at TU Berlin. Since its initial release in May 2011, Flink has gained immense popularity in both academia and industry. And it is currently the most well-known streaming system globally

Bazı özellikleri şöyle

1. Unified streaming and batch APIs

2. Connectivity to one or multiple Kafka clusters

3. Transactions across Kafka and Flink

4. Complex Event Processing

5. Standard SQL support

6. Machine Learning with Kafka, Flink, and Python

Apache Flink vs Apache Spark

Açıklaması şöyle

One of the key distinctions between Flink and Spark lies in their approach to data processing. Spark operates primarily as a batch processing system, with streaming capabilities built on top of its batch engine. In contrast, Flink is designed as a true streaming engine, with batch processing treated as a special case of streaming. This makes Flink more adept at handling real-time data, a critical aspect in many data lakehouse use cases.

Flink's event-time processing is another feature that gives it an edge over Spark. While Spark also supports event-time processing, Flink's handling of late events and watermarks is more sophisticated, which is crucial for ensuring accurate real-time analytics.

In terms of fault tolerance, both frameworks offer robust mechanisms. However, Flink's lightweight asynchronous checkpointing mechanism causes less performance impact compared to Spark's more resource-intensive approach.

Why classic streaming stack didn’t pickup?

Burada ilginç bir açıklama var. Yani Flink classic streaming teknolojisi kabul ediliyor. Classic kabul edilenler sadece Stream Processing yapıyorlar

The classic streaming stack had a good mix of stream processing technologies, from Storm, Samza, Spark Streaming to Flink. They all had their glory days, but somehow, those marvelous technologies failed to reach the great majority of software developers.

In my opinion, the following are the primary reasons for the classic streaming stack not reaching the level of average developers.

Overly complicated technology: Many stream processing frameworks require you to master specialized skills such as distributed systems and performance engineering.
Limited only to the JVM: Almost all notable stream processors were built on top of JVM languages like Java and Scala, requiring you to learn JVM specifics like JVM performance tuning and debugging Out of Memory issues.
Higher footprint on infrastructure: Stream processors demand higher CPU and memory. Also, long-term storage was needed for backfilling and historical analysis.

Bir de modern streaming teknolojileri var. Bunlar Streaming Database ürünleri

Unified streaming and batch APIs

Yani ister batch ister stream iş olsun aynı API kullanılıyor

Complex Event Processing with FlinkCEP

Açıklaması şöyle

The goal of complex event processing (CEP) is to identify meaningful events in real-time situations and respond to them as quickly as possible. CEP does usually not send continuous events to other systems but detects when something significant occurs. A common use case for CEP is handling late-arriving events or the non-occurrence of events.

The big difference between CEP and event stream processing (ESP) is that CEP generates new events to trigger action based on situations it detects across multiple event streams with events of different types (situations that build up over time and space). ESP detects patterns over event streams with homogenous events (i.e. patterns over time). Pattern matching is a technique to implement either pattern but the features look different.

FlinkCEP is an add-on for Flink to do complex event processing. The powerful pattern API of FlinkCEP allows you to define complex pattern sequences you want to extract from your input stream. After specifying the pattern sequence, you apply them to the input stream to detect potential matches. This is also possible with SQL via the MATCH_RECOGNIZE clause.

Flink Mimarisi

Açıklaması şöyle

Apache Flink has two main components:

1. Job Manager. Its mission is to distribute work onto multiple TaskManagers
2. Task Manager. Its mission is to perform the work of a Flink job.

JobManager

Açıklaması şöyle

The JobManager is the master node that coordinates the execution of Flink applications on the cluster. The JobManager consists of three sub-components:

1. JobGraph Generator: The JobGraph Generator takes the application code and configuration and translates it into a JobGraph, which is a logical representation of the data flow and operators of the application
2. Scheduler: The Scheduler takes the JobGraph and assigns it to one or more TaskManagers based on the available resources and parallelism settings. The Scheduler also handles failures and reschedules tasks if needed
3. Checkpoint Coordinator: The Checkpoint Coordinator periodically triggers checkpoints for the application state and coordinates with the TaskManagers to store the checkpoints in a durable storage such as HDFS or S3

TaskManager

Açıklaması şöyle

The TaskManager is the worker node that executes the tasks of Flink applications on the cluster. The TaskManager consists of two sub-components:

Task Slot: The Task Slot is a unit of resource allocation that determines how many tasks a TaskManager can run concurrently. Each Task Slot has a fixed amount of CPU cores and memory assigned to it
Task: The Task is the actual execution unit that runs the operators and functions of Flink applications on the Task Slot. Each Task has an input buffer, an output buffer, and a state backend that manages its local state

Data Model

Açıklaması şöyle. Dataset batch işler içindir, Datastream ise stream işler içindir.

Flink’s data model is based on two core concepts: Dataset and Datastream.

1. A Dataset represents a static collection of data elements that can be processed in parallel using batch operators.
2. A DataStream represents a dynamic stream of data elements that can be processed in real-time using stream operators.

Açıklaması şöyle.

Flink can automatically detect whether a data source is bounded or unbounded by checking its properties and behavior. For example, if a file source has an end-of-file marker or a database source has a query limit, Flink will treat it as a bounded source. If a socket source has no end-of-stream signal or a Kafka source has no end-of-partition marker, Flink will treat it as an unbounded source.

Depending on the type of the data source, Flink will apply different execution strategies for Dataset and DataStream. For bounded sources, Flink will execute Dataset operators in batch mode, which means that it will process the entire data set in one go and produce a final result. For unbounded sources, Flink will execute DataStream operators in streaming mode, which means that it will process the data elements as they arrive and produce incremental results.

Modlar

Apache Flink Modları yazısına taşıdım

Flink'in Mimarideki Yeri

Flink'in mimarideki yeri şeklen şöyle. Bir anlamda Kafka için tamamlayıcı olarak düşünülüyor. Açıklaması şöyle

Many complementary stream processing engines like Apache Flink and SaaS offerings have emerged.

Resiliency

Açıklaması şöyle

Another interesting topic would be how to be resilient using Apache Flink. This is where checkpoints and savepoints would appear. The current implementations of Checkpoints and Savepoints are pretty much the same. Their main differences are:

Checkpoints

Their use case is for self healing in case of unexpected job failures
They are created, owned and released by Flink (without user interaction)
Don’t survive job termination (except retained Checkpoints)

Savepoints

Pretty similar to checkpoints but with extra data info
Their use case is for updates in Flink version, parallelism changes, maintenance windows and so on
They are created, owned and released by user
Survive job termination

Checkpoints

Açıklaması şöyle

Checkpoints make state in Flink fault tolerant by allowing state and the corresponding stream positions to be recovered, thereby giving the application the same semantics as a failure-free execution.

Savepoints, pretty similar to checkpoints, allow rescaling the cluster or a version upgrade.

Stateful Functions

Açıklaması şöyle

Stateful Functions is an API that turns Apache Flink into a platform for running event-driven applications and dramatically reduces developers’ headaches. The API provides four major types of components:
- Stateful Function — Stateful event producer and consumer
- Ingress — Input port of the application
- Ingress router — Component that routes ingested events to functions
- Egress — Output port of the application

The general flow is the following:

1. Ingress ingests incoming events into the application.
2. Ingress Router delivers an event to a particular Function.
3. The Function consumes an event and can produce another one.
4. The Function sends an event to the Egress.
5. Egress delivers the message to a message broker.

Connector Yapısı

1. Kafka'dan mesaj okuyup, bunu işleyip cevap tekrar Kafka'ya yazabiliriz. Böylece uzun süren işlemleri Flink halleder

2. Flink JDBC Connector ve Use windowing kullanılarak Kafka'dan okuyup veri tabanına batch (toplu) halde yazabiliriz

flink komutu

run seçeneği

Örnek

Şöyle yaparız

flink run target/my-flink-project-0.1.jar

Örnek

Şöyle yaparız

cd /opt/flink/bin && \
./flink run 
  -c com.example.flink.DataStreamJob \
  /opt/flink/jobs/redpanda-flink-mongodb-1.0-SNAPSHOT.jar

Maven Archetype

Şöyle yaparız

mvn archetype:generate \
   -DarchetypeGroupId=org.apache.flink \
   -DarchetypeArtifactId=flink-quickstart-java \
   -DarchetypeVersion=1.13.0 \
   -DgroupId=com.example \
   -DartifactId=my-flink-project \
   -Dversion=0.1 \
   -Dpackage=com.example \
   -DinteractiveMode=false

Açıklaması şöyle

This command creates a new Maven project with the Flink dependencies and a sample Flink job.

Yazılım Çorbası

4 Ekim 2021 Pazartesi

Apache Flink - Stream Processing Engine

Hiç yorum yok:

Yorum Gönder

Blog Arşivi