30 Mayıs 2023 Salı

Streaming Database Nedir ?

Giriş
Açıklaması şöyle
As their name suggests, streaming databases can ingest data from streaming data sources and make them available for querying immediately. They extend the stateful stream processing and bring additional features from the databases world, such as columnar file formats, indexing, materialized views, and scatter-gather query execution. Streaming databases have two variations: incrementally updated materialized views and real-time OLAP databases.
Stream Processing vs Streaming Database
1. State Persistence
2. Querying the state
3. Placement of state manipulation logic

1. State Persistence
Açıklaması şöyle. Yani Streaming Database veri tabanı gibi davranıyor ve veriyi segment/page denilen yapılarda saklıyor
Both technologies are equally capable of ingesting data from streaming data sources like Kafka, Pulsar, Redpanda, Kinesis, etc., producing analytics while the data is still fresh. They also have solid watermarking strategies to deal with late arriving data.

But they are quite different when it comes to persisting the state.

Stream processors partition the state and materialize it into the local disk for performance. This local state is periodically replicated to a remote “state backend” for fault tolerance. This process is called checkpointing in most stateful streaming implementations.

On the other hand, streaming databases follow a similar approach to many databases. They first write the ingested data into disk-backed “segments,” a column-oriented file format optimized for OLAP queries. Segments are replicated across the entire cluster for scalability and fault tolerance.
2. Querying the state
Açıklaması şöyle. Yani Streaming Database yine veri tabanı gibi davranıyor ve sorgu için query planner vs. gibi veri tabanı dünyasından gelen yapıları kullanıyor
Since the state is partitioned across multiple instances, a single stream processor node only holds a subset of the entire state. You must contact multiple nodes to run an interactive query against the full state. In that case, how do you know what nodes to contact?

Fortunately, many stream processors provide endpoints to run interactive queries against the state, such as State stores in Kafka Streams. However, they are not as scalable as they promise. Alternatively, stream processors can write the aggregated state to a read-optimized store, such as a key-value database, to offload the query complexity.

When it comes to querying the state, streaming databases behave similarly to regular OLAP databases. They leverage query planners, indexes, and smart query pruning techniques to improve query throughput and reduce latency.

Once a streaming database receives a query, the query broker scatters it across the nodes hosting the relevant segments. The query is executed locally to the node. The broker gathers the results, stitches them together, and returns them to the caller.
3. Placement of state manipulation logic
Açıklaması şöyle. Yani Stream processors kod yazmayı gerektirir. Bazen SQL ile de bu halledilebilir. Ancak Streaming Database veriyi değiştirme işini harici bir uygulamaya devreder ve karışmazlar.
Stream processors require you to know the state manipulation logic beforehand and bake it into the data processing flow. For example, to calculate the running total of events, you must first write the logic as a stream processing job, compile it, package it, and deploy it across all instances of the stream processor.

Conversely, streaming databases have an ad-hoc and human-centric approach toward state manipulation. They offload the state manipulation logic to the consumer application, which could be a human, a dashboard, an API, or a data-driven application. Ultimately, consumers decide what to do with the state rather than deciding it beforehand.
Hangisini Ne Zaman Kullanmalı
Açıklaması şöyle. Yani verinin nasıl dönüştürüleceğini vs. biliyorsak Stream Processors uygun olabilir ama ne sorgulayacağımız bilmiyorsak Streaming Database lazım
Use stream processors when
Stateful stream processors are good when you know exactly how to manipulate the state ahead of time. 

If you need fast access to the materialized state, you can write the state to a read-optimized database and run queries.

Use streaming databases when
Streaming databases are ideal for use cases where you can’t predict the data access patterns ahead of time. They first capture the incoming streams and allow you to query on demand with random access patterns. Streaming databases are also good when you don’t need heavy transformations on the incoming data, and your pipeline terminates at the serving layer.
Stream Processing ve Streaming Database İkisi Birlikte Olabilir mi?
Açıklaması şöyle
Materialize, RisingWave, and DeltaStream are emerging technologies in this space trying to bring stream processing and streaming databases together as a self-serve platform.







Hiç yorum yok:

Yorum Gönder