Yazılım Çorbası: Mayıs 2023

30 Mayıs 2023 Salı

Snowflake Cloud-native Streaming Database

Giriş

Açıklaması şöyle

In recent years, many cloud-native systems have gradually surpassed and subverted their big data system counterparts. One of the most well-known examples is Snowflake. Snowflake brings cloud warehouses to the next level with its innovative architecture: separating storage and computing. It rapidly dominated the market that belonged to the data analytic systems (such as Impala) in the big data era.

Streaming Database Nedir ?

Giriş

Açıklaması şöyle

As their name suggests, streaming databases can ingest data from streaming data sources and make them available for querying immediately. They extend the stateful stream processing and bring additional features from the databases world, such as columnar file formats, indexing, materialized views, and scatter-gather query execution. Streaming databases have two variations: incrementally updated materialized views and real-time OLAP databases.

Stream Processing vs Streaming Database

1. State Persistence

2. Querying the state

3. Placement of state manipulation logic

1. State Persistence

Açıklaması şöyle. Yani Streaming Database veri tabanı gibi davranıyor ve veriyi segment/page denilen yapılarda saklıyor

Both technologies are equally capable of ingesting data from streaming data sources like Kafka, Pulsar, Redpanda, Kinesis, etc., producing analytics while the data is still fresh. They also have solid watermarking strategies to deal with late arriving data.

But they are quite different when it comes to persisting the state.

Stream processors partition the state and materialize it into the local disk for performance. This local state is periodically replicated to a remote “state backend” for fault tolerance. This process is called checkpointing in most stateful streaming implementations.

On the other hand, streaming databases follow a similar approach to many databases. They first write the ingested data into disk-backed “segments,” a column-oriented file format optimized for OLAP queries. Segments are replicated across the entire cluster for scalability and fault tolerance.

2. Querying the state

Açıklaması şöyle. Yani Streaming Database yine veri tabanı gibi davranıyor ve sorgu için query planner vs. gibi veri tabanı dünyasından gelen yapıları kullanıyor

Since the state is partitioned across multiple instances, a single stream processor node only holds a subset of the entire state. You must contact multiple nodes to run an interactive query against the full state. In that case, how do you know what nodes to contact?

Fortunately, many stream processors provide endpoints to run interactive queries against the state, such as State stores in Kafka Streams. However, they are not as scalable as they promise. Alternatively, stream processors can write the aggregated state to a read-optimized store, such as a key-value database, to offload the query complexity.

When it comes to querying the state, streaming databases behave similarly to regular OLAP databases. They leverage query planners, indexes, and smart query pruning techniques to improve query throughput and reduce latency.

Once a streaming database receives a query, the query broker scatters it across the nodes hosting the relevant segments. The query is executed locally to the node. The broker gathers the results, stitches them together, and returns them to the caller.

3. Placement of state manipulation logic

Açıklaması şöyle. Yani Stream processors kod yazmayı gerektirir. Bazen SQL ile de bu halledilebilir. Ancak Streaming Database veriyi değiştirme işini harici bir uygulamaya devreder ve karışmazlar.

Stream processors require you to know the state manipulation logic beforehand and bake it into the data processing flow. For example, to calculate the running total of events, you must first write the logic as a stream processing job, compile it, package it, and deploy it across all instances of the stream processor.

Conversely, streaming databases have an ad-hoc and human-centric approach toward state manipulation. They offload the state manipulation logic to the consumer application, which could be a human, a dashboard, an API, or a data-driven application. Ultimately, consumers decide what to do with the state rather than deciding it beforehand.

Hangisini Ne Zaman Kullanmalı

Açıklaması şöyle. Yani verinin nasıl dönüştürüleceğini vs. biliyorsak Stream Processors uygun olabilir ama ne sorgulayacağımız bilmiyorsak Streaming Database lazım

Use stream processors when
Stateful stream processors are good when you know exactly how to manipulate the state ahead of time.

If you need fast access to the materialized state, you can write the state to a read-optimized database and run queries.

Use streaming databases when
Streaming databases are ideal for use cases where you can’t predict the data access patterns ahead of time. They first capture the incoming streams and allow you to query on demand with random access patterns. Streaming databases are also good when you don’t need heavy transformations on the incoming data, and your pipeline terminates at the serving layer.

Stream Processing ve Streaming Database İkisi Birlikte Olabilir mi?

Açıklaması şöyle.

Materialize, RisingWave, and DeltaStream are emerging technologies in this space trying to bring stream processing and streaming databases together as a self-serve platform.

Apache Storm - İlk Stream Processing System

Giriş

Açıklaması şöyle. Batch Processing (Toplu iş) mantıkla çalışan Hadoop yerine Stream Processing (Akış İşleme) mantığını getirdi.

For many experienced engineers, Apache Storm may be the first stream processing system they have ever used. Apache Storm is a distributed stream computing engine written in Clojure, a niche JVM-based language many junior developers may not know. Storm was open-sourced by Twitter in 2011.

In the big-data era dominated by Hadoop, the emergence of Storm blew many developers’ minds in data processing. Traditionally, the way users process data is first to import a large amount of ta into HDFS, and then use batch computing engines such as Hadoop to analyze the data. With Storm, data could be processed on-the-fly, immediately after it flows into the system. With Storm, data processing latency was drastically reduced: Storm users could receive the latest results in just a few seconds without waiting for hours or even days.

Apache Storm'dan Önce

İlk akademik makale 2002 yılında çıkıyor. Daha sonra bazı ürünler takip ediyor. Açıklaması şöyle.

Just a few years after being studied in academia, stream processing technology was adopted by large enterprises. The top three database vendors, Oracle, IBM, and Microsoft, consecutively launched their stream processing solutions known as Oracle CQL, IBM System S, and Microsoft SQLServer StreamInsight. Interestingly, instead of developing a standalone stream processing system, all these vendors have chosen to integrate stream processing functionality into their existing systems.

Apache Storm'un Eksikleri

Açıklaması şöyle. En önemli eksiği SQL arayüzü sağlamaması

Apache Storm was groundbreaking at its time. However, the initial design of Storm was far from perfect. It lacked many basic functionalities that modern stream processing systems, by default, have to provide: state management, exactly-once semantics, dynamic scaling, SQL interface, etc. But it inspired many talented engineers to build next-generation stream processing systems. Just a few years after Storm emerged, many new stream processing systems were invented: Apache Spark Streaming, Apache Flink, Apache Samza, Apache Heron, Apache S4, and many more. Spark Streaming and Flink eventually stand out and become legends in the stream processing field.

Apache Kafka ve Apache Storm İlişkisi

Kafka sadece message broker, storm ise mesajları işleyen kısım. Akış şöyle.

Realtime application -> Kafka -> Storm -> NoSQL -> d3js

Stream Processing Nedir ?

Giriş

Stream Processing deyince insanların ilk aklına gelen şey low-latency işler yani stock trading, fraud detection, ad monetization gibi şeyler.

Stream Processing Yeni Bir Şey Mi?

Stream processing yeni bir şey değil. Açıklaması şöyle

Stream processing has been studied for over twenty years, and several advanced stream processing systems have been developed.

Neden Herkese Stream Processing Yapmıyor

Bazı sebepler şöyle

1. “We Don’t Need Fresh Results”
2. Stream Processing is Too Costly”
3. “Stream Processing is Challenging to Learn”
4. “Stream Processing Systems Are Tough to Maintain”

“Stream Processing is Challenging to Learn” maddesi için artık karmaşık Java kodları yazmak gerekmiyor. Stream Processing için SQL arayüzleri geliştirilmiş bunlar kullanılıyor

“Stream Processing Systems Are Tough to Maintain” maddesi için şöyle denilebilir. Artık

Dynamic Scaling ve Fault Recovery yöntemleri var. Her şey daha kolay. Ancak gerçekte bence Stream Processing sistemin idama ettirilmesi gerçekten sıkıntılı bir şey

Veriyi Taşımak İçin Stream Processing Kullanmak

Stream Processing ürünleri, A noktasından B'ye veri taşımak için kullanılabilir. Ancak esas kullanım amacı bu değil

Stream Processing Yapısı

Açıklaması şöyle

A stream processing application is a DAG (Direct Acyclic Graph), where each node is a processing step. You write a DAG by writing individual processing functions that perform operations when data flow passes through them. These functions can be stateless operations like transformations or filtering or stateful operations like aggregations that remember the result of the previous execution. Stateful stream processing is used to derive insights from streaming data.

Stream Processing Aşamaları

İki aşamadan oluşur

Olması İstenen

Ingest -> Enrich -> Transform -> Predict -> Act

1. Stream ingestion (ETL)

Stream ingestion ve Stream ETL eş anlamlı kullanılıyor. Klasik anlamdan bir kaynaktan okunan veririnin işlenerek, dönüştürülerek bir başka yere yazılması

2. Streaming analytics

En yeni ve taze bilginin sunulması şeklinde düşünülebilir. Süre son 30 dakika, 1 saat veya 7 gün olabilir.

Stream Processing vs Streaming Database

En büyük fark verinin nasıl saklandığı. Streaming Database yazısına taşıdım

Stream Processing vs Real Time OLAP

Stream Processing soruları şöyle

Monitor accounts that make over 100 transactions within a 10-minute period
Monitor the money spent by users in each city within a 1-hour period

Real Time OLAP soruları şöyle

Calculate the sales volume of bottled water in the past 1 dayalo
Calculate the number of users in each city in the past 6 hours

Stream Processing ve Data Ping Pong

Şeklen şöyle

Data ping pong verinin gereksiz yere ait olduğu broker'dan dışarı çıkması işlenmesi ve tekrar broker'a geri gelmesi demek. Bunun yerine In-broker data transformations kullanılabilir. Bu iş için WebAssemby - Wasm kullanılabilir

23 Mayıs 2023 Salı

Amazon Web Service (AWS) MSK - Managed Streaming for Apache Kafka

Giriş

Açıklaması şöyle

Amazon MSK, standing for Amazon Managed Streaming for Apache Kafka, is a fully managed service that makes it easy to build and run applications that use Apache Kafka to process streaming data. However, connecting to MSK from outside of your Virtual Private Cloud (VPC) requires exposing your brokers to the Kafka clients, which may reside in another AWS account.

There are several approaches to achieve it as mentioned in the AWS article Secure connectivity patterns to access Amazon MSK across AWS Regions,

22 Mayıs 2023 Pazartesi

Google Cloud Pub/Sub

Giriş

Açıklaması şöyle. Sanırım "AWS Simple Queue Service" ile aynı şey.

Pub/Sub is a messaging solution provided by GCP.

Google Cloud Pub/Sub vs Kafka

Google Cloud Pub/Sub at-least-once messaging semantics sağlar.

Kullanım

Açıklaması şöyle

1. Create a new project in the Google Cloud Console.
2. Enable the Pub/Sub API for your project.
3. Create a new topic.
4. Create a new subscription to that topic.
5. Download a private key file for your service account.

Örnek - SpringBoot

application.properties dosyasında şöyle yaparız

spring.cloud.gcp.project-id=your-gcp-project-id
spring.cloud.gcp.pubsub.credentials.location=file:path/to/your/service/account/key.json

Pusblisher

Şöyle yaparız

import org.springframework.cloud.gcp.pubsub.core.PubSubTemplate;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class PublisherController {
    private final PubSubTemplate pubSubTemplate;
    public PublisherController(PubSubTemplate pubSubTemplate) {
        this.pubSubTemplate = pubSubTemplate;
    }
    @PostMapping("/publish")
    public String publishMessage(@RequestParam("message") String message) {
        pubSubTemplate.publish("my-topic", message);
        return "Message published successfully.";
    }
}

Subscriber

Şöyle yaparız

import com.google.cloud.pubsub.v1.AckReplyConsumer;
import com.google.pubsub.v1.PubsubMessage;
import org.springframework.cloud.gcp.pubsub.support.GcpPubSubHeaders;
import org.springframework.cloud.gcp.pubsub.support.BasicAcknowledgeablePubsubMessage;
import org.springframework.messaging.handler.annotation.Header;
import org.springframework.cloud.gcp.pubsub.integration.AckMode;
import org.springframework.cloud.gcp.pubsub.integration.inbound.PubSubInboundChannelAdapter;
import org.springframework.cloud.gcp.pubsub.integration.inbound.PubSubMessageSource;
import org.springframework.context.annotation.Bean;
import org.springframework.integration.annotation.ServiceActivator;
import org.springframework.context.annotation.Bean;
import org.springframework.integration.annotation.MessagingGateway;
import org.springframework.integration.annotation.Poller;
import org.springframework.stereotype.Service;

@Service
public class SubscriberService {

  @ServiceActivator(inputChannel = "mySubscriptionInputChannel")
  public void messageReceiver(String payload, 
                              @Header(GcpPubSubHeaders.ORIGINAL_MESSAGE) 
                              BasicAcknowledgeablePubsubMessage message) {
    System.out.println("Message received from Google Cloud Pub/Sub: " + payload);

    // Acknowledge the message upon successful processing
    message.ack();
  }
}

Configure the input channel

Şöyle yaparız

import org.springframework.cloud.gcp.pubsub.integration.inbound.PubSubInboundChannelAdapter;
import org.springframework.cloud.gcp.pubsub.support.converter.SimplePubSubMessageConverter;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.integration.channel.DirectChannel;
import org.springframework.messaging.MessageChannel;

@Configuration
public class PubSubConfig {
  @Bean
  public MessageChannel myInputChannel() {
    return new DirectChannel();
  }
  @Bean
  public PubSubInboundChannelAdapter messageChannelAdapter(
    @Qualifier("myInputChannel") MessageChannel inputChannel,
    PubSubTemplate pubSubTemplate) {
    PubSubInboundChannelAdapter adapter = new PubSubInboundChannelAdapter(pubSubTemplate,
      "my-subscription");
    adapter.setOutputChannel(inputChannel);
    adapter.setPayloadConverter(new SimplePubSubMessageConverter());
    return adapter;
  }
}

Açıklaması şöyle

In the above configuration, we create a PubSubInboundChannelAdapter that listens to the "my-subscription" subscription and forwards the incoming messages to the "myInputChannel" channel.

18 Mayıs 2023 Perşembe

Security Assertion Markup Language - SAML

Giriş

Açıklaması şöyle

SAML (Security Assertion Markup Language) is another standard for exchanging authentication and authorization data between parties, specifically between an identity provider (IdP) and a service provider (SP). It is commonly used in enterprise applications to provide single sign-on (SSO) functionality.

Şeklen şöyle

17 Mayıs 2023 Çarşamba

Unicode Düzlemleri (Plane) - Basic Multilingual Plane (BMP)

Giriş

Bu düzlem ilk Unicode standardından beri var. Daha sonra Unicode yetersiz kalınca diğer düzlemler de eklenmiş. İlk 65 bin küsur karaktere denk gelir. Açıklaması şöyle

The Basic Multilingual Plane (BMP) is the first and most commonly used 65,536 code points of Unicode, covering the codes of the characters for nearly all modern writing even across languages, from English to Mandarin and Hindi.

Açıklaması şöyle

The Unicode Standard has since been changed to allow for characters whose representation requires more than 16 bits. The range of Unicode code points is now U+0000 to U+10FFFF. The set of characters from U+0000 to U+FFFF is called the basic multilingual plane (BMP), and characters whose code points are greater than U+FFFF are called supplementary characters. Such characters are generally rare, but some are used, for example, as part of Chinese and Japanese personal names. To support supplementary characters without changing the char primitive data type and causing incompatibility with previous Java programs, supplementary characters are defined by a pair of Unicode code units called surrogates.

Unicode karakterlerin sayısal değerlerinin ilk 7 biti Group, sonraki 8 biti Plane, sonraki 8 biti Row, en son 8 biti Cell olarak gruplanır. İlk Plane (group = 0, plane = 0) Basic Multilingual Plane (BMP) olarak adlandırılır. En çok kullanılan karakterler Basic Multilingual Plane alanına düşerler.

Basic Multilingual Plane kendi içinde alt bloklardan oluşuyor. Bir karakterin hangi blokta oluğunu öğrenmek için Java'da Character.UnicodeBlock.of (char) metodu kullanılabilir.

Türkçedeki her karakter aynı Unicode bloğuna ait değil. Çoğunluk BASIC_LATIN bloğunda olmasına rağmen bazı karakterler LATIN1_SUPPLEMENT ve
LATIN_EXTENDED_A bloklarındalar.

2.1 BMP Basic Latin Block
BASIC_LATIN bloğundaki tüm karakterler UTF-8 encoding ile 1 byte yer kaplıyorlar. Yani bu block 0-127 arasındaki ASCII bloğu ile aynı.

2.2 BMP Latin-1 Supplement
Latin-1 Supplement 128'den 255'e kadar olan karakterleri içerir. Bu blok batı Avrupa dillerindeki karakterleri içindir. Basic Latin'den farklı olarak her karakter 8 bit uzunluğundadır. Bu block ISO-8859-1 ile aynı. ISO-8859-1 çok yeni bir standart değil. Örneğin Euro karakteri bu kodekte yok. ISO-8859-15 gibi daha yeni karakterleri içeren kodekler var. Türkçe ISO-8859-9 kodekini kullanıyor.

CP1252'de Latin-1 gibi. Açıklaması şöyle

CP1252 is not a multibyte character set. It is a Windows variant of the Latin-1 or ISO-8859-1 character set.

İlginç Latin-1 Karakterleri
Acute Accent Karakteri
´ şeklinde gösterilir.

Name          Code point  Script  Block               UTF-8
GRAVE ACCENT  U+000060    Common  Basic Latin         60
ACUTE ACCENT  U+0000B4    Common  Latin 1 Supplement  C2 B4

Micro Karakteri
µ şeklinde gösterilir.

Diğer Karakterler
Aşağıdaki tabloda BASIC_LATIN bloğunda olmayan karakterler görülebilir.
Küçük "ı" ve büyük "I" harfleri farklı bloklarda. Aynı şekilde küçük "i" ve büyük "İ" harfleri de farklı bloklarda.

ç LATIN_1_SUPPLEMENT
ğ LATIN_EXTENDED_A
ı LATIN_EXTENDED_A
i BASIC_LATIN
ö LATIN_1_SUPPLEMENT
ş LATIN_EXTENDED_A
ü LATIN_1_SUPPLEMENT

Ç LATIN_1_SUPPLEMENT
Ğ LATIN_EXTENDED_A
I BASIC_LATIN
İ LATIN_EXTENDED_A
Ö LATIN_1_SUPPLEMENT
Ş LATIN_EXTENDED_A
Ü LATIN_1_SUPPLEMENT

2.3 BMP Latin Extended-A
Latin Extended-A bloğu 256'dan 383'e kadar Unicode karakterleri içerir.

2.4 BMP General Punctuation
RIGHT-TO-LEFT OVERRIDE
Şöyle yaparız. Çıktı olarak "hello world!" alırız.

greet = "‮".toString.bind("hello world!")

15 Mayıs 2023 Pazartesi

Twitter Snowflake Unique ID Generator

Giriş

Twitter Snowflake servisini durdurdu. Ancak entellektüel açıdan ilginç olabilir diye not almak istedim.

Şeklen şöyle

| 1 bit | 41 bits   | 5 bits         | 5 bits     | 12 bits         |
|-------|-----------|----------------|------------|-----------------|
| 0     | timestamp | data center id | machine id | sequence number |

Açıklaması şöyle

1. Sign Bit:
The sign bit is never used. Its value is always zero. This bit is just taken as a backup that will be used at some time in the future.

2. Timestamp Bits:
This time is the epoch time. Previously the benchmark time for calculating the time elapsed was from 1st Jan 1970. But Twitter changed this benchmark time to 4th November 2010.

3. Data center Bits:
5 bits are reserved for this, which means that there can be (2⁵) = 32 data centers.

4. Machine ID Bits:
5 bits are reserved for this, which again means that there can be (2⁵) = 32 machines per data center.

5. Sequence No Bits:
These bits are kept to ensure the uniqueness property to be maintained when multiple requests land on the same machine on the same data center at the same timestamp. So, these bits are used for generating sequence numbers for IDs that are generated at the same timestamp. The sequence number is reset to zero at every millisecond. Since we have reserved 12 bits for this, we can have (2¹²) = 4096 sequence numbers which are certainly more than the IDs that are generated every millisecond by every single machine.

Timestamp Bits

Açıklaması şöyle

41 bits which gives us 69 years for any custom epoch

Sequence No Bits:

Açıklaması şöyle

The sequence number is incremented by 1 for every ID generated on the machine. Resets to 0 every millisecond. In theory, we can support 2¹² = 4096 new IDs every millisecond.

Similar Approaches Based On Snowflake

Açıklamasıı şöyle

- Baidu UID generator
- Sonyflake

4 Mayıs 2023 Perşembe

Streaming SQL

Giriş

Açıklaması şöyle

Streaming SQL is a type of query language used for analyzing data in real-time streaming systems. It is designed to process continuous data streams from sources such as social media, sensors, or other IoT devices, and allows users to query this data on-the-fly.

Unlike traditional SQL, which operates on static data, streaming SQL queries are designed to process data as it arrives, allowing for real-time analysis and insights. Streaming SQL queries can be used to filter, aggregate, and transform data in real-time, and can be used to trigger alerts or other automated actions based on specific events or conditions.

Some popular streaming SQL technologies include Apache Flink, Apache Kafka, and Apache Storm. These technologies are commonly used in applications such as fraud detection, real-time analytics, and predictive maintenance, among others.

3 Mayıs 2023 Çarşamba

H2 Veri Tabanı INFORMATION_SCHEMA

Giriş

Tablolar şöyle

ENUM_VALUES

INDEXES

INDEX_COLUMNS

INFORMATION_SCHEMA _CATALOG_NAME

IN_DOUBT

LOCKS

QUERY_STATISTICS

RIGHTS

ROLES

SESSIONS

SESSION_STATE

SETTINGS

SYNONYMS

USERS

CHECK_CONSTRAINTS

COLUMNS

COLUMN_PRIVILEGES

CONSTRAINT_COLUMN_USAGES

DOMAINS

DOMAIN_CONSTRAINTS

ELEMENT_TYPES

FIELDS

KEY_COLUMN_USAGE

PARAMETERS

REFERENTIAL_CONSTRAINTS

ROUTINES

SCHEMATA

SEQUENCES

TABLES

TABLE_CONSTRAINTS

TABLE_PRIVILEGES

TRIGGERS

VIEWS