Yazılım Çorbası: Apache Spark - Data Processing Intensive Uygulamalar İçindir

Giriş
Açıklaması şöyle. Spark 2009 yılında geliştirilmeye başladı.

Apache Spark is a distributed processing engine that allows you to process large datasets across a cluster of computers. It was developed at the University of California, Berkeley in 2009 and was later donated to the Apache Software Foundation, making it an open-source project.

Hadoop ile gelen map-reduce yerine scatter-gather kullanılır. Spark ayarlarını açıklayan bir yazı burada

Maliyet

Spark maliyeti ile ilgili bir yazı burada. Yazıya göre Spark yerine data warehouses tarafından sağlanan SQL arayüzü ile çalıştırılan işler daha ucuza gelebiliyor.

Bileşenler

The Overview Of Apache Spark yazısı güzel

Connectors

Açıklaması şöyle. Yani Apache Spark bir sürü Connector destekler

..., Spark can access all kinds of data sources, such as object stores in the cloud (S3, ABS, …), relational databases via JDBC and much more via custom connectors (Kafka, HBase, MongoDB, Cassandra, …). This wide range of connectors, coupled with the extensive support for data transformations, supports both simple data ingestion, pure data transformation or any combination such as ETL and ELT — Sparks flexibility supports very different types of workflows to fit your exact needs.

Spark ve Kubernetes

Açıklaması şöyle

In 2016, Google and Chinese technology company Huawei announced a joint project to run Spark on Kubernetes. The project aimed to make it easier for users to run Spark in a container environment and to make Kubernetes a more robust platform for large-scale data processing.

Spark Big Data İçindir

Bir başka açıklama şöyle

The primary objective of Spark is processing huge data and also supports advanced analytics with SQL queries, machine learning and graph algorithms.

Data Model
DataSet kullanılır. Bir object ve bunlarla kullanılabilir metodlar verir.

Streams API

Açıklaması şöyle

The popular Apache Spark analytics engine for data processing provides two APIs for stream processing:

1. Spark Streaming
2. Spark Structured Streaming

Açıklaması şöyle. Yani Streams API bir anlamda SQL gibidir.

Apache Spark is a powerful framework for building flexible and scalable data processing applications. Spark offers a relatively simple yet powerful API that provides the capabilities of classic SQL SELECT statements in a clean and flexible API.

Açıklaması şöyle

Spark Streaming is a distinct Spark library that was built as an extension of the core API to provide high-throughput and fault-tolerant processing of real-time streaming data. It allows you to connect to many data sources, execute complex operations on those data streams, and output the transformed data into different systems.
Under the hood, Spark Streaming abstracts over the continuous stream of input data with the Discretized Streams (or DStreams) API. A DStream is just a sequence, chunk, or batch of immutable, distributed data structure used by Apache Spark known as Resilient Distributed Datasets (RDDs).

Açıklaması şöyle

Under the hood, Spark Streaming abstracts over the continuous stream of input data with the Discretized Streams (or DStreams) API. A DStream is just a sequence, chunk, or batch of immutable, distributed data structure used by Apache Spark known as Resilient Distributed Datasets (RDDs).
As you can see from the following diagram, each RDD represents data over a certain time interval. Operations carried out on the DStreams will cascade to all the underlying RDDs.

Şeklen şöyle

Programlama Dili

Açıklaması şöyle

Spark Streaming provides API libraries in Java, Scala, and Python. Remember that Kafka Streams only supports writing stream processing programs in Java and Scala.

Flowman

Apache Spark için YAML ile ETL imkanı sağlar

Windowing support

Açıklaması şöyle

Spark Streaming provides support for sliding windowing. This is a time-based window characterized by two parameters: the window interval (length) and the sliding interval at which each windowed operation will be performed. Compare that to Kafka Streams, which supports four different types of windowing.a

Spark Structured Streaming

Açıklaması şöyle

.. a framework built on the Spark SQL engine that helps process data in micro-batches. Unlike Spark Streaming, Spark Structured Streaming processes data incrementally and updates the result as more data arrives.

Açıklaması şöyle

Since the Spark 2.x release, Spark Structured Streaming has become the major streaming engine for Apache Spark. It’s a high-level API built on top of the Spark SQL API component, and is therefore based on dataframe and dataset APIs that you can quickly use with an SQL query or Scala operation. Like Spark Streaming, it polls data based on time duration, but unlike Spark Streaming, rows of a stream are incrementally appended to an unbounded input table

Açıklaması şöyle

... the programming interface of SQL DataFrames and Structured Streaming DataFrames is not the same. Structured Streaming is a lot more limited in terms of what can be done ...

Şeklen şöyle

Spark Connect

Açıklaması şöyle

The introduction of Spark Connect in v3.4 has brought about a new client-server architecture for Apache Spark.

Şeklen şöyle

Storage

Hadoop gibi sadece HDFS yerine farklı depolama sistemleri kullanılır. Yani storage agnostic'tir. Açıklaması şöyle.

It integrates easily with HIVE and HDFS and provides a seamless experience of parallel data processing.

Skewed Join

Skewed Join için açıklama şöyle. Bunları hızlandırmakla ilgili bir yazı burada.

A Dataset is considered to be skewed for a Join operation when the distribution of join keys across the records in the dataset is skewed towards a small subset of keys. For example when 80% of records in the datasets contribute to only 20% of Join keys.

Spark Uygulaması

Tipik bir Spark uygulaması şöyle. Buna "simple load-transform-save job" diyelim. Bir veri kaynağından veri yüklenir, dönüştürülür ve tekrar kaydedilir.

Biraz daha karışığı "two datasets are merged into one and then saved" şöyle

Kullanım

Örnek

Bazı kavramlar şöyle

1. HDFS

Hadoop Distributed File System anlamına gelir. Dağıtık bir dosya sistemidir

2. Apache Hive

Açıklaması şöyle

Apache Hive is the database facility running over HDFS. It allows querying data with HQL (SQL-like language).

Regular databases (e.g. PostgreSQL, Oracle) act as an abstraction layer over the local file system. While Apache Hive acts as an abstraction over HDFS. That’s it.

3. Spark Workers

Açıklaması şöyle

... Apache Spark workers run on multiple nodes and store the intermediate results in RAM. It’s written in Scala but it also supports Java and Python.

Şeklen şöyle

Burada DataProducer Apache Hive, Data Consumer Aerospike veri tabanı, Apache Spark uygulaması da bizim yüklediğimiz jar dosyası

Gradle ile şöyle yaparız

ext {
    set('testcontainersVersion', '1.16.2')
    set('sparkVersion', '3.2.1')
    set('slf4jVersion', '1.7.36')
    set('aerospikeVersion', '5.1.11')
}

dependencies {
    annotationProcessor 'org.springframework.boot:spring-boot-configuration-processor'
    implementation('org.springframework.boot:spring-boot-starter-validation') {
        exclude group: 'org.slf4j'
    }
    implementation("com.aerospike:aerospike-client:${aerospikeVersion}") {
        exclude group: 'org.slf4j'
    }
    compileOnly "org.apache.spark:spark-core_2.13:${sparkVersion}"
    compileOnly "org.apache.spark:spark-hive_2.13:${sparkVersion}"
    compileOnly "org.apache.spark:spark-sql_2.13:${sparkVersion}"
    compileOnly "org.slf4j:slf4j-api:${slf4jVersion}"

    testImplementation 'org.apache.derby:derby'
    testImplementation "org.apache.spark:spark-core_2.13:${sparkVersion}"
    testImplementation "org.apache.spark:spark-hive_2.13:${sparkVersion}"
    testImplementation "org.apache.spark:spark-sql_2.13:${sparkVersion}"
    testImplementation 'org.springframework.boot:spring-boot-starter-test'
    testImplementation "org.slf4j:slf4j-api:${slf4jVersion}"
    testImplementation 'org.codehaus.janino:janino:3.0.8'
    testImplementation 'org.testcontainers:junit-jupiter'
    testImplementation 'org.awaitility:awaitility:4.2.0'
    testImplementation 'org.hamcrest:hamcrest-all:1.3'
}

Açıklaması şöyle

All Spark dependencies have to be marked as compileOnly. It means that they won't be included in the assembled .jar file. Apache Spark will provide the required dependencies in runtime. If you include them as implementation scope, that may lead to hard-tracking bugs during execution.

Eğer dependency'lere bakarsak açıklaması şöyle

First comes Apache Spark dependencies. The spark-core artefact is the root. The spark-hive enables data retrieving from Apache Hive. And the spark-sql dependency gives us the ability to query data from Apache Hive with SQL usage.

Yazılım Çorbası

29 Aralık 2019 Pazar

Apache Spark - Data Processing Intensive Uygulamalar İçindir

Hiç yorum yok:

Yorum Gönder

Blog Arşivi