29 Aralık 2019 Pazar

Apache Spark - Data Processing Intensive Uygulamalar İçindir

Giriş
Açıklaması şöyle. Spark 2009 yılında geliştirilmeye başladı.
Apache Spark is a distributed processing engine that allows you to process large datasets across a cluster of computers. It was developed at the University of California, Berkeley in 2009 and was later donated to the Apache Software Foundation, making it an open-source project.
Hadoop ile gelen map-reduce yerine scatter-gather kullanılır. Spark ayarlarını açıklayan bir yazı burada


Maliyet
Spark maliyeti ile ilgili bir yazı burada. Yazıya göre Spark yerine data warehouses tarafından sağlanan SQL arayüzü ile çalıştırılan işler daha ucuza gelebiliyor.

Bileşenler

Connectors
Açıklaması şöyle. Yani Apache Spark bir sürü Connector destekler
..., Spark can access all kinds of data sources, such as object stores in the cloud (S3, ABS, …), relational databases via JDBC and much more via custom connectors (Kafka, HBase, MongoDB, Cassandra, …). This wide range of connectors, coupled with the extensive support for data transformations, supports both simple data ingestion, pure data transformation or any combination such as ETL and ELT — Sparks flexibility supports very different types of workflows to fit your exact needs.

Spark ve Kubernetes
Açıklaması şöyle
In 2016, Google and Chinese technology company Huawei announced a joint project to run Spark on Kubernetes. The project aimed to make it easier for users to run Spark in a container environment and to make Kubernetes a more robust platform for large-scale data processing.
Spark Big Data İçindir
Bir başka açıklama şöyle
The primary objective of Spark is processing huge data and also supports advanced analytics with SQL queries, machine learning and graph algorithms.
Data Model
DataSet kullanılır. Bir object ve bunlarla kullanılabilir metodlar verir. 

Streams API
Açıklaması şöyle
The popular Apache Spark analytics engine for data processing provides two APIs for stream processing:

1. Spark Streaming
2. Spark Structured Streaming
Açıklaması şöyle. Yani Streams API bir anlamda SQL gibidir.
Apache Spark is a powerful framework for building flexible and scalable data processing applications. Spark offers a relatively simple yet powerful API that provides the capabilities of classic SQL SELECT statements in a clean and flexible API. 
Açıklaması şöyle
Spark Streaming is a distinct Spark library that was built as an extension of the core API to provide high-throughput and fault-tolerant processing of real-time streaming data. It allows you to connect to many data sources, execute complex operations on those data streams, and output the transformed data into different systems.
Under the hood, Spark Streaming abstracts over the continuous stream of input data with the Discretized Streams (or DStreams) API. A DStream is just a sequence, chunk, or batch of immutable, distributed data structure used by Apache Spark known as Resilient Distributed Datasets (RDDs).
Açıklaması şöyle
Under the hood, Spark Streaming abstracts over the continuous stream of input data with the Discretized Streams (or DStreams) API. A DStream is just a sequence, chunk, or batch of immutable, distributed data structure used by Apache Spark known as Resilient Distributed Datasets (RDDs).
As you can see from the following diagram, each RDD represents data over a certain time interval. Operations carried out on the DStreams will cascade to all the underlying RDDs.
Şeklen şöyle
Programlama Dili
Açıklaması şöyle
Spark Streaming provides API libraries in Java, Scala, and Python. Remember that Kafka Streams only supports writing stream processing programs in Java and Scala.
Flowman
Apache Spark için YAML ile ETL imkanı sağlar

Windowing support
Açıklaması şöyle
Spark Streaming provides support for sliding windowing. This is a time-based window characterized by two parameters: the window interval (length) and the sliding interval at which each windowed operation will be performed. Compare that to Kafka Streams, which supports four different types of windowing.a
Spark Structured Streaming
Açıklaması şöyle
.. a framework built on the Spark SQL engine that helps process data in micro-batches. Unlike Spark Streaming, Spark Structured Streaming processes data incrementally and updates the result as more data arrives.
Açıklaması şöyle
Since the Spark 2.x release, Spark Structured Streaming has become the major streaming engine for Apache Spark. It’s a high-level API built on top of the Spark SQL API component, and is therefore based on dataframe and dataset APIs that you can quickly use with an SQL query or Scala operation. Like Spark Streaming, it polls data based on time duration, but unlike Spark Streaming, rows of a stream are incrementally appended to an unbounded input table
Açıklaması şöyle
... the programming interface of SQL DataFrames and Structured Streaming DataFrames is not the same. Structured Streaming is a lot more limited in terms of what can be done ...
Şeklen şöyle


Spark Connect
Açıklaması şöyle
The introduction of Spark Connect in v3.4 has brought about a new client-server architecture for Apache Spark.
Şeklen şöyle


Storage
Hadoop gibi sadece HDFS yerine farklı depolama sistemleri kullanılır. Yani storage agnostic'tir. Açıklaması şöyle.
It integrates easily with HIVE and HDFS and provides a seamless experience of parallel data processing. 
Skewed Join
Skewed Join için açıklama şöyle. Bunları hızlandırmakla ilgili bir yazı burada
A Dataset is considered to be skewed for a Join operation when the distribution of join keys across the records in the dataset is skewed towards a small subset of keys. For example when 80% of records in the datasets contribute to only 20% of Join keys.
Spark Uygulaması
Tipik bir Spark uygulaması şöyle. Buna  "simple load-transform-save job" diyelim. Bir veri kaynağından veri yüklenir, dönüştürülür ve tekrar kaydedilir.

Biraz daha karışığı "two datasets are merged into one and then saved" şöyle
Kullanım

Örnek
Bazı kavramlar şöyle

1. HDFS
Hadoop Distributed File System anlamına gelir. Dağıtık bir dosya sistemidir

2. Apache Hive
Açıklaması şöyle
Apache Hive is the database facility running over HDFS. It allows querying data with HQL (SQL-like language).

Regular databases (e.g. PostgreSQL, Oracle) act as an abstraction layer over the local file system. While Apache Hive acts as an abstraction over HDFS. That’s it.
3. Spark Workers
Açıklaması şöyle
... Apache Spark workers run on multiple nodes and store the intermediate results in RAM. It’s written in Scala but it also supports Java and Python. 
Şeklen şöyle

Burada DataProducer Apache Hive, Data Consumer Aerospike veri tabanı, Apache Spark uygulaması da bizim yüklediğimiz jar dosyası
Gradle ile şöyle yaparız
ext {
    set('testcontainersVersion', '1.16.2')
    set('sparkVersion', '3.2.1')
    set('slf4jVersion', '1.7.36')
    set('aerospikeVersion', '5.1.11')
}

dependencies {
    annotationProcessor 'org.springframework.boot:spring-boot-configuration-processor'
    implementation('org.springframework.boot:spring-boot-starter-validation') {
        exclude group: 'org.slf4j'
    }
    implementation("com.aerospike:aerospike-client:${aerospikeVersion}") {
        exclude group: 'org.slf4j'
    }
    compileOnly "org.apache.spark:spark-core_2.13:${sparkVersion}"
    compileOnly "org.apache.spark:spark-hive_2.13:${sparkVersion}"
    compileOnly "org.apache.spark:spark-sql_2.13:${sparkVersion}"
    compileOnly "org.slf4j:slf4j-api:${slf4jVersion}"

    testImplementation 'org.apache.derby:derby'
    testImplementation "org.apache.spark:spark-core_2.13:${sparkVersion}"
    testImplementation "org.apache.spark:spark-hive_2.13:${sparkVersion}"
    testImplementation "org.apache.spark:spark-sql_2.13:${sparkVersion}"
    testImplementation 'org.springframework.boot:spring-boot-starter-test'
    testImplementation "org.slf4j:slf4j-api:${slf4jVersion}"
    testImplementation 'org.codehaus.janino:janino:3.0.8'
    testImplementation 'org.testcontainers:junit-jupiter'
    testImplementation 'org.awaitility:awaitility:4.2.0'
    testImplementation 'org.hamcrest:hamcrest-all:1.3'
}
Açıklaması şöyle
All Spark dependencies have to be marked as compileOnly. It means that they won't be included in the assembled .jar file. Apache Spark will provide the required dependencies in runtime. If you include them as implementation scope, that may lead to hard-tracking bugs during execution.
Eğer dependency'lere bakarsak açıklaması şöyle
First comes Apache Spark dependencies. The spark-core artefact is the root. The spark-hive enables data retrieving from Apache Hive. And the spark-sql dependency gives us the ability to query data from Apache Hive with SQL usage.









Hiç yorum yok:

Yorum Gönder