Yazılım Çorbası: Apache Pinot - Bir OLAP veri tabanı

Giriş

Açıklaması şöyle. Yani LinkedIn şirketinde yaratılmış. LinkedIn şirketinde ilk olarak User-facing analytics özelliklerinden olan ‘Who Viewed My Profile?’ ve ‘Talent Search’ gibi işler için kullanılmış

Apache Pinot was created, as was Kafka, at LinkedIn to power analytics for business metrics and user facing dashboards. Since then, it has evolved into the most performant and scalable analytics platform for high-throughput event-driven data.

Uber

Apache Pinot'un petabyte büyüklüğünde veriyi rahatlıkla işleyebileceği duyulunca, daha sonra Uber şirketinde “Restaurant Manager" paneli için kullanılmış

Apache Pinot Nedir ?

Açıklaması şöyle. Bir OLAP veri tabanı ancak bir sürü kaynaktan veri çekebiliyor. Yani hem tarihsel hem de gerçek zamanlı kaynaklardan veri çekebiliyor ve bunları SQL ile sorgulayabilmemizi sağlıyor

Apache Pinot is a distributed OLAP store that can ingest data from various sources such as Kafka, HDFS, S3, GCS, and so on and make it available for querying in real-time. It also features a variety of intelligent indexing techniques and pre-aggregation techniques for low latency.

Cevaplar hızlı ve güncel. Açıklaması şöyle

Answers contain fresh data — As soon as the data is ingested, Pinot makes them available for querying, typically within seconds. So, you won’t get any stale data in the answers.

Answers will be quick — Pinot makes sure that you will always get an answer within milliseconds of latency, even though it is super busy or having to scan billions of records to find the answer.

Can answer multiple questions concurrently — You may not be the only one querying Pinot. It could be hundreds or even millions of users querying Pinot concurrently. But Pinot makes sure that it scales and be available to accommodate all that questions.

Lambda Architecture

Açıklaması şöyle

Pinot follows the lambda architecture. It supports near-real-time data ingestion by consuming online data directly from Kafka and offline data from the Hadoop system. Offline data will serve as a global view, while online data will provide a more real-time view.

Pinot Hangi Problemleri Çözer?

1. User-facing analytics

Dashboards ve Personalization için kullanılır

Açıklaması şöyle

Dashboards — Pinot has been purpose-built to power user-facing applications and dashboards that are supposed to be accessed by millions of users concurrently. While doing so, Pinot maintains stringent SLAs, which are typically in milliseconds range to ensure a pleasant user experience.

Personalization — Apart from that, Pinot is good at performing real-time content recommendations. For example, Pinot powers the news feed of a LinkedIn user, which is based on the impression discounting technique. You can feed clickstream, view stream, and user activity data to Pinot to generate content recommendations on the fly.

2. Ad-hoc querying and exploratory data analysis

Data analysts ve Data scientists tarafından kullanılır

3. Operational intelligence and time-series data processing

Time-series database (TSDB) olarak kullanılır

Storage Nasıldır?

Açıklaması şöyle. Columnar storage kullanıyor

Under the covers, it features columnar storage with intelligent indexing techniques and pre-aggregation techniques. Thus, making Pinot an ideal choice for real-time, low-latency OLAP workloads. For example, BI dashboards, fraud detection, and ad-hoc data analysis are few use cases where Pinot excels.

Açıklaması şöyle

Segments — Raw data ingested by Pinot is broken into small data shards, and each shard is converted into a unit known as a segment. A segment is the centerpiece in Pinot’s architecture which controls data storage, replication, and scaling.

Tables and schemas — One or more segments form a table, which is the logical container for querying Pinot using SQL/PQL. A table has rows, columns, and a schema that defines the columns and their data types.

Tenants — A table is associated with a tenant. All tables belonging to a particular logical namespace are grouped under a single tenant name and isolated from other tenants.

If you are familiar with log-structured storage like Kafka, a segment resembles a physical partition while a table represents a topic. Both topics and tables expect to grow infinitely over time. Therefore, they are partitioned into smaller units so that they can be distributed across multiple nodes.

Şeklen şöyle

Mimarisel Bileşenleri Nelerdir?

Şeklen şöyle

Bileşenler

Açıklaması şöyle. Yukarıdaki şekilde Broker, Server ve Controller görülebilir.

A typical Pinot cluster has multiple distributed system components: Controller, Broker, Server, and Minion. In production, they are deployed independently for scalability.

1. Pinot Broker

Açıklaması şöyle. Sorguları işleyen bileşen

Brokers are the components that handle Pinot queries. They accept queries from clients and forward them to the right servers. They collect results from the servers and consolidate them into a single response to send it back to the client.

Broker sorgunun scatter-gather şeklinde çalışmasını sağlar. Açıklaması şöyle.

Pinot executes queries in a scatter-gather manner instead of the databases that leverage the materialized views where query result has been precomputed.

Scatter-gather execution model Nedir?

Açıklaması şöyle.

Queries are received by brokers — which checks the request against the segment-to-server routing table — scattering the request between real-time and offline servers.

The two servers then process the request by filtering and aggregating the queried data, then returned to the broker. Finally, the broker consolidates each response into one and responds to the client.

2. Pinot Controller

Açıklaması şöyle. Şekilde Ingestion Job nesnesi Controller'a bağlanıyor

You access a Pinot cluster through the Controller, which manages the cluster’s overall state and health. The Controller provides RESTful APIs to perform administrative tasks such as defining schemas and tables. Also, it comes with a UI to query data in Pinot.

3. Pinot Server

Açıklaması şöyle.

Servers host the data segments and serve queries off the data they host. There are two types of servers — offline and real-time.

Offline servers typically host immutable segments. They ingest data from sources like HDFS and S3. Real-time servers ingest from streaming data sources like Kafka and Kinesis.

Açıklaması şöyle

Servers are in charge of hosting segments and query execution. Pinot stores a segment as a directory in the UNIX filesystem, which consists of a metadata file and an index file:

- The segment metadata provides information about the segment’s columns: type, cardinality, encoding scheme, column statistics, and the indexes available for that column.

- The index file stores indexes for all the columns. The files are append-only.

Pinot stores multiple replicas of a segment for high availability. This also improves query throughput, as all the replicas participate in the query processing. Pinot’s servers have a pluggable architecture that supports loading columnar indexes from different storage formats. This allows servers to read data from distributed filesystems like HDFS or object storage like Amazon S3.

4. Pinot Minion

Açıklaması şöyle.

Minion is an optional component that can run background tasks such as “purge” for GDPR (General Data Protection Regulation).

Açıklaması şöyle

Minions are responsible for maintenance tasks. The controllers’ job scheduler assigns tasks to the minions. An example of a minions task is data purging. LinkedIn must purge specific data to comply with legal requirements. Because data is immutable, minions must download segments, remove the unwanted records, rewrite and reindex the segments, and finally upload them back into the system.

Schema

Açıklaması şöyle. Yani SQL ile sorgulanacak alanlar

When creating a real-time table, there are two things you need to prepare. First, you have to create a schema that describes the fields that you intend to query using SQL. Typically, these schemas are described as JSON, and you can create multiple tables that inherit the same underlying schema.

Açıklaması şöyle

First, we need to create a Schema to define the columns and data types of the Pinot table. In a typical schema, we can categorize columns as follows.

Dimensions: Typically used in filters and group by clauses for slicing and dicing into data.
Metrics: Typically used in aggregations, represents the quantitative data.
Time: Optional column represents the timestamp associated with each row.

Örnek - steps-schema.json

Şöyle yaparız

{
  "schemaName": "steps",
  "dimensionFieldSpecs": [
    {
      "name": "userId",
      "dataType": "INT"
    },
    {
      "name": "userName",
      "dataType": "STRING"
    },
    {
      "name": "country",
      "dataType": "STRING"
    },
    {
      "name": "gender",
      "dataType": "STRING"
    }
  ],
  "metricFieldSpecs": [
    {
      "name": "steps",
      "dataType": "INT"
    }
  ],
  "dateTimeFieldSpecs": [{
    "name": "loggedAt",
    "dataType": "LONG",
    "format" : "1:MILLISECONDS:EPOCH",
    "granularity": "1:MILLISECONDS"
  }]
}

Table Definition

Açıklaması şöyle. Real-time veya Batch Table olduğunu belirtir. Data source tanımı burada yapılır.

The second thing you need to create is your table definition. The table definition describes what kind of table you want to create, for instance, for real-time or batch. In this case, we’re creating a real-time table, which requires a data source definition so that Pinot can ingest events from Kafka.

Açıklaması şöyle

Each table has a fixed schema and multiple columns, each of which can be a dimension or a metric. Pinot introduces a special timestamp dimension column called a time column. The time column is used when merging offline and online data

Ayrıca indekslemenin nasıl yapılacağı belirtilir. Açıklaması şöyle.

The table definition is also where we describe how Pinot should index the data it ingests from Kafka. Indexing is an important topic in Pinot, as with mostly any database, but it is especially important when we talk about scaling real-time performance. For example, text indexing is an important part of querying Wikipedia changes. We may want to create a query using SQL that returns multiple different categories using a partial text match. Pinot supports text indexing that makes performance extremely fast for queries that need arbitrary text search.

Örnek -steps-table.json

Şöyle yaparız

{
  "tableName": "steps",
  "tableType": "REALTIME",
  "segmentsConfig": {
    "timeColumnName": "loggedAt",
    "timeType": "MILLISECONDS",
    "schemaName": "steps",
    "replicasPerPartition": "1"
  },
  "tenants": {},
  "tableIndexConfig": {
    "loadMode": "MMAP",
    "streamConfigs": {
      "streamType": "kafka",
      "stream.kafka.consumer.type": "lowlevel",
      "stream.kafka.topic.name": "steps",
      "stream.kafka.decoder.class.name": "org.apache.pinot.plugin.stream.kafka.KafkaJSONMessageDecoder",
      "stream.kafka.consumer.factory.class.name": "org.apache.pinot.plugin.stream.kafka20.KafkaConsumerFactory",
      "stream.kafka.broker.list": "localhost:9876",
      "realtime.segment.flush.threshold.time": "3600000",
      "realtime.segment.flush.threshold.size": "50000",
      "stream.kafka.consumer.prop.auto.offset.reset": "smallest"
    }
  },
  "metadata": {
    "customConfigs": {}
  }
}

Artık şöyle sorgulayabiliriz

SELECT * FROM steps LIMIT 10;

Son 24 saatteki kayıtları şöyle sorgulayabiliriz

SELECT userName, country, sum(steps) as total
FROM steps
WHERE loggedAt > ToEpochSeconds(now()- 86400000)
GROUP BY userName, country
ORDER BY total desc

En fazla steps değerine sahip ilk 10 kullanıcıyı şöyle sorgulayabiliriz

SELECT userName, country, SUM(steps) AS total
FROM steps
GROUP BY userName, country
ORDER BY total desc
LIMIT 10

Segment
Açıklaması şöyle
Tables are partitioned into segments, subsets of a table’s records. A typical Pinot segment has a few dozen million records, and a table can have tens of thousands of segments.
Plugins Nelerdir

Açıklaması şöyle

One of the primary advantages of using Pinot is its pluggable architecture. The plugins make it easy to add support for any third-party system which can be an execution framework, a filesystem, or input format.

In this tutorial, we will use three such plugins to easily ingest data and push it to our Pinot cluster. The plugins we will be using are

- pinot-batch-ingestion-spark
- pinot-s3
- pinot-parquet

pinot-admin komutu

AddTable seçeneği

Örnek

Şöyle yaparız

bin/pinot-admin.sh AddTable \
 -schemaFile /tmp/fitness-leaderboard/steps-schema.json \
 -tableConfigFile /tmp/fitness-leaderboard/steps-table.json \
 -exec

{"status":"Table steps_REALTIME succesfully added"}

StartKafka seçeneği

Örnek

Şöyle yaparız

bin/pinot-admin.sh  StartKafka -zkAddress=localhost:2123/kafka -port 9876

Yazılım Çorbası

1 Ekim 2021 Cuma

Apache Pinot - Bir OLAP veri tabanı

Hiç yorum yok:

Yorum Gönder

Blog Arşivi