Yazılım Çorbası: Google Cloud Spanner

Cloud Spanner Nedir

Açıklaması şöyle. Çok büyük miktarda veriyi global ölçekte ve strong transactional consistency ile kullanabilmeyi sağlar.

Cloud Spanner is a fully managed, mission-critical, relational(SQL), globally distributed database with VERY high availability. It provides strong transactional consistency at the global scale. It can scale to petabytes of data with automatic sharding.

Here are some of the important features:
1. Scales horizontally for reads and writes: In comparison, Cloud SQL provides read replicas BUT you cannot horizontally scale write operations with Cloud SQL!
2.Regional and Multi-Regional configurations
3. Expensive (compared to Cloud SQL): You pay for nodes & storage

Spanner SQL

Açıklaması şöyle

Although Google’s Spanner’s SQL dialect is inspired by MySQL, it is not fully compatible with it either.

Cloud SQL vs Cloud Spanner

Cloud SQL yazısına bakabilirsiniz. Açıklaması şöyle.

Use Cloud Spanner(Expensive) instead of Cloud SQL for relational transactional applications if:

1. You have huge volumes of relational data (TBs) OR
2. You need infinite scaling for a growing application (to TBs) OR
3. Do you need a Global (distributed across multiple regions) Database OR
4. You need higher availability (99.999%)

Google True Time

Google Cloud Spanner, zaman senkronizasyonu için Google True Time kullanır

Distributed Joins

Distributed joins are commonly considered too expensive to use for real-time transaction processing. That is because, besides joining data, they also frequently require moving or shuffling data between nodes in a cluster, which can significantly affect query response times and database throughput. However, there are certain optimizations that can completely eliminate the need to move data to enable faster joins.

Ancak yine de Distributed Join yapabilmek için 4 tane optimizasyon var. Bunlar şöyle

1. Shuffle join

2. Broadcast join

3. Co-located join

4. Pre-computed join

Açıklaması şöyle

Shuffle and broadcast joins are more suitable for batch or near real-time analytics. For example, they are used in Apache Spark as the main join strategies. Co-located and pre-computed joins are faster and can be used for online transaction processing with real-time applications. They frequently rely on organizing data based on unique storage schemes supported by a database.

Join Adımları

Join işlemi 3 adımdan ibarettir. Açıklaması şöyle

- The first step is to move data between nodes in the cluster, such that rows that can potentially be combined based on a join condition end up on the same nodes. Data movement is usually achieved by shuffling or broadcasting data.
- The second step is to compute a join result locally on each node. This usually involves one of the fundamental join algorithms, such as a nested-loop, sort-merge, or hash join algorithm.
- The last step is to merge or union local join results and return the final result. In many cases, it is possible to optimize a distributed join by eliminating one or even two steps from this process.

Shuffle join

Açıklaması şöyle

A shuffle join re-distributes rows from both tables among nodes based on join key values, such that all rows with the same join key value are moved to the same node. Depending on a particular algorithm used to compute joins, a shuffle join can be a shuffle hash join, shuffle sort-merge join, and so forth.

Broadcast join

Açıklaması şöyle

A broadcast join moves data stored in only one table, such that all rows from the smallest table are available on every node. Depending on a particular algorithm used to compute joins, a broadcast join can be a broadcast hash join, broadcast nested-loop join, and so forth.

Co-located join

Açıklaması şöyle

A co-located join does not need to move data at all because data is already stored such that all rows with the same join key value reside on the same node. Data still needs to be joined using a nested-loop, sort-merge, or hash join algorithm.

Pre-computed join

Açıklaması şöyle

A pre-computed join does not need to move data or compute joins locally on each node because data is already stored in a joined form. This type of join skips data movement and join computation and goes directly to merging and returning results.

Hepsi

Şeklen şöyle. Burada Co-located join ve Pre-computed join tipleri veriyi node'lar arasında taşımıyor.

Google Cloud Spanner ile Co-located join

Açıklaması şöyle

Co-located joins can perform significantly faster than shuffle and broadcast joins because they avoid moving data between nodes in a cluster. To use co-located joins, a distributed database needs to have a mechanism to specify which related data entities must be stored together on the same node. In Google Cloud Spanner, this mechanism is called table interleaving.

Logically independent tables can be organized into parent-child hierarchies by interleaving tables. This results in a data locality relationship between parent and child tables, such that one or more rows from a child table are physically stored together with one row from a parent table. For two tables to be interleaved, the parent table primary key must also be included as the prefix of the child table primary key. In other words, the child table primary key must consist of the parent table primary key followed by additional columns.

Elimizde şöyle tablolar olsun