Yazılım Çorbası: Apache Cassandra - Column-Oriented Database

Giriş
Büyük veriyi (Big Data) için 7 tane V önemli

The five Vs of Big Data have expanded to seven – Volume, Velocity, Variety, Variability, Veracity, Visualization, and Value.

Büyük veriyi saklamak için kullanılan yöntemler şöyle

Normalization
Partitioning
Horizontal Sharding
Vertical Sharding
Data Replication

ScyllaDB vs Apache Cassandra

Açıklaması şöyle

it is also a wide column store-based NoSQL database like Cassandra and is also API compatible with Amazon DynamoDB and Cassandra. The major difference between ScyllaDB and Cassandra is it is written in C++ but Cassandra is written in JAVA, and instead of relying on page-cache, it stores row-cache making it more optimized and fast than Cassandra.

Column Oriented DB Ne Demek?

Column Oriented Veri Tabanı

Wide-Column Store

Açıklaması şöyle. Cassandra bunu destekler. Her satır farklı sütunlara sahip olabilir.

A wide-column store like Apache Cassandra or ScyllaDB allows rows (as minimal units of replication, analogous to rows as records in Postgres) to store a great, but most importantly, variable number of columns.

These are great for the sort of data later to be used in aggregations and statistical analysis where events come in large, occasionally inconsistent batches: a fairly fixed schema but of variable width.

The variable width of rows concept is what some might argue, allows flexibility in terms of the events it can store: one event (row) can have fields name(string), address(string), and phone(string), with the next event having name(string), shoe_size(int), and favorite_color(string). Both events can be stored as rows in the same column family (analogous to a table in PostgreSQL or MySQL).

Cassandra İsmi

Açıklaması şöyle. Kimsenin inanmadığı bir kâhinin ismi niye seçilmiş bilmiyorum :)

In Greek Mythology, Cassandra was a priestess of Apollo. Cassandra was cursed to have prophecies that were never to be believed.

Sürümler

2011 yılında 0.7 sürümü vardı
2016 yılında 3.0 sürümü vardı
2021 yılında ise 4.0 sürümü var

Çıkış Amacı

Açıklaması şöyle

Initially developed as an open source alternative to Amazon DynamoDB and Google Cloud Bigtable, Cassandra has had a major impact on our industry.

Cassandra'nın En İyi Kullanım Senaryosu

Açıklaması şöyle

When Cassandra works the best? In append-only scenarios, like time-series data or Event Sourcing architecture (e.g. based on Akka Persistence).Be careful, it’s not a general-purpose database.

Eklemeler yapıldıktan sonra bazı read işleminde hızlı. Açıklaması şöyle

It’s a great tool and we like it, but too often we see teams run into trouble using it. We recommend using Cassandra carefully. Teams often misunderstand the use case for Cassandra, attempting to use it as a general-purpose data store when in fact it is optimized for fast reads on large data sets based on predefined keys or indexes. (…)

Yazmasının hızlı olmasını sebebi sebebi kullanılan engine. Açıklaması şöyle

Storage Engine Type alo
LSM Tree, As a rule of thumb LSM Trees are faster for writes than B Tree(so if your application is write heavy you could consider this option), Though for read its always recommended to do a benchmarking with your particular workload. Old benchmarks have been inconclusive.

Okuma Hızı

Açıklaması şöyle

In Cassandra, reads can be more expensive than writes due to the distributed nature of the database.

When data is written to Cassandra, it is stored in a partition.

Each partition is replicated across multiple nodes in the cluster to ensure fault tolerance and high availability.

When a write occurs, the data is written to the node responsible for that partition and then propagated to the replicas.

Reads, on the other hand, require coordination between multiple nodes in the cluster.

When a read request is made, Cassandra must first determine which nodes are responsible for the data being requested, and then retrieve the data from those nodes.

This coordination and data retrieval process can be slower and more complex than a write operation, especially if the data being requested is stored across many different nodes.

Yazma Hızı

Bir soru şöyle.

Q : Why Writes in Cassandra Are So Fast?
A : To achieve this high performance, Cassandra has a unique write pattern. Cassandra has a few different data structures that it uses:

- Commit Log (Disk)
- Memtable (Memory)
- SSTable (Disk)
All three of these data structures are involved in every write process.

1. Commit Log, WAL gibi. Eğer yazma işlemi yarım kalırsa bu dosyadan kurtarılabiliyor

2. Memtable'a yani belleğe yazma işlemi de başarılı ise bu düğüm istek için onay gönderiyor.

Update Senaryosu

Bazı sorular ve cevaplar şöyle. Yani aynı anda iki istemci aynı satırın farklu sütunlarını güncelleyebilir.

Q : What if two different clients want to update mobile and email separately in a concurrent fashion?
A : When you create the above object in Cassandra, you issue the following command:

CREATE TABLE user (
name text PRIMARY KEY,
mobile text,
email text,
address text
);
INSERT and UPDATE requests happen as below:

INSERT INTO user (name, mobile, email, address)
VALUES ('kousik', '9090909090', 'knath@test.com', 'xyz abc');

UPDATE users SET email = 'knath222@test.com' WHERE name = 'kousik';

UPDATE users SET phone = '7893839393' WHERE name = 'kousik';

Cassandra is designed to handle each column separately. You can issue individual update columns as you do in traditional relational databases. Metadata like last update time is maintained for each column. Update to a particular column for some matching data affects only that column. Thus updates to the columns are finer-grained.

Bir başka soru şöyle

Q: What happens if two different clients update the same column for the same key?
A: Cassandra applies Last Write Wins ( LWW ) strategy to resolve the conflicting updates. Since fine-grained granular updates are on individual columns, it’s not practically possible that all clients end up updating the same column concurrently - their updates would be distributed across columns. Thus Cassandra survives conflicting updates even though clocks are coarsely synchronized to NTP, although it’s a good practice to always keep the clocks synchronized to NTP with the highest possible accuracy, a good such article can be found here.

Q. Still there is a chance that Cassandra loses data due to conflicting updates in the same column, right?
A. Yes, technically it’s possible, however, since updates are spread across columns, the effect should be less, better if you have clocks synced to NTP through the appropriate daemon, that should help as well.

Replication
Not : Consistent Hashing yazısına bakabilirsiniz.
Consistent hashing ile bulunan düğüm haricinde bazı düğümlere daha kopyalama yapılır. Replication her zaman consistency kuralını bozar. Açıklaması şöyle. Yani Eventual Consistency sunulur. Bu konuda BASE Properties For Distributed Database Transactions yazısına bakabilirsiniz.

The disadvantage of non-transactional platforms (Cassandra, for example) is that although one could avoid the locking of shared resources by linearly scaling out the data across the cluster (assuring availability), it comes at the cost of consistency.

Ayrıca replication için leaderless replication kullanılır. Replication varsa mutlaka conflict resolution da gerekir. Açıklaması şöyle

Replication:

Replication Type: Leaderless Replication.

Considering you have n replicas, every write will eventually go to all the nodes, but you can decide a number w which is the number of replicas it should synchronously update for every write request. Now if your application is write heavy, you could make w=1, then to offset the inconsistency it will create till the time we reach eventual consistency, every read request will read from more number of nodes(let’s call this number r). Generally its suggested to keep w+r>n (and this in no way ensures strong consistency), but you can go for <n, if availability is what is more important for your application. The point here is you can move across a spectrum here from very high availability/very low consistency to very low availability/high consistency by tweaking w and r.

Consistency: Eventual Consistency, it’s imperative to note here that irrespective of what database configuration parameters you give, strong consistency is just not guaranteed in cassandra. This is another characteristic that can rule out this database.

Conflict Resolution: A leaderless model calls for conflict resolution. The supported strategy is LWW (Last Write Wins). Will this do?

Single Point of Failure

Açıklaması şöyle.

“There isn’t any central node in Cassandra. Every node is a peer, there is no master – there is no single point of failure.”

Açıklaması şöyle

It is easy to achieve strong consistency in master based distributed systems. However, it also means that there is a compromise on the system's availability if the master is down. Cassandra is a master-less system and trades-off availability over consistency. It falls under the AP category of the CAP theorem, and hence is highly available and eventually consistent by default.

Sharding Nedir?

İki çeşit sharding var
1. Vertical Sharding
Tabloların daha kolay okunması için farklı veri tabanlarında saklanması olarak düşünülebilir.

2. Horizontal Sharding
Aynı tablonun N tane parçaya bölünerek farklı makinelerde saklanması olarak düşünülebilir.

Horizontal Sharding Nedir?
Cassandra bunu destekler. Tablo ortadan ikiye bölünerek farklı veri tabanlarında saklanır. Açıklaması şöyle. Primary Key veya Index alanı sharding key olarak kullanılır.

For sharding the data, a key is required and is known as a shard key. This shard key is either an indexed field or an indexed compound field that exists in every document in the collection.

Sharding İle Dikkat Edilmesi Gereken Şey

Veriyi bir kere daha bölmeye çalışmak yük getirir. Açıklaması şöyle.

Once sharding is employed, redistributing data is an important problem. Once your database is sharded, it is likely that the data is growing rapidly. Adding an additional node becomes a regular routine. It may require changes in configuration and moving large amounts of data between nodes. It adds both performance and operational burden.

Cassandra bunu dert etmiyor. Açıklaması şöyle

Rebalancing: Cassandra uses a strategy to make the number of partitions proportional to the number of nodes. If a new node is added. Some partitions are chosen split in half and are transferred to this new node. Very closely resembles ‘consistent hashing’.

Keyspace

Şeklen şöyle

Primary Key - Var

Primary Key yazısına taşıdım

Foreign Key - Yok

Açıklaması şöyle

There is no foreign key in Cassandra. As a result, it does not provide the concept of Referential Integrity.

Secondary Index

Açıklaması şöyle

Secondary Indexes: Partitioning of secondary indexes is by document. Each partition maintains its own secondary index. Write only needs to deal with the partition in which you are writing the document. Also called local index. Reading requires scatter-gather. Read queries need to be made to all secondary indexes. Thus the read queries are quite expensive. Even parallel queries are prone to tail latency amplification.

Açıklaması şöyle

If the query needs to be performed on the column that is not partition and clustering key we can create a secondary index

A secondary index is stored in a separate column family. It is the best work for columns with a medium level of distinct values and it is not replicated to another node. Keep in mind that as the volume of data increases secondary index queries become slower and they should not be used on frequently updated columns.

Örnek

Şöyle yaparız

CREATE INDEX latitude_index ON sensor_events(latitude);

SELECT * FROM sensor_events WHERE latitude=48.95562;

Cassandra Dosya Kilitlemesi

Cassandra dosyaları sadece yazma amaçlı kilitliyor.

Lightweight Transactions

Açıklaması şöyle

Paxos has been a long-established consensus protocol and was adopted by Cassandra in 2013 for what was called “lightweight transactions.” Lightweight because it ensures that a single partition data change is isolated in a transaction, but more than one table or partition is not an option. In addition, Paxos requires multiple round trips to gain a consensus, which creates a lot of extra latency and fine print about when to use lightweight transactions in your application.

The Raft protocol was developed as the next generation to replace Paxos and several systems such as Etcd, CockroachDB and DynamoDB adopted it. It reduced round trips by creating an elected leader.

The downside for Cassandra in this approach is that leaders won’t span data centers, so multiple leaders are required (see Spanner). Having an elected leader also violates the “shared-nothing” principles of Cassandra and would layer new requirements on handling failure. If a node goes down, a new leader has to be elected.

ACID Transactions - Yeni

Açıklaması şöyle. Accord Consensus Algorithm yazısına bakabilirsiniz

ACID transactions are coming to Apache Cassandra. Globally available, general-purpose transactions that work the way Cassandra works. This isn’t some trick with fine print or application of some old technique.

It’s due to an extraordinary computer science breakthrough called Accord (pdf) from a team at Apple and the University of Michigan.

ACID Transactions - Eski

Eskiden Apache Cassandra ACID Transactions desteklemiyordu. Açıklaması şöyle

Cassandra does not provide ACID properties. It only provides AID property.

Açıklaması şöyle

Cassandra provides weak transactions.

Atomicity: Provided across single node. Not provided when several statements execute across multiple nodes. This means ‘all or none’ doesn’t really exist if transaction is spanning multiple nodes.

Consistency: Implements Paxos which is an implementation of total order broadcast, TOB guarantees the same order of the operations across replicas though doesn’t gurantee the time when the messages will be delivered, so you can see stale values on some replicas.

Isolation: Paxos provides isolation in Compare and set operations.

Durability: Provides durability using multiple replicas.

Write İşlemi

Açıklaması şöyle

Before dwelling into various steps which are employed in writing data in Cassandra, let us first learn some of the key terms. They are:

Commit log: The commit log is basically a transactional log. It’s an append-only file. We use it when we encounter any system failure, for transactional recovery. Commit log offers durability.

Memtable : Memtable is a memory cache that stores the copy of data in memory. It collects writes and provides the read for the data which are yet to be stored to the disk. Generally, Each node has a memtable for each CQL table.

SSTable (Sorted Strings Table): These are the immutable, actual files on the disk. This is a persistent file format used by various databases to take the in-memory data stored in memtables.

It then orders it for fast access and stores it on disk in a persistent, ordered, immutable set of files.

Immutable means SSTables are never modified. It is the final destination of the data in memtable.

Açıklaması şöyle

1. First of all, we can write Writes to any random node in the cluster (called Coordinator Node).
2. We then write to commit log and then it writes data in a memory structure called memtable. The memtable stores write by sorting it till it reaches a configurable limit, and then flushes it.
3. Every writes includes a timestamp.
4. We put the memtable in a queue when the memtable content exceeds the configurable threshold or the commit log space, and then flush it to the disk (SSTable),
5. The commit log is shared among tables. SSTables are immutable, and we cannot write them again after flushing the memtable. Thus, a partition is typically stored across multiple SSTable files.

Şeklen şöyle

Read İşlemi

Açıklaması şöyle

Read operation in Cassandra takes O(1) complexity.

Açıklaması şöyle

1. First of all, Cassandra checks whether the data is present within the memtable. If it exists, Cassandra combines the data with SSTable and return the result.
2. If the data is not present in memTable, Cassandra will try to read it from all SSTable along with using various optimisations.
3. After that, Cassandra will be checking for the row cache. The row cache, if enabled, stores a subset of the partition data stored on disk in the SSTables in memory.
4. Then it uses bloom-filter (it helps to point if a partition key exists in that SSTable) to determine if this particular SSTable contains the key.
5. Suppose the bloom-filter determines a key to be existing on an SSTable, Cassandra will be checking the key cache subsequently. Key cache is an off-heap memory structure that stores the partition index.
6. Now If a partition key is present in key-cache, then the read process skips partition summary and partition index. Consequently, it goes directly to the compression offset map.
7. If in compression offset map the partition key exists, then once the Compression Offset Map identifies the key, we can fetch the desired data from the correct SSTable.

8. If the data is still not available, the coordinator node will request for some read repair.

Şeklen şöyle