Yazılım Çorbası: Data Models

Giriş

Yazıyı (10 Data Models Every Data Engineer Must Know (Before They Break Production)) ilk olarak burada gördüm.

OLTP

PostgreSQL, MySQL, Oracle gibi veri tabanları. Bunları tasarlarken normalization yapılır. Bunlar için

2000'li yıllarda ER Diagrams ve 3NF (Third Normal Form) kullanılıyordu. Halen de kullanılıyor

OLAP

Şu sorulara cevap vermek zor olduğu için OLAP veri tabanlarına ihtiyaç var.

-- Revenue by product category for Q3 2024
-- Requires joining 5 tables just to get to the numbers
SELECT
    p.category_id,
    SUM(oi.quantity * oi.unit_price) AS total_revenue
FROM order_items oi
JOIN orders o      ON oi.order_id    = o.order_id
JOIN customers c   ON o.customer_id  = c.customer_id
JOIN products p    ON oi.product_id  = p.product_id
JOIN payments pay  ON o.payment_id   = pay.payment_id
WHERE o.order_date BETWEEN '2024-07-01' AND '2024-09-30'
  AND pay.status = 'completed'
GROUP BY p.category_id;

OLAP veri tabanı için iki tane yöntem var

1. Road A: Inmon — Build the Enterprise Core First

Veri pek çok farklı sistemden geliyor ve aynı şey için farklı ID, ve isimler verilmiş. Bill Inmon şunu teklif ediyor.

His suggestion is to build a single, normalized enterprise data warehouse that integrates all source systems first — still in 3NF, same structural logic as OLTP — and treat that as the only one authoritative version of Mr. Data for the whole company.

Next, we create data marts using that clean core as the foundation. These are smaller analytical views designed for specific teams. For example, the e-commerce team gets a mart shaped for their specific needs, and finance gets another. We make sure Mr. Data remains one consistent record, so he stays the same in every report the company runs.

2. Road B: Kimball — Flatten It, Ship It, Iterate

Açıklaması şöyle

Ralph Kimball looked at the same problem and started from the other end: what do analysts actually ask?

He noticed that every analytical question has the same shape: How much of X, by Y, over time Z?

- Revenue by product category by quarter.
- Returns by region by month.
- Orders by customer segment by year.

The “how much” is always a number being aggregated. The ‘by’ clauses are your filters and groupings.

Therefore, Kimball thought that we could build the schema around that shape, so that analysts don’t need to reconstruct it through multiple joins.

Açıklaması şöyle

- The “how much” becomes a fact table — one row per order line item, holding Mr. Data’s quantities, prices, and totals.
- The “by” dimensions — product, customer, date — become dimension tables surrounding it.
The technique that makes this possible is denormalization — the opposite of the OLTP and Inmon approach.

Yani ortaya fact tablosunu alıyoruz ve dimension tabloları ile JOIN yapıyoruz

Örnek

Şöyle yaparız. Buna start schema deniyor

-- Revenue by product category for 2024
-- Two joins. That's it.
SELECT
    p.category,
    d.quarter,
    SUM(f.total_revenue) AS revenue
FROM fact_orders f
JOIN dim_product  p ON f.product_key = p.product_key
JOIN dim_date     d ON f.date_key    = d.date_key
WHERE d.year = 2024
GROUP BY p.category, d.quarter
ORDER BY revenue DESC;

Açıklaması şöyle

When we put the fact table in the center and the dimensions around it, we get a star schema. Mr. Data’s order still exists, but now it’s been flattened and reshaped so an analyst can reach it in two joins instead of five.

10. Star Schema: The Legacy Workhorse (That Fails at Scale)

2010'lu yıllardan itibaren Star Schemas and Dimensions çıktı. Burada amaç correctness değil, amaç analitik işler için kolay sorgulama. Açıklaması şöyle.

Star schemas are intuitive and analyst-friendly, but at scale they become a performance bottleneck, especially with massive fact tables, high-cardinality dimensions, and near-real-time workloads.

Fact Table:

Event'leri içerir

Dimension Table

Eventiçindeki descriptions bilgisini içerir.

Örnek

Normal dimension

Suppose you have customers.

Fact table

CustomerKey Amount
1                100
2                50

Customer dimension

CustomerKey Name City
1                Alice Ankara
2                Bob        İzmir

Degenerate dimension eğer event içindeki bir bilgi için dimension yani description bulamıyorsak, fact tablosunda olduğu gibi saklanması anlamına geliyor. OrderNumber, InvoiceNumber, ReceiptNumber,

TransactionId gibi bilgiler dimension tablosuna girmez çünkü detaylandıracak bir şey yok. Bu yüzden Fact tablosunda kalmaya devam eder

9. Snowflake Schema: Over-Engineered & Slow

Açıklaması şöyle.

Snowflake schemas optimize storage, not query performance. In modern analytics (cloud OLAP, dashboards, ad-hoc queries), compute is the bottleneck, not disk. Excessive normalization explodes join depth and kills latency.

8. Data Vault: The Enterprise Monster (When You Need Auditability)

2020'li yıllarda çıktı. Açıklaması şöyle.

Data Vault excels at auditability, lineage, and full historization, critical for regulated industries (banking, healthcare). But its multi-layer architecture makes it fundamentally unsuited for low-latency analytics.

Burada veri kaynağının çok sık değişmesi ve audit, tarihçe istekleri önemli

7. Wide-Column Stores (Cassandra, Bigtable) for Time-Series Chaos

Açıklaması şöyle.

Wide-column databases dominate high-velocity ingest (IoT, metrics, logs) where writes never stop. But they sacrifice query flexibility, no joins, limited filtering, and rigid access patterns. You win on writes, lose on exploration.

6. Graph Models (Neo4j, TigerGraph) for Hidden Relationships

Açıklaması şöyle.

When insight lives in relationships (fraud rings, social influence, network hops), relational joins collapse under recursive depth. Graph databases treat relationships as first-class citizens, making multi-hop traversals fast and natural.

5. Streaming Event Sourcing (Kafka + CDC)

Açıklaması şöyle.

Batch ETL is fundamentally incompatible with real-time systems. CDC turns database mutations into immutable events, enabling near-zero-latency pipelines, replayable state, and system-wide consistency across microservices.

4. Columnar Storage (Parquet, Delta Lake) for Cheap, Fast Analytics

Parquet bir örnek

Açıklaması şöyle.

Row-based databases are optimized for point lookups, not scans. Analytics workloads read a few columns across billions of rows, exactly what columnar storage is built for. The result: orders-of-magnitude faster queries at a fraction of the cost.

Örnek

Şöyle yaparız

CREATE TABLE sales_parquet (
    order_id BIGINT,
    region   STRING,
    amount   DECIMAL(10,2),
    order_ts TIMESTAMP
)
USING PARQUET
PARTITIONED BY (region, order_date);

SELECT
    region,
    SUM(amount) AS total_sales
FROM sales_parquet
WHERE order_date = '2025-12-25'
  AND region = 'US'
GROUP BY region;

Açıklaması şöyle.

Why this is fast
- Only amount and region columns are read
- Only the order_date=2025-12-25 and US partitions are scanned
- All other files are skipped entirely

3. Multi-Model Hybrids (When SQL + NoSQL Collide)

2026 ve sonrasında artık şu kavramlar önemli

Iceberg, dbt, Data Contracts, and LLM-Aware Schemas

Ayrıca çoklu modeller de önemli. Açıklaması şöyle. Burada veri tabanının JSONB sütunları desteklemesi önemli

Real-world data is rarely one shape. Modern apps mix relational facts, semi-structured JSON, and relationships. Multi-model databases let you query everything in one place, without forcing awkward ETL or duplicating data.

1. The Unified Serving Layer (The Future of Production Data)

One dataset. Many engines. Zero rewrites. Açıklaması şöyle

Modern data stacks fracture data across OLTP, OLAP, search, and streaming systems, creating sync lag and duplicated logic. A Unified Serving Layer uses one logical data layer (Iceberg/Hudi/Delta) with multiple access modes: SQL analytics, near-real-time reads, ML, and even graph/search workloads.

Yazılım Çorbası

18 Şubat 2026 Çarşamba

Data Models

Hiç yorum yok:

Yorum Gönder

Blog Arşivi