15 Eylül 2021 Çarşamba

Data Warehouse Nedir - Süzülmüş Verinin Sakladığı Yer

Giriş
Belli bir amaca göre süzülmüş veri anlamına gelir.

Snowflake Schema
Açıklaması şöyle
Snowflake Schema in Data Warehouse Model

- The snowflake schema is a variant of the star schema.

- Here, the centralized fact table is connected to multiple dimensions.
In the snowflake schema, dimensions are present in a normalized form in multiple related tables.

- The snowflake structure materialized when the dimensions of a star schema are detailed and highly structured, having several levels of relationship, and the child tables have multiple parent tables.

-  The snowflake effect affects only the dimension tables and does not affect the fact tables.

===========================+

- For Example: (please refer the image)
- The Employee dimension table now contains the attributes: EmployeeID, EmployeeName, DepartmentID, Region, Territory.

- The DepartmentID attribute links with the Employee table with the Department dimension table. The Department dimension is used to provide detail about each department, such as the Name and Location of the department.

-  The Customer dimension table now contains the attributes: CustomerID, CustomerName, Address, CityID.

- The CityID attributes link the Customer dimension table with the City dimension table. The City dimension table has details about each city such as CityName, Zipcode, State, and Country.

- The main difference between star schema and snowflake schema is that the dimension table of the snowflake schema is maintained in the normalized form to reduce redundancy.

- The advantage here is that such tables (normalized) are easy to maintain and save storage space.

- However, it also means that more joins will be needed to execute the query. This will adversely impact system performance.

===========================+
- Advantages:
- There are two main advantages of snowflake schema given below:

- It provides structured data which reduces the problem of data integrity.
- It uses small disk space because data are highly structured.

===========================+
- Disadvantages:
- Snowflaking reduces space consumed by dimension tables but compared with the entire data warehouse the saving is usually insignificant.

- Avoid snowflaking or normalization of a dimension table, unless required and appropriate.

- Do not snowflake hierarchies of one dimension table into separate tables.

- Hierarchies should belong to the dimension table only and should never be snowflakes.
Multiple hierarchies that can belong to the same dimension have been designed at the lowest possible detail.
Şeklen şöyle


Data Warehouse Üreticileri
Açıklaması şöyle
- Traditional data warehouses are Teradata, Oracle Exadata, IBM DB2 Warehouse etc.

- Cloud DWH are Amazon Redshift, Google Big Query and Snowflake Cloud Data warehouse.
Apache Hudi
Açıklaması şöyle
Traditional data warehouses often deploy Hadoop to store data and provide batch analysis. Kafka is used separately to distribute Hadoop data to other data processing frameworks, resulting in duplicated data. Hudi helps effectively solve this problem; we always use Spark pipelines to insert new updates into the Hudi tables, then incrementally read the update of Hudi tables. In other words, Hudi tables are used as the unified storage format to access data.
Business Intelligence
Şeklen şöyle


Data Lake vs Data Warehouse
Şeklen şöyle. Data Lake ham veri, Data Warehouse is işlenmiş veri.

Açıklaması şöyle
Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.
Buradaki açıklamaya göre Data Warehouse birinci kuşak çözüm. Data Lake ise ikinci kuşak çözüm
The first generation: proprietary enterprise data warehouse and business intelligence platforms; solutions with large price tags that have left companies with equally large amounts of technical debt; Technical debt in thousands of unmaintainable ETL jobs, tables and reports that only a small group of specialized people understand, resulting in an under-realized positive impact on the business.

The second generation: big data ecosystem with a data lake as a silver bullet; complex big data ecosystem and long running batch jobs operated by a central team of hyper-specialized data engineers have created data lake monsters that at best has enabled pockets of R&D analytics; over promised and under realized.

Hiç yorum yok:

Yorum Gönder