Data Engineering Fundamentals


Data Sources

Data Formats

==data serialization== : process of converting a data structure into a format that can be stored or transmitted and reconstructed later. Wikipedia

Row-Major VS Column-Major Format

row-major -> [[Contiguous data]] makes it faster to read and write data observations (Numpy) column-major -> better for analysis purposes ([[Pandas Python]])

Data Models

Relational Models

relations should be normalized (1FN, 2FN, etc…), it will help to reduce the redudancy & improve data integrity

[!caution] One major downside is that data can be massively spreaded and it can be costfull to run jointure operations

NoSQL Models

Structured VS Unstuctured Data

| Structured Data | Unstructured Data | | ————— | —————– | | Data Warehouse | Data Lake |

Data Storage Engines & Processing

Transactionnal VS Analytical Processing

Modes of Dataflow

Batch Processing VS Stream Processing