Skip to content

Personal project designed to enhance and practice my skills in Airflow, Spark, Docker and Python. It focuses on generating fact tables from the Olist dataset to gain valuable business insights. This can help make data-driven decisions to improve various aspects of the e-commerce operation.

Notifications You must be signed in to change notification settings

rxmi-bkd/olist-end-to-end-data-engineering-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

olist-end-to-end-data-engineering-project

Description

Infrastructure

Pipeline

ETL stands for Extract, Transform, Load. An ETL pipeline is a set of processes used to collect data from various sources, transform it into a format that is suitable for analysis or storage, and then load it into a target destination, such as a database, data warehouse, or data lake.

In this project, we focusing on generating an orders fact table from the dataset provided by Olist in order to analyze the sales performance of the company.

Component Usage
Minio Minio serves as the data lake where raw data is stored before being processed.
Airflow Airflow is used to orchestrate the ETL pipeline.
Spark & Python PySpark is used to process the raw data and generate the orders fact table.
MySQL MySQL is used as the data warehouse where the orders fact table is stored.

Orders Fact Table

With these tables, we can analyze the sales performance of the company. For example, we can calculate the average order value.

select year(order_date)   as year,
       month(order_date)  as month,
       avg(price)         as avg_price,
       avg(freight_value) as avg_freight_value,
       avg(price + freight_value) as avg_total_value
from orders_fact
group by year, month
order by year, month;

+------+-------+------------------+-------------------+------------------+
| year | month | avg_total_price  | avg_total_freight | avg_total_order  |
+------+-------+------------------+-------------------+------------------+
| 2016 | 9     | 44.56            | 14.57             | 59.13            |
| 2016 | 10    | 136.38           | 20.11             | 156.50           |
| 2016 | 12    | 10.9             | 8.72              | 19.62            |
| 2017 | 1     | 125.98           | 17.67             | 143.65           |
| 2017 | 2     | 126.76           | 19.98             | 146.74           |
| 2017 | 3     | 124.78           | 19.23             | 144.02           |
| 2017 | 4     | 134.10           | 19.56             | 153.66           |
| 2017 | 5     | 122.36           | 19.37             | 141.73           |
| 2017 | 6     | 120.86           | 19.52             | 140.37           |
| 2017 | 7     | 110.21           | 19.24             | 129.45           |
| 2017 | 8     | 116.90           | 19.19             | 136.09           |
| 2017 | 9     | 129.25           | 19.87             | 149.12           |
| 2017 | 10    | 124.81           | 19.75             | 144.55           |
| 2017 | 11    | 116.59           | 19.49             | 136.08           |
| 2017 | 12    | 117.93           | 18.97             | 136.90           |
| 2018 | 1     | 115.74           | 19.16             | 134.91           |
| 2018 | 2     | 110.03           | 18.60             | 128.64           |
| 2018 | 3     | 119.66           | 20.92             | 140.58           |
| 2018 | 4     | 124.97           | 20.45             | 145.42           |
| 2018 | 5     | 125.74           | 19.34             | 145.08           |
| 2018 | 6     | 122.23           | 22.26             | 144.49           |
| 2018 | 7     | 126.27           | 23.01             | 149.28           |
| 2018 | 8     | 117.92           | 20.51             | 138.43           |
| 2018 | 9     | 145.00           | 21.46             | 166.46           |
+------+-------+------------------+-------------------+------------------+

A lot of other questions can be answered using this dataset. But we need to build other facts tables to answer them. By exploring these questions, you can gain valuable insights into customer behavior, operational efficiency, product performance, and overall business health. This can help in making data-driven decisions to improve various aspects of the e-commerce operation.

Setup

1) Setup docker

docker-compose up

2) Setup Spark and MySQL connection in airflow

follow this guide

Airflow login admin
Airflow password admin
Connection Id spark-conn
Connection Type spark
Host spark://spark
Port 7077
Connection Id mysql-conn
Connection Type mysql
MySQL login admin
MySQL password admin
Host database
Port 3306

3) Upload dataset to datalake

use minio web interface to upload dataset into olist bucket

login admin
password adminadmin

4) Download jars

download the following jars and place them in the jars folder

mysql-connector-j >> used to connect to data warehouse

hadoop-aws & aws-java-sdk >> used to connect to datalake

Possible Improvements

For the moment, our orders fact table pipeline is designed to recreate tables every time the pipeline is run. This is not ideal for a production environment, where we would want to update the tables incrementally.

About

Personal project designed to enhance and practice my skills in Airflow, Spark, Docker and Python. It focuses on generating fact tables from the Olist dataset to gain valuable business insights. This can help make data-driven decisions to improve various aspects of the e-commerce operation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published