End to End Data Engineering Project

Background

As I’m looking for ways to learn more about data analytics and engineering, I stumbled upon the End-to-End Data Engineering Project course at Linkedin Learning. The course is very practical and shows how to build a data pipeline using modern tools and techniques. The author is using macOS and I’m using Windows. As I followed along the course, I stumbled upon issues with setting up the same pipeline on Windows. Because of this, I decided to write this article to show the steps of setting up the data pipeline on Windows.

The data pipeline is basically made up of the following:

Ingestion - Using Airbyte to load the data from the Postgres database to the data warehouse in Big Query

Transformation - Using DBT to load the data from the raw dataset to the staging dataset in Big Query

Orchestration - Using Dagster in order to link the Ingestion and Transformation steps in a single pipeline

Project Setup

First, you have to download Docker Desktop for Windows. The complete instructions are on this page. Make sure that the Docker-WSL integration is enabled. The WSL will enable us to use Linux on Windows. I’ve used the default distribution included in the WSL which is Ubuntu 22.04 on my machine.

Next, we set up our Python environment to use DBT and Dagster for our project. Again, we use the Ubuntu terminal as DBT is throwing some errors when running in Windows.

The tutorial uses venv for the creation of a virtual environment. I decided to use virtualenv and virtualwrapper instead as I want my Ubuntu installation to have the usual Python setup that I use. This way, I can create and switch between different virtual environments inside my Ubuntu installation in case I have another project.

I followed this tutorial in order to set up my Python environment on Ubuntu.

After setting up Ubuntu and Python installation, the other steps are basically the same as the tutorial. However, I’ve used the terminal on our Ubuntu installation to run all the scripts and commands that the author used, instead of the Windows command prompt.

Take note also that the author provided a different docker image for the Postgres database as the original image used in the tutorial is suitable for the Docker app on Apple Silicon architecture. The alternative docker image can be downloaded as follows

$ docker run --name big-star-container -d -p 5432:5432 lilearningproject/big-star-postgres-multi:latest -c "wal_level=logical"

Next Steps

I’m looking to apply my learnings on this tutorial to build another data pipeline but using other tools like Fivetran, AWS, etc. Of course, I’ll write the steps and possibly create a tutorial to showcase alternative architectures.

Data Engineering with dbt Core

Background Previously, I watched and blogged a...

End to End Data Engineering Project

Kenneth Infante

Background

Project Setup

Results

Next Steps

Other Related Articles

2024

Data Engineering with dbt Core

2023

How to Think of Pandas Data Visualization If You’re Coming From Excel

How to Design an Accounting System using SQLite - Part 2

How to Design an Accounting System using SQLite - Part 1