Vantino regularly develops data analytics pipelines for its customers. This is a description of how we migrated a custom data pipeline, which we had initially deployed on premises, to GCP.
If you're a data analytics engineer you know that data pipelines are designed based on the general procedure in computing, which is called ETL. For those who are new to the field, ETL is the abbreviation for extract, transform and load processes, which pretty much describes what happens with data on its way to analytics dashboards. Let’s review it step by step now.
First, we needed to validate data and we chose Great expectations software to do that. This is a step to take if you want your data to be transformed smoothly later, especially when it comes from an untrusted source. This will save you time and resources as you will push only "good" data to the transformation step. Then, we used Dataflow for fast data processing and Data Build Tool for data transformation.
DBT is an open-source Python application, which is not overcomplicated and relatively fast.
As a data warehouse, we used BigQuery. And with Apache Airflow we managed to create an efficient workflow: it uses Python, which is powerful for such tasks. Finally, in order to display the data, we used Looker, which is specifically designed to work with BigQuery warehouse and thus they work smoothly together.
So this is the workflow we chose for creating an efficient and reliable data analytics pipeline on GCP. It's relatively fast, cost-effective and provides great output in the end.