How to Create a Scalable

Data Analytics Pipeline on GCP

This article is for those who've already decided to deploy their IT infrastructure on Google Cloud Platform (GCP) and would like to understand which services may be useful for implementing a data analytics pipeline on GCP.

Vantino regularly develops data analytics pipelines for its customers. This is a description of how we migrated a custom data pipeline, which we had initially deployed on premises, to GCP.

The Workflow

If you're a data analytics engineer you know that data pipelines are designed based on the general procedure in computing, which is called ETL. For those who are new to the field, ETL is the abbreviation for extract, transform and load processes, which pretty much describes what happens with data on its way to analytics dashboards. Let’s review it step by step now.

First, we needed to validate data and we chose Great expectations software to do that. This is a step to take if you want your data to be transformed smoothly later, especially when it comes from an untrusted source. This will save you time and resources as you will push only "good" data to the transformation step. Then, we used Dataflow for fast data processing and Data Build Tool for data transformation.

DBT is an open-source Python application, which is not overcomplicated and relatively fast.

As a data warehouse, we used BigQuery. And with Apache Airflow we managed to create an efficient workflow: it uses Python, which is powerful for such tasks. Finally, in order to display the data, we used Looker, which is specifically designed to work with BigQuery warehouse and thus they work smoothly together.

So this is the workflow we chose for creating an efficient and reliable data analytics pipeline on GCP. It's relatively fast, cost-effective and provides great output in the end.

Contact us to get a data management consultation. For more details, visit our Data & BI Consulting page.