How to Create a Scalable Data Analytics Pipeline on GCP

27 Mar, 2023

Building a Data Analytics Pipeline on Google Cloud Platform: A Guide to Essential Services for IT Infrastructure Deployment

Vantino regularly develops data analytics pipelines for its customers. This is a description of how we migrated a custom data pipeline, which we had initially deployed on premises, to GCP.

The Migration to GCP

In the world of business analytics, data is key. For a Swiss financial company with offices in Austria and Germany, having a reliable and efficient data analytics pipeline is essential to make informed decisions and staying competitive.

Version 1 of the company's data analytics pipeline relied on its ERP (Abacus) and CRM systems to upload CSV reports on a daily basis. These reports included important information such as employees' timesheets and tasks. Before the data was added to a Postgresql database, it was validated by Python scripts against a predefined data schema using the tableschema Python library. This ensured that the data was consistent and accurate. Once the data was validated, it was made available in the Apache Superset visualization platform for analysis and reporting.

However, to take advantage of scalability and reliability, the company has migrated its data analytics pipeline to the Google Cloud Platform (GCP). In version 2, CSV reports are loaded directly into BigQuery, Google's cloud-native data warehouse. This eliminates the need for a custom analytics server and allows the company to easily query and analyze large amounts of data in real time.

To facilitate the migration to GCP, the company uses an ETL (Extract, Transform, Load) process. Specifically, CSV files are loaded into Google Cloud Storage (GCS) with the help of Google Cloud Functions, transformed using the DBT (Data Build Tool) Python library, and then loaded into BigQuery. DBT allows data teams to transform and manage their data in a consistent and maintainable way, and by defining its data schema and transformations in code, the company is able to automate and manage its data pipeline more efficiently.

Apache Airflow, an open-source platform for orchestrating and scheduling data pipelines, is used to manage the flow of data from their ERP and CRM systems to BigQuery. This ensures that the data is loaded into BigQuery regularly and in the correct format.

To further optimize its data pipeline, the company uses GCP Dataflow, a fully managed service for transforming and enriching large datasets in near real-time. This allows the company to perform complex transformations on its data without the need for dedicated infrastructure or time-consuming manual processes.

With the data loaded into BigQuery, it is available for analysis and visualization in Apache Superset. This allows the company to continue using its existing dashboards and visualizations while benefiting from the increased performance and scalability of GCP.

Thanks to the migration to the Google Cloud Platform, the Swiss financial company has been able to improve the efficiency and reliability of its data analytics pipeline. The company can now easily manage, transform, and analyze its data in real time to make more informed decisions and stay ahead of the competition.

Contact us to get a data management consultation. For more details, visit our Data & BI Consulting page.