Data Engineering

Automated ETL Batch Pipeline

Project Overview:

ETL pipelines are crucial in the retail industry as they help retailers manage and leverage their data effectively. They’re used to extract batches of data from different sources, transform it into a consistent format, and load it into a database for analysis and reporting.

Quality checks are also often implemented to ensure data accuracy and to help identify and address data issues early in the process.

This project is to automate ETL batch pipelines by:

  • Consuming data from different sources and applying validation checks
  • Storing the pre-processed data in Azure Data Lake
  • Writing a microservice in Scala and Spark to transform the data and extract useful metrics
  • Storing the metrics in Cassandra
  • Automating the pipeline with Airflow
  • Adding Graphana or Prometheus for metrics visualisation and monitoring
  • Optionally, adding a TLS encryption layer to secure data in transit


Scala, Spark, Cassandra, ADLS, Airflow, Jenkins, Graphana/Prometheus

Date d’expiration: 05 décembre, 2023

