Data Engineering pipeline on Azure
Abstract
The increasing volume and variety of data generated by modern organizations demand efficient and scalable data processing solutions. This project presents the development of a Data Engineering Pipeline using Microsoft Azure, designed to automate data ingestion, transformation, and analysis in a cloud environment. The pipeline leverages services including Azure Data Factory, Azure Data Lake Storage Gen2, Azure Databricks, and Azure Synapse Analytics, following a three-layer architecture: Bronze (raw data), Silver (cleaned and transformed data), and Gold (curated data for analytics). Data is extracted from APIs, processed using PySpark in Databricks, stored in Synapse for analytical queries, and visualized using Power BI dashboards. This implementation demonstrates how cloud-based tools can create efficient, automated, and scalable data workflows, enabling organizations to derive actionable insights from large datasets.
References
Apache Software Foundation (2024). “Apache Spark™ — Unified Analytics Engine for Large-Scale Data Processing.”
Azure Data Factory Documentation (2024). Microsoft Learn. “Overview of Azure Data Factory.”
Azure Databricks Documentation (2024). Microsoft & Databricks. “What is Azure Databricks?”
Azure Synapse Analytics Documentation (2024). Microsoft Learn. “Introduction to Azure Synapse Analytics.”
Databricks (2023). “Medallion Architecture: A Framework for Data Quality and Analytics.” Databricks Blog.
Microsoft Learn (2024). “Introduction to Azure Data Lake Storage Gen2.”
Power BI Documentation (2024). Microsoft Learn. “Get Started with Power BI.”
Shojaee Rad, Z., & Ghobaei-Arani, M. (2024). Data pipeline approaches in serverless computing: a taxonomy, review, and research trends. Journal of Big Data, 11, Article No. 82.
Refbacks
- There are currently no refbacks.