COMP63301 Data Engineering Concepts
Data engineering tends to involve a lifecycle, in which typical phases include data acquisition, profiling, cleaning, integration, modelling, and usage.
This unit introduces the student to relevant stages of the data engineering lifecycle and related concepts, tasks and techniques. It deepens selected aspects of this lifecycle, e.g., transformation, modelling and visualisation, and addresses cross-cutting topics such as security, trust, and robustness. We will investigate pain points, trade-offs, limitations and evaluation criteria that can inform the development of data engineering pipelines in practice.
This is a draft handbook to accompany the teaching of COMP63301 at the Department of Computer Science at The University of Manchester.
Reading list
The reading list for COMP63301 includes:
- Joe Reis, Matt Housley (2022): Fundamentals of Data Engineering: Plan and Build Robust Data Systems, O’Reilly media, ISBN 9781098108304
- Wes McKinney (2022): Python for data analysis, 3rd Edition. O’Reilly media, ISBN 9781098104030
- The Turing Way
- Snakemake tutorial
Related publications include:
- Davide Chicco et al (2022): Eleven quick tips for data cleaning and feature engineering
- Felix Mölder et al (2021): Sustainable data analysis with Snakemake
Aims
The unit aims to provide students with an understanding of the concepts that underpin data engineering and the experience of applying those concepts. In turn, data engineering provides processes and mechanisms that enable value to be obtained from data. These processes and mechanisms can be considered to give rise to a data engineering lifecycle, and this unit explores the concepts that underpin the different stages in such a lifecycle, which include data transformation and visualisation.
Intended Learning Outcomes
Explain the Data Engineering (DE) lifecycle, related concepts, challenges and research questions.
Identify relevant data properties, understanding the shape of data and its representation of the world.
Apply selected DE techniques for data integration, cleaning, transformation and visualisation, ensuring data quality for the purpose of data analysis.
Critically analyse data engineering technologies.
Discuss trade-offs between various design options.