Caylent Catalysts™
AWS Control Tower
Establish a Landing Zone tailored to your requirements through a series of interactive workshops and accelerators, creating a production-ready AWS foundation.
Learn how AWS Amazon Managed Workflows for Apache Airflow (MWAA) can offer an efficient and easy way of building ETL pipelines that are scalable and cost-effective.
Data engineers tend to struggle with scaling when designing an Extract-Transform-Load (ETL) pipeline. What if our 500MB data set becomes 50GB tomorrow? Are we ready to handle that load? How much is it going to cost? Should we be worrying about this uncertain future now? How can we be prepared?
Data engineers and architects lose sleep over these questions. In this article we present an efficient and easy way of building ETL pipelines that are scalable while paying only for the actual usage in an AWS environment.
ETL not only involves the pipelines that do the actual ETL work, but also the orchestration of those pipelines. A lot of ETL pipelines on AWS are composed of EC2 instances in which orchestration and execution of the pipeline share and compete for resources. The image below shows a very common pattern in which there is one primary instance executing Airflow and several worker instances, with an autoscaling group, for handling the load.
This configuration, even though it is stable, it has a lot of flaws in its design:
So, how do we overcome all these issues mentioned above? Which is our best approach to be able to distribute the load and not have to worry about it?
AWS offers its customer with an out of the box Managed Service of Apache Airflow with High Availability by default.
This managed service is totally configurable, and provides several benefits over the configuration mentioned above:
MWAA integrates seamlessly with other AWS resources. In this example we will use ECS Fargate as the easiest way to distribute the pipeline load without worrying about allocating resources or race conditions that can lead to performance problems, paying only for what is used.
The following image shows an example of a highly scalable architecture that can orchestrate and distribute the pipeline loads transparently.
The architecture is composed of one or more docker images that will contain all the necessary code to execute the tasks that are triggered by the MWAA instance. The docker images will be registered in the ECR registry and executed through ECS Fargate.
This architecture has the following benefits:
AWS MWAA Local runner is a utility that will let us test our code locally before uploading it to our cloud environments.
As a requirement to use this utility it is necessary to have the docker engine and client running in our local computer, together with docker compose.
Use the following instructions to install it:
Build the Docker container image using the following command:
Note: it takes several minutes to build the Docker image locally.
Run a local Apache Airflow environment that is a close representation of MWAA by configuration.
To stop the local environment, Ctrl+C on the terminal and wait till the local runner and the postgres containers are stopped.
If you encountered the following error: process fails with "dag_stats_table already exists", you'll need to reset your database using the following command:
Amazon ECS Fargate and Amazon MWAA, are particularly useful for building and deploying containerized applications and workflows in the cloud. We'll explore how to use an ECS Fargate call from MWAA and uncover the real truth behind this process. Whether you're an experienced developer or just getting started with AWS, following the steps of this guide will provide you with valuable insights and practical tips to help you get the most out of these powerful cloud services.
In the containerOverrides section set the command that you want to execute inside the container.
Moving into a managed Airflow service like MWAA provides many benefits, including scalability, reliability, security, and cost-effectiveness, which are important for managing workflows in the cloud.
Using containerized solutions for isolating tasks triggered by Airflow is a best practice for managing complex workflows. By using containerization, you can ensure that tasks are executed in a separate environment that is isolated from Airflow's internal workers. This can help improve performance and prevent failures.
Furthermore, it is important to avoid using BashOperators and PythonOperators in Airflow's internal memory as they can cause failures or degrade performance. Instead, it is recommended to trigger different workloads in proper remote platforms, such as ECS cluster, EKS Pods, Glue, etc., and collect the state once the task finishes.
Finally, Caylent's suggestion to re-design deployment activities and CI/CD to build an integrated solution that can rebuild the docker image or deploy to the S3 bucket in case of modifications to the DAG is a best practice for managing workflows in a scalable and reliable manner. By automating the deployment process, you can ensure that changes are implemented quickly and efficiently, while minimizing the risk of errors or downtime.
If you're interested in learning more about CI/CD on an Amazon ECS cluster with code pipeline, you can take a look at CI/CD on an Amazon ECS cluster with code pipeline.The step-by-step guide provides instructions on setting up CodePipeline, CodeCommit, and CodeBuild to automate deployment and run Airflow on the ECS cluster. Automating deployment can help deliver updates more efficiently and improve DevOps skills using AWS technologies.
Jorge Goldman is an Engineering Manager with over 12 years of experience in diverse areas from SRE to Data Science. Jorge is passionate about Big Data problems in the real world. He graduated with a Bachelors degree in Software Engineering and a Masters degree in Petroleum Engineering and Data Science. He is always looking for opportunities to improve existing architectures with new technologies. His mission is to deliver sophisticated technical solutions without compromising quality nor security. He enjoys contributing to the community through open-source projects, articles, and lectures, and loves to guide Caylent's customers through challenging problems.
View Jorge's articlesGuillermo Britos is a senior software engineer with over 5 years of experience in creating and deploying software solutions for diverse industries, including education, finance, and consulting. He is skilled in cloud computing technologies, including AWS, GCP and Azure. As a proponent of DevOps methodologies, he has implemented CI/CD pipelines for various projects. Guillermo always keeps a rubber duck on his desk for debugging purposes, because, as he says, "talking to a rubber duck is the best way to find bugs in your code."
View Guillermo's articlesExplore how we helped a customer modernize their legacy authentication system with Amazon Cognito.
Learn how we helped an event production and management company implement an AWS Landing Zone to improve their operational capabilities.
Learn how we helped a video surveillance management system company transform their cloud infrastructure, resulting in reduced operational costs, improved security, and enhanced customer experience.