Scale ETL with AWS Managed Workflows

Data Modernization & Analytics
Infrastructure & DevOps Modernization

Learn how AWS Amazon Managed Workflows for Apache Airflow (MWAA) can offer an efficient and easy way of building ETL pipelines that are scalable and cost-effective.

Data engineers tend to struggle with scaling when designing an Extract-Transform-Load (ETL) pipeline. What if our 500MB data set becomes 50GB tomorrow? Are we ready to handle that load? How much is it going to cost? Should we be worrying about this uncertain future now? How can we be prepared?

Data engineers and architects lose sleep over these questions. In this article we present an efficient and easy way of building ETL pipelines that are scalable while paying only for the actual usage in an AWS environment.

Let’s talk about orchestration first: The benefits of AWS Managed Airflow

ETL not only involves the pipelines that do the actual ETL work, but also the orchestration of those pipelines. A lot of ETL pipelines on AWS are composed of EC2 instances in which orchestration and execution of the pipeline share and compete for resources. The image below shows a very common pattern in which there is one primary instance executing Airflow and several worker instances, with an autoscaling group, for handling the load.

This configuration, even though it is stable, it has a lot of flaws in its design:

  1. Configurations of all moving parts tend to be tedious and time consuming
  2. Requires a lot of expertise in AWS to configure all the required elements of the architecture in a secure way
  3. Scaling is hard as the worker EC2s are not provisioned automatically.
  4. Jobs compete for resources in the workers EC2 (Memory and vCPUs)
  5. If for some reason, the EC2 crashes, all the worker jobs running in that instance will be killed. Leaving your pipeline in an inconsistent state.
  6. No High Availability as the Airflow instance is only deployed in one AZ
  7. Deploying new code is a nightmare and there is no easy way to rollback.

So, how do we overcome all these issues mentioned above? Which is our best approach to be able to distribute the load and not have to worry about it?

Managed Workflows with Apache Airflow (MWAA)

AWS offers its customer with an out of the box Managed Service of Apache Airflow with High Availability by default. 

This managed service is totally configurable, and provides several benefits over the configuration mentioned above:

  1. Easy deployment: Just a couple of clicks on the console or using an Infrastructure as Code (IaC) tool like Cloudformation
  2. High Availability (HA): No need to worry about setting up complex architecture to achieve HA
  3. Automatic Logging: System metrics are automatically logged into Cloudwatch
  4. Secure: Don’t need to worry about security, everything is encrypted with KMS
  5. Change management: Easy to deploy any changes in your DAG through S3 with versioning
  6. Seamless integration: Connect to any other resource seamlessly like Athena, Glue, ECS/EKS
  7. High capacity: 3 Cluster sizes that can hold up to 1000 DAGs in its largest configuration

But what about the workers then? How do I distribute the pipeline load?

MWAA integrates seamlessly with other AWS resources. In this example we will use ECS Fargate as the easiest way to distribute the pipeline load without worrying about allocating resources or race conditions that can lead to performance problems, paying only for what is used.

The following image shows an example of a highly scalable architecture that can orchestrate and distribute the pipeline loads transparently.

The architecture is composed of one or more docker images that will contain all the necessary code to execute the tasks that are triggered by the MWAA instance. The docker images will be registered in the ECR registry and executed through ECS Fargate.

This architecture has the following benefits:

  1. All task executions are completely isolated from each other and won’t compete for resources
  2. Tasks can be customized to run with different task definitions in which we can specifically set the amount of memory and vCPUs that we would like to allocate
  3. Docker images can be easily updated and re-deployed through CI/CD
  4. In case of an error, it is very easy to rollback to a previous image.
  5. Scale horizontally
  6. Simpler Architecture (less maintenance overhead)
  7. Developers can run local environments in their computers that match the target architecture

MWAA local runner: test the code before deploying to a hosted environment

AWS MWAA Local runner is a utility that will let us test our code locally before uploading it to our cloud environments.

As a requirement to use this utility it is necessary to have the docker engine and client running in our local computer, together with docker compose.

Use the following instructions to install it:

Build the Docker container image using the following command:

Note: it takes several minutes to build the Docker image locally.

Run a local Apache Airflow environment that is a close representation of MWAA by configuration.

To stop the local environment, Ctrl+C on the terminal and wait till the local runner and the postgres containers are stopped.

Accessing the Airflow UI

Deploying the first DAG to the local environment

  1. Add DAG code to the dags/ folder.
  2. Add Python dependencies to requirements/requirements.txt
  3. To test a requirements.txt without running Apache Airflow, use the following script:

If you encountered the following error: process fails with "dag_stats_table already exists", you'll need to reset your database using the following command:

Now the real truth: How to use an ECS Fargate call from MWAA

Amazon ECS Fargate and Amazon MWAA, are particularly useful for building and deploying containerized applications and workflows in the cloud. We'll explore how to use an ECS Fargate call from MWAA and uncover the real truth behind this process. Whether you're an experienced  developer or just getting started with AWS, following the steps of this guide will provide you with valuable insights and practical tips to help you get the most out of these powerful cloud services.

  1. Import ECSOperator into the code
  2. Set LaunchType property to ‘FARGATE’
  3. Set cluster property 
  4. Set the task definition property 
  5. Set the network properties to use the Security group and subnets to run the proper tasks in MWAA code
  6. In the containerOverrides section make sure to set the name property to the name of the container.

In the containerOverrides section set the command that you want to execute inside the container.

Conclusion

Moving into a managed Airflow service like MWAA  provides many benefits, including scalability, reliability, security, and cost-effectiveness, which are important for managing workflows in the cloud.

Using containerized solutions for isolating tasks triggered by Airflow is a best practice for managing complex workflows. By using containerization, you can ensure that tasks are executed in a separate environment that is isolated from Airflow's internal workers. This can help improve performance and prevent failures.

Furthermore, it is important to avoid using BashOperators and PythonOperators in Airflow's internal memory as they can cause failures or degrade performance. Instead, it is recommended to trigger different workloads in proper remote platforms, such as ECS cluster, EKS Pods, Glue, etc., and collect the state once the task finishes.

Finally, Caylent's suggestion to re-design deployment activities and CI/CD to build an integrated solution that can rebuild the docker image or deploy to the S3 bucket in case of modifications to the DAG is a best practice for managing workflows in a scalable and reliable manner. By automating the deployment process, you can ensure that changes are implemented quickly and efficiently, while minimizing the risk of errors or downtime.

If you're interested in learning more about CI/CD on an Amazon ECS cluster with code pipeline, you can take a look at  CI/CD on an Amazon ECS cluster with code pipeline.The step-by-step guide provides instructions on setting up CodePipeline, CodeCommit, and CodeBuild to automate deployment and run Airflow on the ECS cluster. Automating deployment can help deliver updates more efficiently and improve DevOps skills using AWS technologies.

Data Modernization & Analytics
Infrastructure & DevOps Modernization
Jorge Goldman

Jorge Goldman

Jorge Goldman is a Sr. Big Data Architect with over 12 years of experience in diverse areas from SRE to Data Science. Jorge is passionate about Big Data problems in the real world. He graduated with a Bachelors degree in Software Engineering and a Masters degree in Petroleum Engineering and Data Science. He is always looking for opportunities to improve existing architectures with new technologies. His mission is to deliver sophisticated technical solutions without compromising quality nor security. He enjoys contributing to the community through open-source projects, articles, and lectures, and loves to guide Caylent's customers through challenging problems.

View Jorge's articles
Guillermo Britos

Guillermo Britos

Guillermo Britos is a senior software engineer with over 5 years of experience in creating and deploying software solutions for diverse industries, including education, finance, and consulting. He is skilled in cloud computing technologies, including AWS, GCP and Azure. As a proponent of DevOps methodologies, he has implemented CI/CD pipelines for various projects. Guillermo always keeps a rubber duck on his desk for debugging purposes, because, as he says, "talking to a rubber duck is the best way to find bugs in your code."

View Guillermo's articles

Learn more about the services mentioned

Caylent Catalysts™

AWS Control Tower

Establish a Landing Zone tailored to your requirements through a series of interactive workshops and accelerators, creating a production-ready AWS foundation.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Optimizing Media Management on Amazon S3

Learn how we helped a media company optimize the management of their video assets on Amazon S3.

Infrastructure & DevOps Modernization

Securing Sensitive Data: A Deep Dive into PII Protection with OpenSearch

Learn how organizations can protect sensitive data using Amazon OpenSearch's security features like fine-grained access controls, encryption, authentication, and audit logging.

Data Modernization & Analytics

Optimizing AWS Data Pipelines for Compliance in Digital Advertising

Learn how we helped an advertising customer setup automated, cost-effective pipelines to ensure compliance for sensitive data in their existing processes.

Infrastructure & DevOps Modernization