Caylent Services
AWS Foundations & Migrations
From rehosting to replatforming to rearchitecting, Caylent will help you leverage AWS to its fullest potential to meet your business objectives.
Explore how we helped our customer in the financial sector migrate from High-Performance Computing (HPC) workloads on an on-premise Slurm cluster to AWS ParallelCluster, detailing the process, challenges, and benefits.
High-Performance Computing (HPC) has become an essential tool in many industries, including the financial sector. The capital markets industry, in particular, relies heavily on computational power to process vast amounts of data, run complex simulations, and execute sophisticated trading algorithms. Our recent project involved migrating an on-premise Slurm cluster to AWS ParallelCluster for a customer in this industry. This blog post shares our perspective on the process, the benefits of AWS for HPC workloads, and why we recommend this approach.
Our customer operates in the capital markets industry, which demands high computational power, low latency, and robust data security. They had been running their HPC workloads on an on-premise Slurm cluster, which was becoming increasingly difficult to manage and scale. The customer's primary challenges were:
AWS ParallelCluster is an open-source cluster management tool that simplifies the deployment and management of HPC clusters on AWS.
It is ideal for traditional HPC workloads, such as simulations, large-scale data analysis, and scientific computing whereas another solution like Ray is best suited for machine learning, AI, data processing, and general-purpose distributed applications. Here's why we recommended ParallelCluster for our customer:
The customer initially migrated their workload from an on-premises Slurm setup to AWS ParallelCluster v2 to leverage cloud scalability and manageability. However, as they needed and wanted the newer features, they called Caylent's expertise to upgrade their environment to AWS ParallelCluster v3. This upgrade involved a series of strategic steps and optimizations to enhance performance. Key aspects of the upgrade included:
1. Assessment and Planning: We began with a thorough assessment of the existing HPC environment. This included understanding the current workload, resource utilization, and performance bottlenecks. We also defined the migration strategy, including timelines and milestones.
2. Implementing the ParallelCluster API: AWS provides a CloudFormation template to create the ParallelCluster API. After setting up this API, we created a Terraform provider to interface with it. This allowed us to manage ParallelCluster resources using Terraform, which the customer was already using.
3. Developing Terraform Modules: To simplify the deployment process, we developed two Terraform modules:
locals.tf
file that would then deploy the parallel cluster itself.4. Configuring Custom Scripts: We set up custom scripts for both startup and post-node configuration:
5. Testing and Optimization: Before going live, we conducted extensive testing to ensure that the new HPC environment met the customer’s performance and reliability requirements. We optimized the Slurm configuration for the AWS environment, adjusting parameters to maximize performance and efficiency.
6. Go-Live and Monitoring: After successful testing, we transitioned the HPC workloads to AWS ParallelCluster. We set up monitoring and alerting to ensure ongoing performance and quickly address any issues that arose.
To give a concrete example of the costs associated with running a cluster on AWS, let's consider the following scenario:
m5.large
EC2 instance.c5.2xlarge
EC2 instances.r5.4xlarge
EC2 instances.Let's calculate the monthly costs assuming the cluster runs 8 hours per day:
1. Head Node (m5.large):
2. Queue 1 (c5.2xlarge):
3. Queue 2 (r5.4xlarge):
Total Monthly Cost:
Total: $833.28
This scenario assumes the cluster is not running 24/7 but only for 8 hours a day, which is a typical workday scenario. The costs can vary based on actual usage patterns and instance types selected. By leveraging AWS's flexibility, our customer can optimize costs further by using reserved instances or spot instances.
The migration to AWS ParallelCluster brought several immediate benefits to our customer:
Migrating HPC workloads to AWS ParallelCluster offers significant advantages, especially for industries like Capital Markets that require high computational power, scalability, and robust security. Our experience with this migration project demonstrated the value of AWS in providing a flexible, cost-effective, and high-performance HPC environment.
If you’re considering a similar move, AWS ParallelCluster, combined with the power of Slurms processing capabilities and customizable features, can be a game-changer for your HPC needs. By leveraging AWS's vast infrastructure, you can focus on what matters most – driving innovation and achieving your business goals. Get in touch with us to get started.
Mohmmad El-Zaghah is a Principal Architect at Caylent with over 12 years of experience in the tech industry. As a 6x AWS Certified expert, he specializes in migration strategies to AWS and leads DevOps transformations and implementations. Mohmmad is also a subject matter expert in AWS CDK, leveraging his deep knowledge to design scalable, efficient cloud solutions. With six years of experience in the Microsoft Windows ecosystem, he brings a unique blend of expertise across multiple platforms. His strategic approach and technical prowess have been instrumental in driving successful migrations and optimizing cloud infrastructures for high performance.
View Mohmmad's articlesAs a Cloud Architect at Caylent with 9 years of experience in technology, Leticia has been immersed in the world of AWS since 2018, holding 7 certifications on the platform. Passionate about cloud architecture, she bring deep experience to imagine and implement impactful solutions for clients from a plethora of industries. In addition to technology, she is also a gaming enthusiast and finds joy in outdoor adventures with her husband and children.
View Leticia's articlesLearn how to optimize website performance by converting images to WebP format using Amazon CloudFront and S3, improving load times and user experience.
Learn how to migrate from VMware to Amazon EC2 and avoid VMware licensing and cost uncertainties while unlocking transformative cloud scalability and efficiency.
Aurora MySQL is a high-performance, fully managed database with Amazon RDS benefits, simplifying infrastructure for business focus. Learn migration best practices and essential components for a successful journey toward Aurora MySQL that can lead to increased scalability, resiliency, and cost-effectiveness.