08.14.20

Run The World Case Study

By JP La Torre

Company Introduction

Run The World is an online event platform designed for organizers, attendees, speakers, and sponsors worldwide. Run The World helps people build virtual events through simple plug and play templates⁠ enabling them to create an engaging online event for their attendees.

Run the World supports professional organizations to organize more frequent meetings and engage more members, with little financial cost and setup time needed. They offer exciting formats and proprietary technology that enables attendees to interact with speakers and socialize with each other in a fun and engaging way.

Project Dates

March 2020 – July 2020

Challenge

Run The World was referred to Caylent by Andreessen Horowitz (a16z), who led their recent $10.8 million Series A round in the California-based company. The company was experiencing a severe production outage in their Kubernetes environment hosted on AWS. Their rapid event growth and increased scale of online attendees in their virtual conferences—due in part to the global COVID-19 pandemic⁠—caused the full destabilization of the company’s production environment.

Solution

Run The World contacted us on a Friday morning in March in the midst of a production outage. Within 3 hours of the phone call, the Caylent DevOps team was triaging and remediating the situation—caused in part by a misconfigured service mesh and huge amounts of inbound traffic across several of the company’s hosted events. Many pots of coffee later, Run The World’s production environment was re-deployed and back online that same day. 

Immediately after the production environment was stabilized through Caylent’s SWAT team approach, our team flipped into solutions architecture and began planning a huge platform upgrade for the company. The new roadmap created by Caylent called for a migration of the company’s self-hosted Kubernetes cluster, initially deployed with KOPS, to Amazon Elastic Kubernetes Service (EKS) deployed with Terraform to manage their Infrastructure-as-Code. 

Adopting EKS provided Run The World with highly available, reliable and resilient Kubernetes clusters that leverage AWS EC2 spot instances. This meant we were also able to implement an AWS Savings Plan for considerable cost optimization without sacrificing scalability or availability. The team also carried out a number of GraphQL tweaks to decrease latency and noise and maximize performance tuning.  For further performance improvements, the team configured and deployed Elasticsearch, Prometheus, and Jaeger operators to streamline monitoring and alerting (with Slack integration and notifications as well as SSL certs) including:

  • Management and monitoring of multiple clusters
  • New stack version upgrades with ease
  • Scaling cluster capacity up and down
  • Adjusting cluster configuration
  • Dynamically scaling local storage 
  • Scheduling backups

Everything proposed in the roadmap was designed to protect the company from future outages and downtime by improving the resilience of the infrastructure and introducing scalability to ensure a smooth end user experience during large online events. 

As part of the roadmap, the Caylent team built and deployed brand new EKS Production, Staging, and Development environments using Terraform to deploy the Infrastructure-as-Code. Using Terraform, the team implemented IAM Roles for Service Accounts (IRSA) to allow pods running in the EKS cluster automatically assume an IAM role when using other AWS resources. We created custom Helm Charts to improve Kubernetes deployments and performed service load testing using K6s to prepare the system for large events and traffic volumes. Additionally, we migrated the company’s database to Amazon Aurora so the DB cluster can handle future scaling.

The volume of work covered in a short period of time on Run The World’s behalf also included:

  • Building a NET Diagram outlining platform and AWS architecture
  • Implementing Horizontal Pod Autoscaler (HPA) in the EKS environment
  • Creating developer documentation using Notion for future reference and learning
  • Configuring and building SES, SNS, SQS in Terraform
  • Building, deploying, and managing Gitlab pipelines for the development team
  • Implementing Amazon Elasticache backups
  • Setting up and running liveness/readiness probes to ensure pods run smoothly
  • Assisting in the onboarding and development of new internal DevOps hire

Results

As well as providing swift and immediate tactical DevOps SWAT team remediation of their production outage, over the next several months the team at Caylent undertook and successfully executed a transformative platform re-architecture. The successful realization of which helped stabilize the company’s environments, provide significant ROI on the engagement, and allowed Run The World to expand its user base and host increasingly large online events with tens of thousands of simultaneous attendees. Prior to the partnership with Caylent in 2020 this would have been unachievable.

Testimonial

Caylent is a great team to work with; the team are lifesavers who helped us recover our company’s service in a very short amount of time. We love how communicative they are with us and how they provide detailed explanations on all the projects they are trying to do ahead on our behalf. This work frees up our engineering team who can fully focus on building the product instead and move faster on testing new features.

Caylent’s team leader Agustin, who worked closely with us, is very responsive, organized, and willing to adapt to our schedule where necessary for important deployments. Kevin, Caylent’s VP of engineering, also spent his valuable time helping us to train our engineers and interview potential DevOps candidates. 

With the support from Caylent, Run The World’s service stability is now significantly increased. We never experienced any unintentional down time and thanks to Caylent there are many crucial monitoring and alerts systems built in place to mitigate risk. With the new CI/CD deployment process, the product team is able to spend their time on product releases. All this wraps up to mean we had a great time collaborating with Caylent.      

⁠— Xuan Jiang, CTO

About Caylent

Caylent provides a critical DevOps-as-a-Service function to high growth companies looking for expert support with Kubernetes, cloud security, cloud infrastructure, and CI/CD pipelines. Our managed and consulting services are a more cost-effective option than hiring in-house, and we scale as your team and company grow. Check out some of the use cases, learn how we work with clients, and read more about our DevOps-as-a-Service offering.