re:Invent 2024

Multi-Region Disaster Recovery for QLDB

Application Modernization

Learn how to implement disaster recovery capabilities for your Amazon Quantum Ledger Data Base to improve the availability of your applications across different regions or accounts

Disaster Recovery

Disaster recovery is the process of maintaining or reestablishing vital infrastructure and systems following a natural or human-induced disaster. It involves policies, tools, and procedures to recover or continue operations of critical IT infrastructure, software, and systems. It is considered a subset of business continuity, explicitly focusing on ensuring that the IT systems that support critical business functions are operational as soon as possible after a disruptive event occurs. There are several approaches to it depending on the recovery time expected and the budget available. In this blog, we are going to focus on two main these:

  • Backup & Restore: this strategy basically refers to making copies of the data periodically, and saving them into a different location or device. In case any disaster happens, these backups can be used to recover the saved data. This approach is the “coldest” one because there can be hours spent between the moment when the disaster happens until the application is up again. Note that the entire application has to be deployed again, along with the database and the restoration of the backup files. On the other hand, this can be the cheapest option, because there are no replicated services running, only one database and application instances, and the backups’ storage are billed.
  • Warm Standby: in this case, there are two application servers and two databases running at the same time, being online replicated, but only one of these is operative. The other services that may be involved, like an EC2 instance or a container running the application code, are on standby waiting to be useful as soon as a disaster happens. The advantages of implementing this strategy are that the recovery time and, consequently, the application and database downtime are highly reduced. At the same time, the cost of the solution is higher, perhaps up to double the cost of the standalone system.

QLDB

Amazon Quantum Ledger Database (QLDB) is an AWS-managed ledger database that provides a complete and cryptographically verifiable history of all changes made to your application data.

Ledger databases are different from traditional databases in the way that the data is written in an append-only way, providing full data lineage. The data in these databases is immutable and verifiable. This can also be achieved in traditional databases, but it requires custom development.

QLDB is useful for those applications for which data integrity, completeness, and verifiability are critical. For example, for logistics applications that would need to store the movement between carriers and across borders, as well as in finance for tracking critical data, such as credit and debit transactions.

Although it is highly available across multiple AZs by its nature, it does not offer native support for snapshots of its data, nor cross-region replication or point-in-time recovery. It only gives the ability to export all queries executed between two dates to S3 in different formats, including ION and JSON.

The lack of native DR support is the main reason we decided to build a manual implementation of the strategies mentioned above for QLDB, which will be useful in case an application crucially needs to be highly available along different regions or accounts.

The following sections will describe the technique designed to provide disaster recovery capabilities over QLDB.

Backup Strategy

The solution we thought of to take regular backups is similar to the following:

ropes

This architecture is composed of an EventBridge rule that will trigger a Lambda function each X amount of time. This X will depend on the RPO needed by the customer or application. In our case, it was triggered hourly.

The Lambda function is built in Python and will use boto3’s QLDB client to trigger the export to S3 given the start and end date and time (export_journal_to_s3 method):

The destination S3 bucket, in this case, configured with compliance mode, will store the hourly backup files and will also have a cross-region replication configured to copy these files to another region. If, for any security or regulation reason, cross-account replication is also needed, this can be configured with no major issues.

An example of the content of these files is:

Note that the export files contain ALL queries executed in the database, it can’t be configured to only export DDL/DML queries, so that is a point to have to take into account when restoring the data into the target database.

This backup workflow is independent of whether it is being used for the Backup & Restore or Warm Standby DR approaches. What can be modified depending on the chosen strategy is the interval between each QLDB export to S3.

Recovery Strategy

Option 1, Backup & Restore

For this recovery approach, there are two lambdas responsible for restoring the data into the new QLDB database. In this example, we represent it in two regions to show that it is possible to restore it in the source/original region (in this case, on us-east-1) and so in the secondary region. Also, the bucket in the secondary region can be used to create a new QLDB in a different region than the principal or secondary.

The two-step lambda approach consists of:

  1. A first lambda that will list the .json files from the backup bucket and publish all of them into an SQS queue. This technique is used to distribute the load of the restoration into many lambda instances instead of executing a single function, which may not be sufficient depending on the database size.


  1. The second step will read the upcoming files from the SQS queue, one by one, and select and run only the queries that are related to DML (INSERT, UPDATE, DELETE) in the target QLDB.

In this approach, the RTO tends to be higher, or depends linearly on the amount of queries to re-run. On the other hand, the cost will be significantly reduced as we are going to pay only for the backup’s storage and one QLDB storage and reads/writes.

Option 2, Warm Standby

In this case, the workflow is similar to the previous one. The only difference is that the first lambda, the SQS-distribution, will not be necessary because we are going to configure the DR bucket to publish the message in the queue using S3 notifications.

After that step, the Restoration Lambda will work exactly the same as in the Backup & Restore strategy.

This solution has some pros and cons compared to the previous one. As a pro, this approach would have a very fast RTO as the database is already loaded and running. As a con, the cost of the solution will be higher: we will have to pay double for the QLDB storage costs and the writes to the DB, because we will have two copies of the data (or more, depending on the number of backup regions that we configure); on the other hand, we will pay once for reads as these only go to the active database.

Option 3, Intermediate approach

To take advantage of some of the pros and get rid of some cons of both previous solutions, we can build an intermediate where we try to keep the backup DB “mostly” up to date.

This approach is useful when we have a lot of historical data to reprocess. We would have to skip the most recent backups and only process the older ones. This way, we will pay less than double the storage and writes (the backup DB will have less data), and the RTO will be significantly reduced because the database will be nearly up to date.

This solution can be achieved by executing an initial load of the older backups just by adopting the Backup & Restore option and filtering which files to send to the second region. Once the backup database is loaded with the older data, the solution to adopt will be similar to the original, with all the involved services (QLDB, S3) waiting for the disaster to happen to start the load of the newer backup files.

Disclaimer

This proposed architecture is only a possible solution to the lack of DR support in QLDB.

Conclusion

Even though QLDB is a new service, we consider that it fits all the business requirements which it was created for. It is managed by AWS, so you don’t have to worry about handling infrastructure configurations, and it’s highly available.

The only element against it is that we can find it does not have a native DR strategy or native backups support. We believe that AWS will eventually develop something related to that in the future, but as for now, this is the suggestion for solving this limitation.

The main purpose of this blog was to give a brief idea of what available functionalities we can use to build a homemade strategy in case a disaster happens in our production environment. We know that it can be improved or adjusted depending on the use case or on the database load or size, and we recommend so in case these solutions are not feasible in your environment.

A recommendation that we can give is to try to build a process, maybe in a nightly job, to try to clean and merge the content of the files, for example, deleting all the transactions that we don’t want, like the SELECTs. This way, we will have fewer and smaller backup files to load, which will reduce storage and processing costs.

Caylent can help set you up for innovation and success on the AWS Cloud. Get in touch with our team to discuss how we can help you achieve your goals.

Application Modernization
Franco Balduzzi

Franco Balduzzi

Franco Balduzzi is a Sr Data Engineer with 6 years of experience developing data solutions for different industries, including medical, consulting, and finance/banking. He graduated in Systems Engineering in 2019, and his strongest skills are data-related AWS and GCP cloud computing technologies. He likes to build highly-available and performant end-to-end solutions. Being a proactive person, he is always open to help and contribute with his partners, because he maintains that “you like to be helped when you are in a rush”.

View Franco's articles
Jorge Goldman

Jorge Goldman

Jorge Goldman is an Engineering Manager with over 12 years of experience in diverse areas from SRE to Data Science. Jorge is passionate about Big Data problems in the real world. He graduated with a Bachelors degree in Software Engineering and a Masters degree in Petroleum Engineering and Data Science. He is always looking for opportunities to improve existing architectures with new technologies. His mission is to deliver sophisticated technical solutions without compromising quality nor security. He enjoys contributing to the community through open-source projects, articles, and lectures, and loves to guide Caylent's customers through challenging problems.

View Jorge's articles

Learn more about the services mentioned

Caylent Catalysts™

AWS Control Tower

Establish a Landing Zone tailored to your requirements through a series of interactive workshops and accelerators, creating a production-ready AWS foundation.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Production-Grade EKS Clusters: Best Practices for Scalability, Security, and Efficiency

Learn how Amazon Elastic Kubernetes Service (EKS) simplifies Kubernetes cluster management by providing robust tools, security practices, and scalability solutions for production environments.

Application Modernization

Modernizing Online Educational Platforms on AWS: Enabling Reliable Student Experiences

Learn how we helped an education technology company with a seamless transition to AWS, delivering high availability, disaster recovery, cost savings, compliance, and improved visibility for the customer's network infrastructure.

Infrastructure & DevOps Modernization
Application Modernization

Refactoring Applications for the Cloud: Best Practices

A step-by-step guide to refactoring, a modernization strategy that allows you to enhance your applications with small, incremental improvements instead of a complete rewrite.

Application Modernization