Understanding Disaster Recovery in AWS
What Is a Disaster?
A disaster is any event that negatively impacts your system’s availability, performance, or business continuity. It could be physical damage, cyber attacks, accidental deletions, or regional outages.
The Goal of Disaster Recovery (DR)
DR is about being ready before disaster hits. It’s the process of designing systems that can recover quickly and efficiently with minimal data loss. In AWS, DR isn’t just about data backups—it’s about designing resilient architectures and using cloud-native tools to get your systems back online fast.
Key Concepts: RTO and RPO
Two terms you’ll see again and again:
- RPO (Recovery Point Objective): How much data you can afford to lose (measured in time). E.g. “We can tolerate losing the last 5 minutes of data.”
- RTO (Recovery Time Objective): How long it takes to get back up and running. E.g. “We must be online within 30 minutes.”
Traditional vs Cloud DR Scenarios
- On-Prem → On-Prem: Traditional, costly, and inflexible.
- On-Prem → AWS: Hybrid DR using AWS as a backup and recovery environment.
- AWS Region A → AWS Region B: Full cloud-native DR strategy, offering high automation and rapid recovery.
AWS Disaster Recovery Strategies
Backup and Restore
This is the simplest and most cost-effective option. Backups are stored in S3, Glacier, or replicated across regions. Restoration takes the longest.
- RTO: Hours to days
- RPO: Minutes to hours
- Use Case: Non-critical systems or startups minimizing costs
Pilot Light
A minimal version of your environment is always running in AWS—just enough to power the most critical functions.
- RTO: Tens of minutes
- RPO: Minutes
- Use Case: Businesses needing faster recovery without the cost of a full standby environment
Warm Standby
A scaled-down version of the full system runs in AWS, ready to scale up quickly.
- RTO: Minutes
- RPO: Sub-minute to minutes
- Use Case: Medium-to-high availability systems
Multi-Site / Hot Site
Production workloads run in two or more locations simultaneously.
- RTO: Seconds to minutes
- RPO: Near zero
- Use Case: Mission-critical applications that require maximum uptime
AWS Tips for Better DR
Backups
- Use EBS snapshots, RDS backups, and S3 versioning
- Implement S3 lifecycle policies and Cross Region Replication
- Use Snowball or Storage Gateway for large on-premises backups
High Availability
- Deploy multi-AZ and multi-region setups where possible
- Route 53 can route traffic to a healthy region
- Use Site-to-Site VPN as a backup to Direct Connect
Replication
- RDS cross-region replicas, Aurora Global Databases
- Continuous replication from on-prem to AWS with DMS
- File-level replication via Storage Gateway
Automation
- Use CloudFormation or Elastic Beanstalk to spin up infrastructure
- Set up CloudWatch Alarms to trigger failover or reboot EC2
- Lambda can automate customized recovery workflows
Embracing Chaos
Borrow a page from Netflix’s “Simian Army”: test DR by simulating failures. This helps you find and fix weaknesses before a real disaster.
AWS Services for Migration and Recovery
AWS Database Migration Service (DMS)
DMS is designed to migrate databases with minimal downtime. You can move data between on-premises and AWS, or between AWS services—supporting both homogeneous migrations (e.g. PostgreSQL to PostgreSQL) and heterogeneous migrations (e.g. Oracle to MySQL).
Key Features
- Supports one-time migrations or ongoing replication (great for minimizing cutover time)
- Works with most commercial and open-source DBs (Oracle, SQL Server, MySQL, PostgreSQL, etc.)
- You don’t need to install agents on the source or target databases
How It Works
- DMS uses a replication instance, which is an EC2 under the hood (you manage the specs, AWS manages the patching)
- The replication instance connects to your source and target databases and performs the migration tasks
- You configure endpoints for source and destination, then create and manage migration tasks (full load, ongoing changes, or both)
When Source and Target Engines Differ
If the source and target databases use different engines, you need to use the AWS Schema Conversion Tool (SCT). DMS only moves the data—not the schema (like tables, indexes, stored procedures, etc.).
Examples:
- Oracle → MySQL: Use SCT to convert schema, then DMS to migrate the data
- SQL Server → PostgreSQL: Same deal—SCT handles schema translation
SCT will tell you which parts of the schema can be auto-converted and where manual work is required (especially if you’re using vendor-specific functions).
Example: Ongoing Replication from On-Prem Oracle to RDS MySQL
Let’s say your Oracle database is running in your corporate data center, and you want to continuously replicate data into an Amazon RDS for MySQL instance. Here’s a high-level setup:
- Set up network connectivity: You need connectivity between AWS and your data center. Typically, use a Site-to-Site VPN or AWS Direct Connect.
- Provision a DMS replication instance: This is an EC2 instance managed by DMS. Place it in a VPC that can reach both source and target.
- Install Oracle client drivers on the replication instance (automated by AWS if you select the right engine version).
- Create source and target endpoints: DMS needs login credentials and connection details for both the Oracle DB and RDS MySQL.
- Use SCT to convert the Oracle schema to MySQL: Apply the converted schema to the RDS target DB before data replication starts.
- Create a DMS task: Set it to do a full load + ongoing replication (using Oracle’s redo logs for change data capture).
- Monitor and validate: Use DMS’ validation tools to compare source/target data and ensure accuracy.
This setup allows you to keep the on-prem Oracle database live while the data is streamed into RDS. When you’re ready, you can cut over to the AWS-hosted DB with minimal disruption.
Limitations and Considerations
- Replication latency depends on the network between the replication instance and your source DB
- DDL changes (e.g. new columns) aren’t automatically handled unless explicitly enabled
- Some data types and vendor-specific functions may not convert cleanly during schema conversion
- Always test and rehearse migration workflows before doing it live
RDS & Aurora Migrations
Aurora is fully compatible with MySQL and PostgreSQL, which makes migrations fairly straightforward — whether you’re coming from RDS or from an external database. The general tools used for migration include RDS snapshots, read replica promotion, DMS (for live migrations), and sometimes direct S3-based restores.
Migrating from RDS to Aurora
If you’re already using RDS for MySQL or PostgreSQL, migrating to Aurora involves minimal effort. You have two main options:
-
Option 1: Restore from RDS Snapshots
You can take a snapshot of your RDS instance and restore it directly as a new Aurora database. This is a simple lift-and-shift and works well for one-time migrations.
-
Option 2: Promote an Aurora Read Replica
You can create an Aurora Read Replica from your RDS MySQL or PostgreSQL database. Once replication is caught up (i.e. replication lag is zero), you can promote the replica to be a standalone Aurora cluster.
This method is ideal if you want near-zero downtime, but it does take time and incurs extra cost while both databases are running.
Migrating from External MySQL to Aurora MySQL
You’ve got a couple of solid paths here:
- Option 1: S3-Based Migration Using Percona XtraBackup
- Use Percona XtraBackup to take a backup of your source MySQL database.
- Upload the backup files to an S3 bucket.
- Use that S3 bucket to restore into a new Aurora MySQL database.
This is a faster method compared to logical dumps and is ideal for larger datasets.
- Option 2: mysqldump Utility
- Use
mysqldump
to export your data. - Import it into a newly created Aurora MySQL instance.
This is easier but much slower — good for smaller databases or dev environments.
- Use
-
Option 3: Use AWS DMS
If both source and target databases are live and network-accessible, you can use AWS Database Migration Service (DMS) for continuous replication. DMS works well if you want to keep both environments in sync for a period of time (e.g. for testing before full cutover).
Migrating from External PostgreSQL to Aurora PostgreSQL
PostgreSQL migrations follow a similar structure:
- Option 1: Snapshot-Like Migration via S3
- Take a PostgreSQL backup.
- Upload it to S3.
- Use the
aws_s3
extension in Aurora PostgreSQL to import the backup.
This is efficient and allows you to leverage S3 as an intermediate storage layer.
-
Option 2: Use AWS DMS
As with MySQL, you can use DMS to perform live migration from an external PostgreSQL source into Aurora PostgreSQL. Continuous replication helps with low-downtime or blue/green deployment strategies.
-
Option 3: Aurora Read Replica and Promotion (for RDS PostgreSQL)
This works exactly like the MySQL version — you create an Aurora Read Replica from your RDS PostgreSQL, and once replication lag hits zero, promote it to its own cluster.
Architectural Considerations
- Aurora Global Databases: These allow a primary DB in one region with read replicas in others. They’re great for cross-region DR and global apps. In the event of a failure, you can promote a secondary region to be the new primary.
- Backtracking (Aurora MySQL only): Aurora MySQL supports backtracking, which lets you roll back your DB to a previous state without restoring from backups. This is great for recovering from logical errors without full restore downtime.
- The Aurora Read Replica promotion process isn’t instantaneous — plan for the time and cost.
- DMS requires that both source and target databases are accessible and supported. It won’t convert incompatible schema types — for that, use AWS Schema Conversion Tool (SCT).
- Always monitor replication lag if you’re using read replicas. Promotion before catching up can result in data loss.
On-Premise Migration Strategies
If you’re running workloads in an on-premises data center and want to either migrate to AWS or set up disaster recovery capabilities, AWS offers several tools and strategies to help you make that transition smoothly.
Virtual Machine (VM) Migrations to AWS
You can migrate your existing virtual machines from on-prem into AWS and even bring them back again if needed:
- Amazon Linux 2 as a VM: You can download Amazon Linux 2 in
.iso
format and run it on your on-prem hypervisor (e.g. VMWare, KVM, VirtualBox, or Microsoft Hyper-V). This is useful for consistency between dev/test environments and AWS. - VM Import/Export: Use this to import your existing on-prem VMs into Amazon EC2 and run them there as instances. You can also export them back out to your data center if needed.
- Disaster Recovery Repo: Build a DR strategy by storing critical VM images in AWS as a cold standby, ready to launch into EC2 during a failure event.
Planning and Assessing Your Migration
Before diving into the actual move, AWS provides tools to help you assess what’s running in your data center:
- AWS Application Discovery Service: Automatically collects detailed info on your on-prem servers—like CPU usage, network activity, and software inventory. It helps you plan what to migrate and how to size your AWS resources.
- Server Utilization & Dependency Mapping: Understand which services talk to each other and track how your infrastructure performs, which is essential for avoiding surprises post-migration.
- AWS Migration Hub: Use this to centralize and track the status of all your migration projects across different AWS services.
Migrating On-Prem Databases
- AWS Database Migration Service (DMS): This service lets you replicate data between:
- On-premise and AWS
- AWS to AWS (e.g. between regions)
- AWS back to on-premise (useful for DR testing or hybrid scenarios)
DMS supports a wide variety of engines, including Oracle, MySQL, PostgreSQL, SQL Server, and even DynamoDB. If your source and target databases use different engines (e.g. Oracle to Aurora MySQL), you’ll need SCT to convert the schema and code objects (like stored procedures and triggers) before you can migrate the data.
Migrating Entire Servers
- AWS Server Migration Service (SMS): This service performs incremental replication of your live on-premises servers (including OS and application state) into AWS. It’s ideal for large-scale server migration projects and can help minimize downtime during cutover.
AWS Backup
AWS Backup is a fully managed, centralized backup service that helps you automate and consolidate backups across AWS services—without the complexity of writing custom scripts or managing scattered backup processes manually.
Why Use AWS Backup?
Instead of setting up individual backup solutions for EC2, RDS, or S3, AWS Backup lets you manage everything from one place. It supports a broad set of AWS services:
- Compute & Storage: EC2, EBS, S3
- Databases: RDS (all engines), Aurora, DynamoDB, DocumentDB, Neptune
- File Systems: EFS, FSx (for Windows and Lustre)
- Hybrid Storage: AWS Storage Gateway (Volume Gateway)
It also supports cross-region backups, allowing you to store copies in different AWS Regions for disaster recovery. Even better—it supports cross-account backups, which is a great practice for isolating backup data from the source environment to protect against accidental or malicious deletion.
Backup Plans: Automate Everything
AWS Backup uses Backup Plans, which are essentially blueprints that define how and when backups happen. Plans are flexible and tag-driven, which means you can apply rules to resources automatically based on tags (like Environment=Production
).
A typical backup plan includes:
- Backup frequency: Choose from predefined intervals (e.g. every 12 hours, daily, weekly) or define your own with cron expressions.
- Backup windows: Define when the backup operation should run.
- Retention periods: Choose how long to keep backups (from days to years—or forever).
- Transition to cold storage: Automatically move older backups to cheaper storage tiers (like Glacier), based on your cost and compliance needs.
- Point-in-Time Recovery (PITR): Available for supported services like RDS and DynamoDB, so you can restore to a precise moment just before a failure.
Whether you’re backing up critical databases, file systems, or entire EC2 instances, these features make it easy to design a robust and compliant backup lifecycle.
Vault Lock
Security is a major concern in disaster recovery planning, and AWS Backup doesn’t cut corners here. AWS Backup Vault Lock enforces a WORM (Write Once, Read Many) model, which makes sure that once a backup is created, it can’t be modified or deleted—not even by the root user.
Vault Lock protects your backups from:
- Accidental deletions
- Malicious tampering
- Unintended retention changes
Once Vault Lock is enabled, your backup data is truly immutable—making it a powerful tool against ransomware and internal threats.
AWS Application Migration Service (MGN)
MGN lets you rehost applications from pretty much any source—physical servers, VMs, or cloud-hosted systems—into AWS EC2.
But before you migrate, it helps to know what you’re dealing with. That’s where the Application Discovery Service steps in.
Why use it? To gather insight into your existing environment—especially when you don’t have a full inventory or want to map dependencies.
Two ways to discover:
- Agentless Discovery Connector (usually deployed in vCenter):
- Collects inventory, VM configurations, performance stats (CPU, memory, disk)
- Agent-based Discovery Agent:
- Deeper insight: system configs, processes, and network connections between systems
Once collected, all this data shows up in AWS Migration Hub, where you can plan and track your migration project centrally.
Now you can proceed with Migration using the AWS Application Migration Service (MGN). What it does:
- Replicates your source machines into AWS with continuous block-level replication
- After testing, it spins up EC2 instances from the replicated volumes
- The original server can remain online until cutover (minimizing downtime)
- Supports Linux and Windows, with wide compatibility
Why MGN?
- Lift-and-shift solution which simply migration applications to AWS
- Fully managed and agent-based
- More cost-effective and scalable than the legacy SMS
- Handles complex environments without having to rearchitect immediately
- Converts physical, virtual and cloud-based servers to run natively on AWS
- Minimal downtime, reduced costs
This is ideal when you want to migrate fast, without changing how the app is built.
Bonus Tip: You can combine MGN with CloudWatch Alarms, Systems Manager, or Lambda to automate post-migration steps like installing agents or patching.
Transferring Large Datasets into AWS
Transferring huge datasets (think 100s of TBs) into AWS needs careful planning. The options range from quick-and-easy internet transfers to physical appliance shipping. Choose based on time, bandwidth, and use case. Let’s say you need to move 200TB to AWS. With just a 100 Mbps line:
- Over internet/Site-to-Site VPN: Easy to start, but expect ~185 days
- Over AWS Direct Connect (1 Gbps): Faster, but setup takes time — around ~18.5 days
- With AWS Snowball: Hardware appliances sent to you — typically 1 week for full cycle
How does these options stack up against each other?
1. Over the Internet / VPN
- Good for small or ongoing trickles of data
- Encryption and secure tunneling required
- Slow and not ideal for huge one-time moves
2. AWS Direct Connect
- Dedicated network line to AWS
- 1–10 Gbps (or higher) speeds
- Takes time to provision, but once up, it’s reliable and private
3. AWS Snow Family (Snowball / Snowmobile)
- AWS ships you physical devices to load data on-site
- Snowball Edge (up to 80TB per device) — rugged, secure, efficient
- Great for “big bang” migrations
- Supports offline encryption and tracking
- Use 2–3 in parallel for faster turnaround
Ongoing syncs?
- Use DataSync or DMS over Direct Connect or VPN for incremental data flows
Pro Tip: You can chain Snowball for the initial load and then switch to DMS or DataSync for ongoing replication.
VMware Cloud on AWS
VMware Cloud on AWS lets you run your existing VMware environments in the AWS Cloud without needing to refactor. It’s a powerful bridge for enterprises that want hybrid cloud flexibility. Many enterprises rely on VMware to run their data centers. Rewriting every app for the cloud can be costly and slow. But what if you could just lift those VMware-based apps as-is into AWS and keep using the same tools? That’s what VMware Cloud on AWS does.
Key Features
- Seamlessly extend your vSphere-based workloads into AWS
- Use familiar VMware tools (vCenter, vMotion, NSX, vSAN) — just now on EC2-backed infrastructure
- AWS provides the underlying compute, storage, and networking
Use Cases
- Cloud Bursting: Temporarily expand capacity into AWS during peak loads
- Disaster Recovery: Use VMware Cloud on AWS as your DR target for on-prem workloads
- Data Center Extension or Exit: Migrate in stages, or decommission on-prem data centers gradually
Bonus: Integrated with other AWS services — use S3 for backups, connect with AWS Direct Connect, or run analytics on data using native AWS services like Athena or Redshift.
Real-world scenario: A financial services firm running mission-critical apps on vSphere wants to expand to a second region for resilience but doesn’t want to rewrite the app. With VMware Cloud on AWS, they replicate their VMs and failover easily — all without retraining staff.