Regularly tested BCP and DR plans, evenly distributed, fully-independent, physically distanced data centres.
Disaster recovery is a process to recover from such unforeseen, unplanned events that impact our business. There are various factors that can influence disaster recovery plans, most critical among them is physical distance and latency between the DC(Primary Datacenter) and DRC(Disaster Recovery Site).
DC and DRC sites must not reside in the same disaster zone. For example, as per ISO27001, the DC and DRC must be at least 40 km away from each other. Some availability zones on cloud might not abide by this rule. If you have a DC/DRC requirement be sure to validate the physical distance between your cloud provider's availability zone.
Likewise, for network based data transmission the latency between sender and receiver must be < 1 ms. Data replication or mirroring between multiple sites is a must for faster recovery and lower data loss from any disaster. That is why the latency between DC and DRC must be less than 1 ms. This can be achieved with dedicated interlinks between both sites. Ensure that the availability zones on your cloud are connected via dedicated links. Do not rely on site to site vpns, as it does not guarantee any bandwidth and latency.
Here are some more key components that must be included in DR plan
Business Continuity Plan
A business continuity plan (BCP) is a document that outlines how a business will continue operating during an unplanned disruption in service.
This plan must document the potential risk factors and corresponding mitigation policies. It must be reviewed regularly to keep it inline with changes that happen in the architecture.
BCP is a key element of DR process as it defines the availability level of business and hence the software infrastructure.
RTO/RPO
RTO - How fast can you recover
RPO - How much can you recover
RTO/RPO are documented and in-line with your business availability and SLA requirements.
For example, if your SLA is 99.99% meaning yearly downtime of 52m 35s. So your RTO becomes approximately 1 hr, that means, in an event of a disaster, you must be capable of recovery with 1 hr.
Likewise , an RPO of let say 5 mins, means you must recover the data up until 5 mins from the when the disaster occurred, in other words there can be a data loss for 5 mins.
Good news is the RPO and RTO numbers are what you can decide. So fix only that much that you can provide or else you might default on legal and regulatory terms if you cannot prove it with actual results.
Backups
On-line and scheduled backups or off-site backups for critical systems and data. Weekly full backup, daily diffs and 2 hourly transaction backups or better must be place.
Backups are encrypted if necessary. Support for low-cost encrypted archives if available.
If required, backup policy includes specific provisions for transactional DB and auth systems ensuring consistency at restore.
Image, file and db backups are in place where required.
These backups must be tested and restored regularly to prove they work.
Compliance
Ensure your DR plan is in line with compliance and regulatory needs such as
Data residency and localisation.
Data Privacy & Confidentiality
Data Sharing Policies
Technically a DR site can be anywhere as long as it provides required connectivity and latency. For instance, GCP Singapore can be DR for GCP Jakarta. Technically there is nothing wrong in that. But it becomes wrong when you are in an industry line Banking where as per OJK(Financial Services Regulator of Indonesia) regulation any data must not leave the country.
That is why ensure you abide by the law of the land. Have a well defined and detailed NDA with cloud providers on data privacy and localisations.
DR Drills
DR Drills are mandatory at least once in a year and must be conducted to test and prove the RTO/RPO numbers.
Regularly tested BCP and DR plans on evenly distributed and fully-independent sites needs to be recorded and certified by auditors especially for applications dealing with essential services like banking, healthcare etc.
This also builds confidence on in house processes.
You get DR certified only when you prove what you define in as RTO & RPO. Let me explain,
If you have defined your RTO/RPO as 1hr/10mins, you must ensure you can recover from any disaster within 1hr and you can recover the data from until 10mins since the disaster time.
Summary
Application should be evenly distributed into fully-independent physically distanced data centres.
Multi Site Data Replication for DR.
You will need dedicated interlinks between datacenters to achieve latency < 1ms needed for data replication.
BCP and DR Plans should be tested.
Have a well defined and detailed NDA with cloud providers on data privacy and localisations.