Cloud Platform Enterprise is designed for high availability, with guaranteed 99.95% uptime. This page describes how Acquia delivers high availability for Cloud Platform Enterprise and Site Factory applications.
High-availability architecture
Cloud Platform is built on Amazon Web Services (AWS) infrastructure, which is physically remote from Acquia’s offices. Cloud Platform subscribers may choose the geographic region for their application’s location.
Each region has several availability zones. AWS Availability Zones are separate yet interconnected data centers within the major regions. Acquia Cloud Platform Enterprise offers high availability by using several AWS Availability Zones in one AWS region with redundant infrastructure serving each layer of the technology stack. The following are the five main components of a Drupal application hosted by the Cloud Platform Enterprise:
- Platform CDN (optional) for global cached content delivery
- Regionally-based reverse proxy caching and load balancing infrastructure (Nginx and Varnish®)
- Application layer infrastructure (Apache, PHP, Drupal code, cron, SSH, and Memcached)
- File system infrastructure
- Redhat Gluster (Cloud Classic)
- AWS EFS (Cloud Next)
- Database infrastructure
- Percona MySQL (Cloud Classic)
- AWS Aurora provisioned with MySQL (Cloud Next)
The Cloud Platform edge layer handles all public traffic for Drupal applications, with Platform CDN available to eligible Cloud Platform Enterprise subscribers as a global option for cached content delivery, and load balancing infrastructure available in the application’s primary region for additional cached content delivery and high-availability routing to underlying application-layer infrastructure.
Regional load balancers are configured by default in a hot-cold configuration, with one load balancer handling traffic and another in a different AWS Availability Zone available for failover by Acquia in the event of an emergency. An alternative configuration is available for Cloud Platform Professional or Cloud Platform Enterprise subscribers using Acquia’s legacy SSL certificate installation method, which triggers both load balancers to become active and serve traffic simultaneously as soon as you repoint your DNS to the CNAME listed on the Domains page of the Cloud Platform interface.
On Cloud Platform Enterprise and Site Factory, production environments will always leverage high-availability infrastructure in a single region for application, file system, and database services. For applications running on Cloud Next technologies, non-production environments will also leverage high-availability infrastructure for these three layers.
Acquia’s high-availability network file system operates in a hot-hot configuration with both nodes continuously syncing with each other.
Acquia’s high-availability database layer operates in a hot-standby configuration, with the active node handling all read/write activity while replicating MySQL data over to the passive node. In the event that the active node becomes impaired, the Cloud Platform environment will automatically failover to the hot-standby node using a domain name system-based (DNS) failover process. Normal data sync operations will resume to the previously-active node, now functioning as the hot-standby, as soon as the system detects that it is healthy once again.
Cloud Platform disaster recovery
In the event of a major disaster impacting the availability or performance of a production application’s primary availability zone or region on Acquia Cloud Platform Enterprise or Site Factory, Acquia will make every reasonable effort to restore subscriber services as quickly as possible using alternative availability zones or regions.
Acquia’s Cloud Platform Enterprise and Site Factory products also include disaster recovery services in the event of catastrophic disk failure across all available production environment nodes or total data center loss impacting a production environment. To facilitate this service, Acquia takes disaster recovery snapshots of all data every hour and retains them on a diminishing schedule for three months.
Data stored in disaster recovery snapshots includes application code, static files, and databases. Integrated backup facilities use Amazon EBS and automations that programmatically store snapshots in Amazon S3 buckets (Amazon’s highly available cloud storage). For environments running on Cloud Next technologies, native and custom logic associated with AWS EFS and AWS Aurora Provisioned are also utilized while making snapshots.
Backup data is stored in the same region (for example, US-East, US-West, or EU-West) where the production application is located. Amazon S3 repositories are distributed amongst various Availability Zones (data centers) and several devices within each Availability Zone for redundancy.
The Recovery Point Objective (RPO) for production environments running on Cloud Platform Enterprise and Site Factory is one hour. For environments with file systems or databases exceeding 500 GB in size, the most recent recovery point available may be over one hour old due to snapshot task durations.
To achieve this RPO, Cloud Platform Enterprise and Site Factory production environments generate and store disaster recovery snapshots according to the following schedule:
- Hourly Snapshots: Up to three hours old, taken at the start of each hour.
- Daily Snapshots: Up to seven days old, taken as soon after midnight as possible each day.
- Weekly Snapshots: Up to four weeks old, taken as soon after midnight as possible on Sundays
- Monthly Snapshots: Up to three months old, taken as soon after midnight as possible on the first of the month
The Recovery Time Objective (RTO) on snapshot recovery operations is up to 1 hour per 50 GB of data in the file system or database, whichever is larger. Cloud Platform does not offer a formal RTO for recovery operations that are unrelated to data recovery, as such incidents are application specific. Such cases might require subscriber intervention as well. For more details about terms and conditions, refer to the Cloud Platform infrastructure uptime SLA associated with your subscription.
Acquia’s ability to achieve the specified RTO and RPOs may be impacted by disasters, regional incidents, dependencies on prompt subscriber intervention, or other circumstances outside of Acquia’s direct control.
On the Acquia Cloud Next platform, non-production environments also have disaster recovery capabilities for up to 30 days.
All Cloud Platform subscribers are advised to take regular backups of application code, databases, and files for use in any non-disaster scenarios.
Enhanced disaster recovery with multi-region failover
For subscribers with mission critical applications that can’t afford any downtime in the event of a disaster, Acquia offers an optional enhanced configuration for Cloud Platform Enterprise applications that continuously replicates all production infrastructure and data over to a secondary region.
Edge layer resiliency and failovers
Eligible subscribers on Cloud Platform Enterprise are advised to point DNS for all customer domains to Acquia’s Platform CDN for maximum resiliency on all applications, especially if they are utilized by users outside of your application’s primary reason. Using Acquia’s Platform CDN provides extra resiliency in the event of high traffic activity on your application, or localized networking issues in specific parts of the world.
For subscribers without Platform CDN available or in use, only regional edge-layer infrastructure is available for cached content delivery and load balancing purposes. This layer continues to function with Platform CDN, Cloud Edge CDN, Cloud Edge Security, or a third-party CDN in use.
Cloud Platform Professional and Cloud Platform Enterprise subscribers leveraging the “legacy” SSL certificate installation method for custom SSL certificates will automatically gain the benefits of hot-hot load balancers on Acquia’s regional edge layer, increasing resiliency in the unlikely event of infrastructure impairment. In this scenario, Acquia’s edge layer infrastructure will automatically stop routing traffic to an impaired load balancer node.
Subscribers without this configuration will have at least two Cloud Platform load balancing nodes available in a hot-cold pairing. In the event that the primary node becomes impaired, Acquia will initiate a failover to the secondary load balancer to restore service. No DNS changes are required when this happens, but the application may be unavailable for several minutes while this failover takes place.
After the failover process, impacted applications may experience slower performance than usual while the new primary load balancer’s caches are replenished.
Database backups and failovers
On Cloud Platform Enterprise and Site Factory production environments running on Cloud Classic infrastructure, the active database in a database pair is marked with a DNS pointer. If the DNS infrastructure detects the active database is not responding, the following steps occur:
- The DNS infrastructure attempts to mark the passive database as the active database.
- Any queries requiring changes to the database are handled by the functioning database.
Acquia repairs the unresponsive database.
- The repaired database is re-synchronized with the currently active database using MySQL Binlogs.
- The DNS pointer is reassigned, or failed back, to the original active database.
Once both databases are responsive, data is synchronized between them, and the DNS pointer has been failed back, high availability has been restored.
For environments running on Cloud Next technologies, Acquia leverages more advanced database failover functionality with similar logic to the process on Cloud Classic environments.