Information for: DEVELOPERS   PARTNERS

Using multi-region failover

Important

This feature is not yet available to applications running on Acquia Cloud Next technologies.

Through multi-region failover, Acquia provides Continuity-as-a-Service using a hot cloud recovery model. With multi-region failover, your Production application has a cloned version of its full stack in a secondary failover region. In the event of a failure or substantial impairment in your primary region, your can switch your application immediately to the clone in the secondary region. Multi-region failover is available for Cloud Platform Enterprise applications as an add-on service at an extra cost.

To use multi-region failover, you must also use a CDN service, such as Edge. This is important to avoid an interruption in service; the CDN can continue to serve cached content while your application is switching over to the secondary region.

Important

Multi-region failover on Cloud Platform supports only a single application on a single codebase. Multi-region failover does not support multisites, multiple databases, or multiple codebases on a common production server cluster, due to the difficulties associated with properly re-syncing these types of applications after a full or partial failover.

How it works

Multi-region failover image

When you choose multi-region failover for your Cloud Platform Enterprise application, Acquia duplicates your Production environment in a different region from your primary region. For example, if your application is hosted in the US-East region, your secondary application might be created in the US-West region. The secondary hardware cluster is configured to receive the same code deployments as the primary cluster, so that it is always running the same code. In addition, multi-region failover uses database replication to keep the primary and secondary database servers in sync in both the primary and secondary regions. This means that any changes to the database in either the primary or secondary region will immediately sync to the database servers in the other region.

During normal operations, your application continuously runs a special one-way rsync process on the primary region application, which ensures that any files added to the primary region are also sent to the servers in the secondary (failover) region.

The combination of the synced code, databases, and files means that the failover region and primary region are functionally identical. The main difference, aside from the location of the hardware, is that each region is assigned its own distinct Elastic IP (EIP) address.

The failover process

In the event of an emergency in your application’s primary region, the Acquia multi-region failover configuration ensures that there is an alternative functional version of your live production application. This might be an event that causes the primary hosting region to be, in part or in whole, impaired or inoperative in such a way that Acquia’s support teams cannot restore full service in the primary region immediately or within a reasonable amount of time.

The multi-region failover configuration should not to be used to reduce the impact of routine maintenance or upsizes, to mitigate the impact of high-traffic events if your primary region’s hardware reaches capacity, or to attempt to work around incidents where adverse code, file, or database changes have been deployed to your Production application.

In the event of an emergency, you can begin the failover process at any time; you do not need Acquia’s assistance. If your application uses Acquia Edge CDN, you can request that Acquia Support assist with the failover process. In any case, you should notify Acquia as soon as possible, so that Acquia does not take any conflicting actions in addressing the emergency.

To initiate the failover process, configure your application’s CDN settings to point to the Elastic IP address of the secondary region, instead of the primary region. You can find the Elastic IP addresses on the Domains page of the Cloud Platform interface. After the CDN changes take effect, requests to the application will be handled by servers in the secondary region, instead of the primary region.

Since the caches in the secondary region will be empty at first, performance may be slower immediately following failover until the caches rebuild.

Cron and failover

Cron jobs in Cloud Platform are set to run on servers in the primary region. Upon failover, Acquia Support can edit cron jobs to run in the secondary region instead of the primary region. Cron jobs do not transfer over to servers in the secondary region upon failover. For more information, see Using scheduled jobs to maintain your application.

Operating while in failover

While your application is being served from the secondary region, many common Cloud Platform workflow tasks may not function properly. The secondary region includes a clone of your Production environment, but not other environments (such as Development and Staging). Workflow tasks that are designed to facilitate communications between servers in the same region won’t work between environments in different regions. In other instances (such as full or partial region-wide failures), tasks may fail because Acquia’s code repository or task management servers in those regions are also impaired.

Important

For these reasons, while your application is operating in the secondary region, do not attempt any file or database copy, code commit, or code deployment tasks.

The failback process

After the emergency in the primary region has been resolved, you will need to restore your application to its previous configuration so that it is again served from the primary region. This process is called failback.

Before initiating failback to the primary region, notify Acquia Support to confirm the date and time of the failback. At the time of the failback, Acquia will perform one final manual sync of the application’s files between the secondary and primary regions to ensure that there are no issues or inconsistencies. Acquia will then authorize you to proceed with the CDN failback to the primary region, pointing the CDN settings to the Elastic IP address of the primary region. If your application uses Acquia Edge CDN, you can request that Acquia Support assist with the failback.

Similar to when your application first fails over to the secondary region, caches in the primary region may be stale at the time of failback, so site performance may be reduced while the caches rebuild.

Multi-sites and multi-region failover

Multi-region failover is also available to Cloud Platform Enterprise subscribers with multi-site applications. However, this functionality is not supported for Cloud Platform Professional or Acquia Site Factory applications.

In the event of a failover event, subscribers with applications using multiple databases must ensure that all Production sites on that application are failed over to the secondary region to prevent any risk of data loss after the failover, or as a result of the failback process.

SSL and multi-region failover

Applications configured for multi-region failover should only utilize the standard method for SSL certificate management. The legacy installation method is not supported for this configuration.

Using multi-region failover with other Acquia products

The following features and Acquia products are incompatible with multi-region failover configurations:

  • Shield VPC
  • Secure VPN
  • Elastic Load Balancers (ELBs) (Legacy SSL install method)
  • Resilient Edge Clusters
  • Acquia Search
  • Node.js
  • Digital Asset Manager

Further, no Acquia Marketing Cloud products support the multi-region failover functionality.

All applications requiring any of these features or services must be architected to ensure that sites can continue to serve critical content without these features in the event of a regional impairment and failover.

Testing the failover process

Acquia tests multi-region failovers during the setup process. After this functionality is in place, Acquia does not support any additional testing and will not provide assistance with failing servers over or back in eventss unrelated to an emergency event in your primary region.