Cloud Platform

Stack Metrics - Load balancing

Your load balancers are responsible for handling requests from either visitors to your application or from your content delivery network (CDN). Such an infrastructure uses Nginx to route HTTP requests to Varnish®, which determines if it has a cached version of your application’s content. Any requests that can’t be served from Varnish cache will be routed round-robin to infrastructure in your web tier.

Note

For customers with shared load balancer infrastructure, the displayed metrics reflect all activity on the load balancers, and not only the traffic for your applications. Due to this behavior, when troubleshooting issues graph data should be used only for context.

Load balancer metrics

Graph

Description

Requests count

Total number of requests made to Nginx and Varnish.

HTTP responses

HTTP response codes, grouped by status range (2xx, 3xx, 4xx, 5xx).

Varnish cache hit rate

Varnish cache hit rate as a percentage of total requests.

CPU and memory usage

CPU usage and memory usage for each of the environment’s load balancers, as a percentage of the total available.

Troubleshooting

Use the information in this section to help you troubleshoot load balancer-related issues with your Cloud Platform application.

Requests count

Depending on an application’s HTTPS traffic and Varnish configuration, requests can hit Varnish, Nginx, or both. Although HTTPS requests pass through Nginx before reaching Varnish, HTTP requests bypass Nginx entirely.

Request data can be used to obtain a basic sense of traffic patterns. For more exact metrics regarding traffic to your websites, we recommend that you use a third-party service. This data is useful for tracking general traffic patterns and identifying either spikes in traffic (which can be associated with promotional events or an attack), or sudden drops in traffic (which can be associated with issues in your network such as DNS, the CDN, or the load balancers).

For more information about Varnish, see Using Varnish.

HTTP response codes

The following numbers indicate the success or failure rate of content requests to the load balancer tier of your stack:

  • 2xx Success responses indicate successful content requests.

  • 3xx Redirect responses indicate redirected traffic, which will then return a different response code, potentially including another 300.

  • 4xx Client Error responses indicate client (browser) side errors, such as page not found.

  • 5xx Server Error responses indicate a server-side error caused by a failure in the application or its infrastructure.

As a general rule, an optimized website will have few 3xx or 4xx responses, and should have no 5xx response codes.

  • Elevated 3xx response codes: Analyze your logs to determine what is causing the 3xx responses. The most common sources are 301 and 302 redirects. Redirects with a 301 response code use fewer resources because Varnish can cache them; however, Varnish can’t cache 302 redirects. For more information about the items that Varnish will not cache, see Varnish headers.

  • Elevated 4xx response codes: These codes either mean that content has been removed and is no longer available, or someone is deliberately attempting to load content that does not exist. Examine your log files to determine the source, and then either use a Drupal module to reduce the impact of the these messages, or configure your CDN to block the source of the requests, if malicious.

  • Elevated 5xx response codes: Review Stack Metrics to determine if the reported information corresponds to any tiers running out of CPU or memory, or sudden spikes in infrastructure error counts. If Cloud Platform is the source of the issue, Acquia Support will open a ticket on your behalf and contact you with more information. Otherwise, you should assume a non-platform source for the problem, and determine what change might have triggered the errors. These errors are often due to changes in a website’s code, content, or configuration. You can also use an Application Performance Monitoring service (such as New Relic) to attempt to determine which specific layer (php, Memcache, database, or external calls) may be impacting the health of your application. For more information, see HTTP 5xx status codes on Cloud Platform.

For applications with load balancers shared with other applications, it is important to note that the response codes are recorded at the infrastructure level. Due to this behavior, other applications using the same load balancers may be to blame.

Varnish cache hit rate

This number indicates what percentage of requests to your load balancers are being served from Varnish. Depending on the needs of your websites, this number may range from a small value (for websites with mostly authenticated traffic or no caching) to almost 100% (for websites with no authenticated traffic and long durations set for their external caching values). Since HTTPS requests first pass through Nginx before reaching Varnish, applications with a high percentage of HTTPS requests will see the Varnish cache rate be lower than the request rate.

A resilient application will have a cache hit rate of 80% or greater, while a rate greater than 95% is considered to be exceptionally resilient to spikes in traffic. It is normal for an application’s cache hit rate to fluctuate throughout the day. During off-peak hours, applications will generally serve fresh content as caches from peak hours begin to expire, and requests for popular pages happen less frequently. To reduce the amount of traffic hitting your other infrastructure, attempt to increase your website’s cache hit rate to the highest possible value.

Varnish cache hit rates less than 80%

Customers with Cloud Platform Plus, Cloud Platform Premium and Cloud Platform Elite subscriptions can purchase the Cloud Platform Performance Boost add-on for applications needing a Varnish cache hit rate below 80%. For more information, contact your Account Manager.

CPU and memory usage

Load balancers are generally resilient, able to serve hundreds or thousands of requests per minute without any measurable increase in CPU or memory usage. However, spikes in traffic, sustained high traffic, or large cached media files can all result in load balancers running out of CPU or memory. Although high CPU usage will usually have a minor impact on website performance, running out of memory can lead to infrastructure impairment and should be avoided. Although Acquia monitors for impaired infrastructure, an infrastructure can continue serving traffic indefinitely even when CPU or memory are at or near capacity. If you notice that your load balancers are routinely hitting CPU or memory capacity, create a Support ticket to determine what options are available to increase your load balancer capacity.