Using EC2 health checks to take down your own site

A loader balancer sits in front of a pool of servers, often web servers. It receives requests and routes those requests to one of the available servers so as to distribute the load among them.

To ensure that the load balancer is only routing requests to healthy servers it can utilize something called a health check. For example, a load balancer managing a pool of web servers can periodically send a http request to a pre-defined URL on each web server. If the request succeeds then that web server instance is deemed healthy. If the request fails then the web server instance is deemed unhealthy and no more requests are routed to it.

Read more about how this works on AWS

Inherent in this system is a subtle way to turn a small bug into a major incident. Setting yourself up for downtime all hinges on how you answer this question. What route on the web servers shall we use to do the health check?

The Wrong Answer

Just pick a common URL and use that. The home page of the site for example.

Why This Is A Terrible Idea

Image your web servers are perfectly healthy and are happily processing requests. Then one day a small bug (which has been lurking within your application code for some time) rears its ugly head. This causes requests to one specific route to result in an error.

If that route happens to be the route your health check is using then the load balancer will send a request to the web server, receive an error, deem the web server unhealthy and stop routing it requests.

One after another the load balancer will send each of your web servers a health check request, the request will fail and the web server instance will be removed from your pool of available web servers. When the last web server fails the health check your site will go down.

You just went from having one page unavailable to having your whole site unavailable.

Do not use a route that involves a bunch of application code. Even a small bug in that route can mean a big problem.

How To Do Health Checks

The route should contain the absolute minimum amount of code. Bugs in your application code should not result in requests to the health check route failing. That likely means you are going to create a specific health check route.

It is possible that your special health check route will verify that code can run at all and nothing more. More likely you will want to ensure that the server can access any resources that indicate a connectivity problem. Is the database accessible? If an API is required, is that available?

If you are going to verify the availability of an API that you control you are probably going to want to give it a special health check route too. The web servers can use the API health check route as part of its own health check checks. You don’t want a bug in the API’s application level code taking down your site either.

If your web or API servers need to verify the availability of a database as part of their health check you can run a query like the below. This will enable you to verify that the database is available but it doesn’t matter if the database contains any tables or data.

SELECT ‘works!’