Unexpected HTTP 504 errors from AWS Elastic Load Balancers (ELBs)

Tuesday, December 29, 2015

Introduction

Elastic Load Balancers (ELBs) make use of multiple concurrent connections to backend applications to improve throughput, and will also attempt to make use of HTTP keepalives to mitigate reconnection overhead. These performance gains introduce multiple points where timeouts must be configured.

A misconfigured application / ELB pair can cause an HTTP 504 status code to be returned from an ELB to the client. Additionally, the request may not register in the application’s logs, or on a packet capture against the application.

Basic Diagnostics

There are two immediate causes for receiving a 504 from an ELB:

The application actually took longer than the ELB’s connection timeout to respond. This is a slow timeout - the 504 will typically be returned after a number of seconds, with the default for an ELB being 60 seconds. In this case, it is necessary either to increase the ELB’s connection timeout, or improve application performance.
The application did not respond to the ELB at all, instead closing its connection when data was requested. This is a fast timeout - the 504 will typically be returned in a matter of milliseconds, well under the ELB’s timeout setting

A useful diagnostic step is to enable access logs on the ELB in question. An ELB exhibiting the fast timeout behavior will log a line similar to the following, with -1 values for performance metrics on failing requests:<

2015-12-11T13:42:07.736195Z my-elb 10.0.0.1:59893 - <strong>-1 -1 -1</strong> 504 0 0 0 "GET http://my-elb/ HTTP/1.1" "curl/7.19.7" - -

Understanding Idle Timeouts

It’s fairly simple to determine what the application’s default connection timeout is: telnet to its listening port, and do not send any data. The amount of time elapsed until the connection is closed is roughly the idle timeout. The below example shows a default timeout of 60 seconds (plus about one second of overhead) to an Apache web server:

[root@app01 ~]$ time telnet localhost 8080
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Connection closed by foreign host.

real    1m1.044s
user    0m0.003s
sys     0m0.002s

In this case, adjusting the Timeout configuration parameter in httpd.conf will change the timeout:

#
# Timeout: The number of seconds before receives and sends time out.
#
Timeout 60

Because this timeout is set to 60 seconds, the ELB’s timeout needs to be set to less than 60 seconds.

Understanding Keepalive Timeouts

A more confusing scenario occurs when an ELB continues to exhibit the above 504 behavior even when the connection timeout appears to be set appropriately. This is likely due to Apache’s idle and keepalive timeouts being set differently: a connection which has had no data written to it will time out after a different period of time than a connection waiting for more data to be written to it.

This can again be tested with telnet, by faking an HTTP session with the Connection: Keep-Alive header set:

[root@app01 ~]$ time telnet localhost 8080
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.0
Connection: Keep-Alive

HTTP/1.1 404 Not Found
Date: Tue, 29 Dec 2015 17:16:36 GMT
Server: Apache
Content-Length: 257
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL / was not found on this server.</p>
<hr> <address>Apache Server at 127.0.0.1 Port 80</address>
</body></html>
Connection closed by foreign host.

real    0m5.377s
user    0m0.000s
sys     0m0.007s

This time around, the connection times out in just five seconds. Given an Apache idle timeout and ELB timeout of 60 seconds, this leaves chunks of 55 seconds available for a 504 to be generated. Adjutsing the KeepAliveTimeout configuration parameter in httpd.conf> will adjust this:

#
# KeepAliveTimeout: Number of seconds to wait for the next request from the
# same client on the same connection.
#
KeepAliveTimeout 5

Conclusion

In short, an ELB’s connection timeout must be set lower than both the application’s idle and keepalive timeouts to prevent spurious 504s from being generated.

In the above example, it’s tempting to simply set the ELB’s connection timeout to four seconds. This will have the unintended side effect of mitigating some of the performance gains acheived by keepalives, as connections now will not be valid as long as they could be otherwise. With proper monitoring, it’s straightforward to determine an application’s top-percentile response times, and tailor timeouts to this, ensuring that both keepalives and timeouts are utilized effectively. The community Chef cookbook for Apache exposes this as two parameters, making this trivial to set on a per-application basis:

default['apache']['timeout'] = 60
default['apache']['keepalivetimeout'] = 15

Also keep in mind that these timeouts are an expression of an application’s non-functional requirements. Should the application enforce a fifteen-second SLA, these timeouts should be set to fifteen seconds. Likewise, should the requirements change (say to five seconds), the implementation of the timeouts should change to five seconds as well.