Let’s Encrypt for load-balanced services

The Let’s Encrypt certificate authority was launched for production use in April 2016 and aims to make encrypted connections on the Internet ubiquitous. Historically certificate authorities have demanded payment for issuing X.509 certificates and the issuing/renewal process was largely manual.

Let’s Encrypt issues domain-validated certificates free of charge. Other validation types requiring human interaction, such as Organization Validation or Extended Validation (most browsers show part of the address bar in green for these), are not available. The validation process is fully automated and, once configured, does not need manual intervention for certificate renewal. Let’s Encrypt issues certificates valid for 90 days and most client programs renew certificates a couple of weeks before they expire (usually after 60 days).

In this post we describe how to configure load-balanced services such that they can make use of the Let’s Encrypt certificate authority. The assumption is that the reader is familiar with how Let’s Encrypt works. acmetool is used as the Let’s Encrypt client.

Implementation overview

We’re going to implement Let’s Encrypt certificates for a system with three machines acting as a load-balancing frontend for a service backend. The service’s entry point is the domain name svc.example.com and the servers use the IPv6 address range 2001:db8::/32. IPv4 is left as an exercise for the reader. The frontend machines run HAProxy for load-balancing and keepalived to host virtual IP addresses. nginx is used to serve challenge responses for domain validations and to redirect plain-text requests to HTTPS (the service is only available over TLS).

The network looks as follows:

IPv6 Description
2001:db8::1 Virtual IP
2001:db8::2 Virtual IP
2001:db8::3 Virtual IP
2001:db8::f001 Frontend LB0
2001:db8::f002 Frontend LB1
2001:db8::f003 Frontend LB2
2001:db8::b001 Backend 1
2001:db8::b020 Backend 20

There are as many virtual IP addresses as there are load balancers. This way everyday traffic is still spread over across the load balancers even when there is no outage. The DNS record for the service looks as follows:

svc.example.com.  3600  IN  AAAA  2001:db8::1
svc.example.com.  3600  IN  AAAA  2001:db8::2
svc.example.com.  3600  IN  AAAA  2001:db8::3

The main challenge in such a setup is that the service can’t possibly know to which load balancer the Let’s Encrypt validation process connects in order to validate the domain (svc.example.com). Not only is there round-robin DNS, but the other load balancers may suffer an outage or be under maintenance. Hugo Landau’s deployment guide for IRC networks was a significant inspiration during the development of this configuration and proposes a centralized service to which the individual machines submit their challenge responses and where they redirect incoming requests. In a setup where all machines are controlled by one party that can easily be avoided while also avoiding any cross-machine state. In our configuration the load balancers are stateless in regards to the Let’s Encrypt validation. Every load balancer has its own key/certificate combination and is responsible for its renewal.

Example certificate renewal

Network diagram for load-balancing setup with Let's Encrypt

  1. LB1 determines that it needs to request a new certificate and sends a request to the Let’s Encrypt certificate authority.
  2. As part of the ACME conversation the CA responds with a randomly generated token and a challenge response the server must make available at the URL . acmetool writes the response to /var/run/acme/acme-challenge/TOKEN and conducts a self-test.
  3. Let CA know that the challenge is available.
  4. CA resolves the DNS record svc.example.com and finds the IP addresses 2001:db8::1, 2001:db8::2 and 2001:db8::3 in random order. It connects to the first address. The address, say 2001:db8::2, is currently hosted on LB0. The request GET /.well-known/acme-challenge/TOKEN is sent.
  5. LB0 receives the request and finds that it must act as a reverse proxy with 2001:db8::f001, 2001:db8::f002 and 2001:db8::f003 being the upstream servers. In this example it first connects to itself, to LB0, and issues a request for /.well-known/acme-challenge/TOKEN.
  6. The token is not found in /var/run/acme/acme-challenge/ on LB0 and an HTTP/1.1 404 error is returned. The next upstream server must be tried.
  7. LB1 is now sent the same request and finds the response in /var/run/acme/acme-challenge/TOKEN.
  8. The challenge response is returned from LB1 to LB0 with a status code of HTTP/1.1 200, indicating success.
  9. LB0 forwards the response to the Let’s Encrypt CA. The CA verifies the response and, if successful, proceeds to issue a certificate.
  10. The certificate is made available for download. acmetool retrieves the certificate, stores it permanently in its state directory and invokes the relevant hooks. Services are reloaded to make use of the newly issued certificate.

Keepalived

In case of an outage or maintenance the remaining load balancer or balancers take over the IP addresses of the balancer not in service, ensuring users are not affected. To this effect the priorities per IP address shifted by one per node. The state is set to BACKUP for all IP addresses as there’s no concept of a real primary machine–they’re all equal. Additionally setting the state to BACKUP instead of MASTER prevents an IP address conflict when a machine comes online (another machine with a lower priority is still serving the IP address until synchronization takes place).

Keepalived configuration on LB0:

vrrp_instance master1 {
  interface eth0
  state BACKUP
  # LB0: 101, LB1: 102, LB2: 103
  priority 101
  authentication { … }
  virtual_ipaddress {
    2001:db8::1/32 dev eth0
  }
}

vrrp_instance master2 {
  interface eth0
  state BACKUP
  # LB0: 102, LB1: 103, LB2: 101
  priority 102
  authentication { … }
  virtual_ipaddress {
    2001:db8::2/32 dev eth0
  }
}

vrrp_instance master3 {
  interface eth0
  state BACKUP
  # LB0: 103, LB1: 101, LB2: 102
  priority 103
  authentication { … }
  virtual_ipaddress {
    2001:db8::3/32 dev eth0
  }
}

HAProxy

HAProxy is configured to only provide HTTP over TLS (HTTPS) because plain-text HTTP is served by Nginx. Communication with the backend uses TLS too, but depending on the network layout and/or service requirements plain-text HTTP may be acceptable. See the HAProxy documentation for available options.

frontend fe-service
  bind 2001:db8::1:443 ssl crt /etc/ssl/… ciphers …
  bind 2001:db8::2:443 ssl crt /etc/ssl/… ciphers …
  bind 2001:db8::3:443 ssl crt /etc/ssl/… ciphers …
  default_backend be-service
  mode http
  option httplog
  option forwardfor

backend be-service
  balance roundrobin
  cookie SERVERID insert indirect nocache
  option httpchk GET /status HTTP/1.0
  option httplog
  server be1 2001:db8::b001:443 check ssl verify ca-file /…/ca.pem cookie be1
  …
  server be20 2001:db8::b020:443 check ssl verify ca-file /…/ca.pem cookie be20

Nginx

acmetool is configured to use the webroot mode and to store challenge responses in /var/run/acme/acme-challenge/ (the default path). Define a separate virtual host in Nginx to serve challenge responses on a dedicated port. This port serves only information also available via port 80, hence no further protection is needed.

server {
  listen *:31080;
  server_name svc.example.com;

  location /.well-known/acme-challenge/ {
    alias /var/run/acme/acme-challenge/;
    default_type text/plain;
  }

  location / {
    deny all;
    root /usr/share/nginx/html/nonexist;
    autoindex off;
  }
}

Configure an upstream group with all load balancers (including the local machine):

upstream acme-backend {
  server [2001:db8::f001]:31080 fail_timeout=5s;
  server [2001:db8::f002]:31080 fail_timeout=5s;
  server [2001:db8::f003]:31080 fail_timeout=5s;
}

The main virtual host on port 80 redirects all requests to HTTP over TLS except those for ACME challenge responses (/.well-known/acme-challenge/…). ACME challenge responses are handled by a reverse proxy configuration attempting to fetch the file from all load balancers until produces an HTTP 200 response or all upstream servers (the balancers) have been tried. Connections to unreachable upstream servers time out after 5 seconds.

server {
  listen *:80;
  server_name svc.example.com;

  location /.well-known/acme-challenge/ {
    proxy_next_upstream error timeout invalid_header http_500 http_502
      http_503 http_504 http_403 http_404;
    proxy_pass            http://acme-backend;
    proxy_read_timeout    5s;
    proxy_connect_timeout 5s;
    proxy_redirect        off;
    proxy_set_header      Host $server_name;
  }

  location / {
    return 301 https://$host$request_uri;
  }
}

Test requests

Once everything is up and running it is possible to request files in /var/run/acme/acme-challenge/ on any load balancer from all balancers.

To test this a text file is written to /var/run/acme/acme-challenge/:

root@lb0 $ echo Hello World | tee /var/run/acme/acme-challenge/lb0
Hello World

lb0 is a unique, per-host name used for the test without special meaning. Retrieve the content via another machine, e.g. LB2. On LB2 the request is internally retried on all load balancers, but the proxy request will only return a successful status code and a response body on LB0.

user@local $ curl 'http://[2001:db8::f003]/.well-known/acme-challenge/lb0'
Hello World

A file not existing on any of the machines will produce an error:

user@local $ curl 'http://[2001:db8::f003]/.well-known/acme-challenge/foobar'
…
<head><title>404 Not Found</title></head>
…

Let’s Encrypt will now be able to verify the service domain, svc.example.com. When one of the load balancers needs a new certificate a request is issued to Let’s Encrypt and the validation takes place via any of the balancers and the internal reverse proxy configuration.

acmetool

Configuring acmetool is outside the scope of this document. However, a couple of hints:

  • Use the haproxy hook included with acmetool to install newly received certificates. Configure HAProxy to use /var/lib/acme/live/$domain/haproxy for key and certificate.
  • In scenarios where plain-text HTTP and HTTP over TLS are handled by the same service (e.g. Apache or Nginx) there is a chicken-or-egg problem where the service can’t be started due to the lack of a certificate. We solved this by first inserting a self-signed certificate in order to start the service. Once a signed certificate is received from Let’s Encrypt the service is reloaded.

We have engineered a Puppet module for acmetool and intend to release it at a later point in time.

VSHN AG

VSHN AG offers system engineering and configuration management services from their office in Zurich, Switzerland.