Using Let's Encrypt TLS certificates on the UniFi Cloud Key

Warning: Since writing this post, UniFi has released a new version of the controller and cloud key software referred to as UniFi OS. The instructions below are for versions of the controller before this new version. For a short update, see this post.

I’ve been running UniFi equipment at home for over a year, now, and it’s generally been great. The hardware performs splendidly, the network is stable, and I can indulge my desire to tinker, creating segregated VLANs, setting up an LTE WAN fallback, etc. However, through all that time, there’s been a nagging annoyance that I’ve finally gotten around to solving.

I originally hosted the UniFi Network Controller — hereafter just “controller” — on my Raspberry Pi 3 B, but later bought a Cloud Key Gen2 Plus because maintaining the installation on the Pi was something of a chore, and I wanted to use UniFi Protect, Ubiquiti’s security camera offering. On both devices, though, I noticed that while the controller was accessed via HTTPS, the TLS certificate it used was a hard-coded self-signed cert generated by Ubiquiti. My browser was rightfully spooked, dutifully warning me that my “connection was not private.”

A screenshot of an error page in Brave browser which reads "Your
connection is not private. Attackers might be trying to steal your information
from 192.168.1.3 (for example, passwords, messages, or credit cards)." There is
also a "Learn more" link, and the error identifier
"NET::ERR_CERT_COMMON_NAME_INVALID." Two buttons at the bottom read
"Advanced" and "Back to safety."

Though my browser, Brave, caches my bypass of this warning for a little while, it still eventually returns. The cached “yes I know what I’m doing” response can also expire while I’m using the controller UI, which results in the confusing effect of the site’s AJAX calls silently failing, so the controller appears to lock up until I reload and once again reassure the browser all is well.

Of course I am far from the first person to notice this issue:

There are many more. They range in age, with some as early as 2015, and approach, using all manner of certificates, with manual, automated or “sort of” automated installation. I looked at this situation and decided it was time to try my hand at it, in part because none of the existing solutions seemed to satisfy my goals. The ideal setup would:

Renew automatically
Be low or zero cost
Have easy, idempotent setup
Require minimal modifications to the Cloud Key
Not require exposing the Cloud Key to the internet

With these objectives in mind I set out to build something to suit. While I used a number of resources working on this, I should give special credit to this post by Gerd Naschenweng, which describes an end result fairly similar to what I did here.

Planning for automatic renewal with ACME v2

Let’s Encrypt is by now well known. It’s a non-profit certificate authority (CA) which issues free TLS certificates with short lifetimes, and provides first-class support for renewal automation. Certificate issuance is handled via the IETF-designed ACME protocol which defines mechanisms for CAs to automatically verify ownership of a domain. Let’s Encrypt lists a ton of ACME clients, and I was certain I could find one that suited my needs.

After a little poking around, I found that while Certbot is the recommended ACME client, the Python environment on the Cloud Key isn’t great, and the default apt source lists pointed to an old version of Certbot. I could have installed newer Python versions and added the necessary apt sources, but I decided to give acme.sh, an ACME client written entirely in Bash, a try instead. It’s both beautiful and terrible to behold its source, 7,454 lines of Bash script as of this writing. As I worked with it, I found its documentation lacking — I spent no small amount of time reading its source — but it definitely got the job done.

Installation

NB: The inline code samples below may be out of date compared with the source. If you’re going to use some of this code, go to the source repository.

That choice made, I started on the idempotent setup script. First, some sanity checks to ensure I couldn’t do something too foolish.

set -eux

if [[ $SHELL != '/bin/bash' ]]; then
    echo "wrong shell, expected /bin/bash, got $SHELL"
    exit 1
fi

# Do a lazy check to try to prevent accidentally runs on a non-cloud key device
if ! [[ -d /usr/lib/unifi ]]; then
    echo 'this script should only be run on the cloud key'
    exit 1
fi

Next, installing acme.sh. The README recommends a pipe-to-shell “online” installation method, but it wouldn’t allow me to set the flags I wanted, and I’m no fan of piping source from the internet directly into my shell anyways. The next recommendation involves cloning the repo, but the Cloud Key doesn’t have git, and I would prefer to avoid installing it.

Instead, I ended up downloading a gzipped tarball from GitHub, effectively a poor man’s clone. With trap, it’s easy to ensure that your temporary working directory is cleaned up.

# Subshell to automatically pop cd and rm working dir
(
# Make a working dir
tmpdir=$(mktemp -d)
trap "rm -rf $tmpdir" EXIT
cd $tmpdir

# Get acme.sh
curl --silent \
    --location \
    'https://github.com/acmesh-official/acme.sh/archive/master.tar.gz' | \
    tar -xz

cd acme.sh-master

# Run installer
./acme.sh \
    --debug 3 \
    --install \
    --nocron \
    --noprofile \
    --auto-upgrade
)

I didn’t need to run acme.sh interactively, nor did I want its auto-generated cron entry, hence the corresponding flags. After this ran successfully, it populated ~/.acme.sh/ with a handful of files and directories, including the script itself at ~/.acme.sh/acme.sh.

Configuration

acme.sh, like most ACME clients, accepts hooks which allow you to do things before or after certs are renewed. The easiest way to implement these hooks is more shell scripts, so the next thing my setup script does is write those hook scripts to disk. First is the prehook file:

# Create the prehook file
cat <<'EOF' >~/.acme.sh/prehook.bash
#!/bin/bash

set -eux

mkdir -p ~/.acme.sh/backups
cd ~/.acme.sh/backups

tar -zcvf ~/.acme.sh/backups/tls-backup-$(date --iso-8601=seconds).tgz /etc/ssl/private/*
EOF

The prehook script runs before every attempt to renew a certificate. In this case, it zips up all the current TLS secrets and dumps them into a backup directory inside ~/.acme.sh. In case something goes catastrophically wrong, it means I at least have something to manually restore.

Next is the reload hook.

# Create the reload file
cat <<'EOF' >~/.acme.sh/reload.bash
#!/bin/bash

set -eux

cd /etc/ssl/private

(
trap 'rm -f /etc/ssl/private/cloudkey.p12' EXIT

CERT_PFX_PATH=/etc/ssl/private/cloudkey.p12 ~/.acme.sh/acme.sh \
    --to-pkcs12 \
    --domain unifi.ravron.com \
    --password aircontrolenterprise

# keytool's src alias is the name of the entry, or just its index, starting at
# 1, if not present. acme.sh's --to-pkcs12 doesn't know how to set the name, so
# we have to use the index instead.
keytool -importkeystore \
    -deststorepass aircontrolenterprise \
    -destkeypass aircontrolenterprise \
    -destkeystore unifi.keystore.jks \
    -srckeystore cloudkey.p12 \
    -srcstoretype PKCS12 \
    -srcstorepass aircontrolenterprise \
    -srcalias 1 \
    -destalias unifi \
    -noprompt
)

md5sum unifi.keystore.jks > unifi.keystore.jks.md5

chown root:ssl-cert cloudkey.crt cloudkey.key unifi.keystore.jks.md5
chown unifi:ssl-cert unifi.keystore.jks

chmod 640 cloudkey.crt cloudkey.key unifi.keystore.jks unifi.keystore.jks.md5

# Test nginx config, then reload
/usr/sbin/nginx -t
service nginx reload

# unifi and unifi-protect don't obey reload, so we have to restart them
# entirely. Surprise, surprise.
service unifi restart

# If this system has unifi-protect, restart it too
systemctl is-active --quiet unifi-protect.service && service unifi-protect restart

# If there are more than five backups, delete all but the most recent 5. Do this
# in the reload hook so that we don't delete backups on failed renewals.
rm -f $(ls -1 tls-backup-*.tgz | head -n -5)
EOF

This one’s a lot more complicated. The reload hook runs after all certs have been renewed, if the renewal was successful. It’s intended to restart the services that need to pick up the new configuration. In this case, it also does a bit of custom key munging to get the new certificate chain and private key into the Java keystore format that the controller, and UniFi Protect, need. It’s similar to many other scripts intended to update the controller’s keystore, but it also is careful to fix permissions, update the checksum file, and restart the relevant services as efficiently as possible. The final command cleans up all but the most recent five backup files. I do this here, rather than in the prehook, to ensure that no backups are removed until the renewal has finished successfully.

Issuing the certificate

Version 2 of the ACME protocol offers three different challenge types. The third, TLS-ALPN-01, is for pretty niche uses, and not appropriate here. In short, here’s how the other two work:

HTTP-01: the CA gives the client a nonce to put on the webserver that the domain being validated points to, and the CA then fetches the nonce and confirms it is correct.
DNS-01: the CA gives the client a nonce to put in a TXT DNS record, and the CA then runs a DNS lookup and confirms it is correct

You’ll note that HTTP-01 requires that the client requesting the certificate be accessible from the internet so that the CA’s infrastructure can make an HTTP request to it. This is exactly what I was trying to avoid, so I ruled out HTTP-01 and turned to DNS-01.

acme.sh supports DNS-01 via whole bunch of shell script “modules” in its dnsapi subdirectory. After Let’s Encrypt provides the challenge nonce that should be put into a TXT record on the domain being validated, acme.sh uses the selected module to interact with the DNS provider’s API and add the record. My domain’s DNS provider is Route53, and acme.sh has a module to support it. However, there is of course a catch: editing Route53 records means that I have to provide access to my AWS account so that acme.sh can add a TXT record to the unifi.ravron.com subdomain.

As usual, I want to ensure that any authorization I issue is as limited as reasonably possible. There’s a short wiki page that describes how to set up the necessary IAM permissions for acme.sh, and I largely followed it, using the appendix’s “more restrictive” policy. However, I was a little nervous giving my Cloud Key access to do effectively anything with the ravron.com domain, and I did look for a way to scope down those permissions. I found this excellent analysis of schemes to reduce authorization scope from the EFF, and I decided to implement one of its suggested mitigations: I created a new hosted zone in Route53 and delegated authority for unifi.ravron.com from the ravron.com hosted zone to the new hosted zone, following the documentation. Then, I restricted the IAM policy to acting only on that hosted zone, so that the worst an attacker could do if they stole the new AWS account credentials would be mess with the DNS records for unifi.ravron.com. While that wouldn’t be great, it’s much better than exposing ravron.com.

With the new IAM user’s access keys in hand, I needed to pass them securely to acme.sh. acme.sh stores the IAM credentials in its configuration files after first use, but to get them initially, it reads them out of the standard AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. So, simply export those, being careful to disable xtrace logging:

set +x
export AWS_ACCESS_KEY_ID="$1"
export AWS_SECRET_ACCESS_KEY="$2"
set -x

(I’ll explain how the credentials get into $1 and $2 a little later)

Finally, we’re ready to issue a certificate.

# Issue a cert. This won't actually do anything if the cert does not need
# renewal. It will always update the configuration, saving hooks, AWS creds,
# output filepaths, etc. Note there strict rate limits on production cert
# generation. When testing this script, add `--staging` to the command below.
# Let's Encrypt's staging servers will be used indefinitely until you remove the
# `--staging` option and re-run the setup script.
~/.acme.sh/acme.sh \
    --issue \
    --dns dns_aws \
    --domain unifi.ravron.com \
    --pre-hook ~/.acme.sh/prehook.bash \
    --reloadcmd ~/.acme.sh/reload.bash \
    --fullchain-file /etc/ssl/private/cloudkey.crt \
    --key-file /etc/ssl/private/cloudkey.key \
    --accountemail 'ravron@posteo.net' || true

The command is largely self-explanatory. Note though that while you’re testing, you should use the --staging flag, which causes acme.sh to communicate with Let’s Encrypt’s staging environment, so the certificate issued won’t be accepted by browsers. That’s important because the production environment has somewhat stringent rate limits to prevent abuse, while the staging environment’s rate limits are dramatically looser. Once you’ve tested the script and got it configured, remove the --staging flag and re-run the setup to ensure that future renewals will use the production environment.

The command takes a little while, because it must communicate with Let’s Encrypt, then modify my DNS records and sleep a while before asking Let’s Encrypt to check them. Eventually, it generates a certificate and runs the reload hook to put it into the Java keystore and restart the various services on the Cloud Key.

Automating renewal

The final step is to automate renewal using systemd. acme.sh can automatically generate a cron file, but I prefer systemd for its flexibility and because it’s what I use on my other machines. There’s a helpful wiki page on the topic, at least, and I largely followed its advice.

cat <<'EOF' >/etc/systemd/system/acme.service
[Unit]
Description=Renew Let's Encrypt certificates using acme.sh
After=network-online.target

[Service]
Type=oneshot
ExecStart=/root/.acme.sh/acme.sh --cron --home /root/.acme.sh
# acme.sh returns 2 when renewal is skipped (i.e. certs up to date)
SuccessExitStatus=0 2
EOF

# Stop and disable any existing timer
systemctl disable --now acme.timer || true

cat <<'EOF' >/etc/systemd/system/acme.timer
[Unit]
Description=Daily renewal of Let's Encrypt's certificates

[Timer]
OnCalendar=daily
RandomizedDelaySec=1h
Persistent=true

[Install]
WantedBy=timers.target
EOF

systemctl daemon-reload
systemctl enable --now acme.timer

Here I write out the service file, then the timer file. The timer will run daily, with a one-hour randomized delay, and invoke the oneshot service. The service in turn calls acme.sh --cron, which checks each issued certificate to see if it’s approaching its not-valid-after date, and renews it if so. All of the hooks and filepaths used during the original --issue command are saved and re-used, so there’s no need to specify them again here.

It’s now easy to see when the next renewal check will happen using systemctl status:

$ systemctl status acme.timer
● acme.timer - Daily renewal of Let's Encrypt's certificates
   Loaded: loaded (/etc/systemd/system/acme.timer; enabled; vendor preset: enabled)
   Active: active (waiting) since Wed 2020-09-09 19:42:59 PDT; 39min ago
  Trigger: Thu 2020-09-10 00:40:58 PDT; 4h 18min left

And, as with any systemd unit, it’s also easy to check the logs with journalctl -u acme.service.

Finishing touches

There’s still a couple things left to be sorted out. First, how does this script get run on the Cloud Key? And second, more importantly, how do the AWS credentials get into the positional parameters I mentioned earlier?

The answer is simple, if unexciting: a wrapper script grabs the keys using the aws tool, then invokes the main setup script remotely via SSH.

set -eu

KEY_ID=$(aws configure get profile.acme.aws_access_key_id)
SECRET_KEY=$(aws configure get profile.acme.aws_secret_access_key)

ssh uck 'bash -s' < remote-uck-setup.bash "$KEY_ID" "$SECRET_KEY"

You can find both the main script I’ve discussed and the wrapper on GitHub. These links will be the best choice for up-to-date sources.

Issues I encountered

I ran into a few issues while setting all this up, and I want to cover them briefly here so that you might avoid what I did not.

First was rate limiting. I didn’t realize how strict the rate limits set on the Let’s Encrypt production environment were until I ran into them. The one that’ll get you is the limit of five renewals per unique set of domain names per week. Once I ran into this, I was forced to use the staging certs until a week had elapsed. Don’t make this mistake, it’s annoying — test with staging until you’ve got everything working perfectly.

Second was HTTP Strict Transport Security, or HSTS. ravron.com includes the strict-transport-security header:

$ curl -sI https://ravron.com | grep strict
strict-transport-security: max-age=63072000; includeSubDomains; preload

which tells the browser that this domain and all its subdomains must not be loaded insecurely. This became a problem after I ran into the rate limiting issue I described earlier, because once I started using a staging certificate, my browser wouldn’t let me load unifi.ravron.com, no matter how I insisted. Of course, I could still access it via its LAN IP addresses, and once I got a production certificate installed again, all was well. I simply found it amusing that the HSTS I intended for use on this site was also (correctly, if confusingly) being applied to this internal-only domain.

Finally, less an issue than a last step, I did have to set up a static DNS entry on my network’s DNS server such that unifi.ravron.com resolved to my controller’s IP.

If you run into any issues, or find this guide lacking, send me an email, and I’ll update the page to be more helpful. If you’re feeling even more charitable, you can open a PR against this page, too. Both my email and the link to open a PR are in the footer.

Resources

I picked pieces from a wide variety of resources while getting this working. Here’s a rather disorganized list so you can do your own research.