Earlier this week our service experienced a serious outage lasting 2 hours and 36 minutes. During this time, all HTTPS traffic to our API and website failed due to an expired TLS certificate.
API reliability is something we take extremely seriously at IPinfo, and we’re very sorry for the outage. Our API handles well over 100 billion requests a month, and we understand that even a minute of downtime can have a significant negative impact on our customers and users. The outage yesterday represents one of the biggest and most severe outages in our 12 year history. Below we describe what happened, and what we’re going to change going forward.
We use cert-manager to automate TLS certificate lifecycle management. Our setup:
All our public-facing certificates are set to last 90 days, and renew half-way through (i.e. T-45 days). This ensures that:
In addition to our main ipinfo.io
domain, we own other domains, including ipinfo.net
, ipinfo.org,
and many others. All of these domains share the same TLS certificate.
We rely on Google Cloud for almost all of our infrastructure, but recently decided to adopt Cloudflare for web static asset serving. As part of our experimentation and testing phase, we allocated the ipinfo.org
domain to Cloudflare during early development stages.
When we pointed ipinfo.org's
nameserver (NS) records to Cloudflare, Google Cloud DNS was no longer the authoritative nameserver for that domain. This meant that any DNS records cert-manager attempted to create in our Google Cloud DNS zone for ipinfo.org
were no longer publicly visible. As a result, the DNS-01 ACME challenge records required by Let's Encrypt for certificate validation couldn't be verified.
Our certificate issuance process requires validation for all domains and subdomains listed in the certificate request. Since Let's Encrypt couldn't confirm ownership of ipinfo.org
through the DNS challenge, the entire multi-domain certificate renewal failed, leading to the expiration of our TLS certificate.
cert-manager
had been logging errors, but we didn’t have any monitoring or alerting in place for this, so these errors were missed. Had any of these been seen, or better alerting setup, we would have noticed this issue and fixed it long before it impacted customers, because we attempt to renew 45 days before certificate expiration.
All timestamps UTC:
ipinfo.io
ipinfo.org
ipinfo.org
domains removed from certificate requestWe are taking the following steps to prevent this from happening again:
Ben founded IPinfo in 2013 with the goal of providing reliable, easily accessible IP address data. As IPinfo CEO, he is committed to constantly improving that data and how customers can use it.