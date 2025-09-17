Earlier this week our service experienced a serious outage lasting 2 hours and 36 minutes. During this time, all HTTPS traffic to our API and website failed due to an expired TLS certificate.

API reliability is something we take extremely seriously at IPinfo, and we’re very sorry for the outage. Our API handles well over 100 billion requests a month, and we understand that even a minute of downtime can have a significant negative impact on our customers and users. The outage yesterday represents one of the biggest and most severe outages in our 12 year history. Below we describe what happened, and what we’re going to change going forward.

Graph showing IPinfo load balancer traffic around the time of the incident. Only HTTP traffic made it through during the outage.

What Happened

We use cert-manager to automate TLS certificate lifecycle management. Our setup:

Single multi-domain certificate covering all production domains

covering all production domains DNS-01 challenge validation method (required for wildcard certificates)

validation method (required for wildcard certificates) Let's Encrypt as the Certificate Authority

as the Certificate Authority Pre-issuance validation ensures certificates are valid before Load Balancer deployment (this was the main reason for adoption).

All our public-facing certificates are set to last 90 days, and renew half-way through (i.e. T-45 days). This ensures that:

Certificate renewal is NOT a yearly (or rarer) occasion – it's often enough for everyone in the Engineering team to be exposed to We have enough time (45 days) to attempt the issuance, in case there's any issue.

In addition to our main ipinfo.io domain, we own other domains, including ipinfo.net , ipinfo.org, and many others. All of these domains share the same TLS certificate.

We rely on Google Cloud for almost all of our infrastructure, but recently decided to adopt Cloudflare for web static asset serving. As part of our experimentation and testing phase, we allocated the ipinfo.org domain to Cloudflare during early development stages.

When we pointed ipinfo.org's nameserver (NS) records to Cloudflare, Google Cloud DNS was no longer the authoritative nameserver for that domain. This meant that any DNS records cert-manager attempted to create in our Google Cloud DNS zone for ipinfo.org were no longer publicly visible. As a result, the DNS-01 ACME challenge records required by Let's Encrypt for certificate validation couldn't be verified.

Our certificate issuance process requires validation for all domains and subdomains listed in the certificate request. Since Let's Encrypt couldn't confirm ownership of ipinfo.org through the DNS challenge, the entire multi-domain certificate renewal failed, leading to the expiration of our TLS certificate.

cert-manager had been logging errors, but we didn’t have any monitoring or alerting in place for this, so these errors were missed. Had any of these been seen, or better alerting setup, we would have noticed this issue and fixed it long before it impacted customers, because we attempt to renew 45 days before certificate expiration.

Timeline

All timestamps UTC:

2025-09-17 04:48:57 – TLS certificate expires

– TLS certificate expires 2025-09-17 04:49:00 – Monitoring alerts fire in Slack for ipinfo.io

– Monitoring alerts fire in Slack for 2025-09-17 04:51:00 – Engineers begin certificate renewal failure; standard troubleshooting procedures attempted without success; Continued to investigate

– Engineers begin certificate renewal failure; standard troubleshooting procedures attempted without success; Continued to investigate 2025-09-17 07:11:00 – Incident escalated to senior infrastructure engineers for deeper investigation

– Incident escalated to senior infrastructure engineers for deeper investigation 2025-09-17 07:22:00 – Root cause identified: logs show cert-manager unable to complete DNS-01 challenge for ipinfo.org

– Root cause identified: logs show cert-manager unable to complete DNS-01 challenge for 2025-09-17 07:24:00 – Mitigation applied: ipinfo.org domains removed from certificate request

– Mitigation applied: domains removed from certificate request 2025-09-17 07:25:17 – New certificate successfully issued by Let's Encrypt

– New certificate successfully issued by Let's Encrypt 2025-09-17 07:27:00 – Load Balancer begins to gradually serve new certificate

– Load Balancer begins to gradually serve new certificate 2025-09-17 07:28:00 – All monitoring alerts resolve

– All monitoring alerts resolve 2025-09-17 07:31:00 – Service fully restored, confirmed by external TLS validation

What We’re Doing Going Forward

We are taking the following steps to prevent this from happening again: