4 hours ago by Ben Dowling 3 min read

Post-Mortem: September 17th API Outage

Earlier this week our service experienced a serious outage lasting 2 hours and 36 minutes. During this time, all HTTPS traffic to our API and website failed due to an expired TLS certificate.

API reliability is something we take extremely seriously at IPinfo, and we’re very sorry for the outage. Our API handles well over 100 billion requests a month, and we understand that even a minute of downtime can have a significant negative impact on our customers and users. The outage yesterday represents one of the biggest and most severe outages in our 12 year history. Below we describe what happened, and what we’re going to change going forward.

Graph showing IPinfo load balancer traffic around the time of the incident. Only HTTP traffic made it through during the outage.

What Happened

We use cert-manager to automate TLS certificate lifecycle management. Our setup:

  • Single multi-domain certificate covering all production domains
  • DNS-01 challenge validation method (required for wildcard certificates)
  • Let's Encrypt as the Certificate Authority
  • Pre-issuance validation ensures certificates are valid before Load Balancer deployment (this was the main reason for adoption).

All our public-facing certificates are set to last 90 days, and renew half-way through (i.e. T-45 days). This ensures that:

  1. Certificate renewal is NOT a yearly (or rarer) occasion – it's often enough for everyone in the Engineering team to be exposed to
  2. We have enough time (45 days) to attempt the issuance, in case there's any issue.

In addition to our main ipinfo.io domain, we own other domains, including ipinfo.net, ipinfo.org, and many others. All of these domains share the same TLS certificate.

We rely on Google Cloud for almost all of our infrastructure, but recently decided to adopt Cloudflare for web static asset serving. As part of our experimentation and testing phase, we allocated the ipinfo.org domain to Cloudflare during early development stages.

When we pointed ipinfo.org's nameserver (NS) records to Cloudflare, Google Cloud DNS was no longer the authoritative nameserver for that domain. This meant that any DNS records cert-manager attempted to create in our Google Cloud DNS zone for ipinfo.org were no longer publicly visible. As a result, the DNS-01 ACME challenge records required by Let's Encrypt for certificate validation couldn't be verified.

Our certificate issuance process requires validation for all domains and subdomains listed in the certificate request. Since Let's Encrypt couldn't confirm ownership of ipinfo.org through the DNS challenge, the entire multi-domain certificate renewal failed, leading to the expiration of our TLS certificate.

cert-manager had been logging errors, but we didn’t have any monitoring or alerting in place for this, so these errors were missed. Had any of these been seen, or better alerting setup, we would have noticed this issue and fixed it long before it impacted customers, because we attempt to renew 45 days before certificate expiration.

Timeline

All timestamps UTC:

  • 2025-09-17 04:48:57 – TLS certificate expires
  • 2025-09-17 04:49:00 – Monitoring alerts fire in Slack for ipinfo.io
  • 2025-09-17 04:51:00 – Engineers begin certificate renewal failure; standard troubleshooting procedures attempted without success; Continued to investigate
  • 2025-09-17 07:11:00 – Incident escalated to senior infrastructure engineers for deeper investigation
  • 2025-09-17 07:22:00 – Root cause identified: logs show cert-manager unable to complete DNS-01 challenge for ipinfo.org
  • 2025-09-17 07:24:00 – Mitigation applied: ipinfo.org domains removed from certificate request
  • 2025-09-17 07:25:17 – New certificate successfully issued by Let's Encrypt
  • 2025-09-17 07:27:00 – Load Balancer begins to gradually serve new certificate
  • 2025-09-17 07:28:00 – All monitoring alerts resolve
  • 2025-09-17 07:31:00 – Service fully restored, confirmed by external TLS validation

What We’re Doing Going Forward

We are taking the following steps to prevent this from happening again:

  1. Have separate certificates for each domain, so if there are any renewal issues the impact will be limited to that specific domain. 
  2. Implement an independent TLS certificate expiration monitor process.
    1. This ensures redundancy beyond our current monitoring setup, which had proven at times unreliable with false positives.
    2. Our certificates renew at the 45-day mark (halfway through their 90-day lifespan), so we're adding additional daily validation checks that start alerting if any  certificate has not been successfully renewed within 30 days before expiration.
  3. Improved runbooks: streamlined troubleshooting guide for certificate-related incidents to ensure rapid diagnosis and resolution, even for engineers less familiar with our certificate infrastructure.
  4. Review and improve our Incident Escalation policy, so that escalations happen more quickly in the future.
  5. Review and improve our alerting thresholds.

About the author

Ben Dowling

Ben Dowling

Ben founded IPinfo in 2013 with the goal of providing reliable, easily accessible IP address data. As IPinfo CEO, he is committed to constantly improving that data and how customers can use it.