bun's fetch() will silently kill your production app

ID

bun’s fetch() will silently kill your production app

this is a story about a few days of debugging a fetch timeout issue in production. if you’re running bun 1.3.x in docker with outgoing HTTPS calls, this might save you a few sleepless nights

background

API server running in docker swarm, multiple replicas behind nginx. the app makes outgoing HTTPS calls to several external services, some on cloud run, push notifications, SMS gateway

originally on bun 1.2.18 with compiled binary (bun build —compile). functionally fine, fetch to external services worked without issues

but bun 1.2.18 had a GC HeapHelper bug where each process ate ~28-50% CPU even when idle. running 5 replicas meant 150-250% CPU just for garbage collection doing nothing

eventually upgraded to bun 1.3.13 to fix the GC CPU issue. CPU dropped immediately

this is what CPU and memory looked like on bun 1.2.18

grafana service CPU and memory usage on bun 1.2.18 showing API CPU mean 213%

the symptom

after upgrading to bun 1.3.13, outgoing external API calls started timing out at exactly 15 seconds. not always, intermittently. sometimes 50% of requests would timeout, sometimes 0%

the tricky part: it only happened under real traffic load. every isolated test passed perfectly

the debugging journey

it’s the third-party service

first instinct, the external API is slow. but curl from the same container? 65-200ms. every time. so it’s not the service

it’s DNS

searched bun issues, found oven-sh/bun#10731 about c-ares DNS issues. tried:

  • BUN_DNS_RESOLVER=getaddrinfo at build time
  • setDefaultResultOrder(‘ipv4first’)` in code
  • disabled IPv6 via sysctl

DNS from container was fine (1-20ms). none of these fixed it

it’s the compiled binary

we were using bun build —compile. installed bun runtime in the same container and ran isolated tests, results were fine. switched to runtime mode

but under real traffic it still happened

it’s connection pooling

found oven-sh/bun#9034 about keep-alive reuse on dead sockets. added keepalive: false to all fetch calls

didn’t fix it

it’s IPv6

discovered cloud run target had no AAAA record. curl IPv6 test returned HTTP 000, 1-6 seconds. bun was trying IPv6 first!

disabled IPv6, added ipv4first. helped a little, timeouts went from 50% to ~5%. but still happened

it needs HTTP/2

found oven-sh/bun#13586, bun maintainer confirmed bun fetch uses HTTP/1.1 while some servers respond faster to HTTP/2. tried {protocol: “http2”} (experimental in 1.3.14)

isolated test worked great. under load still timed out. experimental feature wasn’t stable

the breakthrough

installed bun runtime in a container and ran a comparison test:

// bun's fetch()
const r = await fetch(url, { headers })
// vs node:https
const r = await httpsRequest(url, { headers })

both returned identical results in isolated tests. but here’s what happened under real traffic:

clientisolatedunder load
fetch()65ms15,000ms timeout
node:https55ms55ms
curl65ms65ms

node:https and curl were completely unaffected by concurrent load. only bun’s fetch() degraded

this is what outgoing and incoming latency looked like during the issue:

grafana outgoing latency before fix showing 15 second timeouts

grafana incoming latency before fix

something interesting here: outgoing hit 15 seconds but incoming P99 only showed ~3-4 seconds. the request that triggered the outgoing fetch should also be 15 seconds. the reason is the incoming histogram (from hono-prometheus) only has buckets up to 10 seconds, then jumps to +Inf. histogram_quantile interpolates between 10s and infinity, giving inaccurate results. our outgoing histogram has a bucket at 15 seconds (custom defined), so it shows the actual value

the pattern

the degradation was progressive, not sudden:

  • 0-10 min: ~100ms (normal)
  • 10-20 min: ~2-4s (degrading)
  • 20-30 min: ~8-15s (timeout)

and it affected all external HTTPS calls made via fetch(). internal HTTP calls and the worker process (sequential, no Bun.serve()) were unaffected

the last point was key. same bun version, same fetch(), same docker container, but the worker (which processes jobs sequentially, no Bun.serve()) never had this issue

root cause

bun 1.3.x has a regression in its native fetch() implementation when used for outgoing HTTPS requests under concurrent Bun.serve() incoming load. the internal TLS connection pool degrades over time

this doesn’t happen with:

  • node:https (different networking stack in bun)
  • curl (completely separate)
  • bun 1.2.18 (different fetch internals)
  • sequential processing (no concurrent Bun.serve() pressure)

the fix

replace fetch() with node:https for external HTTPS calls:

import https from 'node:https'

function httpsRequest(url: string, options: RequestOptions): Promise<HttpResponse> {
  return new Promise((resolve, reject) => {
    const u = new URL(url)
    const req = https.request({
      hostname: u.hostname,
      path: u.pathname + u.search,
      method: options.method || 'GET',
      headers: options.headers || {},
      family: 4,
    }, (res) => {
      let data = ''
      res.on('data', (chunk) => data += chunk)
      res.on('end', () => resolve({ status: res.statusCode || 0, data }))
    })
    req.on('error', reject)
    req.setTimeout(options.timeout || 15000, () => {
      req.destroy(new Error('Request timeout'))
    })
    if (options.body) req.write(options.body)
    req.end()
  })
}

after deploying this change, external API calls went from 15s timeout to consistent 50-200ms. zero degradation over time

after (using node:https):

grafana outgoing latency after fix showing all services under 1 second

grafana incoming latency after fix showing P99 back to hundreds of ms

lessons learned

  1. isolated tests lie. the issue only appeared under real concurrent load. no amount of bun script.js testing could reproduce it

  2. curl from container proves the network is fine. if curl works but your runtime doesn’t, the problem is in the runtime

  3. worker vs API comparison was the breakthrough. same code, same container, different concurrency model. if one works and the other doesn’t, it’s a concurrency bug

  4. node:https is a valid escape hatch in bun. bun implements node.js APIs using a different code path than native fetch(). when one breaks, the other might work

  5. monitor outgoing latency separately. we had prometheus histograms for outgoing calls. without this, we’d never have noticed the progressive degradation

relevant bun issues:

if you’re hitting similar issues, try node:https before blaming your infrastructure


← Back to blog