Long-Lived Connections
Long-lived connections are one of those topics that sound simple until you try to explain why a live UI “feels stuck” while every health check is green.
The usual failure mode: the client still thinks the WebSocket is open, but a load balancer dropped the flow minutes ago because idle timeouts didn’t line up. Nothing crashes. Metrics look fine. The connection just quietly lies.
This post is my notes on what long-lived connections are, where they show up, and why the boring middlebox details matter as much as the application code. Most of it comes from reading, small experiments, and getting surprised in homelab setups.
What we’re actually talking about
Most of the internet still runs on TCP, and TCP has a personality trait: once it’s open, it wants to stay open.
A long-lived connection is exactly what it sounds like. You open a socket once and reuse it for a while instead of handshaking, requesting, and closing for every single interaction. On paper that’s obvious. In production it’s a trade: you save repeated setup cost, but you inherit state. Something, somewhere, has to remember that this connection exists, who’s on the other end, and when it’s safe to kill it.
If you’ve only ever built request/response APIs, you can go years without thinking about this. The moment you add live UI, streaming, chat, or a connection pool to Postgres, you’re in the club.
The naive model (and why it hurts)
Here’s the version of HTTP we all learned in tutorials:
- TCP handshake (one round trip, more if TLS is involved)
- Send request, read response
- Close the connection
Clean. Easy to reason about. Also expensive if you’re doing it hundreds of times a second.
In a small browser polling experiment I ran locally (one request per second), the interesting cost wasn’t the JSON. It was setup. Each poll opened a fresh connection because keep-alive was off on one hop. Turning it on and fixing a stray Connection: close header cut latency noticeably. Same logic, fewer handshakes.
For a dashboard that fires twenty parallel requests on page load, or a mobile app that wakes up and syncs, the tax adds up fast. You’re not paying for bytes so much as for round trips and setup.
HTTP keep-alive is the first escape hatch. Same TCP connection, multiple requests to the same host. Your browser does this by default. Your reverse proxy probably does too. You usually only notice it when something in the chain disables it, or when idle timeouts fight each other.
HTTP/2 and HTTP/3 push further: many logical streams share fewer transport connections. Multiplexing is great until you realize your load balancer needs to actually understand the protocol end-to-end, not just pass bytes through and hope.
The trade is always the same: fewer handshakes and less CPU spent on setup, in exchange for remembering things. Who’s connected. How long they’ve been idle. What happens when they go away without saying goodbye.
The zoo of long-lived patterns
Not all long-lived connections are the same shape. Here’s how I mentally sort them.
| Pattern | Direction | Typical use |
|---|---|---|
| HTTP keep-alive | Request/response | Browsers, REST APIs, nginx → app |
| WebSockets | Bidirectional | Live dashboards, games, collab tools |
| Server-Sent Events | Server → client | Notifications, log tailing, “good enough” push |
| gRPC / HTTP/2 streams | Multiplexed RPC | Service-to-service, internal APIs |
| Connection pools | App → database | Postgres, Redis, anything with max_connections |
HTTP keep-alive
This is the baseline. It’s so default now that the interesting bugs are subtle: a middleware that forces Connection: close, a health check that opens a new connection per probe and exhausts ephemeral ports, a client library that pools incorrectly across hosts.
If your API “works in curl” but stutters in the browser, it’s worth checking whether you’re accidentally paying the full connection tax on every call.
WebSockets
WebSockets upgrade HTTP into a bidirectional channel. They’re great when the UI needs to change the moment something happens on the server. They’re less great when you reach for them because you heard they’re “real-time” and you only needed occasional server push.
It’s tempting to run generic RPC over WebSockets because the client library is already there. It works until you need proper backpressure, versioning, or straightforward HTTP debugging. Then you miss HTTP.
Server-Sent Events (SSE)
SSE keeps a one-way stream open over ordinary HTTP. I like SSE more than I expected. For “server tells the client something changed” it’s often enough, and you don’t have to redesign your auth or routing around a separate protocol.
The constraints are real though: text framing, one direction, and some proxies buffer SSE oddly. Know your path.
gRPC and HTTP/2 streams
Inside the cluster, multiplexed RPC over long-lived connections is usually the right default. One connection, many in-flight calls, less handshake noise.
The footgun is at the edge. Terminate TLS at an L7 LB that doesn’t speak HTTP/2 to the backend correctly, or pin HTTP/1.1 somewhere in the chain, and you’ll spend a week reading grpc-go logs that look fine in isolation.
Database connection pools
Pools are long-lived connections wearing a trench coat. Opening Postgres isn’t free: auth, memory for backend state, sometimes surprising latency on cold start. Pools amortize that and cap how many concurrent queries can hit the database at once.
A common pool footgun: each replica opens pool_size connections, you scale replicas up, and suddenly replicas × pool_size blows past max_connections on the database. The app looks healthy. The database isn’t.
When the path lies to you
Middleboxes have opinions, and they don’t always tell you.
Picture this:
- The client opens a WebSocket through an L7 load balancer with a 60 second idle timeout
- The user goes to get coffee. Traffic is quiet for 90 seconds.
- The LB drops the flow. No RST packet drama. Just… gone.
- The client still thinks it’s connected until the next ping fails, or until the user clicks something and nothing happens
Your server logs might show a clean disconnect eventually. Your metrics won’t scream. The user experience is “why is this broken until I refresh?”
What I try to remember when building anything long-lived:
- Line up idle timeouts so the client pings more often than the LB gives up, and the LB gives up before the app server forgets. At minimum, know which layer kills first.
- Send application-level pings on WebSockets and similar. TCP keepalive exists, but it’s slow and easy to tune wrong.
- Treat reconnect as normal user flow, not an edge case. Backoff, resync state, don’t assume the socket that hasn’t written in a while is still alive.
NAT gateways and corporate firewalls play the same game on longer timescales. If your mobile app “randomly disconnects” after five minutes in the background, I wouldn’t start by blaming your Go scheduler. I’d draw the path and ask what’s allowed to go idle.
The bill you pay in file descriptors and RAM
On Linux, a connection is a file descriptor. Defaults like ulimit -n of 1024 were written for a world where a process opened a handful of files. A busy API server is not that world.
Each connection drags along more than an integer in a table:
Process memory grows with open connections
┌──────────────────────────────────────────┐
│ App heap (session objects, buffers) │
├──────────────────────────────────────────┤
│ TLS state per connection │
├──────────────────────────────────────────┤
│ Kernel socket buffers (send/recv) │
└──────────────────────────────────────────┘
▲ ▲
│ └── each FD = cost
└── idle WebSockets still count
Kernel buffers. TLS session state. Whatever your framework hangs off conn in a map. Thousands of idle WebSockets can be a real memory problem, the quiet kind that doesn’t spike CPU but slowly eats the box until the kernel’s OOM killer picks a victim. (Yes, that’s why this site is called oomkill. I’m not above a pun.)
Cap concurrent connections, expire idle ones on purpose, and when you’re load testing, watch open FD count and connection age, not only RPS and latency. The scary graphs are often the boring ones.
Failure modes that don’t look like HTTP errors
Request/response fails loudly: 502, timeout, retry, done. Long-lived stuff fails sideways.
Half-open TCP after a network partition is the classic. One side thinks the connection is fine because it hasn’t tried to write yet. The honest fix is usually “send something or use a heartbeat.” I learned that the hard way in lab setups.
Sticky sessions plus long-lived connections plus Kubernetes rollouts is another favorite. User pinned to pod A. Pod A terminates. Client reconnects, maybe to pod B with empty in-memory state. Works great in staging with one replica.
Thundering herd on reconnect after a deploy is the one that punishes you for success. Everyone drops at once, everyone comes back at once, auth service and database see a spike that has nothing to do with steady-state traffic. If reconnect isn’t jittered and bounded, you’re load testing yourself accidentally.
What I want in a client now is boring on purpose: reconnect with backoff, resync from a known cursor or version, never trust a socket that hasn’t proven it’s alive recently, and make idempotent resume someone’s actual job, not a comment in the README.
What I’d tell past-me
Reuse connections when setup is expensive. That’s why keep-alive and pools exist. Match timeouts across client, proxy, and server, or you’ll debug ghosts while dashboards stay green. Assume anything that lives longer than a single page view will disconnect at the worst time, and write the client accordingly.
I’m still learning this stuff. Follow-up posts might dig into gRPC through nginx, pool sizing math, or sticky sessions, if I can reproduce the weird parts in a small setup first.