Connection Pooling Best Practices for MySQL at Scale

Most MySQL incidents we have been paged for over the years fall into one of two buckets: a query that should never have shipped, or a connection layer that quietly collapsed under its own weight. The first kind gets all the attention because it produces a dramatic slow log. The second kind is sneakier. The database looks healthy — CPU is fine, disk is fine, the slow log is empty — and yet the application is throwing Too many connections and timing out. By the time we get the call, the team has usually already done the worst possible thing: raised max_connections and restarted.

Connection pooling is the discipline that prevents that 3 a.m. page. It is also one of the most misunderstood parts of running MySQL at scale, because the intuitive answer — “more connections means more throughput” — is almost exactly backwards. After two decades of operating MySQL fleets for payments platforms, ad-tech firms, and a few household-name consumer apps, we have come to treat the connection layer as a first-class architectural concern, not an afterthought you configure once and forget. Here is how we think about it at MinervaDB, and the practices that have held up under real production load.

Why a MySQL connection is expensive

To pool connections well, you have to understand what a connection actually costs. MySQL uses a thread-per-connection model by default: each client connection is serviced by its own operating-system thread inside mysqld. That thread is not free. It reserves session-level buffers — sort_buffer_size, join_buffer_size, read_buffer_size, the net buffers, and more — and the math compounds fast. A configuration that looks reasonable per session can reserve gigabytes of memory once you multiply it across a few thousand connections. We have watched servers with plenty of RAM start swapping not because of the buffer pool, but because someone left sort_buffer_size at 256M and then opened four thousand connections.

Establishing the connection is expensive too. Every new connection pays for a TCP handshake, the authentication exchange, TLS negotiation if you are encrypting traffic (you should be), and the allocation of those per-thread structures. Under steady state that cost is invisible. Under a thundering herd — a deploy that restarts every application pod at once, or a downstream timeout that makes every worker retry simultaneously — the connection-establishment cost becomes the bottleneck, and MySQL spends its time doing handshakes instead of running your queries. This is the “connection storm,” and raising max_connections makes it worse, not better, because you have simply given the storm a larger door to stampede through.

The official MySQL documentation on connection handling is worth reading closely if you want the mechanics. The short version: connections are a finite, costly resource, and the job of a pool is to create a small number of them, keep them warm, and hand them out fast.

The counterintuitive truth about pool size

Understanding Connection Pooling

The single most common mistake we see is oversized pools. A team reasons that if 50 connections are good, 500 must be better, and they configure their application pool accordingly. Then they add a few more application instances, each with its own pool, and suddenly the database is fielding several thousand connections to serve a workload that needs a fraction of that.

The reason this hurts is that a database server can only do so much concurrent work. If your box has 16 cores and your queries are CPU-bound, roughly 16 of them can make progress at any instant; the rest are queued, context-switching, and contending for the same locks and latches. Beyond a certain point, adding connections increases contention and reduces total throughput while inflating tail latency. The HikariCP maintainers wrote the clearest explanation of this we have read, and we send to clients constantly: About Pool Sizing. The PostgreSQL community’s notes on connection counts reach the same conclusion from a different angle, and the physics applies equally to MySQL.

A reasonable starting formula, derived from Little’s Law, is:

pool_size = (core_count * 2) + effective_spindle_count

For a modern server on NVMe storage, the “spindle” term is small, so you often land near two to four connections per core as a starting point — not per application instance, but in aggregate against the database. That number shocks people. It should. The right pool is almost always smaller than your instinct, and the way to serve more users is to make each connection do its work faster and release it sooner, not to hoard more connections. Treat the formula as a hypothesis to load-test, not gospel; your real number depends on whether your workload is CPU-bound, lock-bound, or I/O-bound. But start small and grow only with evidence.

Two layers of pooling, and when to use each

At scale there are two places pooling happens, and mature architectures use both deliberately.

Application-side pools

This is the pool living inside your application process — HikariCP for the JVM, the built-in pools in Go’s database/sql, SQLAlchemy’s QueuePool in Python, and so on. It is the first line of defense and the one you control most directly. A few non-negotiable settings, using HikariCP names because they are the clearest:

maximumPoolSize   = 10        # per instance; size in aggregate, see below
minimumIdle       = 10        # keep it equal to max to avoid churn
connectionTimeout = 3000      # fail fast (ms), do not let callers hang
idleTimeout       = 600000    # 10 min
maxLifetime       = 1700000   # ~28 min; MUST be below MySQL wait_timeout
keepaliveTime     = 120000    # validate idle conns before MySQL reaps them

The setting people forget is maxLifetime. MySQL will close a connection that has been idle longer than wait_timeout (default 28800 seconds, though you should lower it). If your pool hands out a connection MySQL has already silently killed, the application gets a confusing error on a query that looks perfectly valid. Always set maxLifetime comfortably below MySQL’s wait_timeout so the pool retires connections on its own terms rather than discovering they are dead. We have debugged more than one “intermittent, unreproducible” production error that turned out to be exactly this mismatch.

The aggregate point deserves emphasis. If you run 40 application instances each with maximumPoolSize = 50, you have told MySQL to expect 2,000 connections. Size the pool against the total the database can handle and divide by your instance count. In an autoscaling or serverless world this is genuinely hard — which is the whole reason the second layer exists.

Proxy-side pooling and multiplexing

When you cannot control the number of clients — serverless functions that each open a connection, dozens of microservices, an autoscaling fleet that triples during peak — you put a pooling proxy between the application and MySQL. In our practice the workhorse is ProxySQL, which maintains a small, warm pool of backend connections and multiplexes many frontend client connections onto them. Ten thousand idle application connections can ride on a few hundred backend connections, because most of those clients are not actually executing a query at any given moment.

Multiplexing is powerful and it has sharp edges. ProxySQL can only safely reuse a backend connection for another client when the connection carries no session state that would leak. The moment a client opens a transaction, creates a temporary table, sets a user variable, uses LAST_INSERT_ID(), or issues certain SET statements, ProxySQL must pin that backend connection to that client until the state clears. Applications that wrap everything in transactions or lean on session variables defeat multiplexing without realizing it, and then wonder why the proxy is opening far more backend connections than expected. Audit your application’s use of session state before you count on multiplexing to save you. On managed infrastructure, Amazon RDS Proxy offers a similar capability with the same caveats around “pinning.”

Tuning MySQL to cooperate with your pools

Pooling is a two-sided contract. The server has to be configured to match.

Set max_connections to a number you have actually validated against the server’s memory, not an aspirational ceiling. The honest way to size it is to measure your real per-connection memory footprint under load and leave generous headroom for the InnoDB buffer pool and the OS. On cloud instances this matters even more, because RDS and Aurora derive a default max_connections from instance memory, and that default is often higher than the box can comfortably serve once your session buffers are accounted for.

Lower wait_timeout and interactive_timeout from the eight-hour default to something sane — a few minutes for most web workloads — so that leaked or abandoned connections get reclaimed instead of squatting. Make sure your pool’s maxLifetime stays below whatever you choose. Raise back_log if you expect bursty connection arrival, so the OS queues incoming connection requests rather than refusing them during a spike.

For workloads with very high connection counts and short queries, consider a thread pool instead of thread-per-connection. The thread pool in MySQL Enterprise, and the equivalent in Percona Server, bounds the number of active OS threads regardless of how many connections exist, which protects the server from collapsing under thousands of mostly-idle sessions. It is not a substitute for a connection pool — it solves a different problem — but in high-fan-out architectures the two work well together. A representative server-side baseline:

[mysqld]
max_connections      = 800
wait_timeout         = 300
interactive_timeout  = 300
back_log             = 1000
# size session buffers conservatively; they are PER connection
sort_buffer_size     = 2M
join_buffer_size     = 1M
# give the buffer pool the memory, not a thousand idle threads
innodb_buffer_pool_size = 24G

The anti-patterns that cause real outages

A handful of mistakes account for most of the connection-related incidents we are called in to clean up.

Opening a connection per request. Frameworks and ORMs sometimes make this trivially easy and catastrophically slow. If your code opens and closes a raw connection on every HTTP request, you are paying the full handshake cost constantly and you have no pool at all. Use the framework’s pool, configure it deliberately, and reuse.

Serverless without a proxy. Lambda-style functions that connect directly to MySQL are a connection-storm generator: each concurrent invocation opens its own connection, and a traffic spike becomes a connection spike one-for-one. This is precisely the case a pooling proxy exists to absorb. Put RDS Proxy or ProxySQL in front, always.

Idle-in-transaction connections. A connection that has begun a transaction and then gone to lunch holds locks and pins a backend connection in any multiplexing setup. Watch for this explicitly; it is a frequent cause of mysterious lock waits and exhausted pools.

Forgetting the read replicas. Teams carefully size the pool to the primary and forget that every replica also has its own connection budget. If you route reads to replicas, each replica needs its own pool sizing and its own monitoring.

What to monitor, and what the numbers mean

You cannot manage what you do not measure, and connection health has a small set of metrics that tell you almost everything. From SHOW GLOBAL STATUS:

  • Threads_connected — how many connections exist right now. Trend it; a steady climb that never recedes is a leak.
  • Threads_running — how many are actually executing. This is the number that matters for load. If Threads_connected is 800 but Threads_running is 12, your pools are oversized and mostly idle.
  • Max_used_connections — the high-water mark since startup. Compare it to max_connections to see how close you came to the ceiling.
  • Aborted_connects and Connection_errors_max_connections — non-zero and rising means clients are being refused. This is your early warning.

On the application side, your pool exposes its own telemetry — active connections, idle connections, and, most importantly, the time callers spend waiting to acquire a connection from the pool. That wait time is the single best signal that your pool is too small for the offered load, or that connections are being held too long by slow queries. When pool-acquisition latency climbs, the fix is usually faster queries, not a bigger pool.

A pragmatic checklist

If you take nothing else from this, take the working checklist we apply on every MySQL performance audit:

  • Size pools in aggregate against what the database can serve, starting near two to four connections per core, and grow only with load-test evidence.
  • Set maxLifetime (or your pool’s equivalent) safely below MySQL’s wait_timeout.
  • Fail fast on acquisition with a short connectionTimeout so a stalled database does not turn into stalled application threads.
  • Put a pooling proxy in front of any autoscaling or serverless fleet, and audit your session-state usage so multiplexing actually works.
  • Lower the server timeouts, size max_connections to measured memory, and consider a thread pool for high-fan-out workloads.
  • Monitor Threads_running, pool wait time, and connection errors — not just Threads_connected.

Connection pooling rarely gets the glory that query optimization does, but at scale it is just as decisive. The teams that run MySQL calmly are not the ones with the biggest max_connections; they are the ones who treat connections as the scarce, costly resource they are and engineer the layer accordingly.

MinervaDB is a vendor-neutral data engineering and operations firm. We run 24×7 MySQL support and consulting for organizations from high-growth startups to the Fortune 500, and connection-layer architecture is one of the first things we assess. If your MySQL fleet is fighting connection storms or you are unsure whether your pools are sized correctly, book a consultation or explore our remote DBA services.

About MinervaDB Corporation 277 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.