Azure SQL Caching Strategies for Sub-Millisecond Response Times

Azure SQL · Performance Engineering

Azure SQL Caching Strategies for Sub-Millisecond Response Times

Sub-millisecond response on Azure SQL is not a single feature you switch on — it is a layered caching architecture you engineer deliberately. This is a senior DBA’s guide to the caching strategies that take Azure SQL Database from “fast enough” to genuinely sub-millisecond, with the T-SQL and configuration to apply each one.

MDB MinervaDB Azure SQL Engineering Team 

A query that reads from memory answers in microseconds; a query that waits on remote storage answers in milliseconds. Sub-millisecond Azure SQL performance is, almost entirely, the art of making sure the data your hot path needs is already in memory when it asks for it.

Caching is not one mechanism in Azure SQL Database — it is a hierarchy of them, from the application tier down to the storage engine. At the top sits the application cache, where a value never touches the database at all. Below it lies In-Memory OLTP, where memory-optimized tables eliminate latching and locking entirely. Beneath that, the buffer pool keeps recently used pages resident, and only when all of these miss does a request fall through to local SSD and finally remote storage. Each layer is roughly an order of magnitude slower than the one above it, so the entire discipline of low-latency engineering is the discipline of serving as many requests as possible from the highest layer.

This guide works down that hierarchy strategy by strategy, with the configuration and T-SQL to implement each. The figures used throughout are representative and illustrative — your latencies depend on tier, hardware, schema, and workload — and the method, as on every MinervaDB SQL Server and Azure SQL performance audit, is to measure the cache hit ratios and wait statistics, change one layer, and measure again. Sub-millisecond response is earned one cache layer at a time.

It is worth being explicit about why this matters beyond the engineering. Sub-millisecond latency is increasingly a product requirement rather than a nicety: it is what makes real-time personalization, fraud scoring, gaming leaderboards, and interactive analytics feel instant, and it is frequently written into customer-facing service-level agreements. At the same time, the naive route to low latency — simply provisioning a larger, more expensive Azure SQL tier — inflates cloud spend without addressing the root cause. A well-engineered caching hierarchy delivers the latency the business needs at a fraction of the cost of brute-forcing it with hardware, which is precisely why caching strategy is one of the highest-return investments a data team can make.

Understanding Azure SQL Caching Strategies

The Azure SQL caching hierarchy

Before the strategies, fix this mental model. Every read resolves at one of these layers, and the layer it reaches determines its latency. The goal of every strategy that follows is to push more of your traffic toward the top.

Application cache — Azure Cache for Redis
The request never reaches the database
< 1 ms
In-Memory OLTP — memory-optimized tables
Lock- and latch-free access in the engine
~ sub-ms
Buffer pool — cached data pages
Recently used pages held in memory
~ 1–2 ms
Local SSD — Business Critical tier
Fast local storage on a cache miss
~ low ms
Remote storage — General Purpose tier
The slowest path; every miss lands here
~ 5–10+ ms

Representative latency ranges to illustrate the hierarchy; actual figures depend on service tier, workload, and configuration.

1The buffer pool

Right-size the buffer pool so the working set stays in memory

The buffer pool is Azure SQL’s foundational cache: pages read from storage are held in memory so subsequent reads avoid I/O entirely. Sub-millisecond response begins with ensuring your hot working set — the data and index pages your frequent queries actually touch — fits comfortably in the memory your service tier provides. When it does not, the engine evicts pages under pressure and the same query oscillates between a fast in-memory read and a slow storage read, producing exactly the inconsistent latency that breaks an SLA.

The levers are tier and design. The Business Critical and Premium tiers provide more memory and local SSD than General Purpose, which is why latency-critical workloads belong there; the choice of tier is documented across the Azure SQL vCore service tiers. On the design side, narrower rows and well-chosen indexes mean more of the working set fits in the same memory. Monitor the buffer cache hit ratio and Page Life Expectancy, and treat a falling PLE as the early warning that the working set no longer fits.

There is an important interaction with every other strategy in this guide: anything that reduces the on-disk size of your hot data increases how much of it the buffer pool can hold. Data compression, columnstore, narrow data types, and the removal of unused indexes all act as force multipliers on the buffer pool, raising the hit ratio without buying a single gigabyte of additional memory. This is why a tuning engagement that begins with schema and index design so often improves cache residency as a side effect — reading less data and caching more of it are two sides of the same coin.

T-SQL
-- Buffer cache hit ratio and Page Life Expectancy
SELECT object_name, counter_name, cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name IN
  ('Buffer cache hit ratio', 'Page life expectancy');

-- Which objects dominate the buffer pool right now?
SELECT TOP 10 obj.name, COUNT(*) AS cached_pages
FROM sys.dm_os_buffer_descriptors bd
JOIN sys.allocation_units au ON bd.allocation_unit_id = au.allocation_unit_id
JOIN sys.partitions p ON au.container_id = p.hobt_id
JOIN sys.objects obj ON p.object_id = obj.object_id
GROUP BY obj.name ORDER BY cached_pages DESC;
2In-Memory OLTP

Adopt In-Memory OLTP for the hottest tables

When the buffer pool is not enough — when even an in-memory page read carries too much latching and locking overhead for an extreme-throughput hot path — In-Memory OLTP is the Azure SQL answer. Memory-optimized tables live entirely in memory in a lock-free, latch-free structure, using optimistic multi-version concurrency control instead of locks. For session state, hot counters, ingestion buffers, and other high-contention OLTP patterns, the result is genuinely sub-millisecond access with dramatically higher throughput than disk-based tables.

In-Memory OLTP is available on the Premium and Business Critical tiers, and adopting it is a deliberate design choice rather than a switch: you create a memory-optimized table type, size the data carefully against your memory budget, and choose durability. The capability and its constraints are described in the Microsoft documentation on In-Memory OLTP. Reserve it for the genuinely hot tables that justify the memory; it is a precision tool, not a default.

T-SQL
-- A memory-optimized table for an extreme-throughput hot path
CREATE TABLE dbo.SessionState
(
    session_id  UNIQUEIDENTIFIER NOT NULL,
    payload     VARBINARY(8000)  NOT NULL,
    updated_at  DATETIME2      NOT NULL,
    CONSTRAINT PK_Session PRIMARY KEY NONCLUSTERED HASH (session_id)
        WITH (BUCKET_COUNT = 1048576)
)
WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA);
MinervaDB Insight

Hash indexes on memory-optimized tables demand a correctly sized BUCKET_COUNT — roughly the number of distinct keys. Too few buckets causes collisions and serializes access, quietly erasing the benefit. We size this from real cardinality during every performance audit.

3Native compilation

Natively compile the hottest stored procedures

Memory-optimized tables remove the storage and locking overhead; natively compiled stored procedures remove the interpretation overhead. A conventional T-SQL procedure is interpreted statement by statement at execution time, whereas a natively compiled procedure is translated into machine code when it is created, collapsing the per-statement cost to almost nothing. For short, high-frequency transactions against memory-optimized tables, the combination is what delivers consistent sub-millisecond execution at scale.

The trade-off is flexibility: natively compiled procedures support a focused subset of T-SQL and are best suited to the tight, well-defined hot paths that run millions of times rather than to sprawling business logic. Identify the handful of procedures that dominate your transaction volume and compile those, leaving the rest interpreted. The feature is detailed in the documentation on natively compiled stored procedures.

T-SQL
-- Native compilation for a tight, high-frequency transaction
CREATE PROCEDURE dbo.UpsertSession
    @id UNIQUEIDENTIFIER, @payload VARBINARY(8000)
WITH NATIVE_COMPILATION, SCHEMABINDING
AS BEGIN ATOMIC WITH
  (TRANSACTION ISOLATION LEVEL = SNAPSHOT, LANGUAGE = N'us_english')
    UPDATE dbo.SessionState SET payload = @payload, updated_at = SYSUTCDATETIME()
      WHERE session_id = @id;
END;
4Application tier

Cache at the application tier with Azure Cache for Redis

The fastest database query is the one never issued. The top of the caching hierarchy lives outside the database entirely: an application-tier cache that serves frequent reads before they ever reach Azure SQL. Azure Cache for Redis is the managed, in-memory store designed for exactly this, delivering microsecond-to-low-millisecond reads and removing load from the database so the queries that must reach it run faster too.

The dominant pattern is cache-aside: the application checks Redis first, and on a miss reads from Azure SQL and populates the cache with a sensible expiry. The engineering discipline is in invalidation — deciding how and when cached values expire so users never see stale data — and in choosing what to cache, which should be the read-heavy, slowly changing, high-traffic data that yields the largest hit ratio. Done well, the application cache absorbs the majority of reads, and the database is left to do only the work that genuinely requires it.

Two design decisions determine whether the cache helps or hurts. The first is the eviction and expiry policy: too aggressive and the hit ratio collapses, too lax and users see stale data, so the policy must reflect how tolerant each piece of data is of staleness. The second is the failure mode — the application must degrade gracefully to the database when the cache is unavailable, never treating a cache outage as a data outage. Treat the cache as an accelerator that the system can live without, size it so the working set of hot keys fits in memory, and monitor its hit ratio as closely as you monitor the database itself.

5Analytical caching

Use columnstore and batch mode for analytical hot paths

Caching is not only for point lookups. When the hot path is analytical — aggregations and scans over large tables for dashboards and reporting — the right cache is a columnstore index. Columnstore stores data by column rather than by row, compresses it heavily so far more of it fits in memory, and engages batch-mode execution that processes rows in vectorized chunks. The combination routinely delivers order-of-magnitude gains on analytical queries while shrinking the memory footprint, which in turn improves cache residency.

For hybrid workloads that serve both transactions and analytics from the same data, a nonclustered columnstore index over a rowstore table gives the analytical queries their columnar cache without disturbing the OLTP path. The compression effect is itself a caching strategy: by reducing the bytes each query must read, columnstore increases the share of the working set that stays resident in memory, pushing more reads up the hierarchy.

T-SQL
-- Columnstore for analytical queries over a large table
CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI_Orders
  ON dbo.Orders (order_date, region, product_id, amount);

-- Confirm batch-mode execution in the actual plan
SET STATISTICS XML ON;
6Read replicas

Serve cached reads from read-only replicas

Sub-millisecond response under load requires that read traffic does not contend with writes for the same memory and CPU. Azure SQL’s read scale-out capability, available on the Business Critical and Hyperscale tiers, routes read-only workloads to replicas that each maintain their own warm buffer pool. Directing reporting, dashboard, and analytical queries to a replica leaves the primary’s cache dedicated to the write path, so both improve.

Routing is a connection-string change — set the application intent to read-only — which makes this one of the highest-leverage, lowest-effort caching strategies available. Each replica becomes an independent cache of the data, multiplying the effective memory available to serve reads. For read-heavy applications, distributing reads across replicas is often the difference between a primary that thrashes its cache under mixed load and a fleet that keeps every tier of the hierarchy warm.

Connection string
-- Route read-only workloads to a replica's warm cache
Server=tcp:myserver.database.windows.net;Database=mydb;
ApplicationIntent=ReadOnly;
Authentication=Active Directory Default;
7Pre-computation

Pre-compute results with indexed views

Some expensive results are worth computing once and caching in the database itself. An indexed view — a view with a clustered index — materializes its result set on disk and in the buffer pool, and the engine maintains it automatically as the underlying data changes. For aggregations and joins that the same dashboard runs thousands of times, an indexed view turns a repeated scan-and-aggregate into a fast read of a small, pre-computed, cached result.

The cost is write-time maintenance, so indexed views suit read-heavy data where the aggregate is queried far more often than the base rows change. Used judiciously on the right high-frequency aggregates, they are a powerful in-database cache that the optimizer can even apply automatically to queries that did not name the view. The requirements and rules are covered in the documentation on creating indexed views.

8Plan stability

Stabilize the plan cache with Query Store

The procedure and plan cache is a caching layer in its own right: Azure SQL compiles an execution plan once and reuses it, so a stable, good plan is itself a form of caching that avoids recompilation and guarantees consistent latency. The risk is plan instability — parameter-sensitive plans and recompilations that swap a fast plan for a slow one with no code change, producing the intermittent latency spikes that are hardest to diagnose. Sub-millisecond targets cannot tolerate that variance.

Query Store, enabled by default on Azure SQL Database, records plans and runtime statistics over time and lets you force a known-good plan when the optimizer drifts. It is the mechanism that keeps the plan cache trustworthy: identify the regressed query, compare the plans, and pin the one that meets your latency target. Pair this with sound parameterization so the engine caches and reuses plans rather than compiling a new one for every literal value.

T-SQL
-- Find regressed queries, then force the good plan
SELECT q.query_id, p.plan_id, rs.avg_duration, rs.count_executions
FROM sys.query_store_runtime_stats rs
JOIN sys.query_store_plan p ON p.plan_id = rs.plan_id
JOIN sys.query_store_query q ON q.query_id = p.query_id
ORDER BY rs.avg_duration DESC;

EXEC sys.sp_query_store_force_plan @query_id = 42, @plan_id = 7;

How to prove you reached sub-millisecond

Every strategy here ends with the same instruction: measure. Sub-millisecond is a claim you must substantiate with the distribution, not the average, because the average hides the cache misses that define the user experience. Capture latency percentiles — p95 and p99 — for the hot path, and pair them with the cache-effectiveness metrics that explain them: buffer cache hit ratio, Page Life Expectancy, and, where you run an application cache, the Redis hit ratio.

Read the wait statistics to confirm where time goes. A workload dominated by PAGEIOLATCH waits is missing the buffer-pool cache and reading from storage; one dominated by lock waits is a concurrency problem that In-Memory OLTP or snapshot isolation addresses. The Azure SQL platform exposes these through the dynamic management views and platform metrics, and Query Store records the per-query latency trend. Capture these before and after each change so that “sub-millisecond” is a measured fact in your telemetry rather than an aspiration in a slide.

Make cache-hit ratio a first-class, continuously tracked metric rather than something you check during an incident. Establish a baseline for the buffer cache hit ratio, Page Life Expectancy, and any application-cache hit ratio during normal operation, and alert when they degrade, because a falling hit ratio is the leading indicator of latency trouble that has not yet become visible to users. Tracking these alongside p99 latency turns caching from a one-time tuning exercise into an ongoing operational discipline — exactly the telemetry our engineers establish at the start of an engagement and hand over as a living dashboard, so the gains are sustained long after the work is done.

T-SQL
-- Where is the time actually going?
SELECT TOP 15 wait_type, wait_time_ms,
       100.0 * wait_time_ms / SUM(wait_time_ms) OVER () AS pct
FROM sys.dm_db_wait_stats   -- database-scoped on Azure SQL
ORDER BY wait_time_ms DESC;

The caching strategy in one view

Eight strategies, one principle: serve every request from the highest cache layer you can.

  • Start with the buffer pool. Fit the working set in memory; the right tier and narrow design keep it resident.
  • Escalate to In-Memory OLTP and native compilation for the genuinely hot, high-contention paths that justify the memory.
  • Cache outside the database with Azure Cache for Redis so the busiest reads never reach Azure SQL at all.
  • Cache analytics with columnstore and offload reads to replicas so the primary’s cache serves the write path.
  • Pre-compute with indexed views and keep the plan cache stable with Query Store to remove latency variance.
  • Prove it with percentiles and hit ratios. Sub-millisecond is a measured distribution, not an average.

Frequently asked questions

Can Azure SQL Database really deliver sub-millisecond response times?

Yes, for the right workload and design. Reads served from an application cache or from In-Memory OLTP memory-optimized tables routinely complete in well under a millisecond, and buffer-pool hits are typically a small number of milliseconds. The key is engineering the caching hierarchy so the hot path resolves at the highest layer — sub-millisecond is achievable for hot OLTP and cached reads, not as a blanket guarantee for every query.

What is the single most impactful Azure SQL caching strategy?

It depends on the workload, which is why measurement comes first. For read-heavy applications, an application-tier cache such as Azure Cache for Redis usually yields the largest gain because it removes traffic from the database entirely. For high-contention OLTP, In-Memory OLTP is transformative. For most databases, simply ensuring the working set fits in the buffer pool — the right tier and a lean schema — delivers the broadest improvement.

Which Azure SQL service tier do I need for low-latency caching?

The Business Critical and Premium tiers provide more memory and fast local SSD, and they are required for In-Memory OLTP and for read scale-out replicas, so latency-critical workloads generally belong there. General Purpose uses remote storage and is more cost-effective for workloads that tolerate higher latency. The right tier is a function of how much of your working set must stay in memory and your latency SLA.

When should I use In-Memory OLTP versus Azure Cache for Redis?

They sit at different layers and often complement each other. Azure Cache for Redis serves reads at the application tier so they never reach the database, ideal for read-heavy, slowly changing data. In-Memory OLTP accelerates write-heavy, high-contention transactions inside the database. Read-heavy patterns favor Redis; transaction-heavy hot tables favor In-Memory OLTP; many low-latency systems use both.

How do I monitor cache effectiveness on Azure SQL?

Track the buffer cache hit ratio and Page Life Expectancy through the performance-counter DMVs, watch wait statistics for storage-related waits that indicate cache misses, and use Query Store for per-query latency trends. Where you run an application cache, monitor its hit ratio too. Report latency as percentiles rather than averages so that cache misses are visible rather than hidden in the mean.

When should we engage a specialist for Azure SQL performance?

When latency targets are at risk, when costs are climbing because the workload is over-provisioned to compensate for poor cache design, or when a migration to Azure SQL has not delivered the expected performance. MinervaDB provides Azure SQL and SQL Server performance audits, cloud DBA services, and 24×7 Remote DBA operations for teams that want a senior bench without the in-house headcount.

M

MinervaDB Engineering Team

MinervaDB Inc. is a vendor-neutral database engineering firm delivering consulting, performance engineering, and 24×7 Remote DBA operations across Azure SQL, Microsoft SQL Server, PostgreSQL, MySQL, MongoDB, and the wider data landscape. Read more on the MinervaDB blog or explore cloud DBA services.

MinervaDB · Azure SQL Performance Engineering

Chasing sub-millisecond on Azure SQL?

MinervaDB engineers design and tune the full Azure SQL caching hierarchy — buffer pool, In-Memory OLTP, Redis, columnstore, and replicas — to hit your latency and cost targets, verified against your production telemetry. A thirty-minute conversation is enough to scope the work.

About MinervaDB Corporation 276 Articles
Full-stack Database Infrastructure Architecture, Engineering and Operations Consultative Support(24*7) Provider for PostgreSQL, MySQL, MariaDB, MongoDB, ClickHouse, Trino, SQL Server, Cassandra, CockroachDB, Yugabyte, Couchbase, Redis, Valkey, NoSQL, NewSQL, SAP HANA, Databricks, Amazon Resdhift, Amazon Aurora, CloudSQL, Snowflake and AzureSQL with core expertize in Performance, Scalability, High Availability, Database Reliability Engineering, Database Upgrades/Migration, and Data Security.