How CPU Caches Work — And Why They Matter for Your Code

The Memory Speed Problem

Modern CPUs can execute billions of operations per second. Main RAM (DRAM), by contrast, has a latency of roughly 60–100 nanoseconds. At 3 GHz, that's 180–300 clock cycles wasted waiting for data. Without caching, your CPU would spend most of its time idle. Cache hierarchies exist to bridge this gap.

The Cache Hierarchy: L1, L2, and L3

Caches are small, fast memory banks built directly into the CPU die. They operate in a hierarchy, trading size for speed:

Level	Typical Size	Latency	Shared?
L1	32–64 KB per core	~4 cycles	Per core
L2	256 KB – 1 MB per core	~12 cycles	Per core
L3	8 MB – 64 MB+	~40 cycles	Shared across cores
RAM	Gigabytes	~200+ cycles	Shared (system)

When the CPU needs data, it checks L1 first, then L2, then L3, then RAM. A cache hit at L1 is roughly 50x faster than fetching from RAM.

Cache Lines and Spatial Locality

Caches don't load individual bytes — they load cache lines, typically 64 bytes at a time. When you access one element of an array, the surrounding elements are loaded into cache automatically. This is called spatial locality, and it's why sequential array access is much faster than random access.

Consider iterating a 2D array in C:

// Cache-friendly: row-major order (sequential memory access)
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        sum += matrix[i][j];

// Cache-unfriendly: column-major order (jumps in memory)
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        sum += matrix[i][j];

For large matrices, the row-major version can be dramatically faster because it accesses memory sequentially, maximizing cache line reuse.

Temporal Locality

The other key principle is temporal locality: data accessed recently is likely to be accessed again soon. Caches exploit this by keeping recently used lines resident. Algorithms that revisit the same data (like repeated passes over a small working set) naturally benefit.

Cache Thrashing and Misses

A cache miss occurs when requested data isn't in any cache level and must be fetched from RAM. Frequent misses kill performance. Common causes:

Large data structures that don't fit in cache, accessed randomly (e.g., pointer-chasing through a linked list).
False sharing in multi-threaded code: two threads on different cores modify different data that happen to share a cache line, forcing constant invalidation.
Stride patterns that skip over data in ways that prevent prefetchers from working effectively.

Writing Cache-Friendly Code

Prefer arrays over linked lists for sequential traversal. Array elements are contiguous in memory; linked list nodes are scattered.
Use struct-of-arrays (SoA) over array-of-structs (AoS) when processing only a subset of fields — common in game engines and scientific computing.
Keep hot data together. Fields accessed together should be in the same struct, near each other in memory.
Loop tiling (blocking). Break large matrix operations into smaller tiles that fit in L1/L2 cache.
Minimize pointer indirection. Each pointer dereference is a potential cache miss.

How to Measure Cache Behavior

You don't have to guess — tools can show you exactly what's happening:

perf stat (Linux): reports cache misses, hits, and miss rates per run.
Valgrind Cachegrind: simulates cache behavior and annotates source code with miss counts.
Intel VTune / AMD uProf: detailed hardware performance counter analysis.

Conclusion

Cache behavior is often the difference between an algorithm that's theoretically fast and one that's fast in practice. Understanding how cache lines work, what causes misses, and how to structure data for sequential access gives you a lever that algorithmic complexity analysis alone doesn't capture. Profile first, optimize second — but know what you're optimizing for.