Measuring the size of the cache line empirically

Measuring the size of the cache line empirically

Daniel Lemire's blog

Our computers do not read or write memory in units of bits or even bytes. Rather memory is accessed in small blocks of memory called “cache line”. For a given system, the cache line size is usually fixed and small (e.g.,  16 to 256 bytes). All Intel/AMD x64 systems I have used relied on a 64-byte cache line. My current Apple laptop with its ARM-based M2 processor relies on a 128-byte cache line.

How can you measure the size of the cache line? I asked on Twitter/X and a few people (e.g., Peter Cawley, Will Bickford) pointed out to false sharing. I will come back to false sharing in a future blog post. I do not want to discuss or cover false sharing because it depends on parallelism. I want a simpler solution.

Many people (Robert Clausecker, Sam Westrick, Tomasz Kowalczewski, Vinoth Deivasigamani, Sergey Slotin and others) proposed using ‘strided access’ benchmark.

I finally decided to try a strided copy: from a large array, I copy ever N bytes to another large array. To keep things ‘fair’, I repeat this process N times. Thus, irrespective of N, I do as many operations: I copy the same amount of bytes. It is a problem that should be largely “memory access bound” as long as N is not too small. I start N at 16.

If N is larger than twice the cache line, then I can effectively skip one cache line out of two and do I do as many operations, I expect the speed to be greater. If N is smaller than the cache line, then every cache line must be accessed. Having a stride value just above the cache line should be not sufficiently to see large gains: but you expect the speed to almost double once you reach twice the size of the cache line.

Sadly, several other factors come into play on a modern system with such a benchmark. There is more than just the cache-line size as a variable! So we need to verify the model experimentally.

I wrote the benchmark in C, but the actual C compiler is unlikely to be relevant. The original code was in Go and I got the same results, but I switched to C to make sure I avoided Go-specific issues. Interestingly, ChatGPT converted the Go code to C code automatically for  me, with just a couple of small mistakes. Mind you, it is deliberately simple code.

I run each experiment, for each stride size, 10 times and I record the maximum, the minimum and the average. I use an Apple M2 processor running on my laptop and an Intel-based server. I do not require a particular memory alignment when allocating memory. I do not make any other attempt to control the results.

The numbers on the server are quite nice, with hardly any difference between the average, the maximum and the minimum. If your stride is 129, you are 66% faster than when your stride is 64. This suggests that the cache-line size is 64 bytes. Observe how a stride that is a multiple of 64 (e.g., 128 or 256) is slightly bad: we see the performance dip visibly. It is likely due to cache aliasing issue.

The result on my laptop are much less clean even though I kept the laptop unused during the benchmarking. In this instance, if your stride is 257, you are 2.3 times faster than when your stride is 128. It suggests that the cache-line size is 128 bytes. Just like the Intel system, a stride of 128 is unfortunate: there is a visible performance dip.

Thus you can empirically measure the size of the cache line with a strided copy of a large array… As soon you use a stride that is twice the cache line, you should be more than 50% faster.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page