Estimating your memory bandwidth

Estimating your memory bandwidth

Daniel Lemire's blog

One of the limitations of a compute is the memory bandwidth. For the scope of this article, I define “memory bandwidth” as the maximal number of bytes you can bring from memory to the CPU per unit of time. E.g., if your system has 5 GB/s of bandwidth, you can read up to 5 GB from memory in one second.

To measure this memory bandwidth, I propose to read data sequentially. E.g., you may use a function where we sum the byte values in a large array. It is not necessary to sum every byte value, you can skip some because the processor operates in units of cache lines. I do not know of a system that uses cache lines smaller than 64 bytes, so reading one value every 64 bytes ought to be enough.

uint64_t sum(const uint8_t *data,
    size_t start, size_t len, size_t skip) {
  uint64_t sum = 0;
  for (size_t i = start; i < len; i+= skip) {
    sum += data[i];
  }
  return sum;
}

It may not be good enough to maximize the bandwidth usage: your system has surely several cores. Thus we should use multiple threads. The following C++ code divides the input into consecutive segments, and assigns one thread to each segment, dividing up the task as fairly as possible:

size_t segment_length = data_volume / threads_count;
size_t cache_line = 64;
for (size_t i = 0; i < threads_count; i++) {
  threads.emplace_back(sum, data, segment_length*i,
       segment_length*(i+1), cache_line);
}
for (std::thread &t : threads) {
  t.join();
}

I ran this code on a server with two Intel Ice Lake  processors. I get that the more threads I use, the more bandwidth I am able to get up to around 15 threads. I start out at 15 GB/s and I go up to over 130 GB/s. Once I reach about 20 threads, it is no longer possible to get more bandwidth out of the system. The system has a total of 64 cores, over two CPUs. My program does not do any fiddling with locking threads to cores. I have transparent huge pages enabled by default on this Linux system.

My benchmark ought to be make it easy for the processor to maximize bandwidth usage, so I would not expect more complicated software to hit a bandwidth limit with as few as 20 threads.

My source code is available.

Generated by RSStT. The copyright belongs to the original author.

Source

Report Page