Rolling your own fast matrix multiplication: loop order and …

Daniel Lemire's blog

If you must multiply matrices, you should use dedicated libraries. However, we sometimes need to roll our own code. In C++, you can quickly write your own Matrix template:

template <typename T> 
struct Matrix { Matrix(size_t rows, size_t cols) : 
  data(new T[rows * cols]), rows(rows), cols(cols) {} 

  T &operator()(size_t i, size_t j) { 
      return data.get()[i * cols + j]; 
  } 
  
  const T &operator()(size_t i, size_t j) const { 
      return data.get()[i * cols + j]; 
  }

  std::unique_ptr<T[]> data; size_t rows; size_t cols; 
};

How do you implement a matrix multiplication? A matrix multiplication is a sequence of three loops. If you do not want to get fancy, you have therefore six possibilities:

template <typename T>
void multiply_ikj(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t i = 0; i < a.rows; i++) {
    for (size_t k = 0; k < a.cols; k++) {
      for (size_t j = 0; j < b.cols; j++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_ijk(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t i = 0; i < a.rows; i++) {
    for (size_t k = 0; k < a.cols; k++) {
      for (size_t j = 0; j < b.cols; j++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_kij(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t k = 0; k < a.cols; k++) {
    for (size_t i = 0; i < a.rows; i++) {
      for (size_t j = 0; j < b.cols; j++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_kji(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t k = 0; k < a.cols; k++) {
    for (size_t j = 0; j < b.cols; j++) {
      for (size_t i = 0; i < a.rows; i++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_jki(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t j = 0; j < b.cols; j++) {
    for (size_t k = 0; k < a.cols; k++) {
      for (size_t i = 0; i < a.rows; i++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

template <typename T>
void multiply_jik(const Matrix<T> &a, const Matrix<T> &b, Matrix<T> &c) {
  for (size_t j = 0; j < b.cols; j++) {
    for (size_t i = 0; i < a.rows; i++) {
      for (size_t k = 0; k < a.cols; k++) {
        c(i, j) += a(i, k) * b(k, j);
      }
    }
  }
}

If you use an optimizing compiler and you tell it to compile specifically for your processor, you should get some fast code, at least in some instances. Which order is best?

The exact result depends on your data type (double, float, int), on the size of the matrices, on your compiler and your hardware. I wrote a benchmark where I use 100 by 100 matrices containing double values. I use GCC 12 (with full optimization -O3) and an Intel Ice Lake processor. I tell the compiler to optimize for the exact processor I have thus I expect that it will use advanced AVX-512 instructions when possible.

The net result in my experiment is that the best ordering is ijk or ikj. That is, the textbook order (ijk) is one of the best. The worst ordering is jik.

If you were to compute manually and naively the matrix multiplications, you would need to do 100 times 100 times 100 multiplications, so 1 million multiplications and 1 million additions. Interestingly, the best orderings (ikj and ijk) use roughly a quarter of a million of instructions to load the data, do the multiplications, the additions and storing the data.

Generated by RSStT. The copyright belongs to the original author.

Source

Rolling your own fast matrix multiplication: loop order and …

Report Page