Matrix Multiplication Optimization

Abstract:

The examples shows two ways of performing matrix multiplication, a simple one that gives moderate performance, and a slightly more complex versions that achieves optimal performance using the e-gcc compiler. The optimal code was partially unrolled to allow the compiler to take advantage of the double-load store of the architecture  and to avoid unnecessary pipeline stalls.

Naive Code:

unsigned matmul_naive(float * restrict a, float * restrict b, float * restrict c)
{
	int i, j, k;

	for (i=0; i

Optimized Code:

unsigned matmul(float * restrict aa, float * restrict bb, float * restrict cc)
{
    int i = 0;

	for (i=0; i

Compile Switches:
{-Wall -O3 -std=c99 -mlong-calls -mfp-mode=round-nearest -ffp-contract=fast -funroll-loops}