Optimal Opus

A multi-dimensional ADT from first principles

Or, digging into the guts of NumPy

Complex fractal chaos grid lock
Complex fractal chaos grid lock. Source: Pixy
12 min read

The world of numerical computing has seen the biggest explosion in interest ever in the past decade. With it came these million+ LOC codebases professionally maintained to help with everything from statistical analysis to machine learning and everything in between. It wasn’t always like that, however. The work to build this open source pyramid started with laying a single brick, and that brick’s name is NumPy.

No, you can’t just use nested lists

When engineering numerical programs, a nested linked list over numeric values should top your list of things to never ever implement (haha, get it?). There’s two main reasons why nested lists/arrays/vectors won’t work:

  1. They’re incompatible with the type system and
  2. they’re absolutely horible performance-wise

To showcase the first issue clearly, imagine if we started off trying to tackle only the field of matrix algebra using 64-bit floats (i.e. 1-tensors and 2-tensors). You could imagine using the following vocabulary types:

typedef struct {
  double *data;
  size_t len;
} vector;

typedef struct {
  double **data;
  size_t rows;
  size_t cols;
} matrix;

This looks reasonable at first, but now you’re stuck with a really annoying issue that matrix and vector aren’t related in the type system. This means that defining a polymorphic function that should be semantically compatible with both types isn’t possible1:

// No polymorphism is achieved because they are semantically different!

void
scalar_multiply_and_add_vector
(vector vec, double mul, double add) {
  double *ptr = vec.data;
  for (size_t idx = 0; idx < vec.len; idx++, ptr++)
    *ptr = (*ptr * mul) + add;
}

void
scalar_multiply_and_add_matrix
(matrix mat, double mul, double add) {
  double **ptr = mat.data;
  for (size_t row = 0; row < mat.rows; row++, ptr++) {
    double *_ptr = *ptr;
    for (size_t col = 0; col < mat.cols; col++, _ptr++) {
      *_ptr = (*_ptr * mul) + add;
    }
  }
}

This issue might be workable if all you ever need are vectors and matrices but this problem scales exponentially with each new dimension you wish to support! The other issue with this code is performance; we’ve introduced double indirection of memory addresses with the matrix, and adding each new dimension will add another level of indirection. These indirections are each relatively cheap thanks to the MMU, but when chained in serial and placed in nested for loops they bog down the code and reduce performance. For these reasons nested arrays are infeasible.

Defining a generic multi-dimensional array type

At its core, NumPy, Fortran, Julia, Tensorflow and others all provide a multi-dimensional array implementation under a single data type2. In order to do so however, you have to separate the information relating to dimensionality from the data type holding the underlying data. If you only want to support homogeneous arrays (which covers almost every practical application of multi-dimensional structures) then you can pack all the values into a single, contiguous array.

In reality, most well-designed libraries support multiple execution contexts and parallel workloads, meaning not all the data is guaranteed to exist in a contiguous range of memory addresses or even on the same machine! Hadoop is a great example of this.

So if the information about the dimensions is not encoded into the data type containing the data, we need to store it adjacent to the underlying data. The state of a multi-dimensional index consists of the following:

  • The number of dimensions
  • The length of each dimension
  • The number of elements required to skip in order to increment a dimension’s index by 1 (for each dimension)

We can encapsulate this state like so:

typedef struct {
  size_t nd, size, dimensions[MAX_DIMS], strides[MAX_DIMS];
} extents_t;

By defining a maximum for the number of dimensions we support (through a pre-processor C macro) we can avoid using the heap and keep everything related to indexing on the stack.

Defining a basic getter

In order to find the correct flattened location we need to use the strides we defined inside the struct to tell us how far we need to “jump” over other cells for each dimension. This means we need to accumulate the element-wise product between the index location and the stride count:

inline size_t
getter(extents_t extents, size_t *location)
{
  size_t _index = 0;
  size_t *_stride = extents.strides, *_loc = location;
  for (size_t _dim = 0; _dim < extents.nd; _dim++)
    _index += (*_loc++) * (*_stride++);
  return _index;
}

Slicing

So far we’ve defined a new vocabulary type called extents_t and a basic getter to access any single element using this multi-dimensional index. Looking at singular indices on their own can only take you so far; at some point, you need the ability to create iterators over your data.

One-dimensional slicing

We should start with the simple base case of a one-dimensional slice. These types of slices are trivial to implement because you only need to increment a single dimension. We can still write an inline-able helper function to demonstrate though:

/*
You must ensure yourself that you don't run past the valid data range
*/
inline size_t *
increment_iterator(size_t *start, size_t dim)
{
  size_t *ptr = start;
  ptr[dim]++;
  return ptr;
}

N-dimensional slicing: Why it’s hard

So 1D slicing was quite trivial as expected, but what about multi-dimensional slices? This is where the true dragon of the ADT implementation lies because there are multiple considerations to make as a library author:

  • The traversal order is no longer trivial. For example, for the slice [0:20, 0:20] should we visit [10, 0] before or after the index [0, 10]?
  • The state of the iterator is no longer trivially-copyable because it will be a variable-length array. Hence we need heap allocation for the iterator’s state i.e. iterator construction requires a syscall.
  • We need an algorithm that can sequentially generate the Cartesian product of n sets.
  • We need to balance state management with performance: if we want to traverse 1 million elements it would be unwise to pre-generate every multi-dimensional index we will need.

Let’s first start with comprehending what state we need to keep. Consider the slice [0:2, 0:2, 0:2]. If the data is column-major, then the sequential order of iteration we want to spit out is

  1. [0,0,0]
  2. [1,0,0]
  3. [0,1,0]
  4. [1,1,0]
  5. [0,0,1]
  6. [1,0,1]
  7. [0,1,1]
  8. [1,1,1]

Notice that as we generate new values we need to reset the indices back to the start of the slice. This implies we need to store the start and end range for each dimension because we will need to revisit a given index of a dimension multiple times. Let’s define a vocabulary type for this information:

/*
vocabulary type for the set of all ordinals to iterate in a single dimension.
NOTE: the inclusions are as follows: start <= ordinal < stop
*/
typedef struct {
  size_t start, stop;
} slice_t;

In theory, an array of slice_t is all we need to store for generating the nth element in the sequence. However, if we don’t internally store pre-computed values like the strides and subset lengths, our iterator incrementor would scale with the dimensions which is highly unwise as it will be called inside nested for loops. Hence we need to store the extents of the subset slice too. The resulting state now looks like this:

typedef struct {
  extents_t extents;
  slice_t slices[MAX_DIMS];
} mdspace_t;

Now we have a type that can fully define a multi-dimensional subspace of our initial space. If you sit down and think about it, you can also make subspaces of subspaces (just like in math!). This means we accidentily got composability by doing practically nothing other than being careful with our vocabulary.

Now we need to deduce a mapping from the sequential iteration kk to some multi-dimensional index \vec{\ell}. We want this because it’s the natural isomorphism we will use to convert something we can sequentially step through inside a for loop back into a valid location in our multi-dimensional subspace.

Creating a (stateless) Mapping

The simplest observation we can make is that, with every new iteration of our sequential list, we are increasing the total sum of Cartesian indices by 1. This means that if we start our index at b\vec{b} and iterate kk times to the index \vec{\ell} we know that

i(ibi)=k\sum_{i}(\ell_i - b_i) = k

So how do we want to partition the kk increments to the index? At this point we can actually pick any binning algorithm we want so long as it doesn’t increment a dimension past the maximum value defined inside the slice_t of that dimension. However, we will get the best performance if we pick the binning algorithm that ensures that we traverse the flat data array in a sequential order. The best way we can currently ensure this happens is by always filling the dimensions with the highest stride count first in a trickle-down fashion. So a next step could look something like this instead:

ibi=kSi\ell_i - b_i = \left\lfloor \frac{k}{S_i} \right\rfloor

where SiS_i is the stride of the iith dimension. This works but only for a very limited subset of the sequence. To see why let’s try iterating through the slice [0:2, 0:2] which is row-major:

  • [0, 0]
  • [0, 1]
  • [1, 1]

uh-oh. What happened? We were supposed to see [1, 0] but it got skipped. The reason it got skipped is because once we had k=2k=2 we should have promoted the smaller stride’s index by reseting it back to the start. Whenver mathematicians need some periodic/cyclic integer incrementing they will immediately think of the modulus operator. The main issue we were trying to solve was this promotion problem, so the most intuitive thing to modulus over would be the total number of elements in the slice’s dimension. So our newest attempt has now reached the following mapping:

i=bi+(kSimoduloNi)\ell_i = b_i + \left( \left\lfloor \frac{k}{S_i} \right\rfloor \quad\texttt{modulo}\quad N_i \right)

where, for the iith dimension:

  • NiN_i: the number of elements (i.e. cardinality of the dimension)
  • SiS_i: the stride (i.e. the number of elements in the flat array between the next increment of the dimension)
  • bib_i: the begining of range of indices
  • i\ell_i: the kkth Cartesian index.

We can now create a straight-forward implementation:

/*
Modifies ell in-place to compute the kth index in multi-dimensional form
*/
void
kth_index(size_t *ell, mdspace_t mdspace, size_t k)
{
  size_t _offset, _S, _N, _nd;
  _nd = mdspace.extents.nd;
  for (size_t _idx = 0; _idx < _nd; _idx++)
  {
    _S = mdspace.extents.strides[_idx];
    _N = mdspace.extents.dimensions[_idx];
    _offset = (size_t)floor(k / _S) % _N;
    ell[_idx] = mdspace.slices[_idx].start + _offset;
  }
}

Improving performance using stateful iteration

If you were observant about how we are sequentially iterating through the indices you may have noticed a potential algorithm for predicting the sequence of n-dimensional indices. The basic outline of the algorithm looks like this:

  1. Visit the dimension with the smallest stride
  2. If the index is equal to the maximum allowed value of the slice, set the index to the minimum value and visit the dimension with the second smallest stride with the intent to increment its index
  3. repeat step 2 recursively until you reach a dimension who’s index can be incremented without needing a reset.

However if we want to implement this algorithm we will need to store the multi-dimensional cursor holding our current iteration. While we’re at it, we can also keep track of how many elements we’ve visited too:

typedef struct {
  mdspace_t mdspace;
  size_t iter[MAX_DIMS];
  size_t k;
} mditer_t;

We also need a constructor to help manage the state:

mditer_t
mditer_create(mdspace_t mdspace) {
  mditer_t retval;
  for (size_t _idx = 0; _idx < mdspace.extents.nd; _idx++)
    retval.iter[_idx] = mdspace.slices[_idx].start;
  retval.k = 0;
  retval.mdspace = mdspace;
  return retval;
}

At this point we should implement the algorithm. To simplify the code, I’m not going to implement the generic version of the algorithm which requires sorting the dimensions by their stride count. Instead, I’m going to implement an optimized version which only works for column-major layouts:

void
mditer_next_f(mditer_t *mditer) {
  size_t nd, *iter_p;
  slice_t *slice_p;
  nd = mditer->mdspace.extents.nd;
  slice_p = mditer->mdspace.slices;
  iter_p = mditer->iter;
  for (size_t _idx = 0; _idx < nd; _idx++) {
    if ((*iter_p + 1) >= slice_p->stop) {
      *(iter_p++) = (slice_p++)->start;
    } else {
      (*iter_p)++;
      break;
    }
  }
  mditer->k++;
}

Implementing the row-major (i.e. C version) as well as a generic, slower version which doesn’t asume any order of traversal is left as an exercise for the reader.

Kernel dispatching

It’s now time to reap the rewards. As a very simple exercise, we can write a dispatcher specialized to a multi-dimensional array of doubles and apply a kernel to each element of the subspace. This is trivial thanks to our iterator work in the slicing section:

// Make sure that mditer is pre-initialized with
// mditer_create!!
void
dispatch_kernel_double(
  double *data,
  mditer_t mditer,
  double (*kernel)(double))
{
  int _size =  mditer.mdspace.extents.size;
  for (int _k = mditer.k; _k < _size; _k++)
  {
    size_t _loc = getter(mditer.extents, mditer.iter);
    double x = data[_loc];
    data[_loc] = kernel(x);
    mditer_next(&mditer);
  }
}

Further topics & Pitfalls

There’s lots of additional functionality that would be required to take an effort like this seriously. Tools like NumPy, Tensorflow and others are written by entire dev teams funded by Fortune 500 conglomerates, so it’s no surprise. I did want to mention these additional features though, to enlighten the reader as to what more can be done from here:

  • We can write tooling for equally splitting a subspace into multiple, non-overlapping subspaces as a way to parallelize the workload of traversing the data structure.
  • We can add support for multiple execution contexts and make our iterators compatible with accelerators like GPUs or Google’s TPUs.
  • We can add a Python FFI so that non-C programmers can make use of our iteration library
  • We can add more vocabulary to describe owning vs. non-owning types. This way, we can hold a pointer to the underlying data too without disambiguating ownership of the data.
  • We can add wrapper functions to math libraries like BLAS, LAPACK, Intel MKL, FFTW and potentially others, which would allow our users to make use of these libraries by simply providing the subspace to iterate over rather than doing it themselves.
  • We can add another variable to our vocabulary type to signify if the data is C-like contiguous, Fortran-like contiguous or not contiguous at all. This is already deducable through the strides variable but it’s slightly more combursome. This would allow us to not require hard-coding an iterator stepping method and have our code be more agnostic to layout.

For the purposes of this blog, however, I think what I’ve already shown meets the MVP bar, so for now I’ll put it to rest.

Footnotes

  1. at least natively - you could always cook up some crazy macros in C but that’s not exactly the point here

  2. Yes, I am aware of the sparse & ragged flavors of the core data types/operations these libraries/frameworks offer but no, I will not be discussing those in this blog post.