Parallel Iterators Part 1: Foundations

19 February 2016

Since giving a talk about Rayon at the Bay Area Rust meetup, I’ve been working off and on on the support for parallel iterators. The basic idea of a parallel iterator is that I should be able to take an existing iterator chain, which operates sequentially, and easily convert it to work in parallel. As a simple example, consider this bit of code that computes the dot-product of two vectors:

vec1.iter()
    .zip(vec2.iter())
    .map(|(i, j)| i * j)
    .sum()

Using parallel iterators, all I have to do to make this run in parallel is change the iter calls into par_iter:

vec1.par_iter()
    .zip(vec2.par_iter())
    .map(|(i, j)| i * j)
    .sum()

This new iterator chain is now using Rayon’s parallel iterators instead of the standard Rust ones. Of course, implementing this simple idea turns out to be rather complicated in practice. I’ve had to iterate on the design many times as I tried to add new combinators. I wanted to document the design, but it’s too much for just one blog post. Therefore, I’m writing up a little series of blog posts that cover the design in pieces:

This post: sequential iterators. I realized while writing the other two posts that it would make sense to first describe sequential iterators in detail, so that I could better highlight where parallel iterators differ. This post therefore covers the iterator chain above and shows how it is implemented.
Next post: parallel producers.
Final post: parallel consumers.

Review: sequential iterators

Before we get to parallel iterators, let’s start by covering how Rust’s sequential iterators work. The basic idea is that iterators are lazy, in the sense that constructing an iterator chain does not actually do anything until you “execute” that iterator, either with a for loop or with a method like sum. In the example above, that means that the chain vec1.iter().zip(...).map(...) are all operations that just build up a iterator, without actually doing anything. Only when we call sum do we start actually doing work.

In sequential iterators, the key to this is the Iterator trait. This trait is actually very simple; it basically contains two members of interest:

trait Iterator {
    type Item; // The type of item we will produce
    fn next(&mut self) -> Option<Self::Item>; // Request the next item
}

The idea is that, for each collection, we have a method that will return some kind of iterator type which implements this Iterator trait. So let’s walk through all the pieces of our example iterator chain one by one (I’ve highlighted the steps in comments below):

vec1.iter()              // Slice iterator (over `vec1`)
    .zip(vec2.iter())    // Zip iterator (over two slice iterators)
    .map(|(i, j)| i * j) // Map iterator
    .sum()               // Sum executor

Slice iterators

The very start of our iterator chain was a call vec1.iter(). Here vec1 is a slice of integers, so it has a type like &[i32]. (A slice is a subportion of a vector or array.) But the iter() method (and the iterator it returns) is defined generically for slices of any type T. The method looks something like this (because this method applies to all slices in every crate, you can only write an impl like this in the standard library):

impl<T> [T] {
    fn iter(&self) -> SliceIter<T> {
        SliceIter { slice: self }
    }
}

It creates and returns a value of the struct SliceIter, which is the type of the slice iterator (in the standard library, this type is std::slice::Iter, though it’s implemented somewhat differently). The definition of SliceIter looks something like this:

pub struct SliceIter<'iter, T: 'iter> {
    slice: &'iter [T],
}

The SliceIter type has only one field, slice, which stores the slice we are iterating over. Each time we produce a new item, we will update this field to contain a subslice with just the remaining items.

If you’re wondering what the 'iter notation means, it represents the lifetime of the slice, meaning the span of the code where that reference is in use. In general, references can be elided within function signatures and bodies, but they must be made explicit in type definitions. In any case, without going into too much detail here, the net effect of this annotation is to ensure that the iterator does not outlive the slice that it is iterating over.

Now, to use SliceIter as an iterator, we must implement the Iterator trait. We want to yield up a reference &T to each item in the slice in turn. The idea is that each time we call next, we will peel off a reference to the first item in self.slice, and then adjust self.slice to contain only the remaining items. That looks something like this:

impl<'iter, T> Iterator for SliceIter<'iter, T> {
    // Each round, we will yield up a reference to `T`. This reference
    // is valid for as long as the iterator is valid.
    type Item = &'iter T;

    fn next(&mut self) -> Option<&'iter T> {
        // `split_first` gives us the first item (`head`) and
        // a slice with the remaining items (`tail`),
        // returning None if the slice is empty.
        if let Some((head, tail)) = self.slice.split_first() {
            self.slice = tail; // update slice w/ the remaining items
            Some(head) // return the first item
        } else {
            None // no more items to yield up
        }
    }
}

Zip iterators

Ok, so let’s return to our example iterator chain:

vec1.iter()
    .zip(vec2.iter())
    .map(|(i, j)| i * j)
    .sum()

We’ve now seen how vec1.iter() and vec2.iter() work, but what about zip? The zip iterator is an adapter that takes two other iterators and walks over them in lockstep. The return type of zip then is going to be a type ZipIter that just stores two other iterators:

pub struct ZipIter<A: Iterator, B: Iterator> {
    a: A,
    b: B,
}

Here the generic types A and B represent the types of the iterators being zipped up. Each iterator chain has its own type that determines exactly how it works. In this example we are going to zip up two slice iterators, so the full type of our zip iterator will be ZipIter<SliceIter<'a, i32>, SliceIter<'b, i32>> (but we never have to write that down, it’s all fully inferred by the compiler).

When implementing the Iterator trait for ZipIter, we just want the next method to draw the next item from a and b and pair them up, stopping when either is empty:

impl<A: Iterator, B: Iterator> Iterator for ZipIter<A,B> {
    type Item = (A::Item, B::Item);
    
    fn next(&mut self) -> Option<(A::Item, B::Item)> {
        if let Some(a_item) = self.a.next() {
            if let Some(b_item) = self.b.next() {
                // If both iterators have another item to
                // give, pair them up and return it to
                // the user.
                return Some((a_item, b_item));
            }
        }
        None
    }
}

Map iterators

The next step in our example iterator chain is the call to map:

vec1.iter()
    .zip(vec2.iter())
    .map(|(i, j)| i * j)
    .sum()

Map is another iterator adapter, this time one that applies a function to each item we are iterating, and then yields the result of that function call. The MapIter type winds up with three generic types:

ITER, the type of the base iterator;
MAP_OP, the type of the closure that we will apply at each step (in Rust, closures each have their own unique type);
RET, the return type of that closure, which will be the type of the items that we yield on each step.

The definition looks like this:

pub struct MapIter<ITER, MAP_OP, RET>
    where ITER: Iterator,
          MAP_OP: FnMut(ITER::Item) -> RET
{
    base: ITER,
    map_op: MAP_OP
}

(As an aside, here I’ve switched to using a where clause to write out the constraints on the various parameters. This is just a stylistic choice: I find it easier to read if they are separated out.)

In any case, I want to focus on the second where clause for a second:

where MAP_OP: FnMut(ITER::Item) -> RET

There’s a lot packed in here. First, we said that MAP_OP was the type of the closure that we are going to be mapping over: FnMut is one of Rust’s standard closure traits; it indicates a function that will be called repeatedly in a sequential fashion (notice I said sequential; we’ll have to adjust this later when we want to generalize to parallel execution). It’s called FnMut because it takes an &mut self reference to its environment, and thus it can mutate data from the enclosing scope.

The where clause also indicates the argument and return type of the closure. MAP_OP will take one argument, ITER::Item – this it the type of item that our base iterator produces – and it will return values of type RET.

OK, now let’s write the iterator itself:

impl<ITER, MAP_OP, RET> Iterator for MapIter<ITER, MAP_OP>
    where ITER: Iterator,
          MAP_OP: FnMut(P::Item) -> RET
{
    // We yield up whatever type `MAP_OP` returns:
    type Item = RET;

    fn next(&mut self) -> Option<RET> {
        match self.base.next() {
            // No more items in base iterator:
            None => None,

            // If there is an item...
            Some(item) => {
                // ...apply `map_op` and return the result:
                Some((self.map_op)(item))
            }
        }
    }
}

Pulling it all together: the sum operation

The final step is the actual summation. This turns out to be fairly straightforward. The actual sum method is designed to work over any kind of type that can be added in a generic way, but in the interest of simplicity, let me just give you a version of sum that works on integers (I’ll also write it as a free-function rather than a method):

fn sum<ITER: Iterator<Item=i32>>(iter: ITER) -> i32 {
    let mut result = 0;
    while let Some(v) = iter.next() {
        result += v;
    }
    result
}

Here we take in some iterator of type ITER. We don’t care what kind of iterator it is, but it must produce integers, which is what the Iterator<Item=i32> bound means. Next we repeatedly call next to draw all the items out of the iterator; at each step, we add them up.

One last little detail

There is one last piece of the iterator puzzle that I would like to cover, because I make use of it in the parallel iterator design. In my example, I created iterators explicitly by calling iter:

vec1.iter()
    .zip(vec2.iter())
    .map(|(i, j)| i * j)
    .sum()

But you may have noticed that in idiomatic Rust code, this explicit call to iter can sometimes be elided. For example, if I were actually writing that iterator chain, I wouldn’t call iter() from within the call to zip:

vec1.iter()
    .zip(vec2)
    .map(|(i, j)| i * j)
    .sum()

Similarly, if you are writing a simple for loop that just goes over a container or slice, you can often elide the call to iter:

for item in vec2 {
    process(item);
}

So what is going on here? The answer is that we have another trait called IntoIterator, which defines what types can be converted into iterators:

trait IntoIterator {
    // the type of item our iterator will produce
    type Item;
    
    // the iterator type we will become
    type IntoIter: Iterator<Item=Self::Item>;
    
    // convert this value into an iterator
    fn into_iter(self) -> Self::IntoIter;
}

Naturally, anything which is itself an iterator implements IntoIterator automatically – it just gets “converted” into itself, since it is already an iterator. Container types also implement IntoIterator. The usual convention is that the container type itself implements IntoIterator so as to give ownership of its contents: e.g., converting Vec<T> into an iterator takes ownership of the vector and gives back an iterator yielding ownership of its T elements. However, converting a reference to a vector (e.g., ``&Vec) gives back *references* to the elements &T. Similarly, converting a borrowed slice like &[T] into an iterator also gives back references to the elements (&T). We can implement IntoIteratorfor&[T]` like so:

impl<'iter, T> IntoIterator for &'iter [T] {
    // as we saw before, iterating over a slice gives back references
    // to the items within
    type Item = &'iter T;
    
    // the iterator type we defined earlier
    type IntoIter = SliceIter<'iter, T>;
    
    fn into_iter(self) -> SliceIter<'iter, T> {
        self.iter()
    }
}

Finally, the zip helper method uses IntoIterator to convert its argument into an iterator:

trait Iterator {
    ...
    fn zip<B>(self, other: B) -> ZipIter<Self, B::IntoIter>
        where B: IntoIterator
    {
        ZipIter { a: self, b: other.into_iter() }
    }
}

Taking a step back

Now that we’ve covered the whole iterator chain, let’s take a moment to reflect on some interesting properties of this whole setup. First, notice that as we create our iterator chain, nothing actually happens until we call sum. That is, you might expect that calling vec1.iter().zip(vec2.iter()) would go and allocate a new vector that contains pairs from both slices, but, as we’ve seen, it does not. It just creates a ZipIter that holds references to both slices. In fact, no vector of pairs is ever created (unless you ask for one by calling collect). Thus iteration can be described as lazy, since the various effects described by an iterator take place at the last possible time.

The other neat thing is that while all of this code looks very abstract, it actually optimizes to something very efficient. This is a side effect of all those generic types that we saw before. They basically ensure that the resulting iterator has a type that describes precisely what it is going to do. The compiler will then generate a custom copy of each iterator function tailored to that particular type. So, for example, we wind up with a custom copy of ZipIter that is specific to iterating over slices, and a custom copy of MapIter that is specific to multiplying the results of that particular ZipIter. These copies can then be optimized independently. The end result is that our dot-product iteration chain winds up being optimized into some very tight assembly; in fact, it even gets vectorized. You can verify this yourself by looking at this example on play and clicking the “ASM” button (but don’t forget to select “Release” mode). Here is the inner loop you will see:

.LBB0_8:
	movdqu	(%rdi,%rbx,4), %xmm1
	movdqu	(%rdx,%rbx,4), %xmm2
	pshufd	$245, %xmm2, %xmm3
	pmuludq	%xmm1, %xmm2
	pshufd	$232, %xmm2, %xmm2
	pshufd	$245, %xmm1, %xmm1
	pmuludq	%xmm3, %xmm1
	pshufd	$232, %xmm1, %xmm1
	punpckldq	%xmm1, %xmm2
	paddd	%xmm2, %xmm0
	addq	$4, %rbx
	incq	%rax
	jne	.LBB0_8

Neat.

Recap

So let’s review the criticial points of sequential iterators:

They are lazy. No work is done until you call next, and then the iterator does the minimal amount of work it can to produce a result.
They do not allocate (unless you ask them to). None of the code we wrote here requires allocating any memory or builds up any intermediate data structures. Of course, if you use an operation like collect, which accumulates the iterator’s items into a vector or other data structure, building that data structure will require allocating memory.
They are generic and highly optimizable. Each iterator combinator uses generic type parameters to represent the types of the prior iterator that it builds on, as well as any closures that it references. This means that the compiler will make a custom copy of the iterator specialized to that particular task, which is very amenable to optimization.
- This is in sharp contrast to iterators in languages like Java, which are based on virtual dispatch and generic interfaces. The design is similar, but the resulting code is very different.

So in summary, you get to write really high-level, convenient code with really low-level, efficient performance.