A tour of the Parallel JS implementation (Part 1)
20 March 2013
I am going to write a series of blog posts giving a tour of the current Parallel JS implementation in SpiderMonkey. These posts are intended to serve partly as documentation for the code. The plan is to begin high level and work my way down to the nitty gritty details, so here we go!
I will start my discussion at the level of the intrinsic ForkJoin()
function. As an intrinsic function, ForkJoin()
is not an API
intended for use by end-users. Rather, it is available only to
self-hosted code and is intended to serve as a building block for
other APIs (ParallelArray
among them).
The ForkJoin function
The idealized view of how ForkJoin
works is that you give it a
callback fjFunc
and it will invoke fjFunc
once on each worker thread.
Each call to fjFunc
should therefore do one slice of the total work.
In reality, though, the workings of ForkJoin
are rather more
complex. The problem is that we can only execute Ion-compiled code in
parallel. Moreover, we can only handle ion-compiled code that avoids
the interpreter, since most of the pathways in the interpreter are not
thread-safe. This means that we require accurate type
information. However, we can only get accurate type information by
running the code (and, moreover, we can’t run the code in parallel
because the type monitoring infrastructure is not thread-safe).
What we wind up doing therefore is using sequential executions whenever we can’t run in parallel. This might be because there isn’t enough type information, or it might be because the code contains operations that require parts of the interpreter that are not thread-safe.
In the general case, a single call to ForkJoin
can move back and
forth between parallel and sequential execution many times, gathering
more information and potentially recompiling at each step. After a
certain number of bailouts, however, we will just give up and execute
the remainder of the operations sequentially.
The ForkJoin
function is designed such that the caller does not have
to care whether the execution was done in parallel or sequential.
Either way, presuming that the callback fjFunc
is properly written,
the same results will have been computed in the end.
The arguments to ForkJoin
ForkJoin expects one or two arguments:
ForkJoin(fjFunc) // Either like this...
ForkJoin(fjFunc, feedbackFunc) // ...or like this.
Both arguments are functions. fjFunc
defines the operation that
will execute in parallel and feedbackFunc
is used for reporting on
whether bailouts occurred and why. feedbackFunc
is optional and may
be undefined or null. Not passing in feedback will result in slightly
faster execution as less data is gathered. The ForkJoin
function
does not return anything and neither fjFunc
nor feedbackFunc
are
expected to return any values; instead, fjFunc
and feedbackFunc
are expected to mutate values in place to produce their output.
fjFunc
: The parallel operation
The signature of fjFunc
is as follows:
fjFunc(sliceId, numSlices, warmup)
Here sliceId
and numSlices
are basically the thread id and the
thread count respectively (though we purposefully distinguish between
the slice, a unit of work, and the worker thread—today there is
always one slice per worker thread, but someday we may improve the
scheduler to support work-stealing or other more intelligent
strategies for dividing work and then this would not necessarily be
true).
The warmup
flag indicates whether the function is being called in
warmup mode. As will be explained in the next section, we expect
fjFunc
to generally track how much work it has done so far. When
warmup
is true, the function should do “some” of the remaining work,
but not too much. When warmup
is false, it should attempt to do all
the remaining work. Thus, if fjFunc
successfully returns when
warmup
is false, then ForkJoin
can assume that all of the work for
that slice has been completed.
Warmups and bailouts
On the very first call to ForkJoin
, it is very likely that the
callback fjFunc
has never been executed and therefore no type
information is available. In that case, ForkJoin
will begin by
invoking the callback fjFunc
sequentially (i.e., with the normal
interpreter) and with the warmup
argument set to true. We currently
invoke fjFunc
once for each slice. As we just said, because warmup
is true, each call to fjFunc
should do some of the work in its slice
but not all (fjFunc
is responsible for tracking how much work it has
done; I’ll explain that in a second). Once the calls to fjFunc
return, and presuming no exceptions are thrown, ForkJoin
will
attempt compilation for parallel execution.
Presuming compilation succeeds, ForkJoin
will attempt parallel
execution. This means that we will spin up worker threads and invoke
fjFunc
in each one. This time, the warmup
argument will be set to
false, so fjFunc
should try and do all the rest of the work that
remains in each slice. If all of these invocations are successful,
then, the ForkJoin
procedure is done.
However, it is possible that one or more of those calls to fjFunc
may
bailout, meaning that it will attempt some action that is not
permitted in parallel mode. There are many possible reasons for
bailouts but they generally fall into one of three categories:
- The type information could be incomplete, leading to a failed type guard;
- The script might have attempted some action that is not (yet?) supported in parallel mode even though it seems like it might be theoretically safe, such as access to a JS proxy, built-in C++ function, or DOM object;
- The script might have attempted to mutate shared state.
What we do in response to a bailout is to fallback to another
sequential, warmup phase. As part of this fallback, we typically
invalidate the parallel version of fjFunc
that bailed out, meaning
that we’ll recompile it later. Next, just as we did in the initial
warmup, we invoke fjFunc
using the normal interpterer once for each
slice with the warmup
argument set to true.
Once this “recovery” warmup phase has completed, we will re-attempt parallel execution. The idea is that we now have more accurate type and profiling information, so we should be able to compile successfully this time.
This process of alternating parallel execution and sequential recovery
runs continues until either (1) a parallel run completes without error
(in which case we’re done) or (2) we have bailoud out three times
(which is a random number, obviously, that we probably want to
tune). Once we’ve had three bailouts, we’ll give up and just invoke
fjFunc
sequentially with warmup
set to false.
An example: ParallelArray.map
To make this more concrete, I want to look in more detail about how
ParallelArray.map
is implemented in the self-hosted code. All the
other ParallelArray
functions work in a similar fashion so I will
just focus on this one.
The semantics of a parallel map are simple: when the user writes
pa.map(kernelFunc)
, a new ParallelArray is returned with the result of
invoking kernelFunc
on each element in pa
. In effect, this is just
like Array.map
, except that the order of each iteration is
undefined.
Our implementation works by dividing the array to be mapped into
chunks, which are groups of 32 elements. These chunks are then
divided evenly amongst the N
worker threads. The implementation
relies on shared mutable state to track how many chunks each thread
has been able to process thus far. There is a private array called
info
that stores, for each chunk, a start index, end index, and a
current index. The start and end indices simply reflect the range of
items assigned to the worker. The current index, which is initially
the same as the start index, indicates the next chunk that the worker
should attempt to process. This array is shared across all threads
and is unsafely mutated using special intrinsics (thus bypassing the
normal restrictions against mutating shared state).
The ParallelArray
map function is built on ForkJoin
. A simplified
verison looks something like this:
function map(kernelFunc) {
// Compute the bounds each slice will have to operate on:
var length = this.length;
var numSlices = ForkJoinSlices();
var info = prepareInfoArray(length, numSlices);
// Create the result buffer:
var buffer = NewDenseArray(length);
// Perform the computation itself, writing into `buffer`:
ForkJoin(mapSlice);
// Package up the buffer in a parallel array and return it:
return NewParallelArray(buffer);
function mapSlice(sliceId, numSlices, warmup) {
// ... see below ...
}
}
Here is the source to the map callback function mapSlice
:
function mapSlice(sliceId, numSlices, warmup) {
var chunkPos = info[SLICE_POS(sliceId)];
var chunkEnd = info[SLICE_END(sliceId)];
if (warmup && chunkEnd > chunkPos)
chunkEnd = chunkPos + 1;
while (chunkPos < chunkEnd) {
var indexStart = chunkPos << CHUNK_SHIFT;
var indexEnd = std_Math_min(indexStart + CHUNK_SIZE, length);
// Process current chunk:
for (var i = indexStart; i < indexEnd; i++)
UnsafeSetElement(buffer, i, kernelFunc(self.get(i), i, self));
UnsafeSetElement(info, SLICE_POS(sliceId), ++chunkPos);
}
}
This same code is used both for parallel execution and the sequential
fallback. Each time mapSlice
is invoked, it will use the sliceId
it is given to lookup the current chunk (info[SLICE_POS(sliceId)]
;
SLICE_POS
is a macro that computes the correct index). It will then
process that chunk and update the shared array with the index of the
next chunk (results are unsafely written into the result array
buffer
—note that this result is not yet exposed to non-self-hosted
code). If we are in warmup mode, it will stop and return once it has
processed a single chunk. Otherwise, it keeps going and processes the
remaining chunks, updating the shared array at each point.
The purpose of updating the shared array is to record our progress in
the case of a bailout. If a bailout occurs, it means that after
processing some portion of the current chunk, the function will simply
exit. As a result, the “current chunk” will not be incremented, and
the next time that mapSlice
is invoked with that same sliceId
, it
will pick up and start re-processing the same chunk. This does mean
that if a bailout occurs we will process some portion of the chunk
twice, once in parallel mode and then again in sequential mode after
the bailout. This is unobservable to the end user, though, because
parallel executions are guaranteed to be pure and thus the user could
not have modified shared state or made any observable changes.
The various worker threads will unsafely mutate this shared array to
track their progress. Unsafe mutations make use of the intrinsic
UnsafeSetElement(array, index, value)
, which is more-or-less
equivalent to array[index] = value
except that (1) it assumes the
index is in bounds; (2) it assumes that array
is a dense array or a
typed array; (3) it does not do any data race detection. In other
words, you have to know what you’re doing. The same intrinsic is also
used to store the intermediate results.
feedback
: Reporting on bailouts
The precise API for feedback
is to some extent still being hammered
out. Right now this function is used primarily for unit testing so
that we can be sure that parallelization works when we think it
should. The eventual goal is to display information in the profiler
or other dev tools that indicate what happened. This post is already
long enough, so I’ll defer a discussion of the precise process by
which we gather bailout information. Suffice to say that in the event
of a bailout, each thread records the cause of the bailout (e.g.,
“write to illegal object” or “type guard failure”) along with a stack
trace showing the script and its position.
Note
This note describes the code as it is found on our branch.
This differs slightly from what is currently landed on trunk. In
particular, there have been some recent refactorings that renamed the
ParallelDo
function to ForkJoin
. This is because we recently
refactored the source so that ParallelDo.cpp
and ForkJoin.cpp
,
originally two distinct but tightly interwoven layers, are now fused
into one abstraction.