Baby Steps

A blog about programming, and tiny ways to improve it.

Guaranteeing Parallel Execution

One common criticism of the work on ParallelJS is that the API itself does not guarantee parallel execution. Instead, our approach has been to offer methods whose definition makes parallel execution possible, but we have left it up to the engines to define the exact set of JavaScript that will be safe for parallel execution.

Now, I definitely think it is a good idea to clearly define the subset of JavaScript that our engine will be able to execute in parallel. As I wrote in my preivous post, I want to do this both via documentation and via developer tools that provide live feedback. In some cases, I think, the rules will probably depend on type inference or other dynamic analysis techniques that are subtle and hard to explain, but live feedback should be helpful in detecting and resolving those cases.

Nonetheless, I do not think that the formal specification of ParallelJS should include these sorts of details. In my view, this would be similar to having the ECMAScript committee define what patterns in JavaScript will be efficiently JITted and which will not. This is ultimately going to vary depending on the implementation.

In particular, the JavaScript subset that will be acceptable is going to vary substantially depending on what techniques are used to implement the parallelization. On the extreme end, if we had an implementation based on transactional memory, you could imagine that the full JavaScript language might be accepted. If you think that’s science fiction, consider that newer Intel chips will have hardware support for a limited form of transactional memory. On the other extreme, engines that utilize the GPU will only support a very limited subset, one that most likely excludes memory allocation.

I find the precedent of asm.js to be a more promising approach. The formal specification should only state what the the parallel methods are. Preferably, this specification should be loose enough to accommodate as many different parallel execution techniques as possible, but strict enough to prevent wide divergence between engines. I have argued in the past for “equivalent to some sequential execution”, and I still think that’s the right standard, but there’s room for discussion on this point.

Meanwhile, there are can be several independent specifications that provide guidance as to what subset of JavaScript should be supported to parallelize in different ways. Writing such a specification now is probably immature, I think it would be better to have multiple JavaScript engines involved so that the specification is not tailored to SpiderMonkey.

It will be challenging, I think, to come up with a specification that offers the very strong guarantees that “asm.js” can offer (no recompilation, no bailouts, etc). This is because “asm.js” is a very narrow slice of JS intended to be output by compilers, not by humans. It excludes, for example, all normal JavaScript objects. Now, a specification like this might be useful in a parallel context as well; it could serve as the backend for other languages. But I would hope that we have some broader specifications that define code that humans can write.

It is also important to point out that “asm.js” builds upon a lot of precedent. Smart folk working on projects like Emscripten and Mandreel have already done a lot of the leg work to define the idioms that “asm.js” codifies. I hope that as ParallelJS evolves we’ll also evolve a common set of idioms and “ways of doing things” that we can then formalize.

A Tour of the Parallel JS Implementation (Part 1)

I am going to write a series of blog posts giving a tour of the current Parallel JS implementation in SpiderMonkey. These posts are intended to serve partly as documentation for the code. The plan is to begin high level and work my way down to the nitty gritty details, so here we go!

I will start my discussion at the level of the intrinsic ForkJoin() function. As an intrinsic function, ForkJoin() is not an API intended for use by end-users. Rather, it is available only to self-hosted code and is intended to serve as a building block for other APIs (ParallelArray among them).

Parallel JS Lands

The first version of our work on ParallelJS has just been promoted to mozilla-central and thus will soon be appearing in a Nightly Firefox build near you. I find this pretty exciting. In honor of the occassion, I wanted to take a moment to step back and look both at what has landed now, what we expect to land soon, and the overall trajectory we are aiming for.

Splitting the PJs API

Lately, I’ve been thinking about the ParallelJS API that we want to expose. In particular, I’ve been considering offering methods on the normal array type for basic parallel operations. I think this opens up some interesting doors.

Note: To give credit where credit is due, I should note that a lot of the ideas in this post originate with other members of the Parallel JS team (Shu-yu Guo, Dave Herman, Felix Klock). But I don’t want to speak for them, since we seem to each have our own opinions on the best arrangement, so I’m writing the post from the first person singular (“I”) and not a team perspective (“we”). This does not imply “ownership” of the ideas within.

The basic idea

The basic idea is to add “unordered” or parallel variants of the standard higher-order methods to JavaScript arrays as well as to typed arrays (and binary data arrays when those become available). For example, in addition to map() and reduce(), we’d offer unorderedMap() and unorderedReduce() (in the case of typed arrays, I think we’d have to add map() as well).

The semantics of the unordered variants are the same as their ordered cousins, except that the ordering in which they perform their iterations is not defined. However, if you used the unordered variants, we will attempt parallel execution where possible.

Why call the methods “unordered”?

I chose the (admittedly somewhat clunky) prefix unordered because I want to emphasize the fundamental contract our parallel execution engine offers, which is that parallel execution is equivalent to some sequential ordering, but it doesn’t say which one. This is a somewhat controversial design, but I still feel it’s the right one. In any case, it’s basically orthgonal to this post.

Note that there is no reason we can’t someday try parallel execution for the ordered map() as well. However, we’d have to be very careful to avoid introducing overhead in the case that parallelization fails or would change the semantics of the program. The use of the unordered variant effectively serves as a hint that parallelization is likely to pay off.

What about immutability?

Some readers will remember that ParallelArray objects are immutable while normal JS arrays are not. This is true but it’s not a big obstacle. During any parallel operation, mutations to pre-existing objects are forbidden and must be detected; in the case of a call like array.unorderedMap(func), the array array that is being mapped is itself a pre-existing object and thus would be at least temporarily immutable.

There are of course some good reasons to have immutable data, particularly if we wind up doing GPU operations, in which case memory will have to be transferred back and forth, and we may have to worry about invalidation. If this ever becomes an issue, we can accommodate these more advanced use cases either by the existing freezing interfaces that JS provides or through the multi-dimensional API described below.

What are the benefits of this API?

The biggest benefit of this approach, I think, is that it’s about the simplest way to offer parallelism. You can work with the JS array types we all know and love (or hate, as you prefer). Moreover, integration with existing codebases becomes easier. If you have some loops that are performing pure transformations, such as filtering out records on some criteria, you can change them to execute in parallel just by changing the name of the method you use. On other or older browsers, it’s trivial to polyfill unorderedMap as equivalent to map.

What does this mean for ParallelArray?

Right now, the ParallelArray API serves two masters. It tries to be a very lightweight one-dimensional array but it also tries to be a fairly powerful multi-dimensional matrix. If we offer parallel transformations on normal arrays, that frees up ParallelArray so that it can be targeted at more advanced use cases. In particular, it can be (1) always multi-dimensional and (2) type-annotated to permit efficient storage when you have a matrix of scalar values like bytes or ints. I am right now working on another post regarding some ideas relating to how we can handle the multi-dimensional case; it was originally part of this post but this post was rapidly becoming too long.

Interfacing With C Functions in Rust

One of the things that I’ve been working on for some time now is the proper integration of C functions. As with virtually every other facet of the design of Rust, we’ve been slowly moving from a model where Rust tried to hide low-level details for you to one where Rust offers tight control over what’s going on, with the type system intervening only as needed to prevent segfaults or other strange behavior. This blog post details what I consider to be the best proposal so far; some of the finer points are a bit vague, however.

Extern Function Types

One thing we need is a type for a simple function pointer. Rust’s function types to date have always been closure types, meaning that they referred to the combination of a function pointer and some environment. So we have added an “extern fn” type, which is written as follows:

extern "ABI" fn(T) -> U

Here the "ABI" string must be some ABI that is supported by the Rust compiler. The most common values will be either C or Rust, I imagine, but stdcall (or pascal) may be used occasionally as well, and who knows what we’ll support in the future.

I imagine that the default for “ABI” should be “C”, as it will be the most common thing people really want to use. Calls to any extern function with non-Rust ABI is an unsafe action.

Extern Blocks and Function Declarations

We are moving towards a model where function declarations are placed within extern blocks. This looks something like:

extern "C" {
    fn foo();
    fn bar();
}

In this case, the type of foo and bar would be extern "C" fn().

The reason that we declare extern functions in extern blocks, as opposed to individually, is that on some platforms it is necessary to load blocks of functions that are defined by a common library together.

“crust” functions

In addition to being able to call C functions from within Rust, it is useful to be able to call Rust functions from within C. To this end the compiler will permit Rust fns to be declared with a specific ABI like so:

extern "C" fn crust(t: T) -> U {
}

If you declare a function as having a non-Rust ABI, then this implies a few things:

  • A reference to crust() will have type extern "C" fn(T) -> U.
  • We cannot catch and process failure for you, since the propagation of failure results is ABI specific. Thus is the Rust code within an external function fails, it will cause the process to abort. We may later add some way to catch failure so that you can propagate it yourself (perhaps by returning false, etc).

Stack Switching

Now we come to the interesting (and tricky) part. Internally, Rust makes use of a split stack approach where stack segments are allocated dynamically as the stack grows. This allows us to have a very large number of threads without exhausting our address space (particularly on 32-bit systems). This also allows your programs to recurse as long as there is memory available, which is sometimes useful. It is not, however, what C expects. C functions just expect to have a big chunk of stack available. Hopefully infinite.

Therefore, whenever we recurse into C code, we must make sure that a lot of stack is available. The way we do this today is somewhat magical: functions declared as extern are not in fact the raw C function, but rather a wrapper around the C function that will switch over from the Rust stack (which may be small) to a very big stack. This was more-or-less an ok solution back before we had the idea of getting a raw pointer to a C function and so forth but it’s not very appealing now. Also it can be a performance bottleneck.

The new proposal is to say that when you call an extern "C" fn, nothing magical happens. The stack stays just as it was. To perform the stack switching, we offer a function in the runtime (perhaps a number of functions) called prepare_extern_call(), which can be used like so:

let my_c_function: extern "C" fn() = ...;
do prepare_extern_call {
    my_c_function()
}

Of course, it would be easy to forget to use this function, which would be a recipe for stackfaults. Therefore, we will also offer a lint-mode check that defaults to error. This check will trigger if we see a call to a function of non-Rust ABI that is not lexically enclosing within a call to prepare_extern_call.

There will be variants of prepare_extern_call that allow you to specify the amount of stack size to guarantee more precisely if you prefer, along with other options as those arise.

Auto-generating wrappers

It is our expectation that most people will not directly call C functions. Instead, you will wrap them in a Rust-friendly wrapper that performs some sanity checking, converts from Rust types, etc. This wrapper will also perform the stack switching shown above.

In some cases, though, writing such wrappers can be tedious, so we can supply some annotations in the compiler that will autogenerate these wrappers. This is basically a macro. I am envisioning something like this:

#[auto_wrap] // autogenerate wrappers for enclosing functions
extern "C" {
    #[no_wrap] // ...not this one, I'll do it by hand
    fn my_func1(x: *char) -> bool;

    fn my_func2();
}

fn my_func1(x: ~str) -> bool {
    do x.as_c_string |p| {
        do prepare_extern_call {
        }
    }
}

which would then expand into:

extern "C" {
    fn my_func1(x: *char) -> bool;
    fn my_func2();
}

fn my_func2() -> bool {
   do prepare_extern_call {
       my_func2()
   }
}

One issue that is obvious here is the name collisions. I’m not sure how to resolve that. It seems like the older way of native functions within their own module (extern "C" mod foo) would solve it. Well, we’ll do something. And the precise details of this auto-generation remain to be resolved. But you get the idea.

Destructors and Finalizers in Rust

Rust features destructors and, as of this moment, they are simply not sound with respect to many other features of the language, such as borrowed and managed pointers. The problem is that destructors are granted unlimited access to arbitrary data, but the type system and runtime do not take that into account. I propose to fix this by limiting destructors to owned types, meaning types that don’t contain borrowed or managed pointers.

Revised for Loop Protocol

In Rust today, there is a special for syntax designed to support interruptible loops. Since we introduced it, this has proven to be a remarkable success. However, I think we can improve it very slightly.

Current for protocol

The current “for protocol” is best explained by giving an example of how to implement it for slices:

fn each<E>(v: &[E], f: &fn(&E) -> bool) {
    let mut i = 0;
    let n = v.len();
    while i < n {
        if !f(&v[i]) {
            return;
        }
        i += 1
    }
}

As you can see, the idea is that the last parameter to the each() method is a function of type &fn(&E) -> bool, which means that it is given a pointer to an element in the collection and it returns true or false. The return value indicates whether we should continue iterating.

A little known fact is that the for statement returns whatever the each() method returns. This means that each() methods typically have unit return type so that the Rust compiler doesn’t require a semicolon, which would be used to disregard the result of the for expression.

Problems

The biggest problem with this protocol is that it is not easily composable. In particular, imagine that I have a simple tree like this:

struct Tree<E> {
    elem: E,
    children: ~[Tree<E>]
}

Now let’s try to implement the pre-order traversal method for such a tree. You might think you could do it like this:

fn each<E>(t: &Tree<E>, f: &fn(&E) -> bool) {
    if !f(&t.elem) {
        return;
    }

    for t.children.each |child| { each(child, f); }
}

While this will compile, it will not work as expected. For example, this program:

fn main() {
    let t = Tree {
        elem: 0,
        children: ~[
            Tree { elem: 1, children: ~[
                Tree { elem: 2, children: ~[] }
            ] },
            Tree { elem: 3, children: ~[] }
        ]
    };

    for each(&t) |e| {
        io::println(fmt!("%d", *e));
        if *e == 1 { break; }
    }
}

should print “0” and “1”, but it prints “0”, “1”, and “3”. The reason is that while each() does indeed return early when the iteration function returns false, it doesn’t abort the entire iteration, only the current subtree.

One way to fix this is to wrap the each() function with an inner each function that returns a bool to indicate whether execution should stop:

fn each1<E>(t: &Tree<E>, f: &fn(&E) -> bool) {
    each_inner(t, f);

    fn each_inner<E>(t: &Tree<E>, f: &fn(&E) -> bool) -> bool {
        if !f(&t.elem) {
            return false;
        }

        for t.children.each |child| {
            if !each_inner(child, f) {
                return false;
            }
        }

        return true;
    }
}

Making each() composable

I think that we should change the standard each signature to:

fn each<E>(c: &Coll<E>, f: &fn(&E) -> bool) -> bool

Here the return value of each is always a boolean, and it will be false if the last call to f() returned false, and true otherwise. This makes it easier to write composed each() methods. We would also adjust for statements so that they always return unit and do not return the result of each().

Under this definition, we could write the tree iterator as follows:

fn each2<E>(t: &Tree<E>, f: &fn(&E) -> bool) -> bool {
    f(&t.elem) && t.children.each(|c| each2(c, f))
}

This is clearly an improvement over each1()!

Lifetime Notation Redux

In a previous post I outlined some of the options for updating our lifetime syntax. I want to revist those examples after having given the matter more thought, and also after some discussions in the comments and on IRC.

My newest proposal is that we use <> to designate lifetime parameters on types and we lean on semantic analysis (the resolve pass, more precisely) to handle the ambiguity between a lifetime name and a type name. Before I always wanted to have the distinction between lifetimes and types be made in the parser itself, but I think this is untenable. This proposal has the advantage that the most common cases are still written as they are today.

Here is the example from the previous post in my proposed notation:

struct StringReader<&self> {
                 // ^~~~~ Lifetime parameter designated with &
    value: &self/str,
        // ^~~~~~~~~ Same as today.
    count: uint
}

impl StringReader {
    fn new(value: &self/str) -> StringReader<&self> {
                             //              ^~~~~
                             // Interpreted as a lifetime reference due to
                             // the declaration of StringReader, which states
                             // that first parameter is a lifetime.
        StringReader { value: value, count: 0 }
    }
}

fn value(s: &v/StringReader<&v>) -> &v/str {
         // ^~~~~~~~~~~~~~~~~~~     ^~~~~~
         // As today, lifetime names that appear in a function declaration
         // do not have to be declared anywhere and are implicitly scoped
         // to the containing function declaration.
    return s.value;
}

fn remaining(s: &StringReader<&> -> uint {
             // ^~~~~~~~~~~~~~~~
             // A bare & in a fn decl means "use a fresh name",
             // so this is equivalent to &x/StringReader<&y>.
             // This may be the right thing, see Option 2 below.
    return s.value.len() - s.count;
}

What follows are miscellaneous notes and thoughts. There are a few options that could be tweaked, which I have noted.

Considerations

The only way I have found to distinguish lifetime names purely in the parser that is also visually appealing is to use braces to designate lifetimes (options 7 and 8 in my previous post). As a reminder, the impl of StringReader would look like:

impl StringReader {
    fn new(value: &{self} str) -> StringReader{self} {
        StringReader { value: value, count: 0 }
    }
}

The major problem here is that, as bstrie pointed out on IRC, it’s ambiguous: the {self} which appears in the return type could be interpreted as the function body. His proposed fix was to use whitespace sensitivity, so that StringReader{self} and StringReader {self} are parsed differently, but whitespace sensitivity is something we have always tried to avoid.

I personally find it appealing to use <> both for lifetime and type parameters, because I think it gives the right intution. A lifetime-parameterized declaration is just like a type-parameted declaration with regard to how it works in the type system.

OPTION 1: I opted to include & in the lifetime parameters to a type for consistency (this way, a lifetime name is always preceded by &). However, they are not strictly necessary and they are visually heavy. We could remove them, which would mean you have &v/StringReader<v> and StringReader<self> and not &v/StringReader<&v> and StringReader<&self>. However, the default would still have to be written &, so you’d still have StringReader<&>.

The default lifetime &

In this proposal, the “default lifetime” & would only be usable inside a function declaration or function body. In a function declaration, it means “use a fresh lifetime. In the function body it means “use inference”.

OPTION 2: It would be possible to make & a little smarter, as it is today. Today it means “use a fresh name unless & appears on a nested type, then use the lifetime you are nested within”. If we took that interpretation, then &StringReader<&> would be equivalent to &x/StringReader<&x> and not &x/StringReader<&y>. This is more likely to be what the user wanted, though I don’t think it makes much difference in practice. I’d probably just want to experiment a bit here: start with the simpler version, as I proposed here, and then see how many type errors we get

OPTION 3: We could also allow users to leave off the <> if the only parameter is a lifetime parameter, in which case it would be equivalent to <&>. This means that you could write &StringReader instead of &StringReader<&>.

OPTION 4: The one place that I opted to eschew explicit declarations is on functions. If we wanted, we could always require that all named lifetimes be declared, which would mean that the function value() above would be written:

fn value<&v>(s: &v/StringReader<v>) -> &v/str {
    return s.value;
}

I can’t decide about this option. It strikes me as a reasonably simple story, which appeals to me, but it’s also fairly heavyweight.

How complex can it get?

UPDATE: Per bstrie’s request, here is an example of a type that uses both lifetime and type parameters with trait bounds:

struct Foo<&self, T: Reader+Eq> {
    value: &self/T,
    count: uint
}

fn operate<R: Reader+Eq>(f: Foo<&, R>)
{
    ...
}

The Case for Deterministic Results

In my last post, I made the case against having a deterministic semantics. I’ve gotten a fair amount of feedback saying that, for a Web API, introducing nondeterminism is a very risky idea. Certainly the arguments are strong. Therefore, I want to take a moment and make the case for determinism.

Deterministic or Not?

One of the interesting questions with respect to Parallel JS is what the semantics ought to be if you attempt a parallel operation with a kernel function that has side-effects. There are basically three reasonable options:

  1. Deterministic results where possible: The function behaves “as if” it executed sequentially, executing the kernel from 0 to n, just like Array.map.
  2. Error: An exception is thrown.
  3. Non-determinstic results: The function behaves “as if” it executed sequentially, but the items were mapped in an unspecified order.

The branch currently implements option 3: I believe it is the most consistent and will yield the best performance. However, reasonable people can differ on this point, so I want to make my case.