11 June 2012

Last Thursday and Friday I had the good fortune of presenting a paper of mine at HotPar 2012. The paper is called Parallel Closures: A New Twist on an Old Idea; it basically describes what has evolved to become the PJs (Parallel JavaScript) model, though it does so in the context of a static checker built in Java.

I really enjoyed the workshop: the presenters were generally very good and the audience was lively. I also appreciate that they make an effort to encourage good conversation; for example, at lunchtime the tables are labeled with topics for discussion (“memory models”, say, or, “schedulers”). It’s always hard for young folk like myself to get connected with the older, more knowledgable people in the audience, so everything helps. (“Hi Hans Boehm, love your work on C++ memory models”)

I had originally though to take notes on the individual presentations and then write up little summaries. But it turns out that HotPar takes minutes and—truth be told—such blog posts are always a bit dull. So I thought I’d just focus on a theme that was raised throughout that I find very interesting.

The theme is the interplay between data layout and performant parallelism. I’ll first describe some of the papers and presentations that touched on it, then share some thoughts.

Specializing for NUMA architectures

For example, an ex-colleague of mine, Zoltan Majo, presented a paper on specializing memory layout for NUMA architectures, in which he described his efforts to improve the performance of an image classification algorithm which groups similar images together. The high-level summary of the algorithm is that it extracts “features” from the images and creates a hashtable mapping from each feature to images that have that feature. Finally, for each image, it iterates over each such feature, extracts the other images with that feature, and compares them against the reference image.

This is a very memory intensive benchmark. The way it was coded up resulted in very poor performance in NUMA architectures (NUMA stands for “non-uniform memory access”; a NUMA architecture is one in which the memory banks are divided so that the time to access them will vary depending on the CPU where the memory access occurs). The reason for this is that the images are effectively randomly scattered between the two CPUs. Then each CPU iterates over half of the reference images; processing any given reference image will involve memory stored at all of the memory banks. Hence we are effectively guaranteed to have a large portion of “remote” accesses (access to memory that is not local to the particular CPU). Zoltan’s solution is simple but very clever: instead of having the value in the hashtable be a set of images, he has the value be a list of sets of images, divided up based on the CPU where the image is located. Instead of processing all of the images under a given hash on the same CPU, he then sends a closure over to each CPU to just process the images that are local to that CPU. So basically the processing for a single reference image ends up getting spread into several closures operating over all the CPUs, each one only comparing against the images that are local to the given CPU.

Virtualizing the layout and storage mechanisms

In a panel discussion, Andrew Brownsword described the problems that he encounters in his day job of building video games and high-performance computing (HPC) systems. He stated that the single most important factor in getting good performance is data organization and access patterns. And yet this is completely ignored by most programming models. I think this is very true; it’s been bothering me for some time. More thoughts on that later.

One thing he described that he would particularly like to have would be the ability to define logical data structures but easily change how they are represented in memory; for example, converting from an array of structs to a parallel arrays for each data field and so forth. I was a bit dissapointed in that I was not able to get more concrete examples of what he meant by this, even when I approached him after the panel. However, perhaps I will pursue him over e-mail. I think this could be highly relevant to Rust and Servo. (Databases, of course, have this ability, so maybe we should look there for ideas on how to proceed).

Leo Meyerovich’s work on webpage layout

This was not presented on HotPar, but it’s been on my mind for some time. Leo does brilliant work on parallelizing web-page layout. One of the takeaway lessons from his work is the importance of data structure layout. He spends a lot of time optimizing the layout for SIMD and cache access patterns and shows tremendous performance benefits from doing so. The reason that this bothers me so much is that it is rather unclear to me how much this can be done in the face of dynamic changes to the web page layout, not to mention how well such changes can be encapsulated. This touches on the themes of the prior talk: it is a shame if we are not able to separate out the final layout of the data structures from the code which touches them. And yet for high performance it is generally necessary to specialize code to the data structures and tasks at hand.

Leo’s solution is to use a high-level language based on attribute grammars and to synthesize the code. This is a very nice solution but it is unclear whether it is sufficiently general. Still I think any attempt to virtualize the accessor code will ultimately be somewhat generated.

My own paper

I think that my own paper is a good anti-example of this trend. That is, one of the outright goals in my work was to support current programming encapsulation practices in the face of parallelism. I was trying to make it possible to use closures to safely build parallel control-flow; one of the features of a closure is that it encapsulates the precise data that it accesses. In a parallel world, though, it’s asking for trouble to allow closures to run off and access any data at all: this is how data races are born. So my proposal is to distinguish parallel closures, in which the encapsulated environment is read-only, from traditional, sequential closures, which have full access to the encapsulated environment.

I think this is an incredibly powerful idea, if I do so say so myself. It offers comparable or stronger correctness guarantees than typical effect systems (e.g., the Disciplined Concurrent Programming Using Tasks with Effects model which was also present at HotPar) without the need for any effect annotation. The basic concept is simple and can be explained in about 5 minutes to virtually anyone, even in a noisy bar (experimentally verified numerous times at JSConf). Try to explain the concept of a parameterized effect system to someone who doesn’t even understand type systems. This is not a knock on effect systems; I love effect systems. But I think that the expressivity to complexity ratio is far higher for parallel closures.

All that said, the idea of encapsulating the set of data that a parallel closure will acceses seems to be fairly opposed to the idea of tightly integrating the data-to-be-accessed with the task model itself. I can imagine that an effect system has a certain advantage here: the effects are visible.

A system like Disciplined Concurrent Programming Using Tasks with Effects, in particular, will require the ability to reify the effects, thus making them accessible to the scheduler. It seems like it would be fairly straightforward to extend the scheduler on that model to be NUMA-aware: it could examine the set of objects to be written and try to pick the best core to execute each task. However, there will be some practical challenges: the effects of a task only mention the data to be accessed, but not how many times it will be accessed. Furthermore, the size of each region is not specified, nor is the amount of data from a given region that will actually be touched.


So what does a more data-aware parallel model look like? I am not 100% sure, but I am starting to have some thoughts on the matter. This is going to be part pattern and part language design. It seems like the data structure which will need to be accessed in parallel needs to be encapsulated into an object. So, we might have a ParallelTable class or some such. This class will not expose its layout directly but will rather provide accessors for manipulating its contents; this ensures that the internal layout of the class can be changed at any time (perhaps even dynamically).

In fact, it seems best that the class does not even offer methods for directly accessing its fields. Rather, it offers the ability to spawn a task which will be given access to the fields of the class. This is more-or-less pursuant to Zoltan’s NUMA system. Processing an image in JavaScript would then involve an API something like:

img.process(function(view) {
    // closure that can work with the image, using the
    // `view` object

The image API would automatically invoke this closure over on the node where the image is stored. Of course, if you have to process two images, one will always be remote access. So for that you would end up writing code like:

img1.process(function(view1) {
    img2.process(function(view2) {
        // view1 is remote

APIs like RiverTrail also fit well into this paradigm. You have an array type but the user is encouraged to write transformations of this array as kernel functions applied over the array as a whole.