References

25 April 2012

I want to do an introduction to the regions system I’ve been working on. This is work-in-progress, so some of the details are likely to change. Also, I’m going to try some new terminology on for size: although it has a long history in the literature, I think the term “region” is not particularly accurate, so I am going to use the term “lifetime” or “pointer lifetime” and see how it fits.

In this post I’m just going to show some examples of how the new features can be used. In the next post, I’ll lift the curtain a bit and explain how the checks work.

Introduction

Rust has always (at least, as long as I’ve been around) had three sorts of pointers: @T, which is a task-local pointer into the heap; ~T, a unique pointer into the heap, generally (but not exclusively) used for sending data between tasks; and reference mode arguments, used to give a function a temporary pointer.

The goal of this work is to replace reference mode arguments with something more flexible. Reference mode arguments work quite well for many purposes, but they have one primary limitation: they cannot be stored into data structures.

So, in this branch, we (conceptually at least) remove reference mode arguments from the Rust Pointer Pantheon and replace them with reference types, written &T (this is actually a shorthand, as we will see later). I will refer to a variable of reference type as a reference.

References are basically generic pointers. They can point anywhere: into the stack, into the @ heap, into the ~ heap, even into the inside of a record or vector. They can point at anything that a C pointer could point at and can be used in many of the same ways; however, they are free from the errors that C permits. The type checker guarantees that references are always valid, so you can’t have a reference into freed memory, or into a stack frame that has been popped, and so forth.

We’ll get to the full details of how the safety check works later (probably in a separate post). First, I want to give some examples of using references.

Using references

Simple references and borrowing

Let’s create a record type point for use in our examples:

type point = { x: uint, y: uint };

Now, imagine that we have a function which wants to compute the slope of two points. It doesn’t particularly care where those points are allocated. You could write it like so:

fn slope(p1: &point, p2: &point) -> float {
    let y = (p2.y - p1.y) as float;
    let x = (p2.x - p1.x) as float;
    ret y / x;
}

OK, that was fairly straightforward. Now let’s look at how slope() might be called. First, assume that we have some routine which takes a vector of pairs of points allocated on the heap and computes the maximum slope of any of those pairs. Why you would want such a function, I don’t know, but this is how you would write it:

fn max_slope(ps: [(@point, @point)]) -> float {
    ps.max { |(p1, p2)| slope(p1, p2) }
}

You’ll notice that slope() is called with p1 and p2, which have type @point, not &point. The type checker happily accepts this, however, because a reference can point anywhere, including into the heap. This process of converting of kind of pointer to another is called borrowing.

The reason it’s called borrowing is that, in effect, the callee (slope()) borrows a reference from the caller (max_slope()). It is the caller’s job to ensure that this reference remains valid for the duration of the callee. In this case, there is no extra work required to make that true, but in some cases the compiler may be required to increment a ref count or maintain a GC root (this is highly dependent on how the @ heap is managed, naturally).

You can also borrow ~ pointers. This basically works the same as with @ pointers, except that the unique value cannot be moved away (for example, sent to another task) while it is borrowed. The reason for that is that, for the duration of the borrowing, the unique pointer is no longer unique. So if you sent it to another task, for example, then two tasks would have access to it. Even within a single task, if you gave the pointer away, then there would be multiple copies each claiming to be unique, which would lead to double frees and other badness. The key invariant that borrowing maintains is that, while a ~T may be temporarily aliased, all of the aliases are references, not other ~T pointers. So we can always identify the true owner once the borrowing expires.

Right now, borrowing can only occur in method calls. The borrowing lasts for the duration of the method call. In the future, borrowing will also be possible in alt expressions and when assigning a local variable with let. In the former case, the borrowing will last for the duration of the alt expression. In the latter case, the borrow will last until the local variable goes out of scope (until the end of the enclosing block, in other words).

Taking the address of local variables

Sometimes we wish to give away pointers into our local stack. For example, there is a routine today called vec::push(x, y) which has the effect of appending the value y onto the vector x (in place). This can be implemented using references like so:

fn push<T:copy>(v: &mut [T], elt: T) {
    *v = *v + [T];
}

Here the argument &mut [T] indicates a mutable reference: that is, a reference which can be used to modify the data it points at. The requirement to explicitly declare which pointers may be used for modification stems from Rust’s desire to make mutation explicit, and is analogous to the existing @mut T and ~mut T types.

To call push, we might write code like this:

fn accum() {
    let mut v = [1, 2, 3];
    vec::push(&v, 4);
    vec::push(&v, 5);
}

Here we used the & operator to take the address of a local variable so that we could pass it into the push() routine.

An aside: I believe that in the current implementation of the compiler you would have to write vec::push(&mut v, 4)—that is, you would have to declare when taking the address of v that you intend to mutate through this pointer. I believe there is no reason we can’t lift this restriction, however, and allow the compiler to figure it out for itself. (I rather prefer the explicit form in theory, because I like to make it clear when things are being modified, but I suspect it will be annoying in practice)

Copying into the stack

Right now, if you wish to create a record literal on the stack, you have to manipulate it by value. So you might write code like:

fn create_point() {
    let p1 = { x: 3u, y: 4u };
    let p2 = { x: 5u, y: 10u };
    let p3 = if cond {p1} else {p2};
    ...
}

Here the type of p{1,2,3} is point. But often we wish to manipulate values by pointer. In this case, that would make p3 a cheaper copy, for example. Using references, we can write something like this:

fn create_point() {
    let p1 = &{ x: 3u, y: 4u };
    let p2 = &{ x: 5u, y: 10u };
    let p3 = if cond {p1} else {p2};
    ...
}

Here we used the same & operator, but with an rvalue (an expression that is not assignable). This simply allocates space on the stack and copies the value into it. The corresponding type of p{1,2,3} would then be &point, where & is a reference into the stack of create_point().

Placing references into structures

Next let’s look at a case where we wish to store a reference into a structure. This example comes out of the Rust compiler, but it’s a common pattern in practice.

In the Rust compiler, there is a phase of processing called encode in which we generate the metadata for a compiled crate. During this encoding, we have a struct encode_ctxt that stores the various context which is required. Because this structure is only needed during this one phase, it is allocated on the stack, and we pass it from function to function using references (today, using a reference mode argument).

The code to create this encode context looks something like the following:

type encode_ctxt = { /* contents are not important */ };

fn begin_encoding(...) {
    let ecx = &{ /* allocate an encode context */ };
    for items_to_encode.each { |item|
        encode_item(ecx, item);
    }
}

Here you see that begin_encoding() creates a variable ecx, storing the data onto the stack. This context is then passed to each call to encode_item().

What can happen then is that some subpart of the encoding requires its own context. For example, in our metadata encoding, we sometimes have to serialize the AST for an inlinable function. This requires quite a bit more state, but it’s state that is specific to the inlining itself. So we can define a type inline_ctxt that will include both the encoding context ecx along with some other fields:

type inline_ctxt/& = {
    ecx: &encode_ctxt,
    ...
};

What you see here is that the type inline_ctxt is declared like any other record, but it has this /& following the name. This is a declaration that the type will contain references. The record itself then simply embeds the &encode_ctxt as any other field. Note: It’s possible that the /& might become inferred in the future rather than being explicit.

Now I can write functions that create and use the inlined context as follows:

 fn encode_inlined_item(ecx: &encode_ctxt, ...) {
     let icx = &{ecx: ecx, ...};
     ...
     some_helper_func(icx, ...);
     ...
 }

 fn some_helper_func(icx: &inline_ctxt, ...) {
     // ... can use icx, icx.ecx, etc ...
 }

References in boxes

In the previous example, we create a structure on the stack which contained a reference to some data living in an activation somewhere up the stack. It is also possible to place references into heap objects. For example, I could have allocated the inline_ctxt on the heap like so:

 fn encode_inlined_item(ecx: &encode_ctxt, ...) {
     let icx = @{ecx: ecx, ...};
     ...
     some_helper_func(icx, ...);
     ...
 }

 fn some_helper_func(icx: @inline_ctxt, ...) {
     // ... can use icx, icx.ecx, etc ...
 }

In this case, there is not really much reason to do this, as the lifetime of the inline_ctxt is bound to the stack frame that created it. But it can be convenient in a number of scenarios:

a long computation might make use of internal data that can be collected before the computation itself completes, and this internal data may need to contain references;
allocating values that you plan to return to your caller is most conveniently done with an @ pointer.

This last point is interesting. Basically, in most of our examples we’ve been allocating things on the stack—but you can’t return stuff that’s on your stack up to your caller, clearly (and if you try, in Rust at least, you’ll find that a type error results).

Arenas

One very common C trick for speeding up allocation is to make use of memory pools, also called arenas. If you happen to have a lot of allocations which you plan to do but which will all get freed at one point, then you can allocate a big block of memory and just hand it out piece by piece. Once the pass is done, you free the memory all at once. The key is that you never track whether an individual allocation has completed or not, so you avoid a lot of overhead. The problem with arenas is that, as typically implemented, they are unsafe, because you might free the arena but still hold on to pointers that point into the arena. This is where lifetimes come in.

Using a reference, we can allocate memory in arenas and be sure that the reference will not outlive the arena itself. For example, this function will allocate a new point in an arena and return it:

fn alloc_point(pool: &arena) -> &point {
    ret new (pool) { x: 3u, y: 4u };
}

In this case, the type checker will assign the allocated point the same lifetime as the arena itself. So the point can be used so long as the arena is valid.

Lifetimes

At this point, I’ve shown you a lot of examples of how references can be used, but I have given basically no intution for how it is that the compiler can prevent a reference from being used when it is no longer valid.

The basic idea is that every reference type &T is in fact shorthand for a type written &a.T, where a is some kind of lifetime. The lifetime of a reference defines when it is valid. These lifetimes correspond to the dynamic execution of some function, block, expression, whatever.

To make this clearer, let’s look at an example. Suppose I have this simple function. I have also shown the various lifetimes (named a…c) graphically along the right-hand side.

fn scoped_lifetimes(x: @uint) { // a
    let y = 3u;                 // |
                                // |
    if cond {                   // | b
        let z = 4u;             // | |
                                // | | c
        borrow(x) /* 1 */       // | | |
                                // | | -
    }                           // | -
}                               // -

fn borrow(x: &uint) {...}

There are three distinct lifetimes in the function scoped_lifetimes(), each nested within one another. The outermost one is a, which corresponds to the entire function activation. The expression &y, which takes the address of the local variable y, would have type &a.uint.

The next lifetime is b, which corresponds to the “then-block” of the if statement. The expression &z would have the type &b.uint, because after the if statement concludes the variable z is no longer in scope.

Finally, the lifetime c corresponds just to the call to borrow(x). Here, the variable x is coerced into a region pointer with lifetime &c.uint.

Now let’s examine borrow() a bit more closely. The definition of borrow is in fact shorthand for something like the following:

                         //  d
                         //  .
                         //  .
fn borrow(x: &d.uint) {  //  |
    ...                  //  |
}                        //  |
                         //  .
                         //  -

In other words, the &uint type we saw before in fact expands to a lifetime with a unique name; we’ll call this name d (in fact, all uses of & within the types of a function’s parameters or its return type are references to a special region called the anonymous region—it acts just like a named region, except that it doesn’t have a name).

The lifetime d is a bit different from the other lifetimes we’ve seen, as it appears within the function declaration itself: it is in fact a lifetime parameter. That is, it corresponds to some lifetime which the caller will specify—the callee, borrow() in this case, doesn’t know precisely how long the lifetime d lasts, it only knows that d includes the entire execution of the callee. I’ve tried to depict this in my ASCII art diagram using dots to represent the unknown duration, with the pipes | representing what is known for certain. In the call to borrow(x) which we saw before, the lifetime parameter d would be mapped to the lifetime c from scoped_lifetimes().

Detecting errors

The compiler uses these symbolic lifetimes to prevent problems. Consider something simple like this:

fn give_away() -> &uint {
    let y = 3u;
    ret &y;
}

Here there is an error because the function is attempted to return a pointer into its own stack frame. To see how the compiler detects this, consider the lifetimes involved:

                            // a
                            // .
                            // .
fn give_away() -> &a.uint { // | b
    let y = 3u;             // | |
    ret &y;                 // | |
}                           // | -
                            // .
                            // -

Here I have called the anonymous lifetime parameter a. The expression &y has type &b.uint, which does not match the expected type &a.uint, and so we get a type error. This type error is warning us that the lifetime of the pointer we are trying to return (b) is shorter than the lifetime which was declared (a).

Ta ta for now

There’s more to tell, but I’ll stop here, as this post is already plenty long.