Regions-lite...ish
15 February 2012
I was talking to brson today about the possibility of moving Rust to a regions system. He pointed out that the complexity costs may be high. I was trying to make a slimmer version where explicit region names were never required. This is what I came up with. The truth is, it’s not that different from the original: adding back region names wouldn’t change much. But I’m posting it anyway because it includes a description of how to handle regions in types and I think it’s the most complete and correct proposal at the moment.
The summary
You would have four kinds of Rust pointers:
@MT --- pointers to task-local, boxed data
~MT --- pointers to unique data
&MT --- safe references
*MT --- unsafe, C-like pointers
Here the M
refers to a mutability qualifier (default, mut
, or
const
) and T
refers to a type.
&MT
types, called references, are the new addition. A reference is
a pointer which always points at memory whose validity is guaranteed
by some outer stack frame. The idea is that a caller can give a
callee a reference to some memory that the callee may use but which
may not escape the callee. This memory may be on the caller’s stack
frame or it may be a reference into the task or exchange heaps which
the caller is going to keep valid. This guarantee is upheld by the
type system.
Reference types may appear anywhere. However, if they are used within
another aggregate type such as a record, enum, or class, they “infect”
their container so that it too is considered to be a reference. This
is done by introducing a new kind into the type system, ref
(actually, this is sort of a negative kind: more formally, there is a
kind heap
which contains all types but for those that may
transitively include a reference). This kind may or may not be user
visible: see the section on generics for a discussion of the options.
Coercion between pointer types
The type &MT
is not a supertype of @MT
and ~MT
, but it is
coercable. In the case of @
, we could probably make it a true
subtype, but at the moment a box pointer includes a header, ref count,
etc and so is not binary compatible with a &
pointer, which would be
just a pointer to the box body. If we changed our representation so
that @
pointers point directly to the box and the header is stored
at a negative offset, then we could allow @T
to be a subtype of
&T
.
The type ~MT
, however, can never be a subtype. ~
is not a region.
Rather, the data at the other end of the pointer logically belongs to
a region of its own. So we can allow ~MT
to be coerced to &MT
,
but the region will be a fresh region, and access to the ~MT
pointer
must be prevented for the scope of that fresh region. This is called
“borrowing” a unique pointer. It is only possible for “unique paths”,
where a “unique path” is a path of identifiers a.b.c...z
that is the
only path by which the unique variable can be reached (in practice,
this means that a
must be a local variable and all of the fields
b...z
must have unique or interior type). All of the prefixes of
the unique path must be considered borrowed as well. I am not going
into great detail on the handling of uniques here: it should be quite
similar to what we have today in practice.
Tracking validity of references
Although the user never needs to write it explicitly, each instance of
a reference type is internally associated with a region. There is one
region for every block in the code. In addition, each function/method
has a special region called caller
. For simplicity I do not
consider classes nor impls; it is relatively straightforward to extend
the system to such cases.
Regions are arranged into a tree derived from the structure of the
blocks in the source code. The region caller
is a superregion of
all the internal regions to a function.
In the implementation / formal version of the type system, these
regions are represented explicitly. So a user-written type &MT
expands to a type r&MT
where r
is the node id of the block or of
the function itself (in the case of the caller
region). The region r
is derived from the position where &MT
appears and by inference:
- if
&MT
appears within a parameter list,r
is thecaller
region. - if
&MT
appears on the type of a local variable, inference is used. - if
&MT
appears in a type declaration, see section below.
In general, the type a&T
is a subtype of b&T
if b
is a subregion
a
. The reason is that, because a
is a superregion of b
, the
pointer a&T
is always valid whenever the region b
is valid.
References in type declarations
The rules for which region is assigned when the user writes &MT
omitted one important case: what happens when this type appears in a
type declaration? Consider the following example:
type crate_ctxt = {
mut_map: &map<...>,
node_map: &map<...>,
another_map: &map<...>,
yet_another_map: &map<...>
};
In such cases, the region for the internal references will be assigned when the type is used. For example:
fn trans_foo(ccx: &crate_ctxt) {...}
Here, the type of ccx
will be expanded to:
caller&{mut_map: caller&map<...>, node_map: caller&map<...>, ... }
In effect, types which contain references (transitively) are implicitly parameterized by a region parameter. There is only one such parameter. When the type is instantiated in a specific context, the value for that parameter is provided based on the context.
Taking the address of variables and so forth
The unary operator &M
can be be used with both lvalues and rvalues.
When used with an lvalue, it takes the address of the lvalue. The
mutability qualifier provided must agree with the mutability of the
lvalue. When used with an rvalue, it creates temporary space on the
stack and copies the rvalue into it.
Here is an example of taking the address of lvalues:
fn foo() { // region for this block is "r"
let x = 3;
let mut y = 4;
let px1 = &x; // OK: yields type r&int
let px2 = &const x; // OK: yields type r&const int
let px2 = &mut x; // Error: x is immutable
let py1 = &y; // Error: y is mutable.
let py1 = &const y; // OK: yields type r&const int
let py1 = &mut y; // OK: yields type r&mut int
}
Here is an example of taking the address of rvalues:
fn foo() { // region for this block is "r"
let p1 = &{x: 3, y: 4}; // OK: yields type r&{x:int,y:int}
let p2 = &mut {x: 3, y: 4}; // OK: yields type r&mut {x:int,y:int}
}
Limitations on references
In order to guarantee that reference types do not escape the callee, the type system imposes some limitations:
- Reference types may not be returned.
- Reference types may not be closed over (copied/moved into a closure or interface instance).
- Generic type variables cannot be bound to reference types unless
the generic type variable is of the
ref
kind.
I will cover each restriction in turn. First, though, I want to more precisely define what the type checker considers to be a reference type. The definition is inductive:
- a reference
&MT
; - a type whose definition may contain a reference (e.g.,
@&T
or{x: &T}
, or a class with a field of reference type); - a generic variable with bound
ref
.
Reference types may not be returned
The danger here is that the callee may pass back a reference to the caller that is no longer valid. This is relatively straightforward to prevent: do not allow the return type of a function to be a reference type.
Reference types may not be closed over
It is not allowed to copy a reference type into a boxed/unique closure
nor is it allowed to cast a reference type to a boxed or unique iface.
The reason is that these are the points where the type system loses
the ability to track the constituent types and so we cannot
distinguish a fn@()
that closes over a reference type from other
fn@()
types.
Generic types
There is of course a concern that the limitations on reference types
might be circumvented through the use of generics. This is prevented
through the use of a type kind ref
. A generic type variable may not
be bound to a reference type unless it includes the bound ref
.
Moreover, any generic type variables bound by ref
are considered
reference types and therefore must obey the above restrictions.
A note on variance
In general, ptr types like &MT
or @MT
are covariant in T if M
is
not mut
. This is different from references today which are always
covariant in T; the current behavior is what leads to the type hole
pointed out in the mailing list.