Vectors, slices, and functions, oh my!
14 May 2012
I wanted to bring together the various ideas around vectors and function types into one post. The goals of these changes are
- to achieve orthogonality of the pointer types, so that leading
&
,@
, and~
sigils are the only way to indicate the kind of pointer that is in use; - to help pare down on the proliferation of subtle variantions on types, such as the 5 different function types currently available.
The proposal
The Rust type system would be described by the following grammar. In
this grammar, I have included all optional portions except for region
bounds. I indicated those types which could have a lifetime bound
associated with them by writing (/&r)
in the description (a lifetime
bound indicates the lifetime of any pointers embedded within the type
itself; this is not related to the changes I am discussing here so I
won’t go into detail):
M = mut | const | "" // immutable by default
K = send | copy | move | "" // move by default
T = () | int | uint | float | ... // scalar types
| @MT | ~MT | &r.MT | *MT // pointer types
| id<T*> // enum, resource, class (/&r)
| id // type variable
| [T] // slice type (/&r)
| substr // string slice type (/&r)
| (T*) // tuple type
| (T "*" N) // fixed-size array
| {(M id: T)*} // anonymous record types
| U // dynamically sized types
U = fn:K(T*) -> T // closure types (/&r)
| id:K<T*> // iface instances (/&r)
| vec<MT> // vector type
| str // string type
Dynamically sized types
The types described by U
are separated out because, unlike the other
types listed, they have “dynamic size”—that is, the size of an
instance of U
will vary from instance to instance. As a result, the
U
types are somewhat “second-class” when compared to the other
types:
- Type variables cannot be bound to dynamically sized types.
- Expressions whose type is a dynamically sized type are generally prohibited.
There is one exception to these rules. Literal expressions of
dynamically sized types are permitted, as the compiler can readily
compute their size. The literal forms of the various types U
are:
Type Literal form
---- ------------
fn:K(T*) -> T fn:K(x, y) -> T { ... }
fn:K(T*) -> T id (where id is a fn item)
id:K<T*> iface(v)
vec<MT> [M ...]
str "..."
If it seems useful, we could lift the restriction that type variables cannot be bound to dynamically sized types and instead use some sort of kind to mark variables that may accept dynamically sized types (or to mark those that may not, depending on what we feel the defaults ought to be).
Vectors and slices
These basically work in the same way as currently proposed, but the
syntax has changed. A vector is written vec<T>
; note that unlike
other types, vector type parameters have a mutability, so you might
have vec<mut u8>
, for example.
Slices can be created by using the [:]
operator, which works just as
in Python. So [1, 2, 3][-1:]
returns a slice containing [3]
,
[1, 2, 3][1:-1]
returns [2]
and [1, 2, 3][2:1]
returns an empty
slice. The slice operator can be applied to both vectors and slices.
We could conceivably allow it to be overloaded.
Vectors may be added to slices. The type of the resulting vector is
taken from the left-hand side. So adding a @vec<mut u8>
to a [u8]
yields a (longer) @vec<mut u8>
.
Fixed-length arrays
The type (T * N)
represents a fixed-length array. Here T
is
another type and N
is a constant expression. This is primarily
intended for C compatibility: a fixed-length array has no length field
and is simply represented by N
instances of the type T
laid out
one after the other. In most ways it is precisely equivalent to a
tuple. There is no literal form for such arrays: they are in fact
supertypes of tuples of equivalent size, and so share the tuple syntax
(v1, ..., vN)
. We can introduce a macro for repeating the same
element N
times to avoid repetition.
Fixed-length arrays are indexable and sliceable but their contents are not modifiable. If modification is desired one can create a simple one-entry record.
Note: I think this idea of having fixed-length arrays and tuples be closely related makes sense. I’m mostly trying to keep things simple and not introduce too much machinery for an edge case. But maybe there is a problem with it.
Closure and iface instance bounds
Both the closure and iface instance types feature a bound K
called a
“kind bound”. These types are unlike the other types because they are
“opaque” to the compiler: that is, the compiler does not know the
types of the data that is contained within. The bound K
puts a
restriction on those types so that additional operations can be
permitted.
For example, if you have a closure x
of type fn:send()
, then the
compiler knows that whatever data is closed over by x
is sendable.
The compiler can therefore permit x
to be sent between tasks.
Similarly the iface type to_str:send
describes an instance of some
type which is sendable and implements the to_str
interface.
If no bound is specified, the default is move
(which is the most
general). This simply states that the closure may close over
arbitrary data.
As today, the “sugared closure” form {|x| ...}
would be inferred to
some form of “pointer to closure”. That is, it could result in @fn
,
~fn
or &fn
, depending on the expected type.
There is no type that represents a “closure that accesses variables by
reference and not by copy”. Sugared closures become the only way to
construct such a closure: if the expected type is &fn()
they will
construct a “access-by-reference” closure and the if the expected type
is @fn()
or ~fn()
they will not. This seems non-ideal but
equivalent to the situation today.
Bare functions
The type of “bare functions” (that is, function items which do not
close over anything) is simply fn:send(T*) -> T
. To use a bare
function as a closure, you must prefix the bare function with an
appropriate sigil (&
, @
, or ~
).
For example, the following snippet uses a function inc()
as the
argument to vec::map
:
fn inc(x: int) -> int { x + 1 }
fn inc_all(vs: [int]) -> [int] {
vec::map(vs, &inc)
}
Here the expression &inc
has the (fully elaborated) type
&static.fn:send(int) -> int
. This is a subtype of the expected type
that vec::map
requires: &fn:move(int) -> int
.
To send a bare function between tasks you might write:
fn task_body() { ... }
fn spawn_task() {
task::spawn(~task_body)
}
Representing closures
The representation of closures will change somewhat. Before a closure was the pair of a function pointer with an environment pointer. Now a closure will be a pointer to a structure like:
struct closure {
void *fptr,
type_desc *td,
... // (closed over data)
};
I thought at first that LLVM might not be able to track a function pointer in this case, but experiments suggest that it can. In general, LLVM does a good job of tracking and constant propagating alloca’d memory with precision.
Conceivably, it might be slower to perform an extra load before the
call (that is, to call c->fptr
and not c.fptr
) if the closure
pointer is not in cache. But reasoning about the cache without
experimenting is always risky, and this particular load seems unlikely
to matter, as the closure will soon be accessing its environment, and
in that case you’d have to bring the environment into cache anyhow.
There is one unambiguous, if minor, downside. In the old scheme, bare
functions used as a closure of type fn@
or fn~
could pair the
function pointer with a NULL environment. But in this new scheme a
@fn
or ~fn
will require allocation, because the runtime will
expect to be able to free such pointers like any other pointer. Such
pointers are quite rare though compared to &fn
types, and something
like &inc
can be pre-allocated statically (actually we could use
tagged pointer tricks, I suppose, but it doens’t seem worthwhile).
Pros and cons
To me the real question is whether the system feels simpler on net given the introduction of dynamic size types. I think it does, but obviously this is a subjective question. To me, the benefits of the following are pretty substantial:
- one function type instead of five;
- types like
@fn(int) -> uint
and@vec<int>
seem to have a clear meaning once you are accustomed to@
meaning pointer, vsfn@(int) -> uint
and[int]/@
; - the difference between vectors and slices (and strings and substrings) is clear, currently I think there is plenty of room for confusion.