Treating vectors like any other container

14 November 2013

Some astute comments on a recent thread to rust-dev got me thinking about our approach to vectors. Until now, we have focused on having built-in support for vectors via the vector type (~[T]) and slice types (&[T]). However, another possible approach would be to move vector support out of the language (almost) entirely and into standard libraries. I wanted to write out a post exploring this idea; I find it brings some simplifications and reduces the need for DST. Seems like an idea worth considering. Consider this a thought experiment, not exactly a proposal.

Summary

In a nutshell, the idea is to change the following types:

Current          Proposed

~[T]        ==>  Vector<T>
&'a [T]     ==>  Slice<'a, T>
&'a mut [T] ==>  MutSlice<'a, T>
~str        ==>  String
&'a str     ==>  Substring<'a>

Vector, Slice, etc would all be normal types defined in the standard library, though they would also be lang items (meaning that the compiler knows about them). It’s possible that we could define the type [T] as sugar for Vector<T>, but to keep things clearer I’m going to ignore this for now and just use explicit type names.

This approach has some interesting implications. For one thing, it sidesteps the need for DST when it comes to vectors. Instead, vectors work just like existing collections like HashMap work: the collection (Vector<T>, in this case) specifies the ownership internally. To put it another way, the only kind of vector is ~[T] (those of you who have been following Rust for more than the last year or so will find this sounds familiar).

If you want to pass around a vector by reference, you’d have two choices. You could pass a &Vector<T> around, but that’s not quite as good as passing a Slice<T> (in both cases there is a lifetime parameter that I have omitted since you wouldn’t typically need to write it). It will be easy, and perhaps automatic like today, to coerce a Vector<T> into a Slice<T> (more on that later).

Vector literal syntax

Right now, if you write ~[1, 2, 3] you get a vector allocator on the heap. If you write [1, 2, 3] you get a fixed-size vector ([int, ..3]). If you write &[1, 2, 3] you get an explicit slice – which is basically a fixed-size vector, stored on the stack, and then immediately coerced to a slice. Given the automatic coercion rules, &[...] is not a particularly useful expression, and I generally just write [...] and let the coercion rules take care of the rest.

If vectors move into the library, then all this syntax no longer works. It’s not clear to me what we should replace it with. One possibility is to take the “generalized literal” approach, which is similar to how C++11 handles things, and also similar to how Haskell handles integers. This would mean that you would define a builder trait, probably something like:

trait Builder<T> {
   fn new(length: uint) -> Self;
   fn push(&mut self, t: T);
   fn finish(self) -> Self;
}

A literal like [1, 2, 3] would basically be desugared into some code like:

{
    let mut builder = Builder::new(3);
    builder.push(1);
    builder.push(2);
    builder.push(3);
    builder.finish()
}

This would rely on type inference to select the appropriate type of container.

(In fact, we already have something very much like builder, called FromIterator:

pub trait FromIterator<A> {
    fn from_iterator<T: Iterator<A>>(iterator: &mut T) -> Self;
}

So it might be that we just use FromIterator instead, though it makes it somewhat less obvious precisely what [1, 2, 3] desugars into. Or perhaps we replace FromIterator and just make collect use something like builder.)

Taking this approach would mean that we could easily define new vector types and have them fit in seamlessly. We could also use the same trait to have hashmap literals, though if we left the syntax as is, you’d have to write something like:

let mut map = [(key1, value1), (key2, value2)];

But it’d be easy enough to add some sugar where expr1 -> expr2 is equivalent to (expr1, expr2), which would permit:

let mut map = [key1 -> value1, key2 -> value2];

(Languages like Scala, Smalltalk, or Haskell, where operators can be freely defined, often play tricks like this.)

Slicing

In the DST proposal, slicing and borrowing are the same operation. That is, given a variable x of type ~[T], one could write &*x to obtain a slice of type &[T]. This also implies that any autoborrowing rules we have extend naturally to autoslicing. (Note that this also implies, I believe, that if we permit overloading the Deref operator, then random types can also make themselves coercable to a slice by implementing a Deref that yields [T].)

With Vector<T>, that won’t work, so we must consider the question though of how and when slices are created, and in particular how to preserve the current behavior of coercing vectors into slices when the vector appears as a method receiver or function call argument (it may be that we don’t want to preserve this behavior, but it’s good to know how we would so – I’m actually not the biggest fan of autoborrowing for function call arguments, though I introduced the idea).

My rough idea is to make slicing something that is more overloadable, and hence something that user-defined types might tie into. It is essentially the “multi-object” version of overridable deref (subject of another upcoming blog post). So where a smart pointer gets deref’d into an &T (borrowed pointer to a T), a collection can get sliced into a Slice<T> (borrowed pointer to multiple T values).

We might imagine defining two traits:

trait Dice<T> { // "It slices! It dices!"
    fn slice<'a>(&'a self) -> Slice<'a, T>;
    fn slice_from<'a>(&'a self, from: uint) -> Slice<'a, T>;
    fn slice_to<'a>(&'a self, to: uint) -> Slice<'a, T>;
    fn slice_between<'a>(&'a self, from: uint, to: uint) -> Slice<'a, T>;
}

trait MutDice<T> {
    fn mut_slice<'a>(&'a mut self) -> MutSlice<'a, T>;
    fn mut_slice_from<'a>(&'a mut self, from: uint) -> MutSlice<'a, T>;
    fn mut_slice_to<'a>(&'a mut self, to: uint) -> MutSlice<'a, T>;
    fn mut_slice_between<'a>(&'a mut self, from: uint, to: uint) -> MutSlice<'a, T>;
}

(Incidentally, we could also add a slice operator like v[a..b] that corresponds to the Dice trait. But that’s neither here nor there.)

It is not clear to me just how much coercion we want to preserve. For arguments, we’d check for formal arguments of type Slice<U> where the actual argument is of some type T != Slice and we’d search for an implementation of Dice<U> for T (an analogous process works for MutSlice).

Method lookup would work in a similar way. We would autoderef until we reached a point where we don’t know how to deref the receiver type R. At the time, we would search for an implementation of Dice<U> for R (where U is a fresh variable). If we find one, we can then search for methods on Slice<U>. Similarly we can search for MutDice<U>. I don’t want to go into details here because I have an in-progress post talking about adding a user-defined deref that covers many of the same issues. It’s a bit messy but I think it can work, and moreover if we’re going to be supporting user-defined deref, the process will already largely be in place, and this would just add on to that.

Compile-time constants

Whenever we start discussing moving language features into libraries, it raises the specter of how to deal with compile-time constants. Right now, one can place a static array in a compile-time constant easily enough:

static FIRST_FEW_PRIMES: &'static [uint] = &[1, 2, 3, 5];

It’s not clear how this would work under a builder scenario, nor with slicing being a trait. (Enqueue requests for compile-time functional evaluation and constexprs.)

Fixed-length arrays

Similarly to compile-time constants, fixed-length arrays currently occupy something of a magic territory. This is due to our inability to generate impls that are parameterized by an integer. This means that they do not fit well with generic Builder and Dice traits and so on. Incidentally, fixed length arrays are naturally the only kind we can really generate on the stack or at compile-time, so they are sort of a primitive notion to begin with. (Enqueue requests for impls parameterized by integers, which actually seems fairly doable to me, at least in a limited sense.)

Match patterns

Currently we have pattern matching on vectors, so you can write something like match vec { [a, b, ..c] => ... }. This doesn’t work well if vectors aren’t built into the language. This syntax isn’t that widely used so I think we’d just have to drop it. Note that similar complications arise when trying to pattern match against smart pointer.

Another option is to have the compiler issue calls to the slice trait in order to process such a pattern, but there are several complications that arise:

  • Allowing user-defined code to be invoked as part of pattern matching exposes evaluation order and adds complications for refutable patterns. This might be more of a theoretical concern – perhaps just saying that it’s undefined when, if, and how often that code might be invoked is fine, but maybe it’s not. It’s at least kind of scary.

    • On a related note, right now, because match code is built into the compiler, we allow it to inspect freely without worrying about whether data is mutably borrowed or not. We know, after all, that during a match the only possibility for side effects is when the guard condition runs. If usre-defined code might run, though, we’d have to be more careful.
  • Currently you can move out of vectors that are being matched as well, but I can’t see how that would be possible if we build on the slicing mechanism.

UPDATE: After some discussion on IRC, I decided I overstated the difficulty here. The straightforward way to support user operators in patterns is just to fix the evaluation order as “first to last, depth first”. We would have a naive code generation pass as a fallback. In the event that no slice patterns are found, we can try something more optimized. This is a good way to structure pattern testing in any event; the current code tries to handle all cases in an optimal way, and as a result is quite complex, it’d probably be simpler if we limited it to the easier (and more frequent) cases.

Index operation

We currently support overloading the index operator but the trait is not defined correctly. Indexes are lvalues but this is not reflected in the trait definition, and hence if we do not change the trait we would not be able to write an expression like:

vec[3] = foo

Clearly a problem. Note that we will want to fix this regardless of what happens with the rest of the proposal.

I propose we change the Index trait to:

trait Index<I, E> {
    fn index<'a>(&'a self, index: I) -> &'a E;
}

Here you can see that index returns a borrowed pointer to the element. This allows us to write expressions like &vec[3]; the compiler will automatically insert a dereference if the expression is vec[3].

For mutable indexing, the trait would look as follows:

trait MutIndex {
    fn mut_index<'a>(&'a self, index: I) -> &'a mut E;
    fn assign_index(&'a mut self, index: I, value: E);
}

The first method would be used for an expression like &mut vec[3]. The second would be used for an expression like vec[3] = foo. The reason to separate the two is so that hashtables can use map[key] = index as the notation for inserting a key even if the key is not already present (this avoids the need to allocate a key-value pair with an uninitialized value).

Moves and indexing

One feature that works today but would not work with these index traits is moving out of a vector by index. However, as it turns out, I think that we won’t be able to support this long term in any case, so this is no loss. Let me briefly explain what I mean. Today, you can write code like:

let x = ~[~"Hello, ~"World"];
let y = x[0];

Right now the second assignment moves the string out of x[0] and thus renders the vector x completely inaccessible (because it’s 0th element is missing). (If x were a vector of integers or other types that do not need to be freed, of course, then x would remain accessible.)

Using the index trait, however, moves like that are not possible. This is because the access x[0] would be translated to (roughly) *x.index(0). x.index(0) would be returning a borrowed pointer to a string (&~str) and hence the move would be a move out of a borrowed pointer, which is not permitted.

Nonetheless, I think that these sorts of moves will have to become illegal anyhow as part of issue 5016. The reason is that they are currently implementing by zeroing out x[0]. But we plan to stop doing that and instead have the compiler track precisely which things on the stack need to be freed. We can’t do that and permit vector indices. But this is no great loss; moving out of a vector will be accomplished through methods like pop() etc.

Conclusion

I’ve outlined the key interactions I thought of. Here is a kind of summary of the effects. I don’t feel “concluded” right now, in the sense that I don’t know which approach I prefer. I am worried that we must resolve the niggly details of compile-time constants and fixed-length vectors before we could adopt this approach.

I don’t feel that either has the advantage of being conceptually cleaner or simpler. This approach makes vectors much more like all other collections, which is nice; a DST-based approach permits generalizing “pointer-to-one” (&T, ~T) to “pointer-to-many” (&[T], ~[T]) and brings them both under the same framework.

Advantages

Simpler type system. The type system is certainly simplified by this change. We move vectors out of the language core and into a library.

Reduced need for DST. There may still be some need for DST as part of dealing with objects (I’ll cover this in a later post), but this will arise less frequently. DST has a few sharp edges of its own – mostly annotation burden – so this is perhaps good. Because it will be less important, this might in turn mean that DST can be deferred till post 1.0.

More extensible. Allowing other kinds of vectors to opt into slicing in particular helps to put collections on common footing. However, in a DST world, extensible slicing can also be achieved by having types overload the deref operator to yield [T].

Literal notation. Flexible literal notation is neat and something people commonly request. Of course, it’s worth pointing out that it can be achieved today using macros. For example, seq!(a, b, c) would expand to the same code as above, and perhaps map!(k1 -> v1, k2 -> v2) could cover the map case. Macros would be much simpler on the compiler but often feel second class (queue Dave saying “I told you so”), though as we use them more and more commonly for fairly fundamental things like fail! and assert! perhaps this is no longer true.

Disadvantages

Compile-time constants and fixed-length arrays. It’s unclear to what extent we can make these work, at least without implementing other features we might prefer to hold off on.

Orthogonality of allocation method. One thing we give up with this approach is the ability to have vectors whose storage is managed by a smart pointer. Using a DST-like approach, it’s at least plausible to have a type like Gc<[uint]>. Using this library based approach, we’d either need to make a GcVector type, or else use GC<Vector<uint>>.

More type hints. Building on type inference will mean that sometimes you have to give more type hints than today.

Notation. Slice<T> is significantly longer than &[T]. One could imagine making the type [T] be sugar for Slice<T>, but then one must accommodate the region (['a T?) and perhaps mutability (['a mut T]?). This problem was solved neatly by DST.

Vector match patterns. There seems to be no smooth way to integrate this current feature.