Vectors, strings, and slices
23 April 2012
We’ve been discussing a lot about how to manage vectors and strings in Rust. Graydon sent out an excellent proposal which allows for a great number of use cases to be elegant handled. However, I find the syntax somewhat misleading. I’ve proposed one alternative on the mailing list, but I now find I don’t like it, so I thought I’d brainstorm a bit and try to find something better.
There are really three use cases:
- vectors and strings, which are either allocated on the task heap
(
@
) or exchange heap (~
); - slices, which are a cheap, stack-bound way to represent subvecs and substrings;
- fixed-length vectors, which are mainly for C compatibility.
In this post I’m going to focus on the first two cases. The last use
case (fixed-length vectors) is, I think, quite distinct from the first
two, and we should separate it out. I have also omitted one use case
which Graydon’s proposal included: fixed-length strings. I don’t
think that the type str/10
is of much use, as it refers to a
by-value string that is always 10 characters long. This is not like
a fixed-length buffer that can hold strings up to 10 characters
long. Rather, as I understand it, it can only store strings of
exactly 10 characters. How often are we likely to want that?
Representation
The representation of a vector or string is something like:
struct rust_vec<T> {
int fill; // How many bytes are used
int alloc; // How many bytes are allocated
T data[0]; // Inline data
};
Note in particular that this structure does not have a fixed size. Rather, it will vary depending on how many items are present. This indirection is efficient but causes us a bit of trouble.
The representation of a slice which Graydon proposed is something like this:
struct slice<T> {
T *data;
int length;
};
Basically, the pair of a pointer and a length of memory.
As an aside, Fixed-length vectors, have yet a third representation:
T[N]
. In other words, just a C-like vector. So you can see that
slices, “vectors as a whole”, and fixed-length vectors are quite
different things to the compiler.
Proposal the first
One idea might be something like this:
Proposed Graydon Representation
[T] [T] slice<T>
vec<T> rust_vec<T>
@vec<T> [T]/@ rust_box<rust_vec<T>>*
~vec<T> [T]/~ rust_vec<T>*
substr str slice<char>
str rust_vec<char>
@str str/@ rust_box<rust_vec<char>>*
~str str/~ rust_vec<char>*
The literal forms would basically stay the same as they are today. So
[x1, x2]
has the type vec<T>
and "foo"
has the type str
.
I have intentionally drawn a big distinction between the type of a
slice ([T]
) and the type of a vector (vec<T>
). I think these
things are similar but different and people might be easily confused
if the notation is too similar.
There are two types here that cannot be expressed in the original
system: vec<T>
and str
. There is a good reason that these types
are inexpressible: they do not have a fixed size. So, allowing them
as types introduces a certain danger. For example, a function like
the following could not be compiled:
fn foo(x: @vec<T>) {
let y = *x;
...
}
After all, the size of the stack frame could not be correctly
calculated, it would depend on how much data was in x
. It is particularly
annoying to deal with this situation due to the possibility of writing
generic functions like
fn gen_foo<U>(x: @U) {
let y = *x;
...
}
There are two solutions to this, which are really the same solution in
different guises. The simplest solution is to say that type variables
cannot be bound to the types vec<T>
and str
. This would prevent
us from calling gen_foo()
with a vector or a string. We’d also have
some kind of special treatment around assignments so that let x = [1, 2, 3]
ends up with a slice, I guess. Have to think a bit about
that.
Alternatively, one could have a bound that indicates data of a known
size. This kind would be required to manipulate instances of T
by
value. But this could rapidly become annoying. You might prefer to
have the default be that types do have a known type and you have to
say when the type variable might not. This is also ok although
generally type bounds enable operators, not disable them.
Both solutions are somewhat annoying and I know that Graydon was trying to avoid them in his design.
Proposal the second
If we wanted to avoid the possibility of types whose size is not
known, then we have to take a different tack. We can’t have the @
be a prefix anymore. I’d still rather it come near the front of the
type, and not tacked on the end. So far, my preferred notation for
this is something like the following:
Proposed Graydon Representation
[]T [T] slice<T>
[@]T [T]/@ rust_box<rust_vec<T>>*
[~]T [T]/~ rust_vec<T>*
"" str slice<char>
"@" str/@ rust_box<rust_vec<char>>*
"~" str/~ rust_vec<char>*
Yes, that’s right, I just proposed using ""
as the way to write the
type for strings. Pretty wacky, I know. But it seems like we need a
type name that has two parts, a begin and an end, so that we can stick
the @
and ~
inside of them. An alternative might be to use more
words (str
vs tstr
vs ustr
or something).
The literal forms for vectors could be something like [@|...]
and
[~|...]
. I don’t know about strings.
What about fixed-length vectors?
I don’t know. We could do [N]T
, but then it kind of looks like a
slice. I personally lean towards something like T * N
. We also
need an expression form. To be honest, I don’t care about this that
much, it seems like macros could solve it too.
Summary…
None of these ideas seem perfect. I’m mostly tossing them out there to ensure we keep talking about it.
p.s., I just realized as I read over the post that I forgot about
mutability. Something like [mut T]
cannot be written as vec<mut T>
, at least not today. Sigh. Well, I’ll post this blog post
anyway. As I said, nothing here is perfect, just wanted to capture
some of the things I’ve been thinking about.