Rust pattern: Iterating an over a Rc<Vec<T>>
2 September 2018
This post examines a particular, seemingly simple problem: given
ownership of a Rc<Vec<u32>>
, can we write a function that returns an
impl Iterator<Item = u32>
? It turns out that this is a bit harder
than it might at first appear – and, as we’ll see, for good
reason. I’ll dig into what’s going on, how you can fix it, and how we
might extend the language in the future to try and get past this
challenge.
The goal
To set the scene, let’s take a look at a rather artifical function
signature. For whatever reason, this function has to take
ownership of an Rc<Vec<u32>>
and it wants to return an impl Iterator<Item = u32>
1 that iterates over that vector.
fn iterate(data: Rc<Vec<u32>>) -> impl Iterator<Item = u32> {
... // what we want to write!
}
(This post was inspired by a problem we hit in the NLL working group. The details of that problem were different – for example, the vector in question was not given as an argument but instead cloned from another location – but this post uses a simplified example so as to focus on interesting questions and not get lost in other details.)
First draft
The first thing to notice is that our function takes ownership of a
Rc<Vec<u32>>
– that is, a reference counted2 vector of
integers. Presumably, this vector is reference counted because it is
shared amongst many places.
The fact that we have ownership of a Rc<Vec<u32>>
is precisely
what makes our problem challenging. If the function were taking a
Vec<u32>
, it would be rather trivial to write: we could invoke
data.into_iter()
and be done with it (try it on
play).
Alternatively, if the function took a borrowed vector of type
&Vec<u32>
, there would still be an easy solution. In that case, we
couldn’t use into_iter
, because that requires ownership of the
vector. But we could write data.iter().cloned()
–
data.iter()
gives us back references (&u32
) and the
cloned()
adapter then “clones” them to give us back a u32
(try it on play).
But we have a Rc<Vec<u32>>
, so what can we do? We can’t invoke
into_iter
, since that requires complete ownership
of the vector, and we only have partial ownership (we share this
same vector with whoever else has an Rc
handle). So let’s try using
.iter().cloned()
, like we did with the shared reference:
// First draft
fn iterate(data: Rc<Vec<u32>>) -> impl Iterator<Item = u32> {
data.iter().cloned()
}
If you try that on playground, you’ll find you get this error:
error[E0597]: `data` would be dropped while still borrowed
--> src/main.rs:4:5
|
4 | data.iter().cloned()
| ^^^^ borrowed value does not live long enough
5 | }
| - borrowed value only lives until here
|
= note: borrowed value must be valid for the static lifetime...
This error is one of those frustrating error messages – it says exactly what the problem is, but it’s pretty hard to understand. (I’ve filed #53882 to improve it, though I’m not yet sure what I think it should say.) So let’s dig in to what is going on.
iter() borrows the collection it is iterating over
Fundamentally, the problem here is that when we invoke iter
,
it borrows the variable data
to create a reference (of type &[u32]
).
That reference is then part of the iterator that is getting returned.
The problem is that the memory that this reference refers to is owned
by the iterate
function, and when iterate
returns, that memory will
be freed. Therefore, the iterator we give back to the caller will refer
to invalid memory.
If we kind of ‘inlined’ the iter
call a bit, what’s going on would look like this:
fn iterate(data: Rc<Vec<u32>>) -> impl Iterator<Item = u32> {
let iterator = Iterator::new(&data); // <-- call to iter() returns this
let cloned_iterator = ClonedIterator::new(iterator); <-- call to cloned()
cloned_iterator
}
Here you can more clearly see that data
is being borrowed in the
first line.
drops in Rust are deterministic
Another crucial ingredient is that the local variable data
will be
“dropped” when iterate
returns. “Dropping” a local variable means
two things:
- We run the destructor, if any, on the value within.
- We free the memory on the stack where the local variable is stored.
Dropping in Rust proceeds at fixed point. data
is a local variable,
so – unless it was moved before that point – it will be dropped when
we exit its scope. (In the case of temporary values, we use a set of
syntactic rules to decide its scope.) In this case, data
is a
parameter to the function iterate
, so it is going to be dropped when
iterate
returns.
Another key thing to understand is that the borrow checker does not
“control” when drops happen – that is controlled entirely by the
syntactic structure of the code.3 The borrow checker then comes after
and looks to see what could go wrong if that code were executed. In
this case, it seems that we have a reference to data
that will be
returned, but – during the lifetime of that reference – data
will
be dropped. That is bad, so it gives an error.
What is the fundamental problem here?
This is actually a bit of a tricky problem to fix. The problem here is
that Rc<Vec<u32>>
only has shared ownership of the Vec<u32>
within – therefore, it does not offer any API that will return you a
Vec<u32>
value. You can only get back &Vec<u32>
values – that is,
references to the vector inside.
Furthermore, the references you get back will never be able to
outlive the Rc<Vec<u32>>
value they came from! That is, they will
never be able to outlive data
. The reason for this is simple: once
data
gets dropped, those references might be invalid.
So what all of this says is that we will never be able to return an
iterator over data
unless we can somehow transfer ownership of
data
back to our caller.
It is interesting to compare this example with the alternative signatures we looked at early on:
- If
iterate
took aVec<u32>
, then it would have full ownership of the vector. It can useinto_iter
to transfer that ownership into an iterator and return the iterator. Therefore, ownership was given back to the caller. - If
iterate
took a&Vec<u32>
, it never owned the vector to begin with! It can useiter
to create an iterator that references into that vector. We can return that iterator to the caller without incident because the data it refers to is owned by the caller, not us.
How can we fix it?
As we just saw, to write this function we need to find some way to
give ownership of data
back to the caller, while still yielding up
an iterator. One way to do it is by using a move
closure, like so
(playground):
fn iterate(data: Rc<Vec<u32>>) -> impl Iterator<Item = u32> {
let len = data.len();
(0..len).map(move |i| data[i])
}
So why does this work? In the first line, we just read out the length
of the data
vector – note that, in Rust, any vector stored in a
Rc
is also immutable (only a full owner can mutate a vector), so we
know that this length can never change. Now that we have the length
len
, we can create an iterator 0..len
over the integers from 0
to len
. Then we can map from each index i
to the data using
data[i]
– since the data inside is just an integer, it gets copied
out.
In terms of ownership, the key point is that here the closure is
taking ownership of data
. The closure is then placed into the
iterator, and the iterator is returned. So indeed ownership of the
vector is passing back to the caller as part of the iterator.
What about if I don’t have integers?
You could use the same trick to return an iterator of any type, but you must be able to clone it. For example, you could iterate over strings (playground):
fn iterate(data: Rc<Vec<String>>) -> impl Iterator<Item = String> {
let len = data.len();
(0..len).map(move |i| data[i].clone())
}
Why is it important that we clone it? Why can’t we return references?
This falls out from how the Iterator
trait is designed. If you look
at the definition of iterator, it states that it gives ownership
of each item that it iterates over:
trait Iterator {
type Item;
fn next<'s>(&'s self) -> Option<Self::Item>;
// ^^ This would normally be written
// `&self`, but I'm giving it a name
// so I can refer to it below.
}
In particular, the next
function borrows self
only for the
duration of the call to next
. Self::Item
, the return type, does
not mention the lifetime 's
of the self reference, so it cannot
borrow from self
. This means that I can write generic code where we
extract an item, drop the iterator, and then go on using the item:
fn dump_first<I>(some_iter: impl Iterator<Item = I>)
where
I: Debug,
{
// Get an item from the iterator.
let item = some_iter.next();
// Drop the iterator early.
std::mem::drop(some_iter);
// Keep using the item.
println!("{:?}", item);
}
Now, imagine what would happen it we permitted the closure to
return move |i| &data[i]
and we then passed the resulting iterator
to dump_first
:
- We would first extract a reference into
data
and store it initem
. - We would then drop the iterator, which in turn would drop
data
, potentially freeing the vector (if this is the lastRc
handle). - Finally, we would then go on to use
item
, which has a reference into the (now possibly freed) vector.
So, the lesson is: if you want to return an iterator over borrowed
data, per the design of the Iterator
trait, you must be iterating
over a borrowed reference to begin with (i.e., iterate
would need
to take a &Rc<Vec<u32>>
, &Vec<u32>
, or &[u32]
).
How could we extend the language to help here?
Self references
This is an interesting question. If we focus just on the original
problem – that is, how to return an impl Iterator<Item = u32>
–
then most obvious thing is the idea of extending the lifetime system
to permit “self-references” – for example, it would be nice if you
could have a struct that owns some data (e.g., our Rc<Vec<u32>>
) and
also had a reference into that data (e.g., the result of invoking
iter
). This might allow us a nicer way of writing the solution to
our original problem (returning an impl Iterator<Item = u32>
). In
particular, what we effectively did in our solution was to use an
integer as a kind of “reference” into the vector – each step, we
index again. Since indexing is very cheap, this is fine for iterating
over a vector, but it wouldn’t work with (say) a Rc<HashMap<K, V>>
.
My personal hope is that once we wrap up work on the MIR borrow-checker (NLL) – and we are starting to get close! – we can start to think about self-references and how to model them in Rust. I’d like to transition to a Polonius-based system first, though.
Auxiliary values
Another possible direction that has been kicked around is having some
way for a function to return data that its caller must store, which
can then be referenced by the “real” return value. The idea would be
that iterate
would somehow “store” the Rc<Vec<u32>>
into its
caller’s stack frame, and then return an iterator over
that. Ultimately, this is very similar to the “self-reference”
concept: the difference is that, with self-references, iterate
has
to return one value that stores both the Rc<Vec<u32>>
and the
iterator over it. With this “store data in caller” approach, iterate
would return just the iterator, but would specify that the iterator
borrows from this other value (the Rc<Vec<u32>>
) which is returned
in a separate channel.
Interestingly, this idea of returning “auxiliary” values might permit
us to return an iterator that gives back references – even though I
said that was impossible, per the design of the Iterator
trait. How
could that work? Well, the problem fundamentally is that we want a
signature like this, where the iterator yields up &T
references:
fn iterate<T>(data: Rc<Vec<T>>) -> impl Iterator<Item = &T>
Right now, we can’t have this signature, because we have no lifetime
to assign to the &T
type. In particular, the answer to the question
“where are those references borrowing from?” is that they are
borrowing from the function iterate
itself, which won’t work (as
we’ve seen).
But if we had some “auxiliary” slot of data that we could fill and then reference,
we might be able to give it a lifetime – let’s call it 'aux
. Then we could
return impl Iterator<Item = &'aux T>
.
Anyway, this is just wild, irresponsible speculation. I don’t have concrete ideas for how this would work4. But it’s an interesting thought.
Discussion
I’ve opened a users thread to discuss this blog post (along with other Rust pattern blog posts).
Footnotes
This just means it wants to return “some iterator that yields up
u32
values”. ↩︎Also worth nothing: in Rust, reference counted data is typically immutable. ↩︎
In other words, lifetime inference doesn’t affect execution order. This is crucial – for example, it is the reason we can move to NLL without breaking backwards compatibility. ↩︎
In terms of the underlying semantics, though, I imagine it could be a kind of sugar atop either self-references or [out pointers]. But that’s sort of as far as I got. =) [out pointers]: https://internals.rust-lang.org/t/thoughts-about-additional-built-in-pointer-types/959 ↩︎