On reference-counting and leaks
29 April 2015
What’s a 1.0 release without a little drama? Recently, we discovered that there was an oversight in one of the standard library APIs that we had intended to stabilize. In particular, we recently added an API for scoped threads – that is, child threads which have access to the stack frame of their parent thread.
The flaw came about because, when designing the scoped threads API, we failed to consider the impact of resource leaks. Rust’s ownership model makes it somewhat hard to leak data, but not impossible. In particular, using reference-counted data, you can construct a cycle in the heap, in which case the components of that cycle may never be freed.
Some commenters online have taken this problem with the scoped threads API to mean that Rust’s type system was fundamentally flawed. This is not the case: Rust’s guarantee that safe code is memory safe is as true as it ever was. The problem was really specific to the scoped threads API, which was making flawed assumptions; this API has been marked unstable, and there is an RFC proposing a safe alternative.
That said, there is an interesting, more fundamental question at play
here. We long ago decided that, to make reference-counting practical,
we had to accept resource leaks as a possibility. But some
recent proposals have suggested that we should place
limits on the Rc
type to avoid some kinds of reference leaks. These
limits would make the original scoped threads API safe. However,
these changes come at a pretty steep price in composability: they
effectively force a deep distinction between “leakable” and
“non-leakable” data, which winds up affecting all levels of the
system.
This post is my attempt to digest the situation and lay out my current thinking. For those of you don’t want to read this entire post (and I can’t blame you, it’s long), let me just copy the most salient paragraph from my conclusion:
This is certainly a subtle issue, and one where reasonable folk can disagree. In the process of drafting (and redrafting…) this post, my own opinion has shifted back and forth as well. But ultimately I have landed where I started: the danger and pain of bifurcating the space of types far outweighs the loss of this particular RAII idiom.
All right, for those of you who want to continue, this post is divided into three sections:
- Section 1 explains the problem and gives some historical background.
- Section 2 explains the “status quo”.
- Section 3 covers the proposed changes to the reference-counted type and discusses the tradeoffs involved there.
Section 1. The problem in a nutshell
Let me start by summarizing the problem that was uncovered in more detail. The root of the problem is an interaction between the reference-counting and threading APIs in the standard library. So let’s look at each in turn. If you’re familiar with the problem, you can skip ahead to section 2.
Reference-counting as the poor man’s GC
Rust’s standard library includes the Rc
and Arc
types which are
used for reference-counted data. These are widely used, because they
are the most convenient way to create data whose ownership is shared
amongst many references rather than being tied to a particular stack
frame.
Like all reference-counting systems, Rc
and Arc
are vulnerable to
reference-count cycles. That is, if you create a reference-counted box
that contains a reference to itself, then it will never be
collected. To put it another way, Rust gives you a lot of safety
guarantees, but it doesn’t protect you from memory leaks (or
deadlocks, which turns out to be a very similar problem).
The fact that we don’t protect against leaks is not an accident. This
was a deliberate design decision that we made while transitioning from
garbage-collected types (@T
and @mut T
) to user-defined reference
counting. The reason is that preventing leaks requires either a
runtime with a cycle collector or complex type-system tricks. The
option of a mandatory runtime was out, and the type-system tricks we
explored were either too restrictive or too complex. So we decided to
make a pragmatic compromise: to document the possibility of leaks
(see, for example, this section of the Rust reference manual)
and move on.
In practice, the possibility of leaks is mostly an interesting technical caveat: I’ve not found it to be a big issue in practice. Perhaps because problems arose so rarely in practice, some things—like leaks—that should not have been forgotten were… partially forgotten. History became legend. Legend became myth. And for a few years, the question of leaks seemed to be a distant, settled issue, without much relevance to daily life.
Thread and shared scopes
With that background on Rc
in place, let’s turn to threads.
Traditionally, Rust threads were founded on a “zero-sharing”
principle, much like Erlang. However, as Rust’s type system evolved,
we realized we could do much better – the
same type system rules that ensured memory safe in sequential
code could be used to permit sharing in parallel code as well,
particularly once we adopted RFC 458 (a brilliant insight by
pythonesque).
The basic idea is to start a child thread that is tied to a particular scope in the code. We want to guarantee that before we exit that scope, the thread will be joined. If we can do this, then we can safely permit that child thread access to stack-allocated data, so long as that data outlives the scope; this is safe because Rust’s type-system rules already ensure that any data shared between multiple threads must be immutable (more or less, anyway).
So the question then is how can we designate the scope of the children
threads, and how can we ensure that the children will be joined when
that scope exits. The original proposal was based on closures, but in
the time since it was written, the language has shifted to using more
RAII, and hence the scoped
API is based on RAII. The idea is pretty
simple. You write a call like the following:
fn foo(data: &[i32]) {
...
let guard = thread::scoped(|| /* body of the child thread */);
...
}
The scoped
function takes a closure which will be the body of the
child thread. It returns to you a guard value: running the destructor
of this guard will cause the thread to be joined. This guard is always
tied to a particular scope in the code. Let’s call the scope 'a
. The
closure is then permitted access to all data that outlives 'a
. For
example, in the code snippet above, 'a
might be the body of the
function foo
. This means that the closure could safely access the
input data
, because that must outlive the fn body. The type system
ensures that no reference to the guard exists outside of 'a
, and
hence we can be sure that guard will go out of scope sometime before
the end of 'a
and thus trigger the thread to be joined. At least
that was the idea.
The conflict
By now perhaps you have seen the problem. The scoped API is only
safe if we can guarantee that the guard’s destructor runs, so that the
thread will be joined; but, using Rc
, we can leak values, which
means that their destructors never run. So, by combining Rc
and
scoped
, we can cause a thread to be launched that will never be
joined. This means that this thread could run at any time and try to
access data from its parents stack frame – even if that parent has
already completed, and thus the stack frame is garbage. Not good!
So where does the fault lie? From the point of view of history, it
is pretty clear: the scoped
API was ill designed, given that Rc
already existed. As I wrote, we had long ago decided that the most
practical option was to accept that leaks could occur. This implies
that if the memory safety of an API depends on a destructor running,
you can’t relinquish ownership of the value that carries that
destructor (because the end-user might leak it).
It is totally possible to fix the scoped
API, and in fact there is
already an RFC showing how this can be done (I’ll summarize it
in section 2, below). However, some people feel that the decision we
made to permit leaks was the wrong one, and that we ought to have some
limits on the RC API to prevent leaks, or at least prevent some
leaks. I’ll dig into those proposals in section 3.
Section 2. What is the impact of leaks on the status quo?
So, if we continue with the status quo, and accept that resource leaks
can occur with Rc
and Arc
, what is the impact of that? At first
glance, it might seem that the possibility of resource leaks is a huge
blow to RAII. After all, if you can’t be sure that the destructor will
run, how can you rely on the destructor to do cleanup? But when you
look closer, it turns out that the problem is a lot more narrow.
“Average Rust User”
I think it’s helpful to come at this problem from two difference
perspectives. The first is: what do resource leaks mean for the
average Rust user? I think the right way to look at this is that the
user of the Rc
API has an obligation to avoid cycle leaks or break
cycles. Failing to do so will lead to bugs – these could be resource
leaks, deadlocks, or other things. But leaks cannot lead to memory
unsafety. (Barring invalid unsafe code, of course.)
It’s worth pointing out that even if you are using Rc
, you don’t
have to worry about memory leaks due to forgetting to decrement a
reference or anything like that. The problem really boils down to
ensuring that you have a clear strategy for avoiding cycles, which
usually boils to an “ownership DAG” of strong references (though in
some cases, breaking cycles explicitly may also be an option).
“Author of unsafe code”
The other perspective to consider is the person who is writing unsafe code. Unsafe code frequently relies on destructors to do cleanup. I think the right perspective here is to view a destructor as akin to any other user-facing function: in particular, it is the user’s responsibility to call it, and they may accidentally fail to do so. Just as you have to write your API to be defensive about users invoking functions in the wrong order, you must be defensive about them failing to invoke destructors due to a resource leak.
It turns out that the majority of RAII idioms are actually perfectly memory safe even if the destructors don’t run. For example, if we examine the Rust standard library, it turns out that all of the destructors therein are either optional or can be made optional:
- Straight-forward destructors like
Box
orVec
leak memory if they are not freed; clearly no worse than the original leak. - Leaking a mutex guard means that the mutex will never be released. This is likely to cause deadlock, but not memory unsafety.
- Leaking a
RefCell
guard means that theRefCell
will remain in a borrowed state. This is likely to cause thread panic, but not memory unsafety. - Even fancy iterator APIs like
drain
, which was initially thought to be problematic, can be implemented in such a way that they cause leaks to occur if they are leaked, but not memory unsafety.
In all of these cases, there is a guard value that mediates access to some underlying value. The type system already guarantees that the original value cannot be accessed while the guard is in scope. But how can we ensure safety outside of that scope in the case where the guard is leaked? If you look at the the cases above, I think they can be grouped into two patterns:
- Ownership: Things like
Box
andVec
simply own the values they are protecting. This means that if they are leaked, those values are also leaked, and hence there is no way for the user to access it. - Pre-poisoning: Other guards, like
MutexGuard
, put the value they are protecting into a poisoned state that will lead to dynamic errors (but not memory unsafety) if the value is accessed without having run the destructor. In the case ofMutexGuard
, the “poisoned” state is that the mutex is locked, which means a later attempt to lock it will simply deadlock unless theMutexGuard
has been dropped.
What makes scoped threads different?
So if most RAII patterns continue to work fine, what makes scoped different? I think there is a fundamental difference between scoped and these other APIs; this difference was well articulated by Kevin Ballard:
thread::scoped
is special because it’s using the RAII guard as a proxy to represent values on the stack, but this proxy is not actually used to access those values.
If you recall, I mentioned above that all the guards serve to mediate
access to some value. In the case of scoped
, the guard is mediating
access to the result of a computation – the data that is being
protected is “everything that the closure may touch”. The guard, in
other words, doesn’t really know the specific set of affected data,
and it thus cannot hope to either own or pre-poison the data.
In fact, I would take this a step farther, and say that I think that in this kind of scenario, where the guard doesn’t have a connection to the data being protected, RAII tends to be a poor fit. This is because, generally, the guard doesn’t have to be used, so it’s easy for the user to accidentally drop the guard on the floor, causing the side-effects of the guard (in this case, joining the thread) to occur too early. I’ll spell this out a bit more in the section below.
Put more generally, accepting resource leaks does mean that there is a Rust idiom that does not work. In particular, it is not possible to create a borrowed reference that can be guaranteed to execute arbitrary code just before it goes out of scope. What we’ve seen though is that, frequently, it is not necessary to guarantee that the code will execute – but in the case of scoped, because there is no direct connection to the data being protected, joining the thread is the only solution.
Using closures to guarantee code execution when exiting a scope
If we can’t use an RAII-based API to ensure that a thread is joined, what can we do? It turns out that there is a good alternative, laid out in RFC 1084. The basic idea is to restructure the API so that you create a “thread scope” and spawn threads into that scope (in fact, the RFC lays out a more general version that can be used not only for threads but for any bit of code that needs guaranteed execution on exit from a scope). This thread scope is delinated using a closure. In practical terms, this means that started a scoped thread look something like this:
fn foo(data: &[i32]) {
...
thread::scope(|scope| {
let future = scope.spawn(|| /* body of the child thread */);
...
});
}
As you can see, whereas before calling thread::scoped
started a new
thread immediately, it now just creates a thread scope – it doesn’t
itself start any threads. A borrowed reference to the thread scope is
passed to a closure (here it is the argument scope
). The thread
scope offers a method spawn
that can be used to start a new thread
tied to a specific scope. This thread will be joined when the closure
returns; as such, it has access to any data that outlives the body of
the closure. Note that the spawn
method still returns a future to
the result of the spawned thread; this future is similar to the old
join guard, because it can be used to join the thread early. But this
future doesn’t have a destructor. If the thread is not joined through
the future, it will still be automatically joined when the closure
returns.
In the case of this particular API, I think closures are a better fit than RAII. In particular, the closure serves to make the scope where the threads are active clear and explicit; this in turn avoids certain footguns that were possible with the older, RAII-based API. To see an example of what I mean, consider this code that uses the old API to do a parallel quicksort:
fn quicksort(data: &mut [i32]) {
if data.len() <= 1 { return; }
let pivot = data.len() / 2;
let index = partition(data, pivot);
let (left, right) = data.split_at_mut(data, index);
let _guard1 = thread::scoped(|| quicksort(left));
let _guard2 = thread::scoped(|| quicksort(right));
}
I want to draw attention to one snippet of code at the end:
let (left, right) = data.split_at_mut(data, index);
let _guard1 = thread::scoped(|| quicksort(left));
let _guard2 = thread::scoped(|| quicksort(right));
Notice that we have to make dummy variables like _guard1
and
_guard2
. If we left those variables off, then the thread would be
immediately joined, which means we wouldn’t get any actual
parallelism. What’s worse, the code would still work, it would just
run sequentially. The need for these dummy variables, and the
resulting lack of clarity about just when parallel threads will be
joined, is a direct result of using RAII here.
Compare that code above to using a closure-based API:
thread::scope(|scope| {
let (left, right) = data.split_at_mut(data, index);
scope.spawn(|| quicksort(left));
scope.spawn(|| quicksort(right));
});
I think it’s much clearer. Moreover, the closure-based API opens the
door to other methods that could be used with scope
, like
convenience methods to do parallel maps and so forth.
Section 3. Can we prevent (some) resource leaks?
Ok, so in the previous two sections, I summarized the problem and discussed the impact of resource leaks on Rust. But what if we could avoid resource leaks in the first place? There have been two RFCs on this topic: RFC 1085 and RFC 1094.
The two RFCs are quite different in the details, but share a common
theme. The idea is not to avoid all resource leaks altogether; I think
everyone recognizes that this is not practical. Instead, the goal is
to try and divide types into two groups: those that can be safely
leaked, and those that cannot. You then limit the Rc
and Arc
types
so that they can only be used with types that can safely be leaked.
This approach seems simple but it has deep ramifications. It means
that Rc
and Arc
are no longer fully general container
types. Generic code that wishes to operate on data of all types
(meaning both types that can and cannot leak) can’t use Rc
or Arc
internally, at least not without some hard choices.
Rust already has a lot of precedent for categorizing types. For
example, we use a trait Send
to designate “types that can safely be
transferred to other threads”. In some sense, dividing types into
leak-safe and not-leak-safe is analogous. But my experience has been
that every time we draw a fundamental distinction like that, it
carries a high price. This distinction “bubbles up” through APIs and
affects decisions at all levels. In fact, we’ve been talking about one
case of this rippling effect through this post – the fact that we
have two reference-counting types, one atomic (Arc
) and one not
(Rc
), is precisely because we want to distinguish thread-safe and
non-thread-safe operations, so that we can get better performance when
thread safety is not needed.
What this says to me is that we should be very careful when introducing blanket type distinctions. The places where we use this mechanism today – thread-safety, copyability – are fundamental to the language, and very important concepts, and I think they carry their weight. Ultimately, I don’t think resource leaks quite fit the bill. But let me dive into the RFCs in question and try to explain why.
RFC 1085 – the Leak trait
The first of the two RFCs is RFC 1085. This RFC introduces a
trait called Leak
, which operates exactly like the existing Send
trait. It indicates “leak-safe” data. Like Send
, it is implemented
by default. If you wish to make leaks impossible for a type, you can
explicitly opt out with a negative impl like impl !Leak for MyType
.
When you create a Rc<T>
or Arc<T>
, either T: Leak
must hold, or
else you must use an unsafe constructor to certify that you will not
create a reference cycle.
The fact that Leak
is automatically implemented promises to make it
mostly invisible. Indeed, in the prototype that Jonathan Reem
implemented, he found relatively little fallout in the standard
library and compiler. While encouraging, I still think we’re going to
encounter problems of composability over time.
There are a couple of scenarios where the Leak
trait will, well,
leak into APIs where it doesn’t seem to belong. One of the most
obvious is trait objects. Imagine I am writing a serialization
library, and I have a Serializer
type that combines an output stream
(a Box<Writer>
) along with some serialization state:
struct Serializer {
output_stream: Box<Writer>,
serialization_state: u32,
...
}
So far so good. Now someone else comes along and would like to use my
library. They want to put this Serializer
into a reference counted
box that is shared amongst many users, so they try to make a
Rc<Serializer>
. Unfortunately, this won’t work. This seems somewhat
surprising, since weren’t all types were supposed to be Leak
by
default?
The problem lies in the Box<Writer>
object – an object is designed
to hide the precise type of Writer
that we are working with. That
means that we don’t know whether this particular Writer
implements
Leak
or not. For this client to be able to place Serializer
into
an Rc
, there are two choices. The client can use unsafe
code, or
I, the library author, can modify my Serializer
definition as
follows:
struct Serializer {
output_stream: Box<Writer+Leak>,
serialization_state: u32,
...
}
This is what I mean by Leak
“bubbling up”. It’s already the case
that I, as a library author, want to think about whether my types can
be used across threads and try to enable that. Under this proposal, I
also have to think about whether my types should be usable in Rc
,
and so forth.
Now, if you avoid trait objects, the problem is smaller. One advantage of generics is that they don’t encapsulate what type of writer you are using and so forth, which means that the compiler can analyze the type to see whether it is thread-safe or leak-safe or whatever. Until now we’ve found that many libraries avoid trait objects partly for this reason, and I think that’s good practice in simple cases. But as things scale up, encapsulation is a really useful mechanism for simplifying type annotations and making programs concise and easy to work with.
There is one other point. RFC 1085 also includes an unsafe
constructor for Rc
, which in principle allows you to continue using
Rc
with any type, so long as you are in a position to assert that no
cycles exist. But I feel like this puts the burden of unsafety into
the wrong place. I think you should be able to construct
reference-counted boxes, and truly generic abstractions built on
reference-counted boxes, without writing unsafe code.
My allergic reaction to requiring unsafe
to create Rc
boxes stems
from a very practical concern: if we push the boundaries of unsafety
too far out, such that it is common to use an unsafe keyword here and
there, we vastly weaken the safety guarantees of Rust in
practice. I’d rather that we increase the power of safe APIs at the
cost of more restrictions on unsafe code. Obviously, there is a
tradeoff in the other direction, because if the requirements on unsafe
code become too subtle, people are bound to make mistakes there too,
but my feeling is that requiring people to consider leaks doesn’t
cross that line yet.
RFC 1094 – avoiding reference leaks
RFC 1094 takes a different tack. Rather than dividing types arbitrarily into leak-safe and not-leak-safe, it uses an existing distinction, and says that any type which is associated with a scope cannot leak.
The goal of RFC 1094 is to enable a particular “mental model”
about what lifetimes mean. Specifically, the RFC aims to ensure that
if a value is limited to a particular scope 'a
, then the value will
be destroyed before the program exits the scope 'a
. This is very
similar to what Rust currently guarantees, but stronger: in current
Rust, there is no guarantee that your value will be destroyed, there
is only a guarantee that it will not be accessed outside that
scope. Concretely, if you leak an Rc
into the heap today, that Rc
may contain borrowed references, and those references could be invalid
– but it doesn’t matter, because Rust guarantees that you could never
use them.
In order to guarantee that borrowed data is never leaked,
RFC 1094 requires that to construct a Rc<T>
(or Arc<T>
),
the condition T: 'static
must hold. In other words, the payload of a
reference-counted box cannot contain borrowed data. This by itself is
very limiting: lots of code, including the rust compiler, puts
borrowed pointers into reference-counted structures. To help with
this, the RFC includes a second type of reference-counted box,
ScopedRc
. To use a ScopedRc
, you must first create a
reference-counting scope s
. You can then create new ScopedRc
instances associated with s
. These ScopedRc
instances carry their
own reference count, and so they will be freed normally as soon as
that count drops to zero. But if they should get placed into a cycle,
then when the scope s
is dropped, it will go along and “cycle
collect”, meaning that it runs the destructor for any ScopedRc
instances that haven’t already been freed. (Interestingly, this is
very similar to the closure-based scoped thread API, but instead of
joining threads, exiting the scope reaps cycles.)
I originally found this RFC appealing. It felt to me that it avoided
adding a new distinction (Leak
) to the type system and instead
piggybacked on an existing one (borrowed vs non-borrowed). This seems
to help with some of my concerns about “ripple effects” on users.
However, even though it piggybacks on an existing distinction
(borrowed vs static), the RFC now gives that distinction additional
semantics it didn’t have before. Today, those two categories can be
considered on a single continuum: for all types, there is some
bounding scope (which may be 'static
), and the compiler ensures that
all accesses to that data occur within that scope. Under RFC 1094,
there is a discontinuity. Data which is bounded by 'static
is
different, because it may leak.
This discontinuity is precisely why we have to split the type Rc
into two types, Rc
and ScopedRc
. In fact, the RFC doesn’t really
mention Arc
much, but presumably there will have to be both
ScopedRc
and a ScopedArc
types. So now where we had only two
types, we have four, to account for this new axis:
|-----------------++--------+----------|
| || Static | Borrowed |
|-----------------++--------+----------|
| Thread-safe || Rc | RcScoped |
| Not-thread-safe || Arc | ArcScope |
|-----------------++--------+----------|
And, in fact, the distinction doesn’t end here. There are
abstractions, such as channels, that built on Arc
. So this means
that this same categorization will bubble up through those
abstractions, and we will (presumably) wind up with Channel
and
ChannelScoped
(otherwise, channels cannot be used to send borrowed
data to scoped threads, which is a severe limitation).
Section 4. Conclusion.
This concludes my deep dive into the question of resource leaks. It
seems to me that the tradeoffs here are not simple. The status quo,
where resource leaks are permitted, helps to ensure composability by
allowing Rc
and Arc
to be used uniformly on all types. I think
this is very important as these types are vital building blocks.
On a historical note, I am particularly sensitive to concerns of composability. Early versions of Rust, and in particular the borrow checker before we adopted the current semantics, were rife with composability problems. This made writing code very annoying – you were frequently refactoring APIs in small ways to account for this.
However, this composability does come at the cost of a useful RAII pattern. Without leaks, you’d be able to use RAII to build references that reliably execute code when they are dropped, which in turn allows RAII-like techniques to be used more uniformly across all safe APIs.
This is certainly a subtle issue, and one where reasonable folk can disagree. In the process of drafting (and redrafting…) this post, my own opinion has shifted back and forth as well. But ultimately I have landed where I started: the danger and pain of bifurcating the space of types far outweighs the loss of this particular RAII idiom.
Here are the two most salient points to me:
- The vast majority of RAII-based APIs are either safe or can be made
safe with small changes. The remainder can be expressed with
closures.
- With regard to RAII, the scoped threads API represents something of a “worst case” scenario, since the guard object is completely divorced from the data that the thread will access.
- In cases like this, where there is often no need to retain the guard, but dropping it has important side-effects, RAII can be a footgun and hence is arguably a poor fit anyhow.
- The cost of introducing a new fundamental distinction (“leak-safe”
vs “non-leak-safe”) into our type system is very high and will be
felt up and down the stack. This cannot be completely hidden or
abstracted away.
- This is similar to thread safety, but leak-safety is far less fundamental.
Bottom line: the cure is worse than the disease.