Unsafe code and shared references
1 February 2017
In a previous post, I talked about a proposed approach to drafting the
unsafe code guidelines. Specifically, I want to the approach of having
an executable specification of Rust with additional checks that
will signal when undefined behavior has occurred. In this post, I want
to try to dive into that idea a bit more and give some more specifics
of the approach I have in mind. I’m going to focus on this post on the
matter of the proper use of shared references &T
– I’ll completely
ignore &mut T
for now, since those are much more complicated
(because they require a notion of uniqueness).
For the time being, I’m going to continue to talk about this executable specification as a kind of “enhanced miri”. I think probably the right formal way to express it is not as code but rather as an operational semantics, which is a basically a mathematical description of an interpreter. But at the same time I think we should keep in mind other ways of implementing those same checks (e.g., as a valgrind plugin).
I’m also going to focus on single-thread semantics for now. It seems best to start there, and extend to the multithreaded case only once we have a good handle on how we think the sequential semantics ought to roughly work (perhaps using an operationally-based model like promises as a starting point).
How to use shared references wrong
In Rust, a shared reference is more than a pointer. It’s also a kind
of promise to the type system. Specifically, when you create a
shared reference, the data that it refers to (“referent”) is
considered borrowed, which means that it is supposed to be
immutable (except for under an UnsafeCell
) and valid so long
as the reference is in use. When you’re writing safe Rust, of
course, the borrow checker ensures these properties for you:
fn foo() {
let mut i = 0;
let p = &i;
i += 1; // <-- Error! `i` is shared, cannot mutate.
println!("{}", *p);
}
But what about unsafe code? Certainly it is possible to violate either
of these properties. For now, I’m going to focus on mutating
borrowed data when you are not supposed to; in fact, freeing or moving
borrowed data can be seen as a kind of mutation (overwriting the data
with uninitialized). So here is a running example of an unsafely
implemented function util::increment()
, which takes in a &usize
and increments it:
pub fn increment(u: &usize) {
unsafe {
let p: *const usize = u;
let q: *mut usize = p as *mut usize;
*q += 1;
}
}
Now, clearly, this is a sketchy function, and I think most would agree
that it should be considered illegal, at least under some
executions. In particular, if nothing else, its existence will
interfere with the compiler’s ability to optimize. To see why, imagine
a caller like this one; let’s further assume that the source of
increment()
is unavailable for analysis (perhaps it is part of
another crate, or a different codegen-unit within the current crate).
fn innocent() {
let i = &22;
println!("i = {}", *i);
increment(i);
println!("i = {}", *i);
}
Ideally, the compiler ought to be able to deduce – even without
knowing what increment()
does – that *i
equals 22
throughout
this function execution. After all, the underlying temporary that i
points at is clearly only accessed through a shared reference, which
ought to be immutable. But, of course, that is not a valid assumption:
increment()
is violating its contract. So if we perform
optimizations, such as replacing all uses of i
with the constant
22
, those will be visible to the end-user. In typical C fashion,
this can be justified if we say that the program encounters undefined
behavior, but how can we make that more precise?
Instrumenting to detect failures
Earlier we mentioned that the key property of a shared reference is that the borrowed memory will remain both immutable and valid for the lifetime of the reference. The way that my mental model works, the borrow model is kind of like a (compile time) read-write lock: when you borrow data to create a shared reference, you have acquired a “read-lock” on that data. As a first stab at what our “augmented interpreter” might look like, let’s see if we can realize that intution. (Spoiler: this will turn out to be the wrong approach.)
The basic idea is that the interpreter would track a “reader count”
for every bit of memory. When we create a reference (i.e., when we
execute &i
), that will instruct the interpreter to increment that
counter. The compiler would also generate “release” instructions when
the borrow goes out of scope which would decrement the lock count
again.
So in a sense our augmented program would look like this. The new assertions are written in comments; the interpreter would understand them, even if regular Rust execution does not:
fn innocent() {
let i = &22;
println!("i = {}", i);
// acquire_read_lock(&i);
increment(&i);
// release_read_lock(&i);
println!("i = {}", i);
}
Now, once we’ve inserted those instructions, then presumably
increment()
would dynamically fail as it attempted to execute *q += 1
, because the memory was “read-locked”.
Dealing with unsafe abstractions
So, this idea of a read-write lock seems reasonable so far – why did I say that this would turn out to be the wrong approach? Well, one catch is that it’s not sufficient in general to just freeze a single integer. Rather, when something gets borrowed, we have to freeze all the memory reachable from the point of borrow. That turns out to be problematic: given that Rust is built on unsafe abstractions, it’s not really possible to enumerate all that memory. To see what I mean, consider this program:
fn foo() {
let mut x: Vec<Vec<i32>> = vec![vec![]];
let y = &x; // borrow `x`
...
}
Here, the reference y
borrows x
, which is a vector of
vectors. This implies that not only is the vector x
itself frozen,
so are all the vectors within x
, and so are all the integers in all
those vectors. This means that if the program were to create an unsafe
pointer and navigate to any one of those vectors and try to mutate it,
we should error out.
To enforce this, presumably the compiler would have to insert
something like the acquire_read_lock(&x)
we saw before. This
instruction would cause the interpreter to navigate to all the memory
reachable from x
– but how can it do that? Vectors, after all, are
not a built-in concept in Rust. The Vec
type is just a struct that
stores an unsafe pointer instead, ultimately looking something like this:
struct Vec<T> {
data: *mut T,
len: usize,
capacity: usize,
}
It’s clear that we can freeze the fields of the Vec
, but it’s less
clear how we can freeze the vector’s data. Is it safe or reasonable
for us to reference data
? How do we know that the memory that data
refers to is initialized? (In fact, since vectors over-allocated, some
portion of that data is basically guaranteed to be uninitialized.)
We actually encountered similar issues when thinking about how to
integrate tracing GCs (another topic that would make for a good blog
post!). The bottom line is that whatever scheme you create, people
will always want some way to apply their own customic logic (e.g.,
maybe the pointer isn’t stored as a *mut T
, it’s actually a usize
and you can only extract it by doing an xor
with some other
values). So it’d really be best if we can avoid the need to
“interpret” an unsafe data structure in any way.
A second approach: cannot observe a violation
There is another way to think about the freezing guarantees. Instead
of eagerly locking all the memory that is reachable through a
reference, we might instead declare that the compiler should not be
able to observe any writes. Under this model, modifying the referent
of an &i32
is not – in and of itself – undefined behavior. It only
becomes undefined behavior when that reference is later loaded and
observed to have been written since its creation.
One way to express this is to imagine that there is a global counter
WRITES
tracking the number of writes to memory. Every time we write
to a memory address m
, the interpreter will increment WRITES
and
store the new value to a global map LAST_WRITE[m]
– this map
records, for each address, the last time it was written. When we
create a shared reference r
, we can also read the current value of
WRITES
and associate this value with the reference as
TIME_STAMP[r]
(you can think of it as some extra metadata that gets
carried along somehow).
Now, when we read from a shared reference r
that refers to the
memory address m
, we can check that LAST_WRITE[m] <= TIME_STAMP[r]
, which tells us that the memory has not been written
since the reference r
was created (this may actually be stricter
than we want, but let’s start here).
So, coming back to our running example, the code might look like this, with comments indicating the meta-operations that are happening:
fn innocent() {
let i = &22;
// TIME_STAMP[i] = WRITES
// assert(LAST_WRITES[i] <= TIME_STAMP[i])
println!("i = {}", *i);
increment(&i);
// assert(LAST_WRITES[i] <= TIME_STAMP[i])
println!("i = {}", *i);
}
fn increment(u: &usize) {
unsafe {
let p: *const usize = u;
let q: *mut usize = p as *mut usize;
// WRITES += 1;
// LAST_WRITES[q] = WRITES;
*q += 1;
}
}
Now we can clearly see that the second assertion in innocent()
will
fail, since LAST_WRITES[i]
is going to be equal to
TIME_STAMP[i]+1
. This indicates that some form of undefined behavior
occurred.
Unsafety levels
One of the premises of the Tootsie Pop model is that we can leverage the fact that Rust separates safe from unsafe code to allow for more optimization without making it harder to reason about unsafe code. Although many specific details of the TPM proposal were flawed, I think this basic idea is still necessary if we are to achieve the level of optimization that I would like to achieve while avoiding the problem of unsafe code becoming very hard to reason about. I plan to write more on this specific topic (“safety levels”) in a follow-up post; for now, I want to take for granted that we have some way to designate “safe” functions from “unsafe” functions, and just talk about how we can reflect that designation using assertions, and in turn use those assertions to drive optimization.
Consider this variant of the example from
my previous post about trusting types. Let’s assume that the
function patsy()
here is “safe code”:
fn patsy() {
let i = &22;
let v = *i;
increment(i);
println!("i = {}", v);
}
In this code, the author has loaded *i
before calling
increment()
, but the result is not used until afterwards. The
question is, given that this is safe code, can we optimize this code
by deferring the load until later? This kind of optimization could be
useful in improving register allocation and stack size, for example:
fn patsy() { // "optimized"
let i = &22;
increment(i);
let v = *i; // this is moved here
println!("i = {}", v);
}
In general, my goal is that we can drive whether an optimization is
legal based purely on the assertions and things that we are using to
instrument the code when we check for undefined behavior. The idea is
then similar to how C optimization works: we can perform an
optimization if we can show that it only affects executions that would
have resulted in an assertion failure anyhow. So let’s see what our
instrumented patsy()
looks like so far:
fn patsy() { // instrumented
let i = &22;
// TIME_STAMP[i] = WRITES
// assert(TIME_STAMP[i] <= LAST_WRITES[i])
let v = *i;
increment(i);
println!("i = {}", v);
}
Based only on these assertions, there is no way to justify the
optimization I want to perform. After all, increment()
is free to
update LAST_WRITES[i]
because there is no assertion that states
otherwise.
What went wrong? The disconnected is actually strongly related to my
previous post on observational equivalence – I would like to
optimize patsy()
on the basis that increment()
, being declared as
a safe function, will only do things that safe code could do (or,
rather, safe code augmented with the capabilities we define for unsafe
code). That’s a pretty strong assumption – since it assumes we can
fully describe the possible things the code might do – but we can
weaken it by saying that, since increment()
is declared safe, I
should get to assume that any code that its callers could write
that type-checks will not trigger undefined behavior. But that is
clearly false, as we saw in the previous section: if the caller simply
moves the let v = *i
line down to after increment()
, an
assertion failure occurs.
We can capture some of this intution by saying that, in safe code, we
add additional assertions at function boundaries. The idea is that
when safe code calls a function (and, by definition, that function
must be safe, since calling an unsafe function requires an unsafe
block), it can rely on that function not to disturb the types that it
has access to. So imagine that after every function call in a safe
function, we assert that all our publicly accessible state is still
valid. In this case, since i
is an in-scope reference whose lifetime
has not expired (in particular, even in a NLL world, its lifetime
would include the call to increment()
), that means that the memory it
refers to must not have changed:
s
fn patsy() { // instrumented
let i = &22;
// TIME_STAMP[i] = WRITES
// assert(TIME_STAMP[i] <= LAST_WRITES[i])
let v = *i;
increment(i);
// assert(TIME_STAMP[i] <= LAST_WRITES[i])
println!("i = {}", v);
}
Running with these augmented semantics, we see that increment()
will
yield an assertion failure once patsy()
calls it, even though we
don’t access *i
again. This in turn justifies our compiler’s
decision to move let v = *i
below the call.
Its clear that, even with these stronger assertions, we are not able
to fully check that some bit of unsafe code is a valid safe
abstraction. In other words, we can show that it did not disturb the
local variables of its caller function in any immediate way, but it
may well have disturbed them in some way that will show up later (for
example, increment
might not immediately mutate *i
, but it might
make an alias of i
that will be used later to perform an illegal
mutation). However, we can hopefully show that the abstraction is
safe enough for the compiler to do the optimizations we would like
to do.
Ginning up metadata for false references
Another question that you quickly run into in this approach – and
it’s a question we have to answer no matter what! – is what to do
about references that are created in “unorthodox” ways. For example,
what happens if I make a reference by transmuting a usize
(note: not
recommended):
// Don't do this at home, kids.
fn wacked(x: &T) -> &T {
let i: usize = x as *const T as usize;
let y: &T = transmute(i);
y
}
If the reference x
has some “identity” as a reference, you can
imagine that the machine might preserve that identity when x
is cast
to a usize
, in which case TIME_STAMPS[y] == TIME_STAMPS[x]
. Or
perhaps the time stamp is reset. This is all strongly related to C
memory models (e.g., this one), which also have to
define this sort of thing (related question: at what point does a
pointer gain a numeric address?).
In any case, I’m not sure just what the right answer is here, but I
like how focusing on something executable makes the issue at hand very
concrete. It also seems like that, as we thread this data through an
actual interpreter, these questions will naturally arise (i.e., “hmm,
we have to create a Reference
value here, what should we use for the
time-stamp?”), which will help give us confidence that we have
convered the various corner cases.
Conclusion
The aim of this post is not to make a specific proposal, not yet, but to try and illustrate further the approach I have in mind for specifying an executable form of unsafe code guidelines. The key components are:
- An augmented interpreter that has meta-variables like
WRITES
andLAST_WRITES
that track the set of state.- This interpreter will also have additional metadata for Rust values, such as a time-stamp for references.
- An augmented compilation that includes assertions that can employ these meta-variables
at well-defined points:
- before memory accesses and after function calls seem like likely candidates
- This compilation might take into account the “safety level” of a function as well
- Using these assertions both to check for undefined behavior and to justify optimizations.
There is certainly plenty of work to be done. For example, we have to
work out just how to handle “reborrows” (i.e., &*x
where x: &T
) –
it seems clear that the resulting reference should get the
“time-stamp” of the one from which it is borrowed.
Going further, the approach we outlined here isn’t quite enough to
handle &mut T
, since there we have to reason about the path by which
memory was reached, and not just the state of the memory itself. I
imagine though that we might be able to handle this by creating a
fresh id for each mutable borrow. When a memory cell is accessed, we
would track the id of those that did the access, and then when an
&mut
is used (or the validity of an &mut
is asserted, in safe
code) we would check that all publicly accessible memory is either
older than the reference or has the proper ID associated with it. Or
something like that, anyway.
(Also, there is a lot of related work in this area, much of which I am not that familiar with. Robert Krebbers’s thesis formalizing the C standard is certainly relevant (I’m happy to say that when I spoke with him at POPL, he seemed to agree with the overall approach I am advocating here, though of course we didn’t get down to much level of detail). Projects like CompCert also leap to mind.)