Async Interview #5: Steven Fackler
20 January 2020
Hello! For the latest async interview, I spoke with Steven Fackler (sfackler). sfackler has been involved in Rust for a long time and is a member of the Rust libs team. He is also the author of a lot of crates, most notably tokio-postgres.
I particularly wanted to talk to sfackler about the AsyncRead
and
AsyncWrite
traits. These traits are on everybody’s list of
“important things to stabilize”, particularly if we want to create
more interop between different executors and runtimes. On the other
hand, in [tokio-rs/tokio#1744], the tokio project is considering
adopting its own variant traits that diverge significantly from those
in the futures crate, precisely because they have concerns over the
design of the traits as is. This seems like an important area to dig
into!
Video
You can watch the video on YouTube. I’ve also embedded a copy here for your convenience:
One note: something about our setup meant that I was hearing a lot of echo. I think you can sometimes hear it in the recording, but not nearly as bad as it was live. So if I seem a bit spacey, or take very long pauses, you might know the reason why!
Background: concerns on the async-read trait
So what are the concerns that are motivating tokio-rs/tokio#17144? There are two of them:
- the current traits do not permit using uninitialized memory as the backing buffer;
- there is no way to test presently whether a given reader supports vectorized operations.
This blog post will focus on uninitialized memory
sfackler and I spent most of our time talking about uninitialized memory. We did also discuss vectorized writes, and I’ll include some notes on that at the end, but by and large sfackler felt that the solutions there are much more straightforward.
Important: The same issues arise with the sync Read
trait
Interestingly, neither of these issues is specific to AsyncRead
. As
defined today, the AsyncRead
trait is basically just the async
version of Read
from std
, and both of these concerns apply there
as well. In fact, part of why I wanted to talk to sfackler
specifically is that he is the author of an excellent paper
document that covers the problem of using uninitialized memory
in great depth. A lot of what we talked about on this call is also
present in that document. Definitely give it a read.
Read interface doesn’t support uninitialized memory
The heart of the Read
trait is the read
method:
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize>
This method reads data and writes it into buf
and then – assuming
no error – returns Ok(n)
with the number n
of bytes written.
Ideally, we would like it if buf
could be an uninitialized buffer.
After all, the Read
trait is not supposed to be reading from
buf
, it’s just supposed to be writing into it – so it shouldn’t
matter what data is in there.
Problem 1: The impl might read from the buf, even if it shouldn’t
However, in practice, there are two problems with using uninitialized
memory for buf
. The first one is relatively obvious: although it
isn’t supposed to, the Read
impl can trivially read from buf
without
using any unsafe code:
impl Read for MyReader {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
let x = buf[0];
...
}
}
Reading from an uninitialized buffer is Undefined Behavior and could cause crashes, segfaults, or worse.
Problem 2: The impl might not really initialize the buffer
There is also a second problem that is often overlooked: when the
Read
impl returns, it returns a value n
indicating how many bytes
of the buffer were written. In principle, if buf
was uninitialized
to start, then the first n
bytes should be written now – but are
they? Consider a Read
impl like this one:
impl Read for MyReader {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
Ok(buf.len())
}
}
This impl has no unsafe code. It claims that it has initialized the
entire buffer, but it hasn’t done any writes into buf
at all! Now if
the caller tries to read from buf
, it will be reading uninitialized
memory, and causing UB.
One subtle point here. The problem isn’t that the read impl could return a false value about how many bytes it has written. The problem is that it can lie without ever using any unsafe code at all. So if you are auditing your code for unsafe blocks, you would overlook this.
Constraints and solutions
There have been a lot of solutions proposed to this problem. sfackler and I talked about all of them, I think, but I’m going to skip over most of the details. You can find them either in the video or in in sfackler’s paper document, which covers much of the same material.
In this post, I’ll just cover what we said about three of the options:
- First, adding a
freeze
operation.- This is in some ways the simplest, as it requires no change to
Read
at all. - Unfortunately, it has a number of limitations and downsides.
- This is in some ways the simplest, as it requires no change to
- Second, adding a second
read
method that takes a&mut dyn BufMut
dyn value.- This is the solution initially proposed in [tokio-rs/tokio#1744].
- It has much to recommend it, but requires virtual calls in a core API, although initial benchmarks suggest such calls are not a performance problem.
- Finally, creating a struct
BufMuf
in the stdlib for dealing with partially initialized buffers, and adding aread
method for that.- This overcomes some of the downsides of using a trait, but at the cost of flexibility.
Digression: how to think about uninitialized memory
Before we go further, let me digress a bit. I think the common understanding of uninitialized memory is that “it contains whatever values happen to be in there at the moment”. In other words, you might imagine that when you first allocate some memory, it contains some value – but you can’t predict what that is.
This intuition turns out to be incorrect. This is true for a number of reasons. Compiler optimizations are part of it. In LLVM, for example, an uninitialized variable is not assigned to a fixed stack slot or anything like that. It is instead a kind of “free floating” “uninitialized” value, and – whenever needed – it is mapped to whatever register or stack slot happens to be convenient at the time for most optimal code. What this means in practice is that each time you try to read from it, the compiler will substitute some value, but it won’t necessarily be the same value every time. This behavior is justified by the C standard, which states that reading uninitialized memory is “undefined behavior”.
This can cause code to go quite awry. The canonical example in my mind is the case of a bounds check. You might imagine, for example, that code like this would suffice for legally accessing an array:
let index = compute_index();
if index < length {
return &array[index];
} else {
panic!("out of bounds");
}
However, if the value returned by compute_index
is uninitialized,
this is incorrect. Because in that case, index
will also be “the
uninitialized value”, and hence each access to it conceptually yields
different values. So the value that we compare against length
might
not be the same value that we use to index into the array one line
later. Woah.
But, as sfackler and I discussed, there are actually other layers that
rely on uninitialized memory never being read even below the
kernel. For example, in the linux kernel, the virtual memory system
has a flag called MADV_FREE
. This flag is used to mark virtual
memory pages that are considered uninitialized. For each such virtual
page, khe kernel is free to change the physical memory page at will –
until the virtual page is written to. At that point, the memory is
potentially initialized, and so the virtual page is pinned. What this
means in practice is that when you get memory back from your
allocator, each read from that memory may yield different values,
unless you’ve written to it first.
For all these reasons, it is best to think of uninitialized memory not as having “some random value” but rather as having the value “uninitialized”. This is special value that can, sometimes, be converted to a random value when it is forced to (but, if accessed multiple times, it may yield different values each time).
If you’d like a deeper treatment, I recommend Ralf’s blog post.
Possible solution to read: Freeze operation
So, given the above, what is the freeze operation, and how could it
help with handling uninitialized memory in the read
API?
The general idea is that we could have a primitive called freeze
that, given some (potentially) uninitialized value, converts any
uninititalized bits into “some random value”. We could use this to
fix our indexing, for example, by “freezing” the index before we compare
against the length:
let index = freeze(compute_index());
if index < length {
return &array[index];
} else {
panic!("out of bounds");
}
In a similar way, if we have a reference to an uninitialized buffer,
we could conceivably “freeze” that reference to convert it to a reference
of random bytes, and then we can safely use that to invoke read
.
The idea would be that callers do something like this:
let uninitialized_buffer = ...;
let buffer = freeze(uninitialized_buffer);
let n = reader.read(&mut buffer)?;
...
If we could do this, it would be great, because the existing read
interface wouldn’t have to change at all!
There are a few complications though. First off, there is no such
freeze
operation in LLVM today. There is talk of adding one, but
that operation wouldn’t quite do what we need. For one thing, it
freezes the value it is applied to, but it doesn’t apply through a
reference. So you could use it to fix our array bounds length checking
example, but you can’t use it to fix read
– we don’t need to freeze
the &mut [u8]
reference, we need to fix the memory it refers to.
Secondly, that primitive would only apply to compiler optimizations.
It wouldn’t protect against kernel optimizations like MADV_FREE
. To
handle that, we have to do something extra, such as writing one byte
per memory page. That’s conceivable, of course, but there are some
downsides:
- It feels fragile. What if linux adds some new optimizations in the future, how will we work around those?
- It feels disappointing. After all,
MADV_FREE
was presumably added because it allows this to be faster – and we all agree that given a “well-behaved”Read
implementation, it should be reasonable. - It can be expensive. sfackler pointed out that it is sometimes
common to “over-provision” your read buffers, such as creating a
16MB buffer, so as to avoid blocking. This is fairly cheap in
practice, but only thanks to optimizations (like
MADV_FREE
) that allow that memory to be lazilly allocated and so forth. If we start writing a byte into every page of a 16MB buffer, you’re going to notice the difference.
For these reasons, sfackler felt like freeze
isn’t the right answer
here. It might be a useful primitive for things like array bounds
checking, but it would be better if we could modify the Read
trait
in such a way that we permit the use of “unfrozen” uninitialized
memory.
Incidentally, this is a topic we’ve hit on in previous async interviews. [cramertj and I talked about it][ctj2], for example. My own opinion has shifted – at first, I thought a freeze primitive was obviously a good idea, but I’ve come to agree with sfackler that it’s not the right solution here.
Fallback and efficient interoperability
If we don’t take the approach of adding a freeze
primitive, then
this implies that we are going to have to extend the Read
trait with
some of second method. Let’s call it read2
for short. And this
raises an interesting question: how are we going to handle backwards
compatibility?
In particular, read2
is going to have a default, so that existing
impls of Read
are not invalidated. And this default is going to have
to fallback to calling read
, since that is the only method that we
can guarantee to exist. Since read
requires a fully initialized
buffer, this will mean that read2
will have to zero its buffer if it
may be uninitialized. This by itself is ok – it’s no worse than today.
The problem is that some of the solutions discussed in sfackler’s doc can wind up having to zero the buffer multiple times, depending on how things play out. And this could be a big performance cost. That is definitely to be avoided.
Possible solution to read: Take a trait object, and not a buffer
Another proposed solution, in fact the one described in [tokio-rs/tokio#1744],
is to modify read
so it takes a trait object (in the case of the Read
trait,
we’d have to add a new, defaulted method):
fn read_buf(&mut self, buf: &mut dyn BufMut) -> io::Result<()>
The idea here is that BufMut
is a trait that lets you safely
access a potentially uninitialized set of buffers:
pub trait BufMut {
fn remaining_mut(&self) -> usize;
unsafe fn advance_mut(&mut self, cnt: usize);
unsafe fn bytes_mut(&mut self) -> &mut [u8];
...
}
You might wonder why the definition takes a &mut dyn BufMut
, rather
than a &mut impl BufMut
. Taking impl BufMut
would mean that the
code is specialized to the particular sort of buffer you are using, so
that would potentially be quite a bit faster. However, it would also
make Read
not “dyn-safe”1, and that’s a non-starter.
There are some nifty aspects to this proposal. One of them is that the same trait can to some extent “paper over” vectorized writes, by distributing the data written across buffers in a chain.
But there are some downsides. Perhaps most important is that requiring virtual calls to write into the buffer could be a significant performance hazard. Thus far, measurements don’t suggest that, but it seems like a cost that can only be recovered by heroic compiler optimizations, and that’s the kind of thing we prefer to avoid.
Moreover, the ability to be generic over vectorized writes may not be
as useful as you might think. Often, the caller wants to know whether
the underlying Read
supports vectorized writes, and it would operate
quite differently in that case. Therefore, it doesn’t really hurt to
have two read
methods, one for normal and one for vectorized writes.
Variant: use a struct, instead of a trait
The variant that sfackler prefers is to replace the BufMut
trait
with a struct.2 The API of this struct would be fairly similar
to the trait above, except that it wouldn’t make much attempt to unify
vectorized and non-vectorized writes.
Basically, we’d have a struct that encapsulates a “partially
initialized slice of bytes”. You could create such a struct from a
standard slice, in which case all things are initialized, or you can
create it from a slice of “maybe initialized” bytes (e.g., &mut [MaybeUninit<u8>]
. There can also be convenience methods to create a
BufMut
that refers to the uninitialized tail of bytes from a Vec
(i.e., pointing into the vector’s internal buffer).
The safe methods of the BufMut
API would permit
- writing to the buffer, which will track the bytes that were initialized;
- getting access to a slice, but only one that is guaranteed to be initialized.
There would be unsafe methods for getting access to memory that may be uninitialized, or for asserting that you have initialized a big swath of bytes (e.g., by handing the buffer off to the kernel to get written to).
The buffer has state: it can track what has been initialized. This
means that any given part of the buffer will get zeroed at most
once. This ensures that fallback from the new read2
method to the
old read
method is reasonably efficient.
Sync vs async, how to proceed
So, given the above thoughts, how should we proceed with AsyncRead
?
sfackler felt that the question of how to handle uninitialized output
buffers was basically “orthogonal” from the question of whether and
when to add AsyncRead
. In others, sfackler felt that the AsyncRead
and Read
traits should mirror one another, which means that we could
add AsyncRead
now, and then add a solution for uninitialized memory
later – or we could do the reverse order.
One minor question has to do with defaults. Currently the Read
trait
requires an implementation of read
– any new method (read_uninit
or whatever) will therefore have to have a default implementation that
invokes read
. But this is sort of the wrong incentive: we’d prefer
if users implemented read_uninit
, and implemented read
in terms of
the new method. We could conceivably reverse the defaults for the
AsyncRead
trait to this preferred style. Alternatively, sfackler
noted that we could make both read
and read_uninit
have a
default implementation, one implementing in terms of the other. In
this case, users would have to implement one or the other
(implementing neither would lead to an infinite loop, and we would
likely want a lint for that case).
We also discussed what it would mean it tokio adopted its own
AsyncRead
trait that diverged from std. While not ideal, sfackler
felt like it wouldn’t be that big a deal either way, since it ought to
be possible to efficiently interconvert between the two. The main
constraint is having some kind of stateful entity that can remember
the amount of uninitialized data, thus preventing the inefficient
fallover behavior.
Is the ability to use uninitialized memory even a problem?
We spent a bit of time at the end discussing how one could gain data on this problem. There are two things that would be nice to know.
First, how big is the performance impact from zeroing? Second, how ergonomic is the proposed API to use in practice?
Regarding the performance impact, I asked the same question on tokio-rs/tokio#17144, and I did get back some interesting results, [which I summarized in this hackmd at the time][tokio-hackmd]. In short, hyper’s benchmarks show a fairly sizable impact, with uninitialized data getting speedups3 of 1.3-1.5x. Other benchmarks though are much more mixed, showing either no diference or small differences on the order of 2%. Within the stdlib, we found about a [7% impact on microbenchmarks][#26950].
Still, sfackler raised another interesting data point (both on the
thread and in our call). He was pointing out #23820, a PR which
rewrote read_to_end
in the stdlib. The older implementation was
simple and obvious, but suffered from massive performance cliffs
related to the need to zero buffers. The newer implementation is fast,
but much more complex. Using one of the APIs described above would
permit us to avoid this complexity.
Regarding ergonomics, as ever, that’s a tricky thing to judge. It’s hard to do better than prototyping as well as offering the API on nightly for a time, so that people can try it out and give feedback.
Having the API on nightly would also help us to make branches of frameworks like tokio and async-std so we can do bigger measurements.
Higher levels of interoperability
sfackler and I talked a bit about what the priorities should be beyond
AsyncRead
. One of the things we talked about is whether there is a
need for higher-level traits or libraries that expose more custom
information beyond “here is how to read data”. One example that has
come up from time to time is the need to know, for example, the URL or
other information associated with a request.
Another example might be the role of crates like http
, which aims to
define Rust types for things like HTTP header codes that are fairly
standard. These would be useful types to share across all HTTP
implementations and libraries, but will we be able to achieve that
sort of sharing without offering the crate as part of the stdlib (or
at last part of the Rust org)? I don’t think we had a definitive
answer here.
Priorities beyond async read
We next discussed what other priorities the Rust org might have around Async I/O. For sfackler, the top items would be
- better support for GATs and async fn in traits;
- some kind of generator or syntactic support for streams;
- improved diagnostics, particularly around send/sync.
Conclusion
sfackler and I focused quite heavily on the AsyncRead
trait
and how to manage uninitialized memory. I think that it would be
fair to summarize the main points of our conversation as:
- we should add
AsyncRead
to the stdlib and have it mirrorRead
; - in general, it makes sense for the synchronous and asynchronous versions of the traits to be analogous;
- we should extend both traits with a method that takes a
BufMut
struct to manage uninitialized output buffers, as the other options all have a crippling downside; - we should extend both traits with a “do you support vectorized output?” callback as well;
- beyond that, the Rust org should focus heavily on diagnostics for async/await, but streams and async fns in traits would be great too. =)
Comments?
There is a thread on the Rust users forum for this series.
Appendix: Vectorized reads and writes
There is one minor subthread that I’ve skipped over – vectorized
reads and writes. I skipped it in the blog post because this problem
is somewhat simpler. The standard read
interface takes a single
buffer to write the data into. But a vectorized interface takes a
series of buffers – if there is more data than will fit in the first
one, then the data will be written into the second one, and so on
until we run out of data or buffers. Vectorized reads and writes can
be much more efficient in some cases.
Unfortunately, not all readers support vectorized reads. For that reason, the “vectorized read” method has a fallback: by default, it just calls the normal read method using the first non-empty buffer in the list. This is theoretically equal, but obviously it could be a lot less efficient – imagine that I have supplied one buffer of size 1K and one buffer of size 16K. The default vectorized read method will just always use that single 1K buffer, which isn’t great – but still, not much to be done about it. Some readers just cannot support vectorized reads.
The problem here then is that it would be nice if there were some way to detect when a reader supports vectorized reads. This would allow the caller to choose between a “vectorized” call path, where it tries to supply many buffers, or a single-buffer call path, where it just allocates a big buffer.
Apparently hyper will do this today, but using a heuristic: if a call to the vectorized read method returns just enough data to fit in the first buffer, hyper guesses that in fact vectorized reads are not supported, and switches dynamically to the “one big buffer” strategy. (Neat.)
There is perhaps a second, more ergonomic issue: since the vectorized read method has a default implementation, it is easy to forget to implement it, even if you would have been able to do so.
In any case, this problem is relatively easy to solve: we basically need to add a new method like
fn supports_vectorized_reads(&self) -> bool
to the trait.
The matter of decided whether or not to supply a default is a bit trickier. If you don’t supply a default, then everybody has to implement it, even if they just want the default behavior. But if you do, people who wished to implement the method may forget to do so – this is particularly unfortunate for reads that are wrapping another reader, which is a pretty common case.
Footnotes
Most folks say “object-safe” here, but I’m trying to shift our terminology to talk more about the dyn keyword. ↩︎
Carl Lerche proposed something similar on the tokio thread here. ↩︎
I am defining a “speedup” here as the ratio of
U/Z
, whereU/Z
are the throughput with uninitialized/zeroed buffers respectively. [tokio-hackmd]: https://hackmd.io/ukeyehx7Ta-6KhaVRFi2mg#Measuring-the-impact [#26950]: https://github.com/rust-lang/rust/pull/26950 ↩︎