Async Interview #6: Eliza Weisman

11 February 2020

Hello! For the latest async interview, I spoke with Eliza Weisman (hawkw, mycoliza on twitter). Eliza first came to my attention as the author of the tracing crate, which is a nifty crate for doing application level tracing. However, she is also a core maintainer of tokio, and she works at Buoyant on the linkerd system. linkerd is one of a small set of large applications that were build using 0.1 futures – i.e., before async-await. This range of experience gives Eliza an interesting “overview” perspective on async-await and Rust more generally.

Video

You can watch the video on YouTube. I’ve also embedded a copy here for your convenience:

The days before question mark

Since I didn’t know Eliza as well, we started out talking a bit about her background. She has been using Rust for 5 years, and I was amused by how she characterized the state of Rust when she got started: pre-“question mark” Rust. Indeed, the introduction of the ? operator does feel one of those “turning points” in the history of Rust, and I’m quite sure that async-await will feel similarly (at least for some applications).

One interesting observation that Eliza made is that it feels like Rust has reached the point where there is nothing critically missing. This isn’t to say there aren’t things that need to be improved, but that the number of “rough edges” has dramatically decreased. I think this is true, and we should be proud of it – though we also shouldn’t relax too much. =) Getting to learn Rust is still a significant hurdle and there are still a number of things that are much harder than they need to be.

One interesting corrolary of this is that a number of the things that most affect Eliza when writing Async I/O code are not specific to async I/O. Rather, they are more general features or requirements that apply to a lot of different things.

Tokio’s needs

We talked some about what tokio needs from async Rust. As Eliza said, many of the main points already came up in my conversation with Carl:

async functions in traits would be great, but they’re hard
stabilizing streams, async read, and async write would be great

Communicating stability

One thing we spent a fair while discusing is how to best communicate our stability story. This goes beyond “semver”. semver tells you when a breaking change has been made, of course, but it doesn’t tell whether a breaking change will be made in the future – or how long we plan to do backports, and the like.

The easiest way for us to communicate stability is to move things to the std library. That is a clear signal that breaking changes will never be made.

But there is room for us to set “intermediate” levels of stability. One thing that might help is to make a public stability policy for crates like futures. For example, we could declare that the futures crate will maintain compatibility with the current Stream crate for the next year, or two ears.

These kind of timelines would be helpful: for example, tokio plans to maintain a stable interface for the next 5 years, and so if they want to expose traits from the futures crate, they would want a guarantee that those traits would be supported during that period (and ideally that futures would not release a semver-incompatible version of those traits).

Depending on community crates

When we talk about interoperability, we are often talking about core traits like Future, Stream, and AsyncRead. But as we move up the stack, there are other things where having a defined standard could be really useful. My go to example for this is the http crate, which defines a number of types for things like HTTP error codes. The types are important because they are likely to find their way in the “public interface” of libraries like hyper, as well as frameworks and things. I would like to see a world where web frameworks can easily be converted between frameworks or across HTTP implementations, but that would be made easier if there is an agreed upon standard for representing the details of a HTTP request. Maybe the http crate is that already, or can become that – in any case, I’m not sure if the stdlib is the right place for such a thing, or at least not for some time. It’s something to think about. (I do suspect that it might be useful to move such crates to the Rust org? But we’d have to have a good story around maintainance.) Anyway, I’m getting beyond what was in the interview I think.

Tracing

We talked a fair amount about the tracing library. Tracing is one of those libraries that can do a large number of things, so it’s kind of hard to concisely summarize what it does. In short, it is a set of crates for collecting scoped, structured, and contextual diagnostic information in Rust programs. One of the simplest use cases is to collect logging information, but it can also be used for things like profiling and any number of other tasks.

I myself started to become interesting in tracing as a possible tool to help for debugging and analyzing programs like rustc and chalk, where the “chain” that leads to a bug can often be quite complex and involve numerous parts of the compiler. Right now I tend to just dump gigabytes of logs into files and traverse them with grep. In so doing, I lose all kinds of information (like hierarchical information about what happens during what) that would make my life easier. I’d love a tool that let me, for example, track “all the logs that pertain to a particular function” while also making it easy to find the context in which a particular log occurred.

The tracing library got its start as a structured replacement for various hacky layers atop the log crate that were in use for debugging linkerd. Like many async applications, debugging a linkerd session involves correlating a lot of events that may be taking place at distinct times – or even distinct machines – but are still part of one conceptual “thread” of control.

tracing is actually a “front-end” built atop the “tracing-core” crate. tracing-core is a minimal crate that just stores a thread-local containing the current “event subscriber” (which processes the tracing events in some way). You don’t interact with tracing-core directly, but it’s important to the overall design, as we’ll see in a bit.

The tracing front-end contains a bunch of macros, rather like the debug! and info! you may be used to from the log crate (and indeed there are crates that let you use those debug! logs directly). The major one is the span! macro, which lets you declare that a task is happening. It works by putting a “placeholder” on the stack: when that placeholder is dropped, the task is done:

let s: Span = span!(...); // create a span `s`
let _guard = s.enter(); // enter `s`, so that subsequent events take place "in" `s`
let t: Span = span!(...); // create a *subspan* of `s` called `t`
...

Under the hood, all of these macros forward to the “subscripber” we were talking about later. So they might receive events like “we entered this span” or “this log was generated”.

The idea is that events that happen inside of a span inherit the context of that span. So, to jump back to my compiler example, I might use a span to indicate which function is currently being type-checked, which would then be associated with any events that took place.

There are many different possible kinds of subscribers. A subscriber might, for example, dump things out in real time, or it might just collectevents and log them later. Crates like tracing-timing record inter-event timing and make histograms and flamegraphs.

Integrating tracing with other libraries

It seems clear that tracing would work best if it is integrated with other libaries. I believe it is already integrated into tokio, but one could also imagine integrating tracing with rayon, which distributes tasks across worker threads to run in parallel. The goal there would be that we “link” the tasks so that events which occur in a parallel task inherit the context/span information from the task which spawned them, even though they’re running on another thread.

The idea here is not only that Rayon can link up your application events, but that Rayon can add its own debugging information using tracing in a non-obtrusive way. In the ‘bad old days’, tokio used to have a bunch of debug! logs that would let you monitor what was going on – but these logs were often confusing and really targeting internal tokio developers.

With the tracing crate, the goal is that libraries can enrich the user’s diagnostics. For example, the hyper library might add metadata about the set of headers in a request, and tokio might add information about which thread-pool is in use. This information is all “attached” to your actual application logs, which have to do with your business logic. Ideally, you can ignore them most of the time, but if that sort of data becomes relevant – e.g., maybe you are confused about why a header doesn’t seem to be being detected by your appserver – you can dig in and get the full details.

Integrating tracing with other logging systems

Eliza emphasized that she would really like to see more interoperability amongst tracing libraries. The current tracing crate, for example, can be easily made to emit log records, making it interoperable with the log crate (there is also a “logger” that implements the tracing interface).

Having a distinct tracing-core crate means that it possible for there to be multiple facades that build on tracing, potentially operating in quite different ways, which all share the same underlying “subscriber” infrastructure. (rayon uses the same trick; the rayon-core crate defines the underlying scheduler, so that multiple versions of the rayon ParallelIterator traits can co-exist without having multiple global schedulers.) Eliza mentioned that – in her ideal world – there’d be some alternative front-end that is so good it can replaces the tracing crate altogether, so she no longer has to maintain the macros. =)

RAII and async fn doesn’t always play well

There is one feature request for async-await that arises from the tracing library. I mentioned that tracing uses a guard to track the “current span”:

let s: Span = span!(...); // create a span `s`
let _guard = s.enter(); // enter `s`, so that subsequent events take place "in" `s`
...

The way this works is that the guard returned by s.enter() adds some info into the thread-local state and, when it is dropped, that info is withdrawn. Any logs that occur while the _guard is still live are then decorated with this extra span information. The problem is that this mechanism doesn’t work with async-await.

As explained in the tracing README, the problem is that if an async await function yields during an await, then it is removed from the current thread and suspended. It will later be resumed, but potentially on another thread altogether. However, the _guard variable is not notified of these events, so (a) the thread-local info remains set on the original thread, where it may not longer belong and (b) the destructor which goes to remove the info will run on the wrong thread.

One way to solve this would be to have some sort of callback that _guard can receive to indicate that it is being yielded, along with another callback for when an async fn resumes. This would probably wind up being optional methods of the Drop trait. This is basically another feature request to making RAII work well in an async environment (in addition to the existing problems with async drop that boats described here).

Priorities as a linkerd hacker

I asked Eliza to think for a second about what priorities she would set for the Rust org while wearing her “linkerd hacker” hat – in other words, when acting not as a library designer, but as the author of an that relies on Async I/O. Most of the feedback here though had more to do with general Rust features than async-await specifically.

Eliza pointed out that linkerd hasn’t yet fully upgraded to use async-await, and that the vast majority of pain points she’s encountered thus far stem from having to use the older futures model, which didn’t integrate well with rust borrows.

The other main pain point is the compilation time costs imposes by the deep trait hierarchies created by tower’s service and layer traits. She mentioned hitting a type error that was so long it actually crashed her terminal. I’ve heard of others hitting similar problems with this sort of setup. I’m not sure yet how this is best addressed.

Another major feature request would be to put more work into procedural macros, especially in expression position. Right now proc-macro-hack is the tool of choice but – as the name suggests – it doesn’t seem ideal.

The other major point is that support for cargo feature flags in tooling is pretty minimal. It’s very easy to have code with feature flags that “accidentally” works – i.e., I depend on feature flag X, but I don’t specify it; it just gets enabled via some other dependency of mine. This also makes testing of feature flags hard. rustdoc integration could be better. All true, all challenging. =)

Comments?

There is a thread on the Rust users forum for this series.