Auto-serialization in Rust

9 February 2012

I’ve been working on implementing Cross-Crate Inlining. The major task here is to serialize the AST. This is conceptually trivial but in practice a major pain. It’s an interesting fact that the more tightly you type your data, the more of a pain it (generally) is to work with in a generic fashion. Of functional-ish languages that I’ve used, Scala actually makes things relatively easy by using a combination of reflection and dynamic typing (interfaces like Product come to mind).

Anyway, Rust does not (yet?) have reflection, but I have been working on a program which will autogenerate the serialization code for our AST based on the type definitions itself. Normally, I would probably do this with some Python program and a bunch of hacky regular expressions. But instead I am taking advantage of one of Rust’s nicer (and somewhat unusual, although becoming less so) features: the fact that the Rust compiler is itself a library. (An aside: I plan to implement this serialization code as a syntax extension or macro once those systems mature.)

To use serializer, you provide it with a crate file and a set of type names. It will then generate Rust code that serializes instances of those types. Internally, it invokes the compiler to parse and type check the crate, using the compile_upto() function, which allows you to compile a given input up until a certain point (in this case, up until the type checking phase has completed).

An aside: This is the point where the beauty of crate files becomes more apparent: a crate is a self-contained specification that not only contains a listing of the source modules and so forth, but also the external crates that are required, default compilation options, etc. Having all of this mess encapsulated in a crate means that it is trivial for a tool like serializer to recreate the compilation environment for your package: just provide it with a crate file. If this were a C program, you’d also have to supply a random smattering of gcc options, which you would in turn have to figure out how to extract from your makefile, not to mention the makefiles from external packages that you are using. Ugh.

Once serializer has parsed and type-checked your source, it is provided with a crate AST and a type context (ty::ctxt). Using these two things, it’s fairly straightforward to locate the definitions for the types we are supposed to serialize and walk over them, generating code as we go.

The actual code works by walking ty::t instances. ty::t is the type used in the Rust compiler to represent types. This is distinct from ast::ty, which is the syntax tree that represents a type. ty::t is modeled after the type system in the abstract, which makes it easier to work with. The other reason to walk ty::t instances and not ast::ty is that there is no AST available for types defined in external crates (such as option::t, defined in libcore).

Basically, for each unique ty::t that we encounter we generate a function of the form:

fn serialize<C: serialization::ctxt>(cx: C, t: T) {
    ...
}

Here T is the type represented by the ty::t. The variable cx is a serialization context. This is defined using an interface serialization::ctxt, which looks like so:

mod serialization {
    iface ctxt {
        fn emit_u64(x: u64);
        fn emit_i64(x: i64);
    
        fn emit_record(f: fn());
        fn emit_field(f_name: str, f_id: uint, f: fn());

        fn emit_enum(e_name: str, f: fn());
        fn emit_variant(v_name: str, v_id: uint, f: fn());
        
        ...
    }
}

So, for example, the serialization function for a type {x: uint, y: uint} would look something like:

fn serialize1<C: serialization::ctxt>(cx: C, &&v: {x: uint, y: uint}) {
    cx.emit_record {||
        cx.emit_field("x", 0) {||
            cx.emit_u64(v.x as u64);
        }
        cx.emit_field("y", 1) {||
            cx.emit_u64(v.y as u64);
        }
    }
}

Now, to deserialize, we generate similar code for a deserialization interface:

fn deserialize1<C: deserialization::ctxt>(cx: C) -> {x: uint, y: uint} {
    cx.read_record {||
        let x = cx.read_field("x", 0) {||
            cx.read_u64() as uint
        }
        let y = cx.read_field("y", 1) {||
            cx.read_u64() as uint
        }
        {x: x, y: y}
    }
}

The deserialization interface looks like:

mod deserialization {
    iface ctxt {
        fn read_u64() -> u64;
        fn read_i64() -> i64;
    
        fn read_record<T>(f: fn() -> T) -> T;
        fn read_field<T>(f_name: str, f_id: uint, f: fn() -> T) -> T;

        fn read_enum<T>(f: fn(uint) -> T);
        
        ...
    }
}

A somewhat more interesting case concerns enums. Let’s consider the enum option<R> where R is the record type we’ve been working with. It would be serialized as:

type R = {x: uint, y: uint};
fn serialize2<C: serialization::ctxt>(cx: C, &&v: option<R>) {
    cx.emit_enum("std::option::t<R>") {||
        alt v {
            none {
                cx.emit_variant("std::option::none", 0u) {||
                }
            }
            some(r) {
                cx.emit_variant("std::option::some, 1u) {||
                    serialize1(cx, r); // link to the previous code we saw
                }
            }
        }
    }
}

The deserializer meanwhile would look like:

fn deserialize2<C: deserialization::ctxt>(cx: C) -> option<uint> {
    cx.read_enum {|v_id|
        alt v_id {
            0u { // std::option::none
                std::option::none
            }
            
            1u { // std::option::some
                std::option::some(deserialize1(cx))
            }
            
            _ {
                fail #fmt["Unexpected discriminant %u for option::option",
                    v_id];
            }
        }
    }
}