On types and type schemes

23 April 2012

After my recent dalliance in Matters of a Truly Trivial Nature, I’d like to return to Matters Most Deep and Profound. I’m running up against an interesting question with regions that has to do with the nature of function types like fn(&int): up until now, I’ve assumed that this refers to a function that takes an integer pointer in some region that is specified by the caller. That is, it is a kind of shorthand for a type that might be written like fn<r>(&r.int), where the <r> indicates that the function type is parameterized by the region r.

But first, a digression on types and type schemes…

This notation is analogous to a generic function, like:

fn identity<T>(t: T) -> T { ret t; }

However, there is an important distinction. In Rust, as in ML, parameterization only occurs on named items. So, you can have a named function identity defined generically, but you cannot have a type fn<T>(T) -> T.

This is an interesting and subtle point. In fact, all Rust types are monotypes, meaning types that refer to exactly one thing. Now, it may not be known precisely what that thing is, but there must be a name for it. So, the type of the t parameter is T, which is a type variable. This is a monotype, it can only refer to one thing: the type T. It just so happens, however, that we do not know when typechecking identity what that type T is.

The type of the identity function itself, however, cannot be represented as a monotype. We cannot name a specific type for its parameter and return value, it could safely be used with any type. To accommodate this concept, ML introduced the idea of a type scheme, also called a polytype. (at least by Wikipedia, I’ve never heard the term before. but it seems logical.)

A type scheme is basically a type along with a set of bound type variables. A bound type variable is one that is defined within the scheme itself. So, if you have the type scheme fn<T>(T) -> T, then the variable T is said to be bound in this scheme. In a scheme like fn<T>(T, U) -> T, the variable T is bound, but the variable U is called free, as it is not defined in the scheme. Note that in a monotype all variables are free, as monotypes do not define any type variables.

So, in a way, a like fn<r>(&r.T) is really a monotype, although it does not bind type variables. But it can still refer to many concrete types. This entails complexity.

…and now back to regions.

So, the question is, should region variables be bound or free within a function type? It certainly makes life simpler if they are always free, and it still results in a fairly expressive system.

But first let’s examine why bound regions make life complex. To help keep things clear, I will use the explicit “bound region” notation I introduced earlier, even though it’s not an actual Rust type, and I will eschew anonymous regions. This means that the notation for writing function types and so forth will be a bit heavier than it would be in “real life”.

I will use a few conventions: lowercase letters early in the alphabet like a, b, and c refer to bound regions. Lowercase letters late in the alphabet (r, s) refer to free regions. Plus, I generally drop the types of a region pointer if they are not important, so let &r be shorthand for something like &r.int.

Subtyping of bound regions

Renaming of bound regions. Imagine a type A=fn<a>(&a) and a type B=fn<b>(&b). Is A a subtype of B? Clearly, the answer should be yes: they are basically the same type, as you could rename the bound variable a in A to b, and they would be precisely the same. So here we find that we have to consider possible renamings of bound regions when considering subtyping.

Instantiating bound regions. OK, well, now imagine the type C=fn(&r). Here, the region variable r is not bound but rather free. So in this case, is A a subtype of C? I would argue that the answer should be yes: after all, if you instantiate A with the value a for r, you get fn(&r.T), the same as C. So we ought to consider possible instantiations of bound variables as well.

Coallescing bound regions. Finally, one more example:

fn<a,b>(&a, &b)     <:    fn<c>(&c, &c)

Here the subtype is more flexible than the supertype. The subtype accepts two region pointers in any two regions, but the supertype requires that they be in the same region.

…with type variables, too

Now, just to make things more fun, imagine we throw in a type variable X into the mix. Here we play the role of the inference engine, which is trying to find a value for X. So the question becomes, is there any type that I can assign to X which would make the subtype relation true?

Referring to free regions. Let’s start with a simple example. Consider:

fn(&r)    <:    fn(X)

In this case, r is free, so we can assign X the value of &r and everything should be fine.

Bound regions. Ok, what if the subtype refers to a bound region?

fn<a>(&a)    <:    fn(X)

We can still handle this case, but it requires a region variable as well. In other words, if we create a region variable R, then we can substitute that region variable for a and obtain:

fn(&R) <: fn(X)

Now we can assign X to &R and then assume that the inference engine will find a suitable region for R. This will be based on constraints from the rest of the program.

Multiple parameters. Of course, if there are multiple parameters, there may be interactions between them:

fn<a>(&a, &a)    <:    fn(X, &r)

In this case, because r appears free in the supertype, X can be assigned &r. That would mean that a can be instantiated with r and the subtyping relation holds.

Bound regions within the supertype. What if the region in the supertype is bound, not free?

fn<a>(&a, &a)    <:    fn<c>(X, &r)

In this case, there is no value of X which is suitable. This is perhaps not obvious: you might think that &c would be a fine value for X. But that means that the value of X would refer to the region c, which is bound within the type. It’s a scoping violation. The name c has no meaning outside of the supertype, whereas the type X (which appears free) does have meaning. So X cannot refer to regions bound within the supertype.

Woah.

Yeah, it’s complex. I haven’t come up with an elegant implementation for the inference engine that accommodates all of these scenarios. One option is to not handle all of these cases. I also ought to read up in The Literature as well as the implementations of other languages (e.g., Haskell, Scala) to see what they do in similar scenarios. Still, I dislike the idea of having things in our type system that require citations to explain.

So, can we just drop bound type variables in function types?

Actually, I think we definitely could (not yet sure if we should). Most things I’ve thought of will “just work”, and when they won’t, there is workaround via interfaces (more on that later).

First, an example of something that works:

fn iter(v: [T], f: fn(&T)) { 
    uint::range(0, v.len()) { |i|
        f(&v[i]);
    }
}

This function iterates over each item in the slice v and invokes the function f (I am assuming Graydon’s work on slices and vectors is complete). If we fully expand this type to see all the regions involved, you end up with:

fn iter(v: [T]/&a, f: fn/&a(&a.T)) { ... }

This signature is probably a bit confusing. As usual, I find the best way to think about regions is as lifetimes (in fact, I am considering changing my terminology over to use the word lifetime exclusively). So what this notation means is that there is some span of time a in which the vector data and the function closure is valid. The function itself expects a pointer which is also valid for this same span of time (in this case, that pointer will be a pointer into the vector contents, so its lifetime comes from there). This span of time a will generally be the call to iter() itself.

What doesn’t work?

Basically, what doesn’t work is when you want to have a function whose arguments can have lifetimes whose lifetime is not yet known. This most commonly occurs when functions are stored into records. One example that comes to mind is the hash and eq functions that we use to implement hashtables right now.

Currently, our hashtables are defined with a structure something like:

type hash<K,V> = {
    hashfn: fn(&K) -> uint,
    eqfn: fn(&K, &K) -> bool,
    ...
};

Here you see that the hashfn takes a pointer to the key K and returns the hash (a uint). The eqfn takes two keys and returns a boolean if they are equal.

The key point here is that the lifetimes for the key arguments are not known and cannot be known in advance. The data for the hashtable is stored in a structure on the heap and so its lifetime is not stack-based and hence has no region; for any given hashtable operation, the current array will be borrowed and thus tied to the stack of that particular operation, but these future operations cannot be given a name.

Did you say something about a workaround?

Yes, we can use ifaces to work around this problem. Imagine an iface:

iface hash_key_ops<K> {
    fn hash(k: &K) -> uint;
    fn eq(k1: &K, k2: &K) -> bool;
}

I am mildly abusing ifaces here because the “self” would not be a particular key but rather some singleton object representing the hash function itself. For example, I might define:

enum murmur_hash { murmur_hash };
impl of hash_key_ops<str> for murmur_hash {
    fn hash(k: &str) -> uint { ... }
    fn eq(k1: &str, k2: &str) -> bool { k1 == k2 }
}

Now we could define the hashtable like:

type hash<K,V> = {
    ops: hash_key_ops/@<K>,
    ...
};

fn new_hash<K,V>(ops: hash_key_ops/@<K>) { ... }

Now whenever we want to hash a key we can invoke tbl.ops.hash(key). The key point is that the named functions in an iface, just like function items, can have polytypes even though normal function types are monotypes. Then each time we invoke hash() we would instantiate the bound regions with fresh region variables.

Of course, if we were going to use ifaces with hashtables, we might rather define the iface over the key type itself. That raises some interesting issues about instance coherence which I plan to discuss in a blog post Real Soon Now, but if you’re curious about that you may also want to read my mailing list post on the topic as it is still my preferred solution.