Modeling graphs in Rust using vector indices

6 April 2015

After reading nrc’s blog post about graphs, I felt inspired to write up an alternative way to code graphs in Rust, based on vectors and indicates. This encoding has certain advantages over using Rc and RefCell; in particular, I think it’s a closer fit to Rust’s ownership model. (Of course, it has disadvantages too.)

I’m going to describe a simplified version of the strategy that rustc uses internally. The actual code in Rustc is written in a somewhat dated “Rust dialect”. I’ve also put the sources to this blog post in their own GitHub repository. At some point, presumably when I come up with a snazzy name, I’ll probably put an extended version of this library up on crates.io. Anyway, the code I cover in this blog post is pared down to the bare essentials, and so it doesn’t support (e.g.) enumerating incoming edges to a node, or attach arbitrary data to nodes/edges, etc. It would be easy to extend it to support that sort of thing, however.

The high-level idea

The high-level idea is that we will represent a “pointer” to a node or edge using an index. A graph consists of a vector of nodes and a vector of edges (much like the mathematical description G=(V,E) that you often see):

pub struct Graph {
    nodes: Vec<NodeData>,
    edges: Vec<EdgeData>,
}

Each node is identified by an index. In this version, indices are just plain usize values. In the real code, I prefer a struct wrapper just to give a bit more type safety.

pub type NodeIndex = usize;

pub struct NodeData {
    first_outgoing_edge: Option<EdgeIndex>,
}

Each node just contains an optional edge index, which is the start of a linked list of outgoing edges. Each edge is described by the following structure:

pub type EdgeIndex = usize;

pub struct EdgeData {
    target: NodeIndex,
    next_outgoing_edge: Option<EdgeIndex>
}

As you can see, an edge contains a target node index and an optional index for the next outgoing edge. All edges in a particular linked list share the same source, which is implicit. Thus there is a linked list of outgoing edges for each node that begins in the node data for the source and is threaded through each of the edge datas.

The entire structure is shown in this diagram, which depicts a simple example graph and the various data structures. Node indices are indicated by a number like N3 and edge indices by a number like E2. The fields of each NodeData and EdgeData are shown.

Graph:
    N0 ---E0---> N1 ---E1---> 2
    |                         ^
    E2                        |
    |                         |
    v                         |
    N3 ----------E3-----------+
    
Nodes (NodeData):
  N0 { Some(E0) }     
  N1 { Some(E1) }
  N2 { None     } 
  N3 { Some(E2) } 
  
Edges:
  E0 { N1, Some(E2) }
  E1 { N2, None     }
  E2 { N3, None     }
  E3 { N2, None     }

Growing the graph

Writing methods to grow the graph is pretty straightforward. For example, here is the routine to add a new node:

impl Graph {
    pub fn add_node(&mut self) -> NodeIndex {
        let index = self.nodes.len();
        self.nodes.push(NodeData { first_outgoing_edge: None });
        index
    }
}

This routine will add an edge between two nodes (for simplicity, we don’t bother to check for duplicates):

impl Graph {
    pub fn add_edge(&mut self, source: NodeIndex, target: NodeIndex) {
        let edge_index = self.edges.len();
        let node_data = &mut self.nodes[source];
        self.edges.push(EdgeData {
            target: target,
            next_outgoing_edge: node_data.first_outgoing_edge
        });
        node_data.first_outgoing_edge = index;
    }
}

Finally, we can write an iterator to enumerate the successors of a given node, which just walks down the linked list:

impl Graph {
    pub fn successors(&self, source: NodeIndex) -> Successors {
        let first_outgoing_edge = self.nodes[source].first_outgoing_edge;
        Successors { graph: self, current_edge_index: first_outgoing_edge }
    }
}

pub struct Successors<'graph> {
    graph: &'graph Graph,
    current_edge_index: Option<EdgeIndex>,
}

impl<'graph> Iterator for Successors<'graph> {
    type Item = NodeIndex;
    
    fn next(&mut self) -> Option<NodeIndex> {
        match self.current_edge_index {
            None => None,
            Some(edge_num) => {
                let edge = &self.graph.edges[edge_num];
                self.current_edge_index = edge.next_outgoing_edge;
                Some(edge.target)
            }
        }
    }
}

Advantages

This approach plays very well to Rust’s strengths. This is because, unlike an Rc pointer, an index alone is not enough to mutate the graph: you must use one of the &mut self methods in the graph. This means that can track the mutability of the graph as a whole in the same way that it tracks the mutability of any other data structure.

As a consequence, graphs implemented this way can easily be sent between threads and used in data-parallel code (any graph shared across multiple threads will be temporarily frozen while the threads are active). Similarly, you are statically prevented from modifying the graph while iterating over it, which is often desirable. If we were to use Rc nodes with RefCell, this would not be possible – we’d need locks, which feels like overkill.

Another advantage of this apprach over the Rc approach is efficiency: the overall data structure is very compact. There is no need for a separate allocation for every node, for example (since they are just pushes into a vector, additions to the graph are O(1), amortized). In fact, many C libaries that manipulate graphs also use indices, for this very reason.

Disadvantages

The primary disadvantage comes about if you try to remove things from the graph. The problem then is that you must make a choice: either you reuse the node/edge indices, perhaps by keeping a free list, or else you leave a placeholder. The former approach leaves you vulnerable to “dangling indices”, and the latter is a kind of leak. This is basically exactly analogous to malloc/free. Another similar problem arises if you use the index from one graph with another graph (you can mitigate that with fancy type tricks, but in my experience it’s not really worth the trouble).

However, there are some important qualifiers here:

It frequently happens that you don’t have to remove nodes or edges from the graph. Often you just want to build up a graph and use it for some analysis and then throw it away. In this case the danger is much, much less.
The danger of a “dangling index” is much less than a traditional dangling pointer. For example, it can’t cause memory unsafety.

Basically I find that this is a theoretical problem but for many use cases, it’s not a practical one.

The big exception would be if a long-lived graph is the heart of your application. In that case, I’d probably go with a Rc (or maybe Arc) based approach, or perhaps even a hybrid – that is, use indices as I’ve shown here, but reference count the indices too. This would preserve the data-parallel advantages.

Conclusion

The key insights in this approach are:

indices are often a compact and convenient way to represent complex data structures;
they play well with multithreaded code and with ownership;
but they also carry some risks, particularly for long-lived data structures, where there is an increased change of indices being misused between data structures or leaked.