I recently re-read my original post about Weave - the data-pipeline language I’ve been developing in my spare time.
That post ended with the words “Weave is nearly fully implemented - and I’m nearly through the book! Another few weeks and I believe I’ll have Weave ready[…]”
Oh the Hubris of my youth.
In reality, it has taken me another seven months since those words were typed before I could consder Weave to be even remotely “ready” for “anything”.
What Happened
Well, firstly - it turns out Closures are a damn sight harder to implement in Rust than in C.
See, I’ve been following along with the wonderful resource Crafting Interpreters which implements an interpreter in C (also Java, but that’s before the training wheels come off). The language in the book “Lox” supports closures and the C interpreter in the book uses an elegant implementation called Upvalues which in turn came from Lua.
An Upval uses some very clean, fast double-pointers to maintain track of values on the call stack. When a value referenced by an Upval would be popped off the call stack, it just gets copied to the heap and the Upval’s internal pointer gets updated. And all the places that hold a pointer to the Upval itself continue working transparently.
Beautiful!
But in C, we can do wild things like that with Pointers. We can point them at our call stack. Or change them willy-nilly and we’re just trusted to know what we’re doing. And in fairnes, sometimes we do! But Rust doesn’t play that way.
In which Rust Says “No”.
Rust has a different approach to memory management than C (or Java or Python or any other mainstream language). It has no garbage collector like Java/Python/etc instead, memory lifetimes are tied to scope. e.g. when a variable goes out of scope its data is freed.
Compared to C (well, pointers specifically) that makes it very easy to reason about when some bit of memory will be freed!
But wait, we don’t only use data inside the same function that created it. We programmers love to pass bits all over the place to different functions, oftimes slicing and dicing as we go. If a function has a variable holding some bit of data that was passed in, what happens in Rust when that variable goes out of scope?
The answer is the core of the difference in Rust’s approach to memory management.
Every piece of data has a single “owner”. That owning variable is the one that triggers ‘free’ when it goes out of scope. The owner also has write access to that memory (if it’s not a constant) - but it can hand out read-only access at will. This is called a “borrow” and is essentially a read-only pointer to the borrowed data. You can also move ownership to a new variable or function, which invalidates the og variable immediately.
The Rust compiler has a borrow-checker whose job it is to validate that these constraints are always met - one owner, optionally with multiple borrowers.
This is where we collide with our C-implementation of Upvalues.
In C, we have a double pointer:
upval_ptr -> value_ptr -> [memory]
We pass around copies of upval_ptr to anyone who needs access to what’s in [memory]. Remember, initially, [memory] is a value on the call stack - a variable
inside a function scope that will be lost when we return from the function. When we exit the function, we check to see if anybody’s hanging onto an upval_ptr
to [memory] (the mechanism of that tracking is out of scope for this post), if so, we copy [memory] to the heap and update value_ptr to point there instead.
Without any further changes, any of the other functions holding a copy of the upval_ptr will dereference to the heap location should they need to.
But in Rust, we have a problem.
Who owns the upval_ptr whose value_ptr needs to be changed out from under everyone who is holding a reference to it?
In C, those upval_ptrs are just more values in the call stack. In Rust, there’s no equivalent way to pass those around. If we pick some one of the internal
Upvalue objects to be the owner - well, there’s no guarantee we can pick the right one at compile time. We can pass closures around after all - the topmost
parent in a stack could be outlived by returning a closure created within one of its children - but perhaps not! So none of our Upval objects can have ownership
of the closed over variable. Which means none of them can change an internal “value_ptr” equivalent.
The upvalue mechanism was simply incompatible with Rust semantics.
What about unsafe?
Rust provides a safety hatch - a mechanism by which we can bypass the protections and guarantees provided by the borrow system and manipulate pointers ourselves. As implied above, this uses a keyword:
unsafeGiven a block in which to work unsafely, you can have free access to the internals of the memory system, manipulating pointers and doing whatever you’d like!I could have absolutely used
unsafehere and honestly, perhaps that’s even the right thing to do. The Upval mechanism here is clean, fast, and pretty damn safe.But - I didn’t want to. :D I decided to keep plugging away at things and try to find a Rust-safe approach to this.
Time Passes…
I spent a great deal of time thinking about and exploring alternative approaches to Weave’s memory system.
Complicating matters, I didn’t want to:
- a) force Weave programmers to deal with manual memory management
- b) implement a Garbage Collector
I wanted to piggyback off of Rust’s ownership semantics to avoid the whole mess.
Narrator: He didn’t.
So anyway, I implemented an Arena memory manager for Heap allocations.
Into the Arena
Arenas are miniature heaps that you tie to a function’s scope. Conceptually, it’s a mini heap that you allocate on the stack. When the function exits, everything in that arena pops off with it. (Welcome to the Matrix: there is no stack. There’s just memory.) Arenas can be used as a stand in for heap allocations by just… allocating a big chunk and doing with it what you please. For instance, in a VM-implementation, one could create an Arena for holding Upvalue slots…
…And out Again
And now you have a low-performance memory allocation system that copies and writes but never frees and your VM OOMs in a few seconds of processing a recursive function.
Now, you walk away from the project for a few months and idly think about how else you might avoid writing a Garbage Collector.
YATHITEW (Yet Again, the Hard Way is the Easy Way)
In the end, I gave in. Rather than trying to twist the VM into a pretzel trying to map the internal memory model (Rust’s) onto the external system (Weave) I gave in and decided to continue on with a mark and sweep garbage collector.
Sweeping Up
A Mark and Sweep GC is one of the simplest forms of Garbage Collection. At each collection event, we consider each element of the heap - each is consider “unmarked”. Then, each variable in the VM is examined - each bit of memory pointed to by a current variable is marked. Then every bit of memory pointed to by our “marked” set - until we’ve traced through every reachable bit of memory in the current state.
Once we’ve finished tracing, any memory still unmarked can be deleted - swept away.
Current state
My VM now has a manually-managed heap (and a separate memory region for the Stack). Heap-allocated things like closures and Containers (Weave’s compound data type) are now safely created and then collected by the Garbage collector when we don’t use them anymore.
In spite of my reticence, with memory management finally accomplished, the way was cleared to finish the damn thing.
In a trice, I added the read/write and data pipeline operators which were the original raison d’etre for Weave and suddenly… it was a real, live language.