Weave Performance

Last night, I decided to spend some time working on Weave’s performance. It was… fine for a toy project, but I was a little disappointed in how slow my interpreter ran. Fast enough to show promise, but too slow to use for Real work, basically.

Sitting down with Claude I began refactoring portions of the VM.

replaced name-based lookups of global variables with slots using compiler-inserted ints for O(1) lookup
optimized the Container type with internal hashmaps for key-value lookups*
micro-optimized the function call semantics to reduce / remove memory cloning operations

Everything helped… a bit. I got to the point that a recursive fibonacci(20) calculation took 0.6s instead of 0.8s. A ~25% improvement in call time overall… but it barely made a dent in the clock-time - and almost no change at all in my “real-world” benchmark manipulating a large (2M row) CSV.

# Helper lambdas for reading CSV and writing JSON files
as_csv = ^(f) { read(f, :csv) }
write_json = ^(f) { ^(data) { write(f, data, :json) } }

# A helper lambda for filter, to find customers with my name
is_a_lucas = ^(row) { row["First Name"] == "Lucas" || row["Last Name"] == "Lucas" }

# A filter function - normally defined in an external module, but for this example, replicated here
fn filter(cond) {
  ^(data) {
    out = []
    data *> { if cond(row) { out << row } }
    out
  }
}

# Read a CSV file, extract all the folks named "Lucas" and save it in JSON format:
"customers.csv" |> as_csv  # read in the CSV file
    |> filter(is_a_lucas)  # Take all the customers with my name
    |> ^(filtered_customers) { [lucaess: filtered_customers] } # JSON needs a top-level key, so we'll add one
    |> write_json("lucases.json")  # Save the json file

Processing the full 2m row file took over a full minute of clock time, so I had to limit my benchmark to a mere 100K rows, which started at 1.25s and had improved to a mere 1.12s.

And then something weird happened - my CSV benchmark segfaulted. Among the improvements optimizing memory usage, I’d accidentally created a dangling reference. (Rust protects you from this unless you use unsafe… as you might when you decide to use Pointers to avoid copying VM internals around… Oops). As it turned out though, the segfault was triggered in some old debugging code from the very beginning of the project - nothing I needed now.

It was just a block in a debug log statement that ran inside the core loop of the VM…

In fact, it ran just before executing instructions. Each and every one.

A debug block which decompiled the current stack frame. And the rest of the call stack. And the next instruction for good measure - then converted them all to a String for the logs.

…

In my defense, it had been very helpful in the beginning when I was making sure the bytecode for loops was inserting jump targets correctly.

I deleted the decompilation code and recompiled.

I reran my benchmarks. “That can’t be right.” I said. Most of my benchmarks now completed so fast that time couldn’t resolve the sub-millisecond runtime. I had Claude validate the benchmarks too.

That can't be right it said the test took 0.00s vs the expected 0.67s

A few more test cases later, the improvement was ratified - over 100x faster.

I stared at my laptop and the new results. I started laughing in hysterical shock, holding my face in my hands. I’d spent HOURS on optimization only to get a pathetic 0.2s improvement.

…and all I’d ever had to do was delete two damn lines of debug logging.

Remember to clean up after yourself, kids. Those debug strings aren’t free. Especially. ESPECIALLY in the hot loop of an interpreter.

What’s the performance like now?

Much better, thanks! For instance, the recursive fibonacci(20) is among the sub-1ms benchmarks.

I need to create some new benchmarks and try out implementing some other benchmarks in Weave… but with the hour growing late, I decided to try implementing my 2M-row CSV -> filter -> JSON benchmark in Weave, Python and Ruby, to get a very rough idea of where Weave stood now.

CSV to JSON Benchmark

(Lower is better)

Ruby	Python	Weave
148.43s	6.50s	5.37s

Whoa. In this single arena, this one data point, Weave actually outperformed Python - admittedly only by a hair and with a large number of caveats, but still - that’s incredibly encouraging!

To be explicitly clear I do not believe this result means that Weave is anywhere approaching Python or Ruby in terms of real world performance or usability. There are so many reasons that Weave is not ready for prime time. There’s no debugger, error messages are often useless, runtime issues have no stack trace (you do get the single line where the error at least), not to mention I’m real nervous about my direct pointer usage after hitting that segfault… maybe I didn’t track the memory access as well as I’d thought I did.

All of those things are required for a real language. And most of them will cost performance, at least a little. For instance, building stack traces for exception handling is often a very expensive part of an interpreter. It’s why Exceptions are considered “expensive” to performance in many languages - and why Python’s exception handling is rightly recognized for being so fast you can use it for flow control.

I haven’t provided the exact Weave + Python + Ruby code I used for these results because it doesn’t matter. My claim is not that Weave is actually on par with Python in any way.

Just that - in this one case, in this very specific race, my little Weave interpreter has shown that it’s as fast as Python. Which means it’s got the potential to be fast enough for real work.

WTF is up with Ruby?

Turns out that Ruby’s stdlib implementation of CSV parsing is notoriously slow. I had no idea when I concieved this competition. I have since found that there are gems out there for fast CSV parsing in Ruby - FasterCSV or RubyZSV for instance.

While I didn’t bother testing them, ZSV claims an approximate 5.x improvement on large files. Assuming that holds (and that CSV parsing completely dominated the Ruby runtime), Ruby with ZSV would still have been (~150s/5 == ~30s) significantly slower than Python or Weave.

Regardless… it happens that CSV parsing is a soft underbelly for Ruby and I just happened to hit it in this benchmark.

* On Containers and internal Hashmaps

Container keys were originally implemented as a list of Pairs parallel to the list of Values within the Container. When indexing into a Container, the VM linearly searches the Key list for the matching Key string (or symbol) and then, taking the index from its Pair, retrieves the Value at the index.

The thought was that replacing the List of (Key, Index) Pairs in the Container implementation with an actual Map of Key -> Index would be faster, given the O(1) lookup… But in practice, this actually slowed the VM!

The reason was that the cost of building the HashMap - especially for a case like a 2M row CSV - outweighed the value provided during lookup! Given the relatively small number of keys in my test data (12), the O(1) lookup cost failed to outweigh the O(nlogn) creation of the HashMaps themselves.

I ended up keeping the optimization, but it only kicks in after 16 keys. Why 16? Eh, it just sort of felt right. I’m adding more benchmarks and test cases to find the right balance. Given the impossibility of controlling string key lengths (I mean, I could, but I’m not going to) and that string interning would give better results for Key usecases anyway, I’m not going to invest too much time here until it proves itself to be a bottleneck in future.