Performance · Optimization · Engineering
I make slow software fast.
Performance work across the stack: algorithms, application code, build
pipelines, infrastructure. I profile what's actually slow and ship the
fix on the codebase you already have. As one worked example,
this site
walks the same haversine kernel from a naïve pandas
.apply all the way down to a Zig + AVX-512 implementation
sustaining 150 GB/s, with the maintenance cost of each rung made
explicit.
What I've built
Real systems with real numbers. Click any card for the writeup.
Engineer on the optimisation arc
Provstiskyen: performance work on a 10-year SaaS
50s → 18s
App startup
Profiled and fixed the cold-start path on a 44,000-line R Shiny production app: 50-second logins down to 18, and 35-minute deploys down to 80 seconds, all on the existing codebase. The full rewrite that came later was made possible by a year of targeted optimisation work first.
Backtesting engine and scanner
Thoth
<2 ms
per-ticker backtest
Hot-loop discipline at the small scale: a 13-strategy backtest of the US equities universe finishes in seconds. Pure Polars expressions, threaded bulk runner, regime-gated strategies. The kind of code-level performance work I bring to bigger systems.
Anonymised, Danish specialty retailer
Inventory simulation arena
9 invariants
hot-loop locks
Polars-based simulation engine where seven candidate inventory strategies compete on years of real demand data. Nine source-level invariants guard the hot loop from accidental refactor regressions: the optimisation contract is written into the test suite.
This site
Tachyon
9,100 → 0.29 ns/pair
Python V0 → Zig V7
The same haversine kernel walked from a naïve pandas `.apply` through C++, Rust, Zig SIMD, and finally an analyzer-driven V7 in Zig that reads its own compiled assembly to land at 150 GB/s, plus a WebGPU compute lab in the browser. End-to-end demo of the optimisation work I do for clients.
Horus / Neper / Maat
Home GitOps cluster
4 nodes
ARM64 GitOps cluster
Bare-metal Kubernetes on 4× Raspberry Pi 4 with Flux, Cilium, Tailscale, an in-cluster Zot registry, and MinIO. The infrastructure layer of the optimisation work; same patterns I apply to bigger clusters at work.
How I work
- Measure first. The diagnostic usually delivers more value than the fix, because most teams never had numbers to argue from.
- Cheap fix before expensive rewrite. A targeted profile and a 100-line change ships in days. A rewrite takes quarters and might not converge. Most slow code has a cheap fix waiting in the existing codebase, and the discipline is finding it.
- Defensible methods. Numbers come from real percentile distributions, not vibes; every choice of tool gets justified on paper before it goes anywhere near production.
- Phased rollout with rollback. Shadow mode first, then a single canary, then staged batches. Every step has a defined revert path before it ships.
Now
What's next
Day-job performance-tuning work is the bulk of the week. Outside of that, I keep pushing on Provstiskyen's Analyse module: the last piece before the legacy R Shiny app can finally retire.
If something you run is slower than it should be, get in touch .