Performance · Optimization · Engineering

I make slow software fast.

Performance work across the stack: algorithms, application code, build pipelines, infrastructure. I profile what's actually slow and ship the fix on the codebase you already have. As one worked example, this site walks the same haversine kernel from a naïve pandas .apply all the way down to a Zig + AVX-512 implementation sustaining 150 GB/s, with the maintenance cost of each rung made explicit.

Read the flagship case → See all work

Haversine · 1M pairs · 9950X3D

Δ 46k×

V0

pandas .apply
—
V1

numpy vectorised
229×
V2

Polars planner
351×
Z1

Zig naïve scalar
505×
Z2

Zig polynomial
403×
Z3

Zig AVX2 SIMD4
1.4k×
Z4

Zig AVX-512 SIMD8
3.2k×
Z5

+ multithreading
21k×
Z6

+ Estrin's scheme
26k×
Z7

+ FMA + pool
46k×

What I've built

Real systems with real numbers. Click any card for the writeup.

Deep dives →

Engineer on the optimisation arc

Provstiskyen: performance work on a 10-year SaaS

50s → 18s

App startup

Profiled and fixed the cold-start path on a 44,000-line R Shiny production app: 50-second logins down to 18, and 35-minute deploys down to 80 seconds, all on the existing codebase. The full rewrite that came later was made possible by a year of targeted optimisation work first.

Optimization Fullstack DevOps

R Shiny FastAPI Polars Docker GKE MariaDB

This site

Tachyon

9,100 → 0.29 ns/pair

Python V0 → Zig V7

The same haversine kernel walked from a naïve pandas `.apply` through C++, Rust, Zig SIMD, and finally an analyzer-driven V7 in Zig that reads its own compiled assembly to land at 150 GB/s, plus a WebGPU compute lab in the browser. End-to-end demo of the optimisation work I do for clients.

Optimization DevOps Fullstack

Python Zig Rust C++ WebGPU FastAPI Astro Fly.io

Horus / Neper / Maat

Home GitOps cluster

4 nodes

ARM64 GitOps cluster

Bare-metal Kubernetes on 4× Raspberry Pi 4 with Flux, Cilium, Tailscale, an in-cluster Zot registry, and MinIO. The infrastructure layer of the optimisation work; same patterns I apply to bigger clusters at work.

DevOps Fullstack

Kubernetes Flux Cilium Tailscale MinIO Zot ARM64

How I work

Measure first. The diagnostic usually delivers more value than the fix, because most teams never had numbers to argue from.
Cheap fix before expensive rewrite. A targeted profile and a 100-line change ships in days. A rewrite takes quarters and might not converge. Most slow code has a cheap fix waiting in the existing codebase, and the discipline is finding it.
Defensible methods. Numbers come from real percentile distributions, not vibes; every choice of tool gets justified on paper before it goes anywhere near production.
Phased rollout with rollback. Shadow mode first, then a single canary, then staged batches. Every step has a defined revert path before it ships.

Now

What's next

Day-job performance-tuning work is the bulk of the week. Outside of that, I keep pushing on Provstiskyen's Analyse module: the last piece before the legacy R Shiny app can finally retire.

If something you run is slower than it should be, get in touch .