Skip to content

Performance · Optimization · Engineering

I make slow software fast.

Performance work across the stack: algorithms, application code, build pipelines, infrastructure. I profile what's actually slow and ship the fix on the codebase you already have. As one worked example, this site walks the same haversine kernel from a naïve pandas .apply all the way down to a Zig + AVX-512 implementation sustaining 150 GB/s, with the maintenance cost of each rung made explicit.

Haversine · 1M pairs · 9950X3D

Δ 46k×

  1. V0
    pandas .apply
  2. V1
    numpy vectorised
    229×
  3. V2
    Polars planner
    351×
  4. Z1
    Zig naïve scalar
    505×
  5. Z2
    Zig polynomial
    403×
  6. Z3
    Zig AVX2 SIMD4
    1.4k×
  7. Z4
    Zig AVX-512 SIMD8
    3.2k×
  8. Z5
    + multithreading
    21k×
  9. Z6
    + Estrin's scheme
    26k×
  10. Z7
    + FMA + pool
    46k×

What I've built

Real systems with real numbers. Click any card for the writeup.

Engineer on the optimisation arc

Provstiskyen: performance work on a 10-year SaaS

50s → 18s

App startup

Profiled and fixed the cold-start path on a 44,000-line R Shiny production app: 50-second logins down to 18, and 35-minute deploys down to 80 seconds, all on the existing codebase. The full rewrite that came later was made possible by a year of targeted optimisation work first.

Optimization Fullstack DevOps
R Shiny FastAPI Polars Docker GKE MariaDB

Backtesting engine and scanner

Thoth

<2 ms

per-ticker backtest

Hot-loop discipline at the small scale: a 13-strategy backtest of the US equities universe finishes in seconds. Pure Polars expressions, threaded bulk runner, regime-gated strategies. The kind of code-level performance work I bring to bigger systems.

Optimization Fullstack Data
FastAPI Polars PostgreSQL TimescaleDB Astro React

Anonymised, Danish specialty retailer

Inventory simulation arena

9 invariants

hot-loop locks

Polars-based simulation engine where seven candidate inventory strategies compete on years of real demand data. Nine source-level invariants guard the hot loop from accidental refactor regressions: the optimisation contract is written into the test suite.

Optimization Fullstack Data
FastAPI Polars MariaDB React Pixi Vite

This site

Tachyon

9,100 → 0.29 ns/pair

Python V0 → Zig V7

The same haversine kernel walked from a naïve pandas `.apply` through C++, Rust, Zig SIMD, and finally an analyzer-driven V7 in Zig that reads its own compiled assembly to land at 150 GB/s, plus a WebGPU compute lab in the browser. End-to-end demo of the optimisation work I do for clients.

Optimization DevOps Fullstack
Python Zig Rust C++ WebGPU FastAPI Astro Fly.io

Horus / Neper / Maat

Home GitOps cluster

4 nodes

ARM64 GitOps cluster

Bare-metal Kubernetes on 4× Raspberry Pi 4 with Flux, Cilium, Tailscale, an in-cluster Zot registry, and MinIO. The infrastructure layer of the optimisation work; same patterns I apply to bigger clusters at work.

DevOps Fullstack
Kubernetes Flux Cilium Tailscale MinIO Zot ARM64

How I work

  • Measure first. The diagnostic usually delivers more value than the fix, because most teams never had numbers to argue from.
  • Cheap fix before expensive rewrite. A targeted profile and a 100-line change ships in days. A rewrite takes quarters and might not converge. Most slow code has a cheap fix waiting in the existing codebase, and the discipline is finding it.
  • Defensible methods. Numbers come from real percentile distributions, not vibes; every choice of tool gets justified on paper before it goes anywhere near production.
  • Phased rollout with rollback. Shadow mode first, then a single canary, then staged batches. Every step has a defined revert path before it ships.

Now

What's next

Day-job performance-tuning work is the bulk of the week. Outside of that, I keep pushing on Provstiskyen's Analyse module: the last piece before the legacy R Shiny app can finally retire.

If something you run is slower than it should be,  get in touch .