Self-hosted LLM deployments, model routing, private inference, AI observability. For businesses that want the power of modern AI without sending customer data to a vendor, without per-token surprises, and without lock-in to any single model provider.
OpenAI / Anthropic / Gemini APIs are powerful and easy to start with — and exactly the wrong answer for a surprising number of real business use cases. Customer data leaving your perimeter. Unpredictable per-token costs that explode with usage. Vendor lock-in to a single provider. Rate limits that kneecap batch jobs. Compliance teams asking questions nobody can answer.
Self-hosting AI solves most of this, but it's a different skill set. GPU provisioning, model selection (Llama? Mistral? Qwen?), quantization tradeoffs, inference server choice (vLLM, TGI, Ollama, llama.cpp?), routing across models, evaluation, observability, cost control. Most teams either don't have the expertise in-house or don't want to grow it as a side project.
That's the work I do. I design and deploy private AI infrastructure that matches what you actually need — not a blanket replication of OpenAI. Often the right answer is a mix: self-hosted for bulk / sensitive / cheap work, external API for the hard edge cases, a router deciding between them.
GPU provisioning (on-prem, bare metal, or cloud). Model deployment with vLLM / TGI / Ollama / llama.cpp depending on workload. Quantization and batching tuned to your cost/latency tradeoff.
Route requests between models based on cost, latency, quality. Fall back to external APIs when self-hosted can't handle it. Single unified interface for your applications.
ChatGPT-style internal assistant running on your infrastructure. Connects to your internal docs, knowledge bases, databases. Nobody outside your network sees the prompts.
Trace every request. Log inputs, outputs, latencies, costs. Automated evaluation against golden sets. Catch regressions before users do.
Per-team quotas. Per-endpoint budgets. Alerting on anomalous spend. No more surprise $50k monthly bills.
Your deployment, your repo, your infrastructure. Deployment scripts, upgrade runbooks, rollback procedures — all yours. Open-source stack throughout.
No magic proprietary AI platforms. Everything runs on tools you can understand, inspect, and replace. Your stack, your rules.
Discovery is free. Workload assessment (Phase 1) is always fixed-price. Later phases scoped based on what Phase 1 reveals.
Independent senior engineer. 20+ years infrastructure, five years running production AI at personal and client scale.
I've been deploying servers since Apache was new. Docker since it was beta. Kubernetes in anger, then sensibly out of it for smaller teams. When LLMs became deployable in 2023, I was already in a position to integrate them into business systems rather than just call an API.
What I bring is old-school infrastructure discipline applied to modern AI. GPU budgets. Observability. Cost control. Fallback strategies. The kind of production thinking that turns a cool demo into a system you can bet a business on.
Based in Chicago. Working worldwide. Direct contracts or US entity.
At scale, often yes — sometimes 5-10x cheaper at steady state. Below $2-5k/month on APIs, probably not worth it. I model actual costs honestly before you commit — sometimes the answer is 'stay on API for now, revisit in 6 months'.
For most business tasks, Llama 3 70B, Qwen 2.5 72B, or Mistral Large are within 10-20% of GPT-4 at a fraction of the inference cost. For reasoning-heavy work, you'll still want frontier APIs. I design with both.
Not necessarily. Options: (1) on-prem bare-metal GPUs (best economics at scale), (2) reserved cloud GPU instances (AWS, Lambda, RunPod), (3) hybrid — own some, burst to cloud. I model all three before you commit.
Self-hosting actually simplifies compliance because data never leaves your perimeter. I work with your compliance team on architecture documentation, audit trails, data handling. I'm not a compliance lawyer but I've built systems that passed audits.
Yes. Most inference stacks expose OpenAI-compatible APIs, so your existing code barely changes. I can also build custom integrations where needed.
You own the deployment, repo, and infrastructure. If open models stall and you want to move back to APIs, the router already supports that — pull a lever, done. No lock-in by design.
Yes. Standard mutual NDA before anything sensitive is discussed.
Yes — I'm Chicago-based but work across US / EU / Asia. Contracts adaptable to your jurisdiction.
No proprietary lock-in to any platform I built. Every piece is open-source or industry-standard.
Written cost model showing break-even vs your current spend before you commit to migration.
Before cutover, I run real traffic through both old and new systems and compare outputs.
You see every AI call — inputs, outputs, latencies, costs — from day one. No black boxes.
Every deployment can be operated by your team after handoff. Full docs, runbooks, walkthroughs.
If self-hosting doesn't make sense for your scale, I'll tell you. Better to turn away a deployment contract than sell you the wrong thing.
Discovery calls are free. Bring your current AI usage patterns, pain points, or just questions — we'll work through what makes sense for you specifically.