20+ years infrastructure engineering · Self-hosted · private · observable · Remote · fluent in US / EU timezones · Based in Chicago, USA · worldwide delivery

Own your AI stack. End to end.

Self-hosted LLM deployments, model routing, private inference, AI observability. For businesses that want the power of modern AI without sending customer data to a vendor, without per-token surprises, and without lock-in to any single model provider.

The AI-as-API model has real limits.

OpenAI / Anthropic / Gemini APIs are powerful and easy to start with — and exactly the wrong answer for a surprising number of real business use cases. Customer data leaving your perimeter. Unpredictable per-token costs that explode with usage. Vendor lock-in to a single provider. Rate limits that kneecap batch jobs. Compliance teams asking questions nobody can answer.

Self-hosting AI solves most of this, but it's a different skill set. GPU provisioning, model selection (Llama? Mistral? Qwen?), quantization tradeoffs, inference server choice (vLLM, TGI, Ollama, llama.cpp?), routing across models, evaluation, observability, cost control. Most teams either don't have the expertise in-house or don't want to grow it as a side project.

That's the work I do. I design and deploy private AI infrastructure that matches what you actually need — not a blanket replication of OpenAI. Often the right answer is a mix: self-hosted for bulk / sensitive / cheap work, external API for the hard edge cases, a router deciding between them.

You should call me if:
  • You're spending $5k+/month on AI APIs and costs are scaling faster than value
  • You have data regulations that make third-party API usage complicated (HIPAA, GDPR, internal policy)
  • You want to run AI on-premise / in your own cloud account for governance reasons
  • You're locked into one model provider and it's starting to bite (price hikes, outages, deprecations)
  • You need evaluation and observability around AI outputs — not just "trust the model"
  • You want a private alternative to ChatGPT for internal use that doesn't expose your prompts to anyone

Private AI infrastructure, deployed and observable.

Self-hosted inference

GPU provisioning (on-prem, bare metal, or cloud). Model deployment with vLLM / TGI / Ollama / llama.cpp depending on workload. Quantization and batching tuned to your cost/latency tradeoff.

Model routing & fallback

Route requests between models based on cost, latency, quality. Fall back to external APIs when self-hosted can't handle it. Single unified interface for your applications.

Private assistant deployments

ChatGPT-style internal assistant running on your infrastructure. Connects to your internal docs, knowledge bases, databases. Nobody outside your network sees the prompts.

Observability & evaluation

Trace every request. Log inputs, outputs, latencies, costs. Automated evaluation against golden sets. Catch regressions before users do.

Cost & usage controls

Per-team quotas. Per-endpoint budgets. Alerting on anomalous spend. No more surprise $50k monthly bills.

Full source code & runbooks

Your deployment, your repo, your infrastructure. Deployment scripts, upgrade runbooks, rollback procedures — all yours. Open-source stack throughout.

Four phases. No black boxes.

01
1 week

Workload assessment

My part
Audit current AI usage — what models, what volumes, what costs, what data flows. Identify what can move to self-hosted, what stays external, what changes either way.
Your part
Share current API usage patterns, traffic volumes, data sensitivity classifications. Straight answers about compliance and budget constraints.
Deliverable
Workload map + self-hosting fit assessment + fixed-price Phase 2 proposal
02
1-2 weeks

Infrastructure design

My part
Pick the hardware target (GPU class, on-prem vs cloud). Select model candidates for each workload. Design routing, observability, and cost-control layers. Prototype the hardest piece on real workload.
Your part
Provision infrastructure (I can advise or do it). Sign off on model choices and architecture.
Deliverable
Architecture diagram + benchmark results + deployment runbook + Phase 3 scope
03
3-6 weeks

Deployment & migration

My part
Stand up the full stack. Migrate existing workloads incrementally — shadow traffic first, canary rollout, full cutover. Observability wired in from day one.
Your part
Coordinate team on migration timing. Validate outputs match your current setup on real queries.
Deliverable
Production AI infrastructure with observable metrics + migration complete
04
ongoing or clean exit

Operate & hand off

My part
Retained engineer for model updates, evaluations, optimization — or handoff to your team with complete documentation.
Your part
Decide shape of ongoing support.
Deliverable
Retained-engineer agreement OR clean handoff package

Open-source, boring, proven.

No magic proprietary AI platforms. Everything runs on tools you can understand, inspect, and replace. Your stack, your rules.

Inference
vLLM · TGI · Ollama · llama.cpp · ExLlama
vLLM for throughput. Ollama for simplicity. llama.cpp for CPU/edge. Chosen per workload, not dogma.
Models
Llama 3 · Mistral · Qwen · Phi · DeepSeek · Gemma
Model selection is task-specific. 7B for cheap bulk. 70B+ for hard reasoning. Quantized when latency matters.
Routing & orchestration
LiteLLM · custom routers · OpenAI-compatible gateways
Your apps see OpenAI-shape APIs. The router decides which model / provider actually runs the request.
Observability
Langfuse · OpenTelemetry · Grafana · Loki · custom tracing
Trace every AI call. Evaluate every output. Spot quality regressions and cost anomalies automatically.
Infrastructure
Docker · Kubernetes · Nomad · Linux · GPU drivers · CUDA
Docker for most deployments. Kubernetes if you already run it. Bare metal when the GPU math demands it.
Vector / retrieval
pgvector · Qdrant · Weaviate · Milvus · LanceDB
pgvector first unless you need something it can't do. Separate vector DB only when retrieval is the bottleneck.

Who this works for.

Good fit if:

  • You're already spending real money ($5k+/month) on AI APIs
  • Data sensitivity or compliance makes third-party AI a real problem, not a theoretical one
  • You want ownership — your models, your infrastructure, your data flows
  • You have a technical team that can operate the result (I hand off clean)
  • You understand self-hosting has an operational cost as well as a savings
  • You want a strategic advisor, not a vendor with a proprietary platform to sell

Not a fit if:

  • You want a managed AI platform with zero ops overhead — pay OpenAI / Anthropic, that's their product
  • Your AI usage is $200/month — self-hosting economics don't work at that scale
  • You have zero technical team — you need someone to run the infrastructure, not just deploy it
  • You're shopping for the cheapest GPU-hosting setup — I optimize for correctness and observability, not raw price
  • You want me to pretend a 7B model is going to replace GPT-4 — it won't, and I'll tell you so

Every engagement scoped individually.

Discovery is free. Workload assessment (Phase 1) is always fixed-price. Later phases scoped based on what Phase 1 reveals.

Assessment
$5k – $12k · 1-2 weeks
Workload audit, fit assessment, architecture recommendation, cost modeling. You get a written report whether or not we continue.
Deployment
$25k – $80k · 4-8 weeks
Full stack setup, migration from existing APIs, observability, routing, first workload in production
Ongoing
$4k – $12k / month
Retained engineer for model updates, new workloads, evaluation, optimization. Predictable monthly cost.
Self-hosting saves money at scale but costs time and expertise upfront. I'll tell you honestly whether it makes sense for your situation before you commit to deployment.

Because I've been running this infrastructure for my own business.

Self-hostedGPU inferenceOpen modelsPrivate deployment
Problem
My own consulting practice runs on AI — content generation, research, code analysis, internal agents. Doing that on per-call APIs became expensive fast, and I had data I didn't want shared.
Approach
Built a self-hosted stack: GPU server running multiple open models behind a router, observability through Langfuse, fallback to external APIs when local can't handle it. Everything in Docker, monitored, reproducible.
Outcome
Predictable monthly infrastructure cost. Complete data privacy. Ability to route traffic intelligently across models based on task. This is what I build for clients — tested first on my own operation.

Dmytro Klymentiev

Independent senior engineer. 20+ years infrastructure, five years running production AI at personal and client scale.

I've been deploying servers since Apache was new. Docker since it was beta. Kubernetes in anger, then sensibly out of it for smaller teams. When LLMs became deployable in 2023, I was already in a position to integrate them into business systems rather than just call an API.

What I bring is old-school infrastructure discipline applied to modern AI. GPU budgets. Observability. Cost control. Fallback strategies. The kind of production thinking that turns a cool demo into a system you can bet a business on.

Based in Chicago. Working worldwide. Direct contracts or US entity.

Questions you might have.

Does self-hosting really save money vs OpenAI?

At scale, often yes — sometimes 5-10x cheaper at steady state. Below $2-5k/month on APIs, probably not worth it. I model actual costs honestly before you commit — sometimes the answer is 'stay on API for now, revisit in 6 months'.

Which open models are actually good enough?

For most business tasks, Llama 3 70B, Qwen 2.5 72B, or Mistral Large are within 10-20% of GPT-4 at a fraction of the inference cost. For reasoning-heavy work, you'll still want frontier APIs. I design with both.

Do I need to buy GPUs?

Not necessarily. Options: (1) on-prem bare-metal GPUs (best economics at scale), (2) reserved cloud GPU instances (AWS, Lambda, RunPod), (3) hybrid — own some, burst to cloud. I model all three before you commit.

What about compliance — HIPAA, GDPR, SOC2?

Self-hosting actually simplifies compliance because data never leaves your perimeter. I work with your compliance team on architecture documentation, audit trails, data handling. I'm not a compliance lawyer but I've built systems that passed audits.

Can you integrate with our existing apps?

Yes. Most inference stacks expose OpenAI-compatible APIs, so your existing code barely changes. I can also build custom integrations where needed.

What if open models get worse or you lock me in?

You own the deployment, repo, and infrastructure. If open models stall and you want to move back to APIs, the router already supports that — pull a lever, done. No lock-in by design.

Will you sign an NDA?

Yes. Standard mutual NDA before anything sensitive is discussed.

Do you work with clients outside the US?

Yes — I'm Chicago-based but work across US / EU / Asia. Contracts adaptable to your jurisdiction.

What I guarantee, in writing.

01

Open-source stack

No proprietary lock-in to any platform I built. Every piece is open-source or industry-standard.

02

Cost modeling before deployment

Written cost model showing break-even vs your current spend before you commit to migration.

03

Shadow-traffic validation

Before cutover, I run real traffic through both old and new systems and compare outputs.

04

Complete observability

You see every AI call — inputs, outputs, latencies, costs — from day one. No black boxes.

05

Clean handoff option

Every deployment can be operated by your team after handoff. Full docs, runbooks, walkthroughs.

06

Straight economics

If self-hosting doesn't make sense for your scale, I'll tell you. Better to turn away a deployment contract than sell you the wrong thing.

Thirty minutes to talk through your AI stack.

Discovery calls are free. Bring your current AI usage patterns, pain points, or just questions — we'll work through what makes sense for you specifically.

Last updated: 2026-04-23 · by Dmytro Klymentiev