Own your AI stack. End to end.

Self-hosted LLM deployments, model routing, private inference, AI observability. For businesses that want the power of modern AI without sending customer data to a vendor, without per-token surprises, and without lock-in to any single model provider.

Ask me anything — I'll reply now Telegram → contact@klim.expert →

Dmytro Klymentiev

Independent senior engineer. 20+ years infrastructure, five years running production AI at personal and client scale.

20+ Years infrastructure engineering 5+ Years production AI Open Source-only stack 1 Engineer · your direct line

LinkedIn GitHub X / Twitter Personal site

Available for projects · replies within 24 hours

The AI-as-API model has real limits.

OpenAI / Anthropic / Gemini APIs are powerful and easy to start with — and exactly the wrong answer for a surprising number of real business use cases. Customer data leaving your perimeter. Unpredictable per-token costs that explode with usage. Vendor lock-in to a single provider. Rate limits that kneecap batch jobs. Compliance teams asking questions nobody can answer.

Self-hosting AI solves most of this, but it's a different skill set. GPU provisioning, model selection (Llama? Mistral? Qwen?), quantization tradeoffs, inference server choice (vLLM, TGI, Ollama, llama.cpp?), routing across models, evaluation, observability, cost control. Most teams either don't have the expertise in-house or don't want to grow it as a side project.

That's the work I do. I design and deploy private AI infrastructure that matches what you actually need — not a blanket replication of OpenAI. Often the right answer is a mix: self-hosted for bulk / sensitive / cheap work, external API for the hard edge cases, a router deciding between them.

You should call me if:

You're spending $5k+/month on AI APIs and costs are scaling faster than value
You have data regulations that make third-party API usage complicated (HIPAA, GDPR, internal policy)
You want to run AI on-premise / in your own cloud account for governance reasons
You're locked into one model provider and it's starting to bite (price hikes, outages, deprecations)
You need evaluation and observability around AI outputs — not just "trust the model"
You want a private alternative to ChatGPT for internal use that doesn't expose your prompts to anyone

Private AI infrastructure, deployed and observable.

Self-hosted inference

GPU provisioning (on-prem, bare metal, or cloud). Model deployment with vLLM / TGI / Ollama / llama.cpp depending on workload. Quantization and batching tuned to your cost/latency tradeoff.

Model routing & fallback

Route requests between models based on cost, latency, quality. Fall back to external APIs when self-hosted can't handle it. Single unified interface for your applications.

Private assistant deployments

ChatGPT-style internal assistant running on your infrastructure. Connects to your internal docs, knowledge bases, databases. Nobody outside your network sees the prompts.

Observability & evaluation

Trace every request. Log inputs, outputs, latencies, costs. Automated evaluation against golden sets. Catch regressions before users do.

Cost & usage controls

Per-team quotas. Per-endpoint budgets. Alerting on anomalous spend. No more surprise $50k monthly bills.

Full source code & runbooks

Your deployment, your repo, your infrastructure. Deployment scripts, upgrade runbooks, rollback procedures — all yours. Open-source stack throughout.

Four phases. No black boxes.

1 week

Workload assessment

My part

Audit current AI usage — what models, what volumes, what costs, what data flows. Identify what can move to self-hosted, what stays external, what changes either way.

Your part

Share current API usage patterns, traffic volumes, data sensitivity classifications. Straight answers about compliance and budget constraints.

Deliverable

Workload map + self-hosting fit assessment + fixed-price Phase 2 proposal

1-2 weeks

Infrastructure design

My part

Pick the hardware target (GPU class, on-prem vs cloud). Select model candidates for each workload. Design routing, observability, and cost-control layers. Prototype the hardest piece on real workload.

Your part

Provision infrastructure (I can advise or do it). Sign off on model choices and architecture.

Deliverable

Architecture diagram + benchmark results + deployment runbook + Phase 3 scope

3-6 weeks

Deployment & migration

My part

Stand up the full stack. Migrate existing workloads incrementally — shadow traffic first, canary rollout, full cutover. Observability wired in from day one.

Your part

Coordinate team on migration timing. Validate outputs match your current setup on real queries.

Deliverable

Production AI infrastructure with observable metrics + migration complete

ongoing or clean exit

Operate & hand off

My part

Retained engineer for model updates, evaluations, optimization — or handoff to your team with complete documentation.

Your part

Decide shape of ongoing support.

Deliverable

Retained-engineer agreement OR clean handoff package

Open-source, boring, proven.

No magic proprietary AI platforms. Everything runs on tools you can understand, inspect, and replace. Your stack, your rules.

Inference

vLLM · TGI · Ollama · llama.cpp · ExLlama

vLLM for throughput. Ollama for simplicity. llama.cpp for CPU/edge. Chosen per workload, not dogma.

Models

Llama 3 · Mistral · Qwen · Phi · DeepSeek · Gemma

Model selection is task-specific. 7B for cheap bulk. 70B+ for hard reasoning. Quantized when latency matters.

Routing & orchestration

LiteLLM · custom routers · OpenAI-compatible gateways

Your apps see OpenAI-shape APIs. The router decides which model / provider actually runs the request.

Observability

Langfuse · OpenTelemetry · Grafana · Loki · custom tracing

Trace every AI call. Evaluate every output. Spot quality regressions and cost anomalies automatically.

Infrastructure

Docker · Kubernetes · Nomad · Linux · GPU drivers · CUDA

Docker for most deployments. Kubernetes if you already run it. Bare metal when the GPU math demands it.

Vector / retrieval

pgvector · Qdrant · Weaviate · Milvus · LanceDB

pgvector first unless you need something it can't do. Separate vector DB only when retrieval is the bottleneck.

Who this works for.

Good fit if:

You're already spending real money ($5k+/month) on AI APIs
Data sensitivity or compliance makes third-party AI a real problem, not a theoretical one
You want ownership — your models, your infrastructure, your data flows
You have a technical team that can operate the result (I hand off clean)
You understand self-hosting has an operational cost as well as a savings
You want a strategic advisor, not a vendor with a proprietary platform to sell

Not a fit if:

You want a managed AI platform with zero ops overhead — pay OpenAI / Anthropic, that's their product
Your AI usage is $200/month — self-hosting economics don't work at that scale
You have zero technical team — you need someone to run the infrastructure, not just deploy it
You're shopping for the cheapest GPU-hosting setup — I optimize for correctness and observability, not raw price
You want me to pretend a 7B model is going to replace GPT-4 — it won't, and I'll tell you so

Every engagement scoped individually.

Discovery is free. Workload assessment (Phase 1) is always fixed-price. Later phases scoped based on what Phase 1 reveals.

Assessment

$5k – $12k · 1-2 weeks

Workload audit, fit assessment, architecture recommendation, cost modeling. You get a written report whether or not we continue.

Deployment

$25k – $80k · 4-8 weeks

Full stack setup, migration from existing APIs, observability, routing, first workload in production

Ongoing

$4k – $12k / month

Retained engineer for model updates, new workloads, evaluation, optimization. Predictable monthly cost.

Self-hosting saves money at scale but costs time and expertise upfront. I'll tell you honestly whether it makes sense for your situation before you commit to deployment.

Because I've been running this infrastructure for my own business.

Self-hostedGPU inferenceOpen modelsPrivate deployment

Problem

My own consulting practice runs on AI — content generation, research, code analysis, internal agents. Doing that on per-call APIs became expensive fast, and I had data I didn't want shared.

Approach

Built a self-hosted stack: GPU server running multiple open models behind a router, observability through Langfuse, fallback to external APIs when local can't handle it. Everything in Docker, monitored, reproducible.

Outcome

Predictable monthly infrastructure cost. Complete data privacy. Ability to route traffic intelligently across models based on task. This is what I build for clients — tested first on my own operation.

Full case study on klymentiev.com →

Dmytro Klymentiev

Independent senior engineer. 20+ years infrastructure, five years running production AI at personal and client scale.

I've been deploying servers since Apache was new. Docker since it was beta. Kubernetes in anger, then sensibly out of it for smaller teams. When LLMs became deployable in 2023, I was already in a position to integrate them into business systems rather than just call an API.

What I bring is old-school infrastructure discipline applied to modern AI. GPU budgets. Observability. Cost control. Fallback strategies. The kind of production thinking that turns a cool demo into a system you can bet a business on.

Based in Chicago. Working worldwide. Direct contracts or US entity.

Questions you might have.

Does self-hosting really save money vs OpenAI?

At scale, often yes — sometimes 5-10x cheaper at steady state. Below $2-5k/month on APIs, probably not worth it. I model actual costs honestly before you commit — sometimes the answer is 'stay on API for now, revisit in 6 months'.

Which open models are actually good enough?

For most business tasks, Llama 3 70B, Qwen 2.5 72B, or Mistral Large are within 10-20% of GPT-4 at a fraction of the inference cost. For reasoning-heavy work, you'll still want frontier APIs. I design with both.

Do I need to buy GPUs?

Not necessarily. Options: (1) on-prem bare-metal GPUs (best economics at scale), (2) reserved cloud GPU instances (AWS, Lambda, RunPod), (3) hybrid — own some, burst to cloud. I model all three before you commit.

What about compliance — HIPAA, GDPR, SOC2?

Self-hosting actually simplifies compliance because data never leaves your perimeter. I work with your compliance team on architecture documentation, audit trails, data handling. I'm not a compliance lawyer but I've built systems that passed audits.

Can you integrate with our existing apps?

Yes. Most inference stacks expose OpenAI-compatible APIs, so your existing code barely changes. I can also build custom integrations where needed.

What if open models get worse or you lock me in?

You own the deployment, repo, and infrastructure. If open models stall and you want to move back to APIs, the router already supports that — pull a lever, done. No lock-in by design.

Will you sign an NDA?

Yes. Standard mutual NDA before anything sensitive is discussed.

Do you work with clients outside the US?

Yes — I'm Chicago-based but work across US / EU / Asia. Contracts adaptable to your jurisdiction.

What I guarantee, in writing.

Open-source stack

No proprietary lock-in to any platform I built. Every piece is open-source or industry-standard.

Cost modeling before deployment

Written cost model showing break-even vs your current spend before you commit to migration.

Shadow-traffic validation

Before cutover, I run real traffic through both old and new systems and compare outputs.

Complete observability

You see every AI call — inputs, outputs, latencies, costs — from day one. No black boxes.

Clean handoff option

Every deployment can be operated by your team after handoff. Full docs, runbooks, walkthroughs.

Straight economics

If self-hosting doesn't make sense for your scale, I'll tell you. Better to turn away a deployment contract than sell you the wrong thing.

Own your AI stack. End to end.

The AI-as-API model has real limits.

Private AI infrastructure, deployed and observable.

Self-hosted inference

Model routing &amp; fallback

Private assistant deployments

Observability &amp; evaluation

Cost &amp; usage controls

Full source code &amp; runbooks

Four phases. No black boxes.

Workload assessment

Infrastructure design

Deployment &amp; migration

Operate &amp; hand off

Open-source, boring, proven.

Who this works for.

Good fit if:

Not a fit if:

Every engagement scoped individually.

Because I've been running this infrastructure for my own business.

Dmytro Klymentiev

Questions you might have.

What I guarantee, in writing.

Open-source stack

Cost modeling before deployment

Shadow-traffic validation

Complete observability

Clean handoff option

Straight economics

Thirty minutes to talk through your AI stack.

Model routing & fallback

Observability & evaluation

Cost & usage controls

Full source code & runbooks

Deployment & migration

Operate & hand off