Software that actually does work. Not just talks about it.

Custom AI agents, LLM-powered internal tools, workflows where the model is one component in a real system. I build AI applications that solve specific business problems — not generic chatbots, not demos that fall apart in production.

Ask me anything — I'll reply now Telegram → contact@klim.expert →

Dmytro Klymentiev

Independent senior engineer. 20+ years software engineering, five years production AI — across agents, RAG, custom integrations, self-hosted deployments.

20+ Years software engineering 5+ Years production AI Eval Driven, not vibes 1 Engineer · your direct line

LinkedIn GitHub X / Twitter Personal site

Available for projects · replies within 24 hours

The gap between AI demos and AI that ships.

Everyone has seen the demo: ask an AI agent to book a flight or summarize a doc, watch it succeed. Most of those demos break the moment they hit real data, real edge cases, or real cost constraints. Moving from 'demo that works 80% of the time' to 'production system that a business can rely on' is a different project entirely.

The real work is the unglamorous part: prompt engineering with version control, evaluation on gold-standard datasets, fallback logic for when the model gets it wrong, observability so you can debug failures, cost tracking so you don't wake up to a $30k monthly bill, integration with your real systems (CRM, database, APIs), and a UX that matches the actual confidence level the AI has.

That's where I work. I build AI agents that connect to your real data, do real work, and survive contact with production. Often the result looks less like a ChatGPT interface and more like an ordinary web app with AI quietly running specific decisions underneath.

You should call me if:

You have a specific repetitive task where AI could do most of the work — but you need it to be accurate, not just plausible
You've tried building with LangChain / CrewAI / Autogen and the results don't survive real workloads
You want an internal AI assistant that actually knows your business — not just a ChatGPT wrapper
You need an agent to take actions (send email, update CRM, generate document) not just answer questions
Your team has tried AI features and they failed in edge cases — you need someone who can diagnose and fix
You're planning a product with AI at the core and need an engineer who has shipped AI, not just prototyped it

Agents that do real work, built to production standards.

Custom AI agents

Task-specific agents — lead qualification, customer support triage, document processing, data enrichment, research. Connected to your real systems, not sandbox toys.

LLM-powered internal tools

Web apps with AI running specific decisions underneath. Looks like an ordinary tool, works much smarter. Your team uses it without knowing or caring which model is behind it.

RAG & knowledge retrieval

AI assistants that actually know your company — trained on your docs, wiki, tickets, databases. Real citations. No hallucinations on factual questions.

Multi-model orchestration

Use the right model for each step. Cheap fast model for classification. Expensive slow model for complex reasoning. Your code doesn't care — the router decides.

Evaluation & observability

Gold-standard test sets. Automated evaluation on every deploy. Full tracing of every AI call. Catch regressions before users do.

Full source code & ownership

Your agents, your prompts, your eval data, your deployment. No SaaS dependency beyond the model provider. Clean handoff if you want to operate it yourself.

Four phases. Eval-driven. No demo-ware.

1 week

Problem scoping

My part

Map the specific task. Identify where AI actually helps vs where deterministic code is better. Build the first eval set from real examples. Estimate feasibility honestly.

Your part

Share real data, real edge cases, and examples of both good and bad outcomes. The eval set is only as good as the examples you give me.

Deliverable

Problem specification + eval dataset v1 + feasibility report + fixed-price Phase 2 proposal

1-3 weeks

Prototype & evaluate

My part

Build the first working version. Measure against the eval set. Iterate on prompts, models, retrieval strategy. Show real numbers — not vibes.

Your part

Review outputs on real cases. Flag what's wrong, not just what's missing. Expand eval set with failures we find.

Deliverable

Working prototype + evaluation results + production architecture proposal

3-8 weeks

Production build

My part

Integration with your real systems (CRM, DB, APIs). Cost controls. Observability. Fallback logic. Deploy to staging. Shadow traffic validation.

Your part

Coordinate integration with your team. Sign off on cost/quality tradeoffs. Test on real users / real workflows.

Deliverable

Production-ready system with full observability and evaluation pipeline

ongoing or clean exit

Launch & iterate

My part

Retained engineer for prompt updates, new cases, model upgrades, performance tuning — or handoff to your team.

Your part

Decide shape of ongoing support based on how much the system needs to evolve.

Deliverable

Retained-engineer agreement OR handoff package with eval pipeline your team can run

Model-agnostic, framework-light, production-focused.

I avoid heavy frameworks that abstract away what the model actually sees. Prompts in version control. Real code, not drag-and-drop chains.

Models

GPT-4 class · Claude · Llama 3 · Qwen · Gemini · fine-tunes

Model choice depends on task. Frontier APIs for hard reasoning. Open models for bulk / sensitive work. Often both, routed per step.

Orchestration

Custom Python / TypeScript · LiteLLM · Pydantic AI · light LangChain

Usually custom code — more control, clearer failure modes. Frameworks only where they save real time without hiding important behavior.

RAG & retrieval

pgvector · Qdrant · LlamaIndex · custom ranking

Retrieval is often the bottleneck. I tune embeddings, ranking, and context composition — not just 'install a vector DB'.

Evaluation

Custom eval harnesses · LLM-as-judge · golden datasets · regression tests

You can't improve what you can't measure. Every agent has an eval set that runs on every deploy.

Observability

Langfuse · OpenTelemetry · custom tracing · Grafana

Trace every agent call. See prompts, outputs, costs, latency. Debug production by reading logs, not guessing.

Integration

OpenAI-shape APIs · webhooks · queue workers · scheduled jobs

Agents live inside real systems. CRM, ticketing, email, databases — the AI is one component, not the whole app.

Who this works for.

Good fit if:

You have a specific task — not 'add AI somewhere in the product'
You understand AI gets things wrong sometimes and need someone to engineer around that, not hand-wave it
You have real data to build an eval set from (hundreds of examples, ideally thousands)
Your budget is $20k+ for a proper agent build — honest AI work is not cheap
You want the agent integrated into your existing systems, not a standalone chatbot
You care about cost, quality, and ownership — not just the fanciest demo

Not a fit if:

You want a ChatGPT-style chat widget on your website — that's a product someone already sells, not a custom build
You have no eval data and no way to generate any — AI without evaluation is guessing in public
You need an agent that's 'creative' and 'surprising' — my work is about reliability, not novelty
Your use case is content generation for SEO spam — not my work
You expect 100% accuracy — no AI system hits that, and if someone promises it they're lying

Every engagement scoped individually.

Discovery is free. Problem scoping (Phase 1) is fixed-price and produces a go / no-go answer whether or not we continue.

Scoping

$5k – $10k · 1 week

Problem specification, eval dataset v1, feasibility report. You get a written answer whether the problem is AI-solvable at the quality you need.

Build

$25k – $80k · 4-10 weeks

Working production agent integrated with your systems — evaluation pipeline included. Most single-purpose agents land here.

Platform

$80k+ · 2-6 months

Multi-agent system, complex orchestration, or AI at the core of a product. Typically phased across multiple quarters.

AI work is evaluation-heavy. Half the cost of a good agent is often the eval infrastructure and the iteration cycle. Skimp on evaluation and you ship something that embarrasses you in production.

I run AI agents in my own business every day.

Production agentsCustom orchestrationEvaluationObservability

Problem

My own consulting practice depends on AI doing real work — content generation, code review, research aggregation, internal assistants. Generic ChatGPT UX didn't cut it.

Approach

Built custom orchestration that routes work across models, uses RAG on my own knowledge base, maintains eval sets for each type of task, and runs in production 24/7. Costs tracked per call. Outputs traced and logged.

Outcome

Working AI infrastructure in daily use. Models upgraded, prompts improved, eval sets growing — because it's my operation on the line. This is what I build for clients — tested first on my own business.

Full case study on klymentiev.com →

Dmytro Klymentiev

Independent senior engineer. 20+ years software engineering, five years production AI — across agents, RAG, custom integrations, self-hosted deployments.

I'm an engineer who discovered AI late, by the standards of the hype cycle — meaning I came in with twenty years of discipline around production code, evaluation, and observability. That turns out to be exactly the skill set AI applications need.

What I build isn't glamorous. It's agents that handle lead intake for real businesses, RAG systems that actually return correct answers, workflows where LLMs are one trusted component in a larger deterministic system. The stuff that makes a business operation better, not the stuff that goes on a demo reel.

Based in Chicago. Working worldwide. Direct contracts or US entity.

Questions you might have.

Why custom AI instead of using an existing product?

Custom makes sense when your workflow is specific enough that no product fits, or when your data is sensitive enough that you can't use SaaS AI. For generic use cases — just writing, just meeting notes, just summarization — use the product. Custom for everything else.

Which models should we use?

Depends on the task. Most production agents use 2-3 models: a cheap fast one for routing/classification, a capable one for the main work, maybe a frontier model for edge cases. I'll recommend the mix based on your workload and budget.

How do you handle AI getting things wrong?

Evaluation first. I build eval sets from real data, measure accuracy honestly, and design fallback logic for failures — retry, escalate to human, route to different model, or refuse to act. No 'trust the model and hope'.

What about hallucinations?

For factual tasks (RAG, data extraction), I ground models in your actual data with citations. For generative tasks (drafting, brainstorming), hallucination is sometimes the feature — we constrain where it matters, accept where it doesn't. Evaluation tells us which is which.

Can the agent take actions on its own?

Yes — with guardrails. I typically design tiered autonomy: read-only actions automatic, write actions staged for review, high-stakes actions always human-approved. The right cut depends on your risk tolerance and the specific task.

Do you use LangChain / CrewAI / Autogen?

Rarely. I prefer plain code for orchestration — easier to debug, less hidden behavior, faster to change. Frameworks make sense when they save real time without hiding important choices; usually they don't.

What ongoing work does an agent need?

Prompt updates as models change. New examples added to eval sets. Performance tuning. New edge cases as usage grows. Typically 10-30% of initial build cost per year as retained engineer — or 0 if you take it over internally.

Will you sign an NDA?

Yes. Standard mutual NDA before anything sensitive is discussed.

What I guarantee, in writing.

Eval-driven development

Every agent has an evaluation set. I measure accuracy on real data before claiming anything works.

Full observability

Every AI call traced — prompt, output, latency, cost. You debug by reading logs, not guessing.

Honest feasibility

If a task isn't solvable at the quality level you need, I'll tell you in Phase 1 and refund if you want out. Better than an expensive failure.

No framework lock-in

Custom code, not proprietary platforms. Your agent doesn't depend on me or my tools continuing to exist.

Cost controls from day one

Per-call budgets, usage alerts, rate limits. No surprise bills.

Human-in-loop where it matters

I don't ship fully-autonomous agents on high-stakes actions. Review gates are a feature, not a limitation.

Software that actually does work. Not just talks about it.

The gap between AI demos and AI that ships.

Agents that do real work, built to production standards.

Custom AI agents

LLM-powered internal tools

RAG &amp; knowledge retrieval

Multi-model orchestration

Evaluation &amp; observability

Full source code &amp; ownership

Four phases. Eval-driven. No demo-ware.

Problem scoping

Prototype &amp; evaluate

Production build

Launch &amp; iterate

Model-agnostic, framework-light, production-focused.

Who this works for.

Good fit if:

Not a fit if:

Every engagement scoped individually.

I run AI agents in my own business every day.

Dmytro Klymentiev

Questions you might have.

What I guarantee, in writing.

Eval-driven development

Full observability

Honest feasibility

No framework lock-in

Cost controls from day one

Human-in-loop where it matters

Thirty minutes. Bring a specific task.

RAG & knowledge retrieval

Evaluation & observability

Full source code & ownership

Prototype & evaluate

Launch & iterate