Custom agents · LLM workflows · production-grade · Evaluation · observability · cost control · Remote · fluent in US / EU timezones · Based in Chicago, USA · worldwide delivery

Software that actually does work. Not just talks about it.

Custom AI agents, LLM-powered internal tools, workflows where the model is one component in a real system. I build AI applications that solve specific business problems — not generic chatbots, not demos that fall apart in production.

The gap between AI demos and AI that ships.

Everyone has seen the demo: ask an AI agent to book a flight or summarize a doc, watch it succeed. Most of those demos break the moment they hit real data, real edge cases, or real cost constraints. Moving from 'demo that works 80% of the time' to 'production system that a business can rely on' is a different project entirely.

The real work is the unglamorous part: prompt engineering with version control, evaluation on gold-standard datasets, fallback logic for when the model gets it wrong, observability so you can debug failures, cost tracking so you don't wake up to a $30k monthly bill, integration with your real systems (CRM, database, APIs), and a UX that matches the actual confidence level the AI has.

That's where I work. I build AI agents that connect to your real data, do real work, and survive contact with production. Often the result looks less like a ChatGPT interface and more like an ordinary web app with AI quietly running specific decisions underneath.

You should call me if:
  • You have a specific repetitive task where AI could do most of the work — but you need it to be accurate, not just plausible
  • You've tried building with LangChain / CrewAI / Autogen and the results don't survive real workloads
  • You want an internal AI assistant that actually knows your business — not just a ChatGPT wrapper
  • You need an agent to take actions (send email, update CRM, generate document) not just answer questions
  • Your team has tried AI features and they failed in edge cases — you need someone who can diagnose and fix
  • You're planning a product with AI at the core and need an engineer who has shipped AI, not just prototyped it

Agents that do real work, built to production standards.

Custom AI agents

Task-specific agents — lead qualification, customer support triage, document processing, data enrichment, research. Connected to your real systems, not sandbox toys.

LLM-powered internal tools

Web apps with AI running specific decisions underneath. Looks like an ordinary tool, works much smarter. Your team uses it without knowing or caring which model is behind it.

RAG & knowledge retrieval

AI assistants that actually know your company — trained on your docs, wiki, tickets, databases. Real citations. No hallucinations on factual questions.

Multi-model orchestration

Use the right model for each step. Cheap fast model for classification. Expensive slow model for complex reasoning. Your code doesn't care — the router decides.

Evaluation & observability

Gold-standard test sets. Automated evaluation on every deploy. Full tracing of every AI call. Catch regressions before users do.

Full source code & ownership

Your agents, your prompts, your eval data, your deployment. No SaaS dependency beyond the model provider. Clean handoff if you want to operate it yourself.

Four phases. Eval-driven. No demo-ware.

01
1 week

Problem scoping

My part
Map the specific task. Identify where AI actually helps vs where deterministic code is better. Build the first eval set from real examples. Estimate feasibility honestly.
Your part
Share real data, real edge cases, and examples of both good and bad outcomes. The eval set is only as good as the examples you give me.
Deliverable
Problem specification + eval dataset v1 + feasibility report + fixed-price Phase 2 proposal
02
1-3 weeks

Prototype & evaluate

My part
Build the first working version. Measure against the eval set. Iterate on prompts, models, retrieval strategy. Show real numbers — not vibes.
Your part
Review outputs on real cases. Flag what's wrong, not just what's missing. Expand eval set with failures we find.
Deliverable
Working prototype + evaluation results + production architecture proposal
03
3-8 weeks

Production build

My part
Integration with your real systems (CRM, DB, APIs). Cost controls. Observability. Fallback logic. Deploy to staging. Shadow traffic validation.
Your part
Coordinate integration with your team. Sign off on cost/quality tradeoffs. Test on real users / real workflows.
Deliverable
Production-ready system with full observability and evaluation pipeline
04
ongoing or clean exit

Launch & iterate

My part
Retained engineer for prompt updates, new cases, model upgrades, performance tuning — or handoff to your team.
Your part
Decide shape of ongoing support based on how much the system needs to evolve.
Deliverable
Retained-engineer agreement OR handoff package with eval pipeline your team can run

Model-agnostic, framework-light, production-focused.

I avoid heavy frameworks that abstract away what the model actually sees. Prompts in version control. Real code, not drag-and-drop chains.

Models
GPT-4 class · Claude · Llama 3 · Qwen · Gemini · fine-tunes
Model choice depends on task. Frontier APIs for hard reasoning. Open models for bulk / sensitive work. Often both, routed per step.
Orchestration
Custom Python / TypeScript · LiteLLM · Pydantic AI · light LangChain
Usually custom code — more control, clearer failure modes. Frameworks only where they save real time without hiding important behavior.
RAG & retrieval
pgvector · Qdrant · LlamaIndex · custom ranking
Retrieval is often the bottleneck. I tune embeddings, ranking, and context composition — not just 'install a vector DB'.
Evaluation
Custom eval harnesses · LLM-as-judge · golden datasets · regression tests
You can't improve what you can't measure. Every agent has an eval set that runs on every deploy.
Observability
Langfuse · OpenTelemetry · custom tracing · Grafana
Trace every agent call. See prompts, outputs, costs, latency. Debug production by reading logs, not guessing.
Integration
OpenAI-shape APIs · webhooks · queue workers · scheduled jobs
Agents live inside real systems. CRM, ticketing, email, databases — the AI is one component, not the whole app.

Who this works for.

Good fit if:

  • You have a specific task — not 'add AI somewhere in the product'
  • You understand AI gets things wrong sometimes and need someone to engineer around that, not hand-wave it
  • You have real data to build an eval set from (hundreds of examples, ideally thousands)
  • Your budget is $20k+ for a proper agent build — honest AI work is not cheap
  • You want the agent integrated into your existing systems, not a standalone chatbot
  • You care about cost, quality, and ownership — not just the fanciest demo

Not a fit if:

  • You want a ChatGPT-style chat widget on your website — that's a product someone already sells, not a custom build
  • You have no eval data and no way to generate any — AI without evaluation is guessing in public
  • You need an agent that's 'creative' and 'surprising' — my work is about reliability, not novelty
  • Your use case is content generation for SEO spam — not my work
  • You expect 100% accuracy — no AI system hits that, and if someone promises it they're lying

Every engagement scoped individually.

Discovery is free. Problem scoping (Phase 1) is fixed-price and produces a go / no-go answer whether or not we continue.

Scoping
$5k – $10k · 1 week
Problem specification, eval dataset v1, feasibility report. You get a written answer whether the problem is AI-solvable at the quality you need.
Build
$25k – $80k · 4-10 weeks
Working production agent integrated with your systems — evaluation pipeline included. Most single-purpose agents land here.
Platform
$80k+ · 2-6 months
Multi-agent system, complex orchestration, or AI at the core of a product. Typically phased across multiple quarters.
AI work is evaluation-heavy. Half the cost of a good agent is often the eval infrastructure and the iteration cycle. Skimp on evaluation and you ship something that embarrasses you in production.

I run AI agents in my own business every day.

Production agentsCustom orchestrationEvaluationObservability
Problem
My own consulting practice depends on AI doing real work — content generation, code review, research aggregation, internal assistants. Generic ChatGPT UX didn't cut it.
Approach
Built custom orchestration that routes work across models, uses RAG on my own knowledge base, maintains eval sets for each type of task, and runs in production 24/7. Costs tracked per call. Outputs traced and logged.
Outcome
Working AI infrastructure in daily use. Models upgraded, prompts improved, eval sets growing — because it's my operation on the line. This is what I build for clients — tested first on my own business.

Dmytro Klymentiev

Independent senior engineer. 20+ years software engineering, five years production AI — across agents, RAG, custom integrations, self-hosted deployments.

I'm an engineer who discovered AI late, by the standards of the hype cycle — meaning I came in with twenty years of discipline around production code, evaluation, and observability. That turns out to be exactly the skill set AI applications need.

What I build isn't glamorous. It's agents that handle lead intake for real businesses, RAG systems that actually return correct answers, workflows where LLMs are one trusted component in a larger deterministic system. The stuff that makes a business operation better, not the stuff that goes on a demo reel.

Based in Chicago. Working worldwide. Direct contracts or US entity.

Questions you might have.

Why custom AI instead of using an existing product?

Custom makes sense when your workflow is specific enough that no product fits, or when your data is sensitive enough that you can't use SaaS AI. For generic use cases — just writing, just meeting notes, just summarization — use the product. Custom for everything else.

Which models should we use?

Depends on the task. Most production agents use 2-3 models: a cheap fast one for routing/classification, a capable one for the main work, maybe a frontier model for edge cases. I'll recommend the mix based on your workload and budget.

How do you handle AI getting things wrong?

Evaluation first. I build eval sets from real data, measure accuracy honestly, and design fallback logic for failures — retry, escalate to human, route to different model, or refuse to act. No 'trust the model and hope'.

What about hallucinations?

For factual tasks (RAG, data extraction), I ground models in your actual data with citations. For generative tasks (drafting, brainstorming), hallucination is sometimes the feature — we constrain where it matters, accept where it doesn't. Evaluation tells us which is which.

Can the agent take actions on its own?

Yes — with guardrails. I typically design tiered autonomy: read-only actions automatic, write actions staged for review, high-stakes actions always human-approved. The right cut depends on your risk tolerance and the specific task.

Do you use LangChain / CrewAI / Autogen?

Rarely. I prefer plain code for orchestration — easier to debug, less hidden behavior, faster to change. Frameworks make sense when they save real time without hiding important choices; usually they don't.

What ongoing work does an agent need?

Prompt updates as models change. New examples added to eval sets. Performance tuning. New edge cases as usage grows. Typically 10-30% of initial build cost per year as retained engineer — or 0 if you take it over internally.

Will you sign an NDA?

Yes. Standard mutual NDA before anything sensitive is discussed.

What I guarantee, in writing.

01

Eval-driven development

Every agent has an evaluation set. I measure accuracy on real data before claiming anything works.

02

Full observability

Every AI call traced — prompt, output, latency, cost. You debug by reading logs, not guessing.

03

Honest feasibility

If a task isn't solvable at the quality level you need, I'll tell you in Phase 1 and refund if you want out. Better than an expensive failure.

04

No framework lock-in

Custom code, not proprietary platforms. Your agent doesn't depend on me or my tools continuing to exist.

05

Cost controls from day one

Per-call budgets, usage alerts, rate limits. No surprise bills.

06

Human-in-loop where it matters

I don't ship fully-autonomous agents on high-stakes actions. Review gates are a feature, not a limitation.

Thirty minutes. Bring a specific task.

Discovery calls are free. Come with an actual problem — a workflow you'd like to automate, a decision you'd like AI to make, a tool you'd like your team to have. We'll work through whether AI is the right answer, and what it would take.

Last updated: 2026-04-23 · by Dmytro Klymentiev