AI EngineeringAgentic AI

Production LLM Agents Ki Evaluation: Unit Tests Se Aage

Agent ki failures aksar final output par nahi balke span level par hoti hain. Yeh guide RAGAS metrics, span level evaluation, LangSmith setup aur 2026 ke target scores cover karti hai.

9 min read

Hissa 01 · Markazi Masla

Agents ki evaluation LLM calls ki evaluation se kyun mukhtalif hai

Ek waahid LLM call ya to sawal ka acha jawab deta hai ya nahi. Ek agent run sequence mein 20 se 100 decisions leta hai. Step 7 par failure aisa final output de sakti hai jo dekhne mein qabil-e-yaqeen ho lekin mukammal taur par ghalat ho.

Foran Jawab

Chhota jawab: Agent evaluation ko span level par hona chahiye — har tool call, retrieval decision aur reasoning step — sirf final output par nahi. Output evaluation failures ko tab pakarta hai jab woh pehle hi pipeline mein phail chuki hoti hain.

Chatbot evaluation ka standard — kya output sawal ka jawab deta hai, kya woh factually drust hai, kya woh style guide se milta hai — agents ke liye nakaafi hai. Ek agent jo ghalat document retrieve kare, sahi tool ko ghalat parameters ke saath call kare, ya step 3 par user intent ko misclassify kare — aksar pur-aitmaad lagne wala final output de dega. Jab tak aap output evaluate karenge, error baqi steps mein pehle hi phail chuki hogi.

Munaasib evaluation infrastructure ki kami ki wajah se 2026 mein taqreeban aadhe agentic AI projects cancel hone ki peshangoi hai. Teams ship karti hain, inconsistent results paati hain, wajah diagnose nahi kar paatin aur system par bharosa kho deti hain. Hal behtar model nahi — step level par behtar measurement hai.

Hissa 02 · Failure Categories

Jin teen failure categories ki measurement zaroori hai

Retrieval failures

Agent ghalat documents retrieve karta hai, bohot kam retrieve karta hai, ya contextually irrelevant chunks retrieve karta hai. Phir downstream reasoning ghalat information par khari ho jati hai. RAGAS context precision aur context recall is ko napte hain. Context precision ka target 0.80 se upar aur context recall ka 0.75 se upar rakhein.

Reasoning failures

Agent ke paas sahi context hota hai magar woh ghalat conclusion nikalta hai, intent ko misclassify karta hai, ya task ke liye ghalat tool chunta hai. Yeh failures automatically napna mushkil hain aur aksar ek alag judge model ya curated evaluation dataset chahiye hota hai jis mein known-correct reasoning paths hon.

Action failures

Agent sahi tool ko ghalat parameters ke saath call karta hai, ghalat tool call karta hai, ya technically valid magar contextually inappropriate action leta hai. In ko consistently pakarne ka wahid tareeqa har tool call ka span level logging hai — uske parameters, return value aur agent ka agla reasoning step.

Hissa 03 · RAGAS metrics

Production RAG agents ke liye paanch RAGAS metrics

RAGAS production metrics — definitions aur targets
MetricKya napta haiTarget
FaithfulnessJawab mein dawe retrieved context se support hote hain0.90 se upar
Answer relevancyJawab us cheez ko address karta hai jo sawal ne pucha0.85 se upar
Context precisionRetrieved chunks sawal se relevant hain0.80 se upar
Context recallJawab ke liye darkar saari information retrieve ho gayi0.75 se upar
Answer correctnessJawab ground truth ke muqable mein factually durust hai0.80 se upar

RAGAS, faithfulness, answer relevancy aur context precision ke liye ground truth labels ke baghair chalta hai. Yeh isay live production traffic par chalana practical banata hai, jahan har query ke liye human-verified sahi jawab nahi hote. Context recall aur answer correctness ko ground truth chahiye, isliye in ko live traffic par nahi balke development ke dauran ek curated evaluation set par istemal karein.

Hissa 04 · Span level evaluation

Output par nahi, step par napein

Span level evaluation agent run ke har intermediate step ko ek named span ke taur par uske inputs, outputs, latency aur token cost ke saath log karti hai. LangGraph par mabni agents ke liye LangSmith default mein yehi capture karta hai.

Har tool call ek span hai. Har retrieval ek span hai. Har reasoning step ek span hai. Jab agent run ghalat result de, aap LangSmith mein trace kholte hain, woh span dhoondhte hain jahan se error shuru hua, aur us step par maujood inputs, outputs aur context seedha parhte hain. Aap andaaza nahi lagate — aap dekhte hain.

Yehi woh khasoosiyat hai jo debuggable production systems ko bhurbure systems se alag karti hai. Span level observability ke baghair ek ghalat agent output ek pheli hai. Iske saath, ghalat output ek waahid span ban jata hai jise aap shanakht, reproduce aur fix kar sakte hain.

Span level evaluation flow: har agent step (retrieval, reasoning, tool call) ek named span ke taur par log hota hai. RAGAS aur judge models spans ko asynchronously evaluate karte hain. Dashboards threshold ki khilaf warzi ko surface karte hain.
Span level evaluation failures ko us hi step par pakar leti hai jahan se woh shuru hoti hain. Output evaluation sirf final result dekhti hai — failure phail jaane ke baad.

Hissa 05 · Evaluation stack

LangSmith plus RAGAS plus DeepEval: 2026 ka production stack

Observability ke liye LangSmith

LangGraph par mabni agents ke liye har span automatically capture karta hai. Traces store karta hai. RAGAS integration support karta hai. Aap ko live traffic samples aur historical traces par evaluators chalane deta hai. Kisi bhi production agent ke liye minimum viable setup.

Retrieval quality ke liye RAGAS

Live traffic par faithfulness, answer relevancy aur context precision ke reference-free metrics. Production queries ke 5 se 10 percent sample par asynchronously chalayein. Threshold se neeche metric girne par alert karein.

Behavioral testing ke liye DeepEval

Curated datasets ke muqable agent ke behavior ko evaluate karne ka test suite framework. Regressions ko production tak pohnchne se pehle pakarne ke liye har deployment par CI/CD mein chalayein. Hallucination detection, prompt injection ke khilaf mazbooti aur custom behavioral metrics cover karta hai.

Hissa 06 · Production checklist

Ship karne se pehle kam az kam evaluation setup

LLM agents ke liye production evaluation checklist
TaqazaToolTawatur
Tamam agent runs ke liye span level tracingLangSmithHamesha on
Faithfulness 0.90 se uparRAGAS via LangSmithAsync, 10 percent sample
Answer relevancy 0.85 se uparRAGAS via LangSmithAsync, 10 percent sample
Behavioral regression testsCI/CD mein DeepEvalHar deployment
Tool call schema validationPipeline mein custom validatorHar tool call
Kam confidence wale runs ke liye human review queueLangSmith datasetHaftawaar

FAQ

Aksar Poochay Janay Walay Sawalat

Production mein AI agents ka jaiza kaise lein?

Har intermediate step, tool call aur retrieval decision capture karne ke liye span level tracing chalayein. Faithfulness aur answer relevancy monitor karne ke liye RAGAS metrics ko live traffic ke sample par asynchronously istemal karein. Har deployment par DeepEval se behavioral regression tests chalayein. Response pipeline ko evaluation par block na karein — asynchronously chalayein.

LLM agents ke liye span level evaluation kya hai?

Span level evaluation agent run ke har intermediate step — har tool call, retrieval step aur reasoning step — ko inputs, outputs aur context ke saath ek named span ke taur par log karti hai. Span level par evaluate karne se aap exactly shanakht kar sakte hain ke kis step ne error paida ki, bajaye iske ke aap final output se reverse-engineer karein.

Production RAG agent ke liye kaun se RAGAS metrics istemal karoon?

Faithfulness aur answer relevancy se shuru karein — donon reference-free hain aur ground truth labels ke baghair live traffic par chal sakte hain. Faithfulness ka target 0.90 se upar aur answer relevancy ka 0.85 se upar rakhein. Retrieval quality ko khaas taur par napne ke liye curated evaluation dataset ke saath context precision aur context recall add karein.

Kya LangGraph agents ke liye LangSmith behtareen evaluation tool hai?

LangSmith, LangGraph par mabni agents ke liye sab se zyada integrated option hai — yeh instrumentation code ke baghair spans automatically capture karta hai, RAGAS integration ko natively support karta hai, aur historical traces par evaluations chalane ke liye dataset interface deta hai. Doosre frameworks par kaam karne wali teams ke liye Arize Phoenix aur Langfuse milti julti capability rakhne wale mazboot alternatives hain.

Aksar Pochay Janay Walay Sawaal

Production mein AI agents ki evaluation kaise ki jati hai?
Span level tracing set up karein taake har intermediate step, har tool call aur har retrieval decision capture ho. Live traffic ke sample par RAGAS metrics asynchronously chalayein aur faithfulness aur answer relevancy monitor karein. Har deployment par DeepEval ke saath behavioral regression tests run karein.
LLM agents ki span level evaluation kya hoti hai?
Span level evaluation mein agent run ka har intermediate step — tool call, retrieval step, reasoning step — apne inputs, outputs aur context ke saath ek named span ke taur par log hota hai. Span level par evaluate karne se yeh seedha pata chal jata hai ke ghalti kis step par hui, na ke final output se ulta nikala jaye.
Production RAG agent ke liye konse RAGAS metrics use karne chahiye?
Faithfulness aur answer relevancy se shuru karein — donon reference free hain aur live traffic par bina ground truth labels chal jate hain. Faithfulness 0.90 se ooper aur answer relevancy 0.85 se ooper ka target rakhein. Retrieval quality ko specifically maapne ke liye curated evaluation dataset par context precision aur context recall add kar lein.
Kya LangGraph agents ke liye LangSmith best evaluation tool hai?
LangGraph par bane agents ke liye LangSmith sab se zyada integrated option hai — bina instrumentation code ke spans automatically capture karta hai, RAGAS integration natively support karta hai aur historical traces par evaluation chalane ke liye dataset interface bhi deta hai. Doosre frameworks par chalne wali teams ke liye Arize Phoenix aur Langfuse strong alternatives hain.