Production LLM Agents Ki Evaluation: Unit Tests Se Aage
Agent ki failures aksar final output par nahi balke span level par hoti hain. Yeh guide RAGAS metrics, span level evaluation, LangSmith setup aur 2026 ke target scores cover karti hai.
Hissa 01 · Markazi Masla
Agents ki evaluation LLM calls ki evaluation se kyun mukhtalif hai
Ek waahid LLM call ya to sawal ka acha jawab deta hai ya nahi. Ek agent run sequence mein 20 se 100 decisions leta hai. Step 7 par failure aisa final output de sakti hai jo dekhne mein qabil-e-yaqeen ho lekin mukammal taur par ghalat ho.
Foran Jawab
Chhota jawab: Agent evaluation ko span level par hona chahiye — har tool call, retrieval decision aur reasoning step — sirf final output par nahi. Output evaluation failures ko tab pakarta hai jab woh pehle hi pipeline mein phail chuki hoti hain.
Chatbot evaluation ka standard — kya output sawal ka jawab deta hai, kya woh factually drust hai, kya woh style guide se milta hai — agents ke liye nakaafi hai. Ek agent jo ghalat document retrieve kare, sahi tool ko ghalat parameters ke saath call kare, ya step 3 par user intent ko misclassify kare — aksar pur-aitmaad lagne wala final output de dega. Jab tak aap output evaluate karenge, error baqi steps mein pehle hi phail chuki hogi.
Munaasib evaluation infrastructure ki kami ki wajah se 2026 mein taqreeban aadhe agentic AI projects cancel hone ki peshangoi hai. Teams ship karti hain, inconsistent results paati hain, wajah diagnose nahi kar paatin aur system par bharosa kho deti hain. Hal behtar model nahi — step level par behtar measurement hai.
Hissa 02 · Failure Categories
Jin teen failure categories ki measurement zaroori hai
Retrieval failures
Agent ghalat documents retrieve karta hai, bohot kam retrieve karta hai, ya contextually irrelevant chunks retrieve karta hai. Phir downstream reasoning ghalat information par khari ho jati hai. RAGAS context precision aur context recall is ko napte hain. Context precision ka target 0.80 se upar aur context recall ka 0.75 se upar rakhein.
Reasoning failures
Agent ke paas sahi context hota hai magar woh ghalat conclusion nikalta hai, intent ko misclassify karta hai, ya task ke liye ghalat tool chunta hai. Yeh failures automatically napna mushkil hain aur aksar ek alag judge model ya curated evaluation dataset chahiye hota hai jis mein known-correct reasoning paths hon.
Action failures
Agent sahi tool ko ghalat parameters ke saath call karta hai, ghalat tool call karta hai, ya technically valid magar contextually inappropriate action leta hai. In ko consistently pakarne ka wahid tareeqa har tool call ka span level logging hai — uske parameters, return value aur agent ka agla reasoning step.
Hissa 03 · RAGAS metrics
Production RAG agents ke liye paanch RAGAS metrics
| Metric | Kya napta hai | Target |
|---|---|---|
| Faithfulness | Jawab mein dawe retrieved context se support hote hain | 0.90 se upar |
| Answer relevancy | Jawab us cheez ko address karta hai jo sawal ne pucha | 0.85 se upar |
| Context precision | Retrieved chunks sawal se relevant hain | 0.80 se upar |
| Context recall | Jawab ke liye darkar saari information retrieve ho gayi | 0.75 se upar |
| Answer correctness | Jawab ground truth ke muqable mein factually durust hai | 0.80 se upar |
RAGAS, faithfulness, answer relevancy aur context precision ke liye ground truth labels ke baghair chalta hai. Yeh isay live production traffic par chalana practical banata hai, jahan har query ke liye human-verified sahi jawab nahi hote. Context recall aur answer correctness ko ground truth chahiye, isliye in ko live traffic par nahi balke development ke dauran ek curated evaluation set par istemal karein.
Hissa 04 · Span level evaluation
Output par nahi, step par napein
Span level evaluation agent run ke har intermediate step ko ek named span ke taur par uske inputs, outputs, latency aur token cost ke saath log karti hai. LangGraph par mabni agents ke liye LangSmith default mein yehi capture karta hai.
Har tool call ek span hai. Har retrieval ek span hai. Har reasoning step ek span hai. Jab agent run ghalat result de, aap LangSmith mein trace kholte hain, woh span dhoondhte hain jahan se error shuru hua, aur us step par maujood inputs, outputs aur context seedha parhte hain. Aap andaaza nahi lagate — aap dekhte hain.
Yehi woh khasoosiyat hai jo debuggable production systems ko bhurbure systems se alag karti hai. Span level observability ke baghair ek ghalat agent output ek pheli hai. Iske saath, ghalat output ek waahid span ban jata hai jise aap shanakht, reproduce aur fix kar sakte hain.
Hissa 05 · Evaluation stack
LangSmith plus RAGAS plus DeepEval: 2026 ka production stack
Observability ke liye LangSmith
LangGraph par mabni agents ke liye har span automatically capture karta hai. Traces store karta hai. RAGAS integration support karta hai. Aap ko live traffic samples aur historical traces par evaluators chalane deta hai. Kisi bhi production agent ke liye minimum viable setup.
Retrieval quality ke liye RAGAS
Live traffic par faithfulness, answer relevancy aur context precision ke reference-free metrics. Production queries ke 5 se 10 percent sample par asynchronously chalayein. Threshold se neeche metric girne par alert karein.
Behavioral testing ke liye DeepEval
Curated datasets ke muqable agent ke behavior ko evaluate karne ka test suite framework. Regressions ko production tak pohnchne se pehle pakarne ke liye har deployment par CI/CD mein chalayein. Hallucination detection, prompt injection ke khilaf mazbooti aur custom behavioral metrics cover karta hai.
Hissa 06 · Production checklist
Ship karne se pehle kam az kam evaluation setup
| Taqaza | Tool | Tawatur |
|---|---|---|
| Tamam agent runs ke liye span level tracing | LangSmith | Hamesha on |
| Faithfulness 0.90 se upar | RAGAS via LangSmith | Async, 10 percent sample |
| Answer relevancy 0.85 se upar | RAGAS via LangSmith | Async, 10 percent sample |
| Behavioral regression tests | CI/CD mein DeepEval | Har deployment |
| Tool call schema validation | Pipeline mein custom validator | Har tool call |
| Kam confidence wale runs ke liye human review queue | LangSmith dataset | Haftawaar |
FAQ
Aksar Poochay Janay Walay Sawalat
Production mein AI agents ka jaiza kaise lein?
Har intermediate step, tool call aur retrieval decision capture karne ke liye span level tracing chalayein. Faithfulness aur answer relevancy monitor karne ke liye RAGAS metrics ko live traffic ke sample par asynchronously istemal karein. Har deployment par DeepEval se behavioral regression tests chalayein. Response pipeline ko evaluation par block na karein — asynchronously chalayein.
LLM agents ke liye span level evaluation kya hai?
Span level evaluation agent run ke har intermediate step — har tool call, retrieval step aur reasoning step — ko inputs, outputs aur context ke saath ek named span ke taur par log karti hai. Span level par evaluate karne se aap exactly shanakht kar sakte hain ke kis step ne error paida ki, bajaye iske ke aap final output se reverse-engineer karein.
Production RAG agent ke liye kaun se RAGAS metrics istemal karoon?
Faithfulness aur answer relevancy se shuru karein — donon reference-free hain aur ground truth labels ke baghair live traffic par chal sakte hain. Faithfulness ka target 0.90 se upar aur answer relevancy ka 0.85 se upar rakhein. Retrieval quality ko khaas taur par napne ke liye curated evaluation dataset ke saath context precision aur context recall add karein.
Kya LangGraph agents ke liye LangSmith behtareen evaluation tool hai?
LangSmith, LangGraph par mabni agents ke liye sab se zyada integrated option hai — yeh instrumentation code ke baghair spans automatically capture karta hai, RAGAS integration ko natively support karta hai, aur historical traces par evaluations chalane ke liye dataset interface deta hai. Doosre frameworks par kaam karne wali teams ke liye Arize Phoenix aur Langfuse milti julti capability rakhne wale mazboot alternatives hain.
Aksar Pochay Janay Walay Sawaal
- Production mein AI agents ki evaluation kaise ki jati hai?
- Span level tracing set up karein taake har intermediate step, har tool call aur har retrieval decision capture ho. Live traffic ke sample par RAGAS metrics asynchronously chalayein aur faithfulness aur answer relevancy monitor karein. Har deployment par DeepEval ke saath behavioral regression tests run karein.
- LLM agents ki span level evaluation kya hoti hai?
- Span level evaluation mein agent run ka har intermediate step — tool call, retrieval step, reasoning step — apne inputs, outputs aur context ke saath ek named span ke taur par log hota hai. Span level par evaluate karne se yeh seedha pata chal jata hai ke ghalti kis step par hui, na ke final output se ulta nikala jaye.
- Production RAG agent ke liye konse RAGAS metrics use karne chahiye?
- Faithfulness aur answer relevancy se shuru karein — donon reference free hain aur live traffic par bina ground truth labels chal jate hain. Faithfulness 0.90 se ooper aur answer relevancy 0.85 se ooper ka target rakhein. Retrieval quality ko specifically maapne ke liye curated evaluation dataset par context precision aur context recall add kar lein.
- Kya LangGraph agents ke liye LangSmith best evaluation tool hai?
- LangGraph par bane agents ke liye LangSmith sab se zyada integrated option hai — bina instrumentation code ke spans automatically capture karta hai, RAGAS integration natively support karta hai aur historical traces par evaluation chalane ke liye dataset interface bhi deta hai. Doosre frameworks par chalne wali teams ke liye Arize Phoenix aur Langfuse strong alternatives hain.