
Publié il y a environ 7 heures
Design and implement an evaluation harness for AI agents. Define and instrument key metrics (task completion, tool-call correctness, hallucination detection, silent failure identification), run agents against controlled scenarios, and score their behavior systematically.
Technologies: Python, LangChain/LangGraph, RAGAS, DeepEval, guardrails, context & memory management, Git, Docker.