Overview
- Project: design and implement an automated quality evaluation framework for production chatbot and mailbot systems to improve reliability and quality.
- The project is part of the Care Technology team and targets production systems using real customer interactions.
Responsibilities / What you will do
- Build a maintainable benchmark test dataset from real customer interactions and implement a multi-layer evaluation engine (rule-based, embedding-based, LLM-based).
- Implement false-positive control strategies using intent-scoped rules, multi-reference answers, and weighted scoring; implement automated regression detection and integrate it into the CI/CD pipeline as a quality gate.
Technical environment
- Programming Language: Python; AI Stack: LLM APIs, Retrieval-Augmented Generation (RAG), Embeddings; APIs: Open AI or similar LLM services.
- DevOps: Git, CI/CD (GitHub Actions); Cloud: AWS; Visualization: Streamlit / Grafana / Web dashboards; Cost control: token consumption tracking, tiered evaluation strategy.
Design & Maintenance
- Design a scalable test maintenance system (config-driven tests, versioning, human-in-the-loop review) and implement cost-aware evaluation and optimization (tiered test execution, controlled LLM usage).
- Deliver full technical documentation and ensure the framework is maintainable and integrable into existing pipelines.
Candidate profile / Qualifications
- Final-year student in Software Engineering with strong foundations in software development, REST APIs, and web technologies (HTTP, JSON).
- Good programming level in Python; experience or interest in software testing and automated testing of AI systems; interest in NLP and LLMs.
- Comfortable with Git and basic cloud concepts; analytical mindset, strong problem-solving skills, good documentation and communication skills in English; ability to work autonomously on a complex end-to-end technical project.
How to apply
- Apply via BambooHR or the link shared in the post (see application_link).