AI Evaluation Data Scientist - AI/ML/LLM - (Hybrid) - Madrid
4 days ago
Madrid
AI Evaluation Data Scientist A fantastic opportunity for a driven AI Data Scientist to join a leading Quantum AI company, who work on cutting-edge solutions that make AI faster, greener, and more accessible. You’ll be working alongside world-leading experts in quantum computing and AI, with the opportunity to work on challenging projects and shape the future of Generative AI systems. This is initially a 9 Month Fixed Term Contract, with scope to extend - *Hybrid working from sites in Madrid or Barcelona. Responsibilities: • Design and lead the evaluation strategy for our Agentic AI and RAG systems, turning customer workflows and business needs into measurable metrics and clear success criteria., • Contribute to the end-to-end design of Agentic AI and RAG systems, injecting a data-and-evaluation perspective into retrieval strategies, orchestration policies, tool usage, and memory to solve complex, real-world problems across industries., • Develop task-based, multi-step evaluations that reflect how the different components of our systems (retrieval, planning, tool use, memory) perform in real-world scenarios across cloud and edge deployments., • Develop and refine rigorous evaluation frameworks that reflect real-world performance, going beyond model benchmarks to assess task success, reasoning capabilities, factual consistency, reliability, and user success metrics across diverse problem domains., • Build and maintain a reproducible evaluation pipeline, including datasets, scenarios, configs, test suites, versioned assets, and automated runs to track regressions and improvements over time., • Curate and generate high-quality datasets for evaluation, including synthetic and adversarial data, to strengthen coverage and robustness., • Implement and calibrate LLM-as-a-judge evaluations, aligning automated scoring with human feedback and ensuring fairness, robustness, and representativeness., • Perform deep error analyses and ablations to uncover failure patterns, maintain a taxonomy of failure modes (reasoning, grounding, hallucinations, tool failures), and provide actionable insights to engineers to improve model and system performance., • Partner with ML specialists to create a data flywheel, where evaluation continuously informs new dataset creation, improvements on prompts, tool usage, model training, and system refinements, quantifying improvements over time., • Define and monitor operational metrics (latency, cost, reliability) to ensure evaluations align with production and customer expectations., • Maintain high engineering standards, including clear documentation, reproducible experiments, robust version control, and well-structured ML pipelines., • Contribute to team learning and mentorship, guiding junior engineers and sharing expertise in LLM development, evaluation, and deployment best practices., • Participate in code reviews, offering thoughtful, constructive feedback to maintain code quality, readability, and consistency. Required minimum Qualifications • Master's or Ph.D. in Computer Science, Machine Learning, Data Science, Physics, Engineering, or related technical fields, with relevant industry experience., • Solid hands-on experience (3+ years for mid-level, 5+ years for senior) working as a Data Scientist, ML Engineer, or Research Scientist in applied AI/ML projects deployed in production environments., • Strong background in evaluation of machine learning systems, ideally with experience in LLMs, RAG pipelines, or multi-agent systems., • Proven ability to design and implement evaluation methodologies that go beyond static benchmarks, capturing real-world task success, reasoning, and robustness., • Hands-on experience with dataset creation and curation (including synthetic data generation) for training and evaluation., • Proven experience with agent-based architectures (task decomposition, tool use, reasoning workflows), RAG architectures (retrievers, vector databases, rerankers), and orchestration frameworks (LangGraph, LlamaIndex)., • Strong problem-solving skills, with the ability to navigate ambiguity and design practical solutions to open-ended user or business needs., • Strong software engineering skills, with proficiency in Python, Docker, Git, and experience building robust, modular, and scalable ML codebases., • Familiarity with common ML and data libraries and frameworks (e.g., PyTorch, HuggingFace, LangGraph, LlamaIndex, Pandas, etc.)., • Experience with cloud platforms (ideally AWS)., • Fluent in English. By applying to this role, you understand that we may collect your personal data & store & process it on our systems. For more information please see our Privacy Notice ()