Python Team Lead - Remote
14 hours ago
Barcelona
Impress is #1 AI driven chain of orthodontic clinics with digital processes in Europe, and TOP 3 in the world. Por favor, asegúrese de leer atentamente los siguientes detalles antes de enviar cualquier solicitud. Born in Barcelona in 2019, we’ve grown to pioneer leading care, flagship clinics and state-of-the-art tech across 10 countries and more than 200 clinics, we completely disrupted the orthodontics market by cutting treatment costs in half, making it more affordable to people. To do this, we develop both web and mobile products for users, as well as internal software for clinics, and together with the ML department we are changing the whole industry. Our business model, a true combination of medical expertise and digitalization has been recognized in the top fastest-growing HealthTech companies by Forbes and we are currently listed as a LinkedIn Top 10 startup! Role overview: We are looking for an ML Platform Lead to own the infrastructure and deployment layer that powers our machine learning products. This is a senior technical leadership role sitting at the intersection of ML engineering, cloud infrastructure, and product impact — the person in this role turns trained models into reliable, scalable, cost-efficient production services, and builds the platform that lets the entire ML team move faster. You will own end-to-end delivery: from provisioning cloud infrastructure and designing event-driven pipelines to optimizing GPU inference and shipping multi-environment production deployments. You will manage a small team of ML platform engineers, collaborate across product, ML research, and cloud teams, and be accountable for the reliability, performance, and cost of the ML platform. Requirements: • 6+ years of experience in ML engineering, MLOps, or ML platform roles with production responsibility, • Strong hands-on experience with cloud infrastructure — AWS (Lambda, Batch, Step Functions, S3, EC2, ECS/EKS, DynamoDB, RDS) and/or GCP (Cloud Run, Artifact Registry), • Proficiency with Infrastructure as Code — Terraform and/or Terragrunt for managing multi-environment cloud deployments, • Experience deploying and operating ML inference services in production (Triton Server, TorchServe, FastAPI, or equivalent), • Experience with Docker — multi-stage builds, image optimization, container registries, • Strong understanding of GPU compute for ML workloads — instance selection, cost optimization, inference profiling, • Demonstrated ability to deliver end-to-end ML services from infrastructure provisioning to production deployment, • Experience with event-driven architectures — SNS, SQS, webhooks, or equivalent message-passing systems, • Track record of quantifiable business impact — cost reductions, automation rates, throughput improvements, • Experience training deep learning models (PyTorch, segmentation models, computer vision), • Familiarity with model optimization techniques — ONNX/TorchScript conversion, quantization, batching strategies, • Experience with ArgoCD or other GitOps tooling for Kubernetes, • Background in healthcare, dental-tech, or other regulated / precision-critical domains, • Experience with Terragrunt for managing multi-account, multi-environment Terraform at scale 1. Platform Ownership: • Design, build, and maintain the ML serving and deployment platform across AWS and GCP in a hybrid cloud setup, • Own multi-environment infrastructure (dev, prelive, live) using Terraform and Terragrunt , ensuring reproducibility and consistency across all stages, • Manage ML services running on Kubernetes , including migrations, health checks, secrets management, and observability, • Evaluate and adopt new infrastructure approaches — serverless GPU compute, dynamic scaling, scale-to-zero architectures — to balance performance and cost 2. ML Deployment & Optimization: • Deploy and maintain ML inference services using Nvidia Triton Server, FastAPI, and containerized workloads, • Optimize GPU inference pipelines — profiling, batching, model conversion (ONNX/TorchScript), and GPU instance selection, • Own Docker image strategy including multi-stage builds, layer caching, and size optimization for complex multi-repository ML projects, • Integrate ML services into event-driven architectures using AWS SNS, SQS, Lambda, Step Functions, and Batch 3. Cost & Efficiency: • Continuously identify and execute infrastructure cost optimizations — from right-sizing compute to serverless job migration to dynamic GPU scaling, • Define and track cost-per-inference and cost-per-case metrics across services; own the monthly infra spend for the ML platform, • Build GPU scaling xcskxlj strategies (including scale-to-zero) that reduce idle costs as workload patterns change 4. Team Leadership: • Define clear ownership areas for team members and build their technical skills and independence, • Set the technical direction for the ML platform, balancing short-term delivery with long-term maintainability 5. Cross-Functional Collaboration: • Partner with ML researchers to productize new models — wrapping models into APIs, selecting the right serving infrastructure, and managing the path from prototype to production, • Work closely with product managers and clinical stakeholders to align ML platform capabilities with business needs, • Collaborate with the Cloud/DevOps team on shared infrastructure, security practices, and capacity planning, • Vacation Days to ensure you have time to relax and recharge, • Sick Leave Coverage for when you need to focus on your health, • Enjoy Spanish Public Holidays on top of your vacation days, • A company-provided laptop and necessary equipment to help you excel in your role, • Exclusive Teeth Aligner Benefits as part of your employee perks, • A modern, vibrant workplace where collaboration and creativity thrive, • A team that feels more like a family , with plenty of laughs and support along the way