Workload Orchestration Engineer
hace 3 días
Madrid
Bei Roche kannst du ganz du selbst sein und wirst für deine einzigartigen Qualitäten geschätzt. Unsere Kultur fördert persönlichen Ausdruck, offenen Dialog und echte Verbindungen. Hier wirst du für das, was du bist, wertgeschätzt, akzeptiert und respektiert. Dies schafft ein Umfeld, in dem du sowohl persönlich als auch beruflich wachsen kannst. Gemeinsam wollen wir Krankheiten vorbeugen, stoppen und heilen und sicherstellen, dass jeder Zugang zur Gesundheitsversorgung hat – heute und in Zukunft. Werde Teil von Roche, wo jede Stimme zählt. Job description As a Workload Orchestration Engineer within the Accelerated Compute Engineering (ACE) team, you will be responsible for overseeing and advancing our workload orchestration tech stack across both our High-Performance Computing (HPC) and industry-leading AI Factory platforms. With the rapid expansion of our compute infrastructure, efficiently scheduling, managing, and maximizing the utilization of our CPU and GPU environments is paramount. You will own the deployment, configuration, and fine-tuning of orchestration platforms that schedule massive, parallel computational workloads. By implementing robust scheduling policies for traditional scientific workflows and modern containerized AI workloads, you will bridge the gap between heavy compute capacity and efficient execution. Your work will directly ensure that Roche’s researchers, data scientists, and engineers can seamlessly run large-scale AI model training and computational science simulations at scale. Description of the area Hosting and Infrastructure (HI) provides mission-critical on-premise infrastructure, cloud hosting, connectivity, and technology products that enable all functions at every Roche site to develop, innovate, connect, and deliver compliant digital products across the Roche Enterprise. The Value Streams - Accelerated Compute Engineering (ACE) Team is focused on driving both customer success and platform success by acting as a center of excellence and delivery for the High Performance Compute and AI Infrastructure supporting AI and HPC use cases across Roche. This team facilitates seamless onboarding and adoption for business vertical customers needing accelerated compute—helping those infrastructure consumers with needs optimized for high availability, seamless data transfer, flexibility, speed, and the rapidly changing needs of AI—helping achieve rapid time-to-value. Job Responsibilities Orchestration Stack Deployment & Governance • Design, implement, and maintain the SLURM Workload Manager ecosystem across our HPC cluster architectures, ensuring high availability and optimal resource distribution., • Deploy and manage Run:ai as the core orchestration and virtualization layer for the AI Factory, enabling fractional GPU allocation and dynamic resource allocation., • Evaluate, architect, and implement SLURM Slinky integrations where required to seamlessly bridge Kubernetes-based AI orchestration with traditional HPC cluster resources. Containerization & Workload Optimization • Define best practices and frameworks for containerized scientific execution, utilizing Singularity/Apptainer and/or Enroot to provide secure, reproducible performance environments for HPC., • Translate user and workload requirements into optimized scheduling parameters (e.g., topology-aware scheduling, multi-node scaling)., • Actively profile and tune scheduling queues, quality-of-service (QoS) parameters, and fair-share policies to maximize multi-tenant efficiency. Platform Reliability & Telemetry • Partner with Observability Engineers to implement continuous monitoring, telemetry, and reporting dashboards to track scheduler efficiency, queue wait times, and hardware utilization rates., • Troubleshoot complex workload failures, including distributed training synchronization issues, MPI communication bottlenecks, and driver incompatibilities., • Maintain configuration-as-code models for the scheduling tier, leveraging automation to deploy cluster policies uniformly. Qualifications Education / Experience • Bachelor’s or an advanced degree in Computer Science, Applied Mathematics, Computational Engineering, or a similar technical discipline., • 5+ years of systems engineering experience, with a heavy emphasis on workload scheduling, resource management, and cluster optimization for multi-tenant environments., • Deep technical familiarity with Enterprise Linux operating systems and distributed systems architecture., • HPC Scheduling & Tooling: Expert-level proficiency in administering SLURM, including complex partition designs, accounting, and plug-in management. Highly proficient with Singularity for container runtime execution., • AI Orchestration: Hands-on experience or deep architectural understanding of Run:ai, Kubernetes, and containerized GPU scheduling paradigms., • Infrastructure Literacy: Solid understanding of high-speed interconnects (InfiniBand, RoCE) and multi-node communication architectures (MPI, NCCL) as they relate to job placement., • Automation: Proficiency in automating scheduler configurations and telemetry gathering, or infrastructure automation tooling. Leadership & Mindset: • Lean & Agile Mindset: Highly focused on driving efficiency, reducing idle compute time, and creating frictionless pathways for user workload submissions., • Collaboration & Advocacy: Outstanding capability to translate scientific and AI model workflow challenges into scalable scheduler configurations., • Intellectual Curiosity: A strong passion for remaining ahead of industry trends regarding GPU slicing, fractionalization, and the convergence of AI workloads with traditional HPC schedulers. Eine gesündere Zukunft treibt uns zur Innovation an. Mehr als 100.000 Mitarbeiter weltweit arbeiten gemeinsam daran, wissenschaftliche Fortschritte zu erzielen und sicherzustellen, dass jeder Zugang zur Gesundheitsversorgung hat – heute und für zukünftige Generationen. Durch unser Engagement werden über 26 Millionen Menschen mit unseren Medikamenten behandelt und mehr als 30 Milliarden Tests mit unseren Diagnostik-Produkten durchgeführt. Wir ermutigen uns gegenseitig, neue Möglichkeiten zu erkunden, Kreativität zu fördern und hohe Ziele zu setzen, um lebensverändernde Gesundheitslösungen zu liefern. Gemeinsam können wir eine gesündere Zukunft gestalten. Roche ist ein Arbeitgeber, der die Chancengleichheit fördert. #J-18808-Ljbffr