Senior DevOps Engineer
25 days ago
London
Senior DevOps Engineer – AI & Cloud Infrastructure Type: Permanent / Full-Time (Employment or Contract considered) Location: Remote or Hybrid Time Zones: UK, Europe, North America–friendly The Opportunity We’re working with a high-growth tech-start up company building a next-generation AI cloud platform, focused on fast, reliable inference for large language models and other compute-intensive workloads. The platform combines modern cloud infrastructure, Kubernetes, GPU clusters, and developer-first tooling to support mission-critical AI systems operating across multiple regions. They’re now looking for a Senior DevOps Engineer to take ownership of the infrastructure backbone — someone who enjoys operating complex systems at scale and working closely with infrastructure, ML, and product engineering teams. What You’ll Be DoingAI Cloud Infrastructure • Design, build, and operate highly available, secure infrastructure supporting AI inference, fine-tuning, and data processing workloads, • Manage multi-region Kubernetes clusters, including GPU-heavy environments, • Implement autoscaling strategies across heterogeneous compute fleetsInfrastructure as Code & Automation, • Own and evolve infrastructure-as-code using tools such as Terraform, Helm, and similar, • Automate provisioning of compute, networking, and storage, • Build tooling to spin environments up and down for experiments, benchmarks, and customer deploymentsCI/CD & Release Engineering, • Design and maintain CI/CD pipelines across backend, infrastructure, and ML components, • Implement safe deployment strategies (e.g. blue/green, canary releases), • Partner with engineers to improve build speed, test reliability, and deployment confidenceObservability, Reliability & SRE, • Build and operate observability stacks (metrics, logging, tracing), • Define and monitor SLOs / SLAs for latency, availability, and reliability, • Create runbooks, playbooks, and incident response processes for production systemsSecurity & Best Practices, • Implement best practices around secrets management, access control, and network security, • Support secure, multi-tenant environments for enterprise customers, • Help foster a culture of operational excellence, ownership, and reliability What They’re Looking ForEssential • 4–8+ years’ experience in DevOps, SRE, Platform, or Infrastructure Engineering, • Strong experience running production systems on major cloud platforms (AWS, GCP, or Azure), • Deep hands-on experience with Kubernetes in production, • Strong Infrastructure-as-Code skills (Terraform or equivalent), • Proficiency in at least one scripting or programming language (e.g. Python, Go, Bash), • Solid understanding of networking, security fundamentals, and distributed systems, • Proven experience building reliable, observable, automated systemsNice to Have, • Experience supporting GPU-based workloads or ML infrastructure, • Exposure to AI / ML platforms, inference systems, or data pipelines, • Familiarity with modern CI/CD tooling and GitOps approaches, • Experience with observability tooling (metrics, logs, tracing), • Background in cloud platforms, AI infrastructure, or high-scale SaaS environments Why Join • Work on core infrastructure powering cutting-edge AI systems, • High impact and ownership over architecture and tooling decisions, • Collaboration with senior engineers across infrastructure, ML, and product, • Competitive compensation, equity, and long-term growth potential, • Flexible remote / hybrid working