Senior / Lead / Principal Platform Engineer (DevOps / Cloud Infrastructure)
2 days ago
Los Angeles
Job Description Location: West Hollywood / Los Angeles, CA Work Model: On-site (5 days per week) Employment Type: Full-Time Compensation: $200,000–$300,000+ USD (depending on experience and seniority), plus a competitive sign-on bonus. Applicants must be legally authorized to work in the United States. Visa sponsorship is not available for this role. About the Opportunity Our client is a well-funded, early-stage AI company building a next-generation intelligence platform for high-stakes, real-world decision making. The platform ingests and fuses data from satellite feeds, autonomous sensors, logistics networks, enterprise systems, and open-source intelligence (OSINT) to power production AI/ML workloads, knowledge graphs, and intelligent decision-making systems. This is not a traditional SaaS, DevOps, or chatbot company. The engineering team is building production AI infrastructure where reliability, scalability, security, and developer productivity are mission-critical. We're looking for a Senior, Lead, or Principal Platform Engineer who enjoys building platforms—not simply maintaining them. You'll own the cloud infrastructure, Kubernetes platform, CI/CD and GitOps workflows, infrastructure automation, and internal developer platform that enables engineering teams to build and deploy production AI systems at scale. This is a highly collaborative, hands-on engineering role with significant ownership and influence over the platform architecture. The Role As a Platform Engineer, you'll design, build, and operate the infrastructure that powers complex AI/ML workloads, while creating the internal tooling and platform capabilities that help software engineers move faster and more reliably. The ideal candidate has a strong software engineering foundation, deep cloud infrastructure expertise, and experience owning production Kubernetes environments from design through day-to-day operations. Key ResponsibilitiesPlatform Engineering • Design, build, and operate scalable cloud infrastructure supporting production AI/ML workloads., • Own Kubernetes infrastructure, including architecture, networking, security, upgrades, scaling, and operational reliability., • Build and evolve an internal developer platform that improves engineering productivity and deployment velocity., • Develop self-service infrastructure and automation that enables engineering teams to ship software quickly and safely., • Continuously improve developer experience through platform engineering best practices. Cloud Infrastructure & DevOps, • Design and implement modern CI/CD and GitOps workflows for production environments., • Build reusable Infrastructure-as-Code solutions using Terraform and related tooling., • Architect highly available, resilient, and cost-efficient cloud infrastructure., • Drive adoption of containerization, Kubernetes, and cloud-native infrastructure across engineering teams., • Support AI-powered development workflows using tools such as Claude Code, Cursor, GitHub Copilot, or similar technologies. AI Infrastructure, • Build and optimize infrastructure supporting GPU-accelerated machine learning workloads., • Improve GPU provisioning, scheduling, utilization, and resource management., • Support scalable infrastructure for model training, inference, and AI services deployed in production., • Partner closely with AI engineers to optimize platform performance and reliability. Reliability & Operations, • Lead the investigation and resolution of complex production incidents across cloud infrastructure, Kubernetes, networking, and applications., • Perform root-cause analysis and implement long-term improvements that increase reliability., • Build comprehensive monitoring, alerting, logging, and observability solutions., • Drive platform reliability, performance optimization, and operational excellence. Collaboration & Architecture, • Partner with software engineers, AI engineers, security teams, and technical leadership on platform architecture decisions., • Produce technical design documentation for major infrastructure initiatives., • Champion engineering best practices around automation, scalability, security, testing, and reliability., • Evaluate emerging technologies that improve infrastructure capabilities and developer productivity. Required Qualifications, • Bachelor's degree in Computer Science, Software Engineering, Information Technology, or a related technical discipline (Master's preferred)., • 5+ years of experience building and operating production cloud infrastructure, Platform Engineering, DevOps, or Site Reliability Engineering (SRE) environments., • Strong software engineering foundation with experience building automation, tooling, services, or developer platforms using Python, Go, Bash, or similar languages., • Demonstrated ownership of production Kubernetes clusters, including architecture, networking, upgrades, scaling, and operational support., • Hands-on experience designing and building Infrastructure-as-Code solutions using Terraform, including authoring reusable modules., • Strong experience designing and building CI/CD and GitOps pipelines—not simply maintaining existing pipelines., • Deep experience with Google Cloud Platform (GCP) and/or AWS., • Strong understanding of containerization technologies including Docker and Kubernetes., • Experience building and operating production-scale distributed systems., • Strong troubleshooting skills across cloud infrastructure, Kubernetes, networking, and applications., • Experience with observability platforms such as Prometheus, Grafana, Datadog, ELK, or equivalent., • Excellent communication and collaboration skills. Preferred Qualifications Experience with one or more of the following is highly desirable: • AI/ML infrastructure and GPU-accelerated workloads., • NVIDIA GPU infrastructure and CUDA environments., • Internal developer platforms and self-service infrastructure., • GitOps methodologies., • AI-native development tools such as Claude Code, Cursor, GitHub Copilot, or Codex., • Security-focused environments including DevSecOps practices., • Air-gapped, sovereign, or highly regulated deployment environments., • Defense, aerospace, government, or other mission-critical industries., • FedRAMP, ITAR, CMMC, or similar compliance frameworks., • Serverless architectures and distributed systems. What We're Looking For Successful candidates will demonstrate: • A platform engineering mindset with experience designing, building, and owning infrastructure—not simply maintaining existing environments., • A strong software engineering foundation and passion for automation., • Experience building platforms and internal tooling that improve developer productivity., • Excellent systems thinking across cloud infrastructure, Kubernetes, networking, security, and distributed systems., • A high level of ownership and comfort working in fast-moving environments with significant technical responsibility., • A pragmatic approach to balancing reliability, scalability, security, and developer experience. Compensation & Benefits, • Base salary: $200,000–$300,000+, depending on experience and seniority., • Competitive sign-on bonus., • Comprehensive benefits package., • Opportunity to join a well-funded, high-growth AI company at an early stage with significant technical ownership., • Long-term career growth with opportunities to take on broader platform and infrastructure leadership responsibilities as the organization continues to scale. Why Join?, • Build production infrastructure powering real-world AI systems—not internal IT or traditional enterprise DevOps., • Own the Kubernetes platform, developer experience, and cloud infrastructure that enables AI engineers to move faster., • Work alongside a highly technical engineering team solving challenging platform and infrastructure problems., • Support GPU-accelerated AI/ML workloads deployed in production., • Help shape the technical foundation of a rapidly growing AI company where engineering quality, ownership, and innovation are highly valued. If you're passionate about Platform Engineering, cloud infrastructure, Kubernetes, automation, and building the systems that power next-generation AI applications, we'd love to hear from you.