DevOps Engineer
4 days ago
Santa Clara
Job Summary We are seeking a highly capable Senior DevOps Engineer / Platform Engineer to build, operationalize, and scale the infrastructure and deployment foundation for a strategic site-builder / network automation platform. This role will focus on creating reliable CI/CD pipelines, production-grade Kubernetes deployment patterns, managed database services, observability, environment reproducibility, secrets management, and Infrastructure as Code across development, testing, staging, and production environments. This engineer will play a critical role in moving the platform from an early-stage, partially manual operating model into a repeatable, supportable, and production-ready DevOps model. The environment includes Kubernetes-hosted services, AWS managed services, workflow orchestration with Temporal, integration with Nautobot, Argo-based promotion flows, and the supporting tooling required for debugging, snapshotting, local development, and production support. This is a hands-on engineering role for someone who can design the right platform patterns, implement them directly, and establish a durable operating model between development and DevOps teams. Key Responsibilities Platform Deployment & CI/CD • Design, implement, and maintain CI/CD pipelines for testing, staging, and production environments., • Build and maintain deployment workflows that support safe and seamless promotion across environments., • Improve and maintain Argo-based deployment workflows to enable controlled release progression from test to staging to production., • Establish baseline deployment mechanisms for the site-builder application and related services., • Standardize Kubernetes application packaging and deployment patterns, with a strong preference toward Helm-based lifecycle management for complex services and third-party components., • Migrate existing deployments to Helm charts where appropriate. Kubernetes & Runtime Platform Engineering • Support the deployment and ongoing operation of services running in Kubernetes., • Improve runtime reliability, resiliency, and troubleshooting for distributed services operating inside shared Kubernetes clusters., • Investigate and harden service-to-service connectivity patterns, especially for workflow components such as workers connecting to the Temporal engine., • Partner with development teams to define production-grade runtime requirements, resource sizing, restart policies, and platform support boundaries. Infrastructure as Code & Cloud Services • Design and implement fully declarative Infrastructure as Code for managed cloud services, especially in AWS., • Provision and maintain managed data services such as RDS/PostgreSQL and MongoDB-compatible document databases across all environments., • Eliminate manual infrastructure setup where possible and replace it with reproducible, version-controlled deployment patterns., • Prepare the platform for future scale across multiple environments and regions through repeatable IaC and GitOps-aligned practices. Data Services, Snapshots & Developer Enablement • Setup and maintain RDS, MongoDB, Redis/cache services, and related dependencies for all environments., • Build tooling and operational processes for: ◦ production and staging database snapshots, ◦ restoring snapshots into development environments, ◦ enabling local debugging and development from realistic data states. • Support creation of local and development environments, including Minikube-based environment-as-code approaches that mirror production behavior as closely as practical., • Improve platform reproducibility so engineers can quickly stand up close-to-production development environments. Workflow Orchestration & Temporal Support • Lead the setup, deployment, and operational support of Temporal for workflow orchestration., • Support production operations for Temporal, including troubleshooting performance issues, restarts, scaling concerns, and resource shortages., • Establish maintainable deployment patterns for Temporal using supported packaging and lifecycle management approaches., • Partner with engineering teams to ensure workflow platform reliability and upgradeability over time. Observability, Reliability & Incident Readiness • Design and maintain observability across testing, staging, and production using tools such as Prometheus and Grafana., • Define and implement monitoring for: ◦ service and cluster utilization, ◦ CPU, memory, storage, ◦ IOPS / throughput metrics, ◦ database connections and session counts, ◦ cache hit / miss / coverage metrics, ◦ RDS and MongoDB utilization, ◦ service health and alerting. • Build and maintain logging, tracing, and correlation capabilities, separated appropriately by environment., • Create tools to support deep debugging and operational inspection, including raw database reads, cleanup of unused volumes, and emergency cache invalidation. Security, Access & Secrets Management • Maintain secrets management processes across environments., • Build tooling for short-lived internal token generation and long-lived secret rotation., • Support secure access from deployed services to active production devices and southbound systems., • Help establish credential management patterns for southbound integrations and device-facing access., • Partner with related teams to define safe operational limits and controls for service integrations. External Integrations & Platform Support • Support integration patterns with Nautobot and help define safe client-side behaviors such as rate limiting, retry/backoff, and service protection mechanisms., • Partner with application teams to understand and mitigate integration issues such as rate limiting or request rejection., • Support staging and testing by enabling virtual device environments where needed., • Contribute to end-to-end acceptance testing and production readiness activities. Operating Model & Cross-Functional Execution • Help define an effective operating model between Development and DevOps, whether via RACI, embedded Agile delivery, or a hybrid support model., • Support deployment readiness, incident management, environment ownership boundaries, and lifecycle responsibilities., • Work closely with software engineering, infrastructure, application owners, and partner teams to drive production readiness and sustainable operations. Required Qualifications • Bachelor’s degree in Computer Science, Engineering, Information Systems, or equivalent practical experience., • 7+ years of experience in DevOps, Platform Engineering, SRE, or Infrastructure Engineering roles., • Strong hands-on experience with Kubernetes in production environments., • Strong experience building and maintaining CI/CD pipelines for multi-environment software delivery., • Strong experience with ArgoCD, GitOps workflows, or equivalent deployment tooling., • Strong experience with Helm and Kubernetes package/deployment lifecycle management., • Experience with AWS managed services, especially RDS/PostgreSQL, document databases, and related infrastructure., • Strong experience with Infrastructure as Code, such as Terraform and/or similar declarative tooling., • Experience with Prometheus, Grafana, and modern observability practices., • Experience with Redis/cache services, secrets management, and operational debugging., • Strong Linux, networking, and distributed systems troubleshooting skills., • Strong scripting and automation skills in one or more languages such as Python, Bash, or Go., • Proven ability to work cross-functionally and operate effectively in environments where ownership boundaries are still evolving. Preferred Qualifications • Experience with Temporal deployment and production operations., • Experience supporting developer platforms with local environment reproducibility using Minikube, kind, or similar tools., • Experience with MongoDB / DocumentDB operations and restore workflows., • Experience integrating with Nautobot, NetBox, or similar infrastructure source-of-truth platforms., • Experience operating in shared-cluster environments with multi-team tenancy and constrained access models., • Experience designing platform patterns for internal products that must scale across regions or multiple deployment footprints., • Familiarity with network automation or infrastructure orchestration platforms is a plus. What Success Looks Like • CI/CD pipelines are reliable, repeatable, and support safe promotion across all environments., • Kubernetes deployments are standardized, maintainable, and production ready., • Managed infrastructure is defined as code rather than through manual setup., • Temporal, databases, cache layers, and observability tooling are stable and supportable., • Development teams can reproduce realistic environments locally for faster debugging and delivery., • Secrets, access patterns, and operational tooling are mature enough to support production-scale operations., • The DevOps operating model is clearly defined and enables faster deployments with less operational risk.