Manager, DevOps, SRE & AI Infrastructure
3 months ago
San Jose
Job DescriptionAppZen is the leader in autonomous spend-to-pay software. Its patented artificial intelligence accurately and efficiently processes information from thousands of data sources so that organizations can better understand enterprise spend at scale to make smarter business decisions. It seamlessly integrates with existing accounts payable, expense, and card workflows to read, understand, and make real-time decisions based on your unique spend profile, leading to faster processing times and fewer instances of fraud or wasteful spend. Global enterprises, including one-third of the Fortune 500, use AppZen’s invoice, expense, and card transaction solutions to replace manual finance processes and accelerate the speed and agility of their businesses. At AppZen, we value candidates who are actively using AI tools to enhance productivity, automate repetitive tasks, and solve problems more efficiently. Across all roles, we are looking for team members who leverage AI in meaningful ways to drive impact in their work. To learn more, visit us at . In this role, success means making AppZen’s cloud platform more reliable, resilient, and easy to operate, with strong monitoring and observability that meets enterprise customer expectations. You will build repeatable pipelines for AI and LLM workflows, including versioned prompts, automated testing, and quality checks. You’ll ensure safe, traceable, and cost-conscious operation of multi-step AI workflows. Success also means fostering a culture of reliability across the engineering team with clear service-level objectives and blameless post-mortems. Finally, you’ll grow and mentor your team, helping engineers advance while effectively using AI tools to improve day-to-day work and overall productivity.Responsibilities: • Own and operate AppZen’s cloud and Kubernetes infrastructure, including AWS services (VPC, IAM, EKS, RDS/Aurora, S3) and Terraform-based provisioning., • Lead reliability and observability efforts with metrics, logs, tracing, alerts, and incident response to continuously improve service uptime and performance., • Define and manage service-level objectives (SLOs), error budgets, and reliability practices across multi-service architectures., • Build and scale AI and LLM pipelines, ensuring safe execution, cost/latency controls, and high-quality outputs for production workflows., • Establish LLMOps quality and evaluation practices, including automated testing, regression monitoring, and tracking AI-specific failure modes., • Lead, mentor, and grow a high-performing DevOps/SRE team, promoting AI-assisted engineering tools and automation best practices., • Drive cross-functional collaboration across Engineering, Product, and Customer Success on platform, reliability, and AI infrastructure initiatives., • Ensure security and compliance readiness, including SOC2/ISO 27001 standards, access controls, and best practices for sensitive data handling.Must Have:, • 8+ years of experience in Platform Engineering, DevOps, SRE, or Cloud Infrastructure, with hands-on coding (Python and/or Go preferred)., • 3+ years of hands-on SRE experience, including SLOs, error budgets, incident management, and reliability improvements across production SaaS systems., • Experience operating AI/LLM systems in production, including workflows, latency/cost optimization, and failure handling., • 2+ years of engineering management with direct reports in DevOps or SRE teams., • Deep expertise in AWS, Kubernetes (EKS), Linux, and networking, with infrastructure-as-code experience using Terraform., • Strong observability and CI/CD practices (metrics, tracing, alerting, GitHub Actions, ArgoCD, Jenkins, etc.)., • Experience with SQL and NoSQL databases (PostgreSQL/Aurora, Cassandra, DynamoDB, or MongoDB)., • Active practitioner of AI-assisted engineering tools (GitHub Copilot, Cursor, Claude, ChatGPT, or similar) to improve workflows and team productivity.AI Tools Adoption (Required):, • Demonstrated use of AI-assisted development tools such as GitHub Copilot, Cursor, Claude, or ChatGPT in day-to-day DevOps, SRE, or infrastructure work., • Ability to describe concrete ways AI tools have improved engineering velocity, reduced operational toil, or improved incident response., • Interest in evaluating and piloting new AI-driven tooling for infrastructure automation, alert triage, and operational workflows.Nice-to-Have:, • Experience with GPU infrastructure and ML workload orchestration (e.g., Slurm, Ray) and performance tuning., • Familiarity with agentic AI frameworks (LangChain, LangGraph, CrewAI, AutoGen) and RAG/vector DB patterns., • Exposure to LLM monitoring and evaluation tools (Langfuse, Braintrust, or equivalents)., • Knowledge of ERP ecosystems (SAP, Oracle, Workday, Coupa, Jaggaer) relevant to enterprise integrations., • Relevant AWS certifications (Solutions Architect, DevOps Engineer, SysOps Administrator)., • Experience with secure agent runtimes: container isolation, sandboxing, or virtualization for safe tool execution.What We Offer:, • High-impact role at a category-defining AI company, building the infrastructure that powers AppZen’s next generation of autonomous finance agents., • Competitive compensation, including base salary, annual performance bonus, and equity., • Modern cloud-native stack with autonomy to choose tools and shape platform architecture., • Collaborative, high-trust engineering culture that values reliability and AI-first development.