Principal DevOps Engineer
2 days ago
San Jose
Job DescriptionAppZen is the leader in autonomous spend-to-pay software. Its patented artificial intelligence accurately and efficiently processes information from thousands of data sources so that organizations can better understand enterprise spend at scale to make smarter business decisions. It seamlessly integrates with existing accounts payable, expense, and card workflows to read, understand, and make real-time decisions based on your unique spend profile, leading to faster processing times and fewer instances of fraud or wasteful spend. Global enterprises, including one-third of the Fortune 500, use AppZen’s invoice, expense, and card transaction solutions to replace manual finance processes and accelerate the speed and agility of their businesses. At AppZen, we value candidates who are actively using AI tools to enhance productivity, automate repetitive tasks, and solve problems more efficiently. Across all roles, we are looking for team members who leverage AI in meaningful ways to drive impact in their work. To learn more, visit us at . As Principal DevOps Engineer you are the most senior individual contributor on the team. You set the technical direction, own the hardest infrastructure and reliability problems end-to-end, and lift the entire org through architecture, code, design reviews, and mentorship. You partner closely with the DevOps Manager and engineering leadership on roadmap and standards, but your scorecard is technical outcomes — not headcount. Expect roughly 70-80% deep hands-on engineering (Terraform, Kubernetes, Postgres, Elasticsearch, pipelines, incident command) and 20-30% technical leadership: design reviews, mentorship, cross-team alignment, and writing the standards others build against. What You'll Do: Set technical direction * Own the architecture for AppZen's cloud platform: AWS topology, Kubernetes design, datastore strategy, CI/CD, and observability — make the long-horizon calls and write the design docs the rest of engineering builds against. * Lead deep design reviews; set bar-raising standards for reliability, security, performance, and cost across infrastructure code and production systems. * Identify the highest-leverage platform investments (toil reduction, reliability, developer velocity) and drive them from idea to rollout. Run Reliable, Secure Cloud Infrastructure * Drive AWS architecture and operations across multiple regions and accounts; own multi-account landing-zone, IAM, and network patterns. * Set the Terraform module and IaC patterns the team uses; lead the hardest migrations and cleanups personally. * Partner with Security on SOC 2, ISO 27001, GDPR, and customer audit requirements; design controls for IAM, network, and secrets management. * Drive cloud cost engineering: visibility, forecasting, and optimization (Savings Plans, rightsizing, multi-tenant efficiency). Operate and Scale Critical Production Datastores * Be the team's go-to expert on PostgreSQL in production: schema and index strategy, query tuning, vacuum/bloat, replication, failover, point-in-time recovery, and major-version upgrades on RDS / Aurora. * Own scaling and reliability of Elasticsearch / OpenSearch: shard and index design, JVM/heap tuning, snapshot strategy, hot-warm tiers, and incident response under heavy ingest or query load. * Set patterns for supporting datastores: Redis (caching, queues), Kafka or SQS/SNS (streaming and async), and S3-backed data lakes — including HA, durability, and disaster recovery. * Lead capacity planning, performance benchmarking, data-tier cost optimization, backup/restore drills, and customer data isolation for multi-tenant workloads. Evolve the Kubernetes and Container Platform * Own the architecture of our EKS-based Kubernetes platform: cluster lifecycle, autoscaling, multi-tenancy, and workload isolation. * Define the golden paths service teams use — Helm, Kustomize, and GitOps tooling such as ArgoCD or Flux — and personally build the trickiest pieces. * Set patterns for service mesh, ingress, and zero-downtime deployments. Own CI/CD and the Developer Platform * Architect internal developer platform capabilities so product teams ship safely and quickly without infra friction. * Drive the design of build, test, and deploy pipelines (e.g., GitHub Actions, Jenkins, ArgoCD); enforce supply-chain security and artifact provenance. * Set the bar for DORA metrics: lead time, deploy frequency, change failure rate, and MTTR — and own the highest-impact improvements. Drive observability and SRE practice * Architect the observability stack (e.g., Datadog, Prometheus, Grafana, OpenTelemetry); define metrics, logs, and tracing standards across services. * Define and operationalize SLOs and error budgets in partnership with service owners. * Act as incident commander for high-severity events; lead blameless post-mortems and convert learnings into durable systemic fixes. Multiply the Team * Mentor senior and staff engineers; raise the bar through code and design reviews, pairing, and writing the references docs and run books others learn from. * Represent Cloud Engineering in cross-team forums; influence Product Engineering, Security, and Data on architecture and reliability decisions without authority. * Help the DevOps Manager hire — calibrate technical bar, design interview loops, and close senior candidates. What You Bring: * 10+ years of experience in DevOps, SRE, infrastructure, or platform engineering, with at least 3 years operating at a Staff or Principal level (or equivalent technical leadership scope). * Deep, hands-on AWS expertise across compute, networking, IAM, data, and observability services; demonstrated ownership of multi-account, multi-region SaaS architectures. * Strong production experience with Kubernetes (preferably EKS), including upgrades, autoscaling, and securing multi-tenant clusters. * Demonstrated hands-on operations experience with PostgreSQL at scale — query and index tuning, replication, HA/failover, backups, and version upgrades — and with Elasticsearch / OpenSearch (cluster sizing, shard strategy, ingest tuning, and incident response). * Working knowledge of additional datastores commonly used in SaaS: Redis, Kafka or other message brokers, and object storage; comfortable evaluating trade-offs between managed services (RDS, Aurora, ElastiCache, MSK, OpenSearch Service) and self-managed options. * Expert with Terraform and modern IaC patterns; clear opinions on module design, state management, and PR-driven workflows. * Strong scripting and automation skills in at least one of Python, Go, or Bash; comfortable contributing real code, not just reviewing. * Track record of designing and operating CI/CD pipelines at scale (GitHub Actions, Jenkins, ArgoCD, or similar). * Experience running production workloads under SOC 2 or comparable compliance frameworks; comfortable partnering with Security on audits and remediation. * Demonstrated technical leadership without formal authority: writing decision-grade design docs, mentoring engineers, and influencing across teams. You enjoy lifting others through your work. Nice To Have: * Experience supporting AI/ML or data-heavy SaaS workloads (GPU fleets, vector stores, large async pipelines). * Familiarity with service mesh (Istio, Linkerd) and progressive delivery (Argo Rollouts, feature flags). * Background scaling FinOps practices and managing cloud spend at $5M+ annual run-rate. * Experience operating multi-tenant SaaS with strict data isolation requirements for enterprise finance customers. * Exposure to multi-cloud or hybrid-cloud environments (Azure, GCP). * Open-source contributions, conference talks, or internal tech-leadership artifacts (eng wikis, RFCs, paved-road frameworks). AppZen is committed to fair and equitable compensation practices. The base pay range for this role is posted above. Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to skill set, depth of experience, certifications, and specific work location. This may be different in other locations due to differences in the cost of labor. The total compensation package for this position may also include annual performance bonus, stock, benefits and/or other applicable incentive compensation plans. We are an equal opportunity employer and value diversity. All employment is decided on the basis of qualifications, merit and business need. You can find our Privacy Notice linked on the bottom of our appzen.com website. We may use artificial intelligence tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans.