Senior Site Reliability Engineer
20 hours ago
Jersey City
Job Description F2Onsite is seeking a Senior Site Reliability Engineer, you'll bring software engineering practices to operations - building the reliability framework, defining Service Level Objectives (SLOs), and automating toil away. Please only send resume if you meet the requirements. Send resumes to . Thank You! Role Overview: As a Senior Site Reliability Engineer, you'll bring software engineering practices to operations - building the reliability framework, defining Service Level Objectives (SLOs), and automating toil away. You'll own the health and performance of container platforms (EKS & OpenShift), Middleware Platforms (Kafka, Redis), and the CI/CD/observability pipelines that power modern, distributed applications. Key Responsibilities: • Platform Operations:, • Administer and optimize Kubernetes clusters - Amazon EKS and Red Hat OpenShift, • Manage platform lifecycle, upgrades, scaling, and security controls, • Middleware Management:, • Operate and tune event platforms like Apache Kafka, • Administer in-memory data stores like Redis Enterprise Clusters, • Administer and maintain 3 Scale API Gateway platform., • Automation:, • Fine tune Infrastructure-as-Code (IaC) pipelines and platform components, • Automate manual operations through IaC & configuration management tools/platforms., • Observability & Instrumentation:, • Design and implement monitoring dashboards and alerts with Prometheus, Grafana, ELK stack, and Splunk, • Instrument Java, Node.js, and Python distributes apps - embed tracing, metrics, and logs at code-level to meet SLOs., • Reliability Engineering:, • Define SLIs/SLOs and manage error budgets- use data-driven insights to balance reliability and feature velocity., • Lead on-call rotations, incident response, and conduct blameless root cause analysis to drive continuous improvement., • Performance & Capacity:, • Forecast and right-size resource usage across clusters and middleware, • 12+ years of overall industry experience., • 6+ years in SRE, DevOps, Platform, or Production Engineering roles., • EKS and/or OpenShift administration certification (CKA, AWS Certified Kubernetes Administrator, Red Hat Certified OpenShift Administrator, or equivalent)., • Hands-on with Kubernetes internals, networking, Helm charts, and Operators., • Middleware expertise: Deploying, scaling, and securing Kafka and Redis clusters., • Strong IaC toolchain experience: Helm, ArgoCD, Terraform, Ansible or equivalent tools/platforms, • Observability mastery: Prometheus, Grafana, ELK/Splunk or equivalent tools/platforms., • Enforce container security and policy governance using tools like OPA/Gatekeeper, Kyverno, and scanners such as Trivy, Clair, and Snyk, integrated with CI/CD and admission controls for automated compliance., • Implement Kubernetes network segmentation using NetworkPolicy and/or Calico, ensuring secure east-west traffic and minimizing blast radius to protect service reliability., • Programming/scripting proficiency in Python, Shell Scripting, Groovy or similar automation scripting., • Demonstrable experience instrumenting distributed applications (Java, Node.js, Python) with metrics, logs, and tracing libraries., • Proven track record of running large-scale production systems with minimal downtime., • Service mesh experience (Istio, Linkerd)., • Chaos engineering foundations (Chaos Monkey, LitmusChaos)., • Familiarity with security/compliance in regulated environments., • You'll be the architect of reliability guardrails - building automation and pipelines that free developers and engineers from manual ops., • You'll define and enforce SLO-driven releases, leveraging error budgets to strike the right balance between innovation and uptime., • You'll own end-to-end instrumentation: from container runtime metrics through Kafka-backed event flows to application-level traces in code.Company DescriptionF2OnSite is the fastest growing IT field services company in the United States, with hundreds of employee technicians in over 40 states. F2 OnSite performs service on computers, printers, point of sale systems, servers and other hardware technologies - including installations, migrations, deployments and break/fix. Learn more at F2onsite.com. WHAT WE DO: Our focus is Hardware: Desktops, Laptops, Servers, Printers, POS systems, and LCDs. We have hundreds of team members across the US who work Onsite at customer locations - providing hardware break/fix services, migrate data, install computers, move printers, install/fix servers and POS systems. We close thousands of service calls each week, and do whatever it takes to get our customers up and running again. We specialize in all types of technology, projects, desktop support and more.F2OnSite is the fastest growing IT field services company in the United States, with hundreds of employee technicians in over 40 states. F2 OnSite performs service on computers, printers, point of sale systems, servers and other hardware technologies - including installations, migrations, deployments and break/fix. Learn more at F2onsite.com.\r\n\r\nWHAT WE DO: Our focus is Hardware: Desktops, Laptops, Servers, Printers, POS systems, and LCDs. We have hundreds of team members across the US who work Onsite at customer locations - providing hardware break/fix services, migrate data, install computers, move printers, install/fix servers and POS systems. We close thousands of service calls each week, and do whatever it takes to get our customers up and running again. We specialize in all types of technology, projects, desktop support and more.