Site Reliability Engineer
5 months ago
Falls Church
Job Description Tax Analysts is seeking a Site Reliability Engineer (SRE) to help establish and shape our reliability engineering practice from the ground up. This is a unique opportunity to join a mission-driven organization and play a key role in ensuring the reliability, scalability, and performance of our AWS-hosted business applications. As part of a cross-functional engineering team, you will work to improve observability, automate operational processes, and lead incident response and continuous improvement efforts. This role is ideal for a mid-level engineer with cloud and software engineering experience who is eager to deepen their expertise in site reliability engineering, learn from senior staff, and help build a culture of reliability. ESSENTIAL DUTIES AND RESPONSIBILITIES: • Help define and implement service-level indicators (SLIs) and objectives (SLOs) for cloud-based applications., • Build, configure, and maintain monitoring, alerting, and dashboarding solutions using AWS CloudWatch, X-Ray, and third-party tools such as DataDome., • Leverage advanced AWS observability tools (e.g., CloudWatch Synthetics, Contributor Insights) to proactively monitor system health., • Contribute to the development and implementation of a structured on-call support process as our reliability practice evolves., • Implement monitoring, and maintain site protection and bot mitigation solutions, including DataDome, to defend against automated attacks and ensure application availability, and analyze performance during postmortems of incidents., • Investigate incidents, security events, and operational anomalies, resolve, perform root cause analysis, and run a postmortem process., • Identify repetitive or manual operational tasks (‘toil’) and design scripts or automations using AWS Lambda and CloudFormation to improve efficiency and reliability., • Assist in the maintenance and enhancement of CI/CD pipelines and automated deployment processes., • Work closely with development, QA, cloud, and DevOps teams to ensure reliability, scalability, and security are integrated into system and application designs., • Contribute to the documentation of systems, processes, incident learnings, compliance, and reliability best practices., • Stay current with emerging AWS, SRE, and observability technologies, and make recommendations to adopt new tools or approaches that improve system resilience and operational excellence., • Participate in the evaluation and rollout of new AWS services and features that can benefit system reliability or team efficiency., • Perform other related duties as assigned to support the team and organizational objectives. KNOWLEDGE & SKILLS: • Strong analytical, troubleshooting, and problem-solving abilities., • Hands-on experience with AWS CloudWatch (metrics, logs, dashboards, alarms) for proactive monitoring and alerting., • Familiarity with AWS X-Ray for distributed tracing and in-depth troubleshooting of microservices architectures., • Experience leveraging tools like CloudWatch Synthetics and Contributor Insights for canary testing and log analytics., • Knowledge of AWS CloudTrail for auditing and investigating API calls and security events., • Experience using AWS Athena for ad-hoc querying and analysis of logs during incident investigations and postmortems., • Proficiency with AWS CloudFormation for reliable and repeatable infrastructure provisioning., • Experience automating operational tasks and workflows using AWS Lambda or similar event-driven services., • Understanding of AWS services such as API Gateway, CloudFront, and Elastic Load Balancer (ELB) to ensure availability, scalability, and optimal performance of distributed systems., • Experience working with site protection and bot mitigation solutions (such as DataDome or Cloudflare)., • Working knowledge of scripting or programming languages such as Python, Bash, or Node.js for automation and tooling., • Excellent communication and documentation skills; ability to collaborate effectively with cross-functional teams., • Eagerness to learn and adopt new tools, technologies, and best practices in cloud reliability and operations. Requirements • Bachelor’s degree in computer science, engineering, or a related field; equivalent professional experience considered., • 3+ years of professional experience in cloud engineering, DevOps, infrastructure, or observability roles (AWS required)., • Experience implementing SRE principles (prior work in an SRE role is a plus)., • Experience with monitoring, incident response, or reliability work in a production environment., • Experience working in an Agile development environment, collaborating within cross-functional teams., • Health/Dental/Vision, • 401K: Immediately vested, • Tuition assistance, • Qualified employer under the Public Service Loan Forgiveness program (PFSL), • Generous Paid Time Off, • Dog-friendly office, • Private gym onsite, • Medical, Dental, Vision Insurance, • Health Savings Account (HSA), • Flexible Spending Account (FSA), • Employee Assistance Program (EAP), • Life and AD&D Insurance, • Disability Insurance, • Pet Insurance, • Tuition Assistance, • Trade Publication/News Subscription Reimbursement, • Exercise Room, • Paid Holidays, • Vacation and Sick Leave, • Parental Leave Tax Analysts is an Equal Employment Opportunity Employer.