Glasgow City
Job Title: AWS Site Reliability Engineer (Data Platform) Role Summary We are looking for an AWS Site Reliability Engineer (SRE) to support and scale a cloud-native data platform built on AWS, Snowflake, and Databricks. The role focuses on driving reliability through automation, disaster recovery (DR) testing, resiliency engineering, observability, and proactive SLO/SLI/SLA management. Key Responsibilities • Design, build, and maintain automation for infrastructure provisioning, platform operations, and incident response using IaC and CI/CD., • Lead resiliency and disaster recovery planning, including regular DR drills, failure testing, and recovery validation across AWS and data platform components., • Define, implement, and manage SLIs, SLOs, and SLAs for critical data pipelines and platform services; use error budgets to guide reliability improvements., • Build and operate robust observability solutions (metrics, logs, traces, alerts) for AWS services, Snowflake, and Databricks workloads., • Partner with data engineering and platform teams to embed reliability-by-design into architecture and delivery practices., • Perform root cause analysis (RCA) and drive continuous improvement to reduce toil and improve platform availability and performance, • Practical knowledge of SRE principles, including SLO/SLI/SLA design and error budgets., • Strong experience with AWS (e.g., EC2, S3, IAM, VPC, CloudWatch) in production environments, • Experience with observability tools and monitoring/alerting best practices., • Hands-on experience with automation and IaC (Terraform, CloudFormation, CDK) and scripting (Python, Bash)., • Experience running DR tests, chaos engineering, or resiliency testing in cloud environments., • Familiarity with CI/CD pipelines and GitOps practices., • Background supporting large-scale data or analytics platforms.