Acquire-Site Reliability Engineer
hace 2 días
Denver
Job Description:\n\nSite Reliability EngineerAbout Acquire Learning Acquire Learning is a learning management platform built specifically for ABA (Applied Behavior Analysis) therapy. Clinicians and behavior technicians use Acquire every day while working with clients on the autism spectrum, and the data captured in the platform shapes real treatment decisions. We are a small, product-focused team building in a HIPAA-regulated environment. Reliability is not a checkbox here. When a clinician is mid-session with a child, the platform working and behaving predictably is the difference between productive therapy and a disrupted session. Your work will directly affect the communities we serve.About the Role Acquire Learning is hiring its first dedicated Site Reliability Engineer. This is a mid-level role with a clear path to Lead SRE at Acquire as the company grows. We are looking for someone with real-world SRE, DevOps, infrastructure, and production-engineering experience who is ready to take meaningful ownership now and grow into the person responsible for how reliability works at Acquire. You will report to the Lead Engineer and collaborate regularly with our CEO and CTO. You will not be inheriting a mature SRE team or a thick runbook library. You will help build them. From day one, you will be the person most focused on keeping the production environment healthy, the release pipeline trustworthy, and the reliability surface of the codebase improving, while also raising release-quality risk clearly and acting as the customer-facing escalation point when production issues need investigation. This role is hands-on across both infrastructure and code. In the early days, you will: Run and improve our deploy pipelines, observability stack, and infrastructure as code. Triage and respond to production alerts and customer-reported issues. Collaborate with engineering on the reliability and architecture areas of the codebase in Node.js and TypeScript: repair scripts, migrations, observability instrumentation, index management, job lifecycle, deploy tooling, e2e and release automation. Own release-quality signals and partner with engineering on the QA tooling and test automation that protects clinicians from regressions. This is intentionally a broad role today, by design, because Acquire needs someone who can move across infrastructure, code, and release reliability with judgment. As the company and team grow, the role narrows toward Lead SRE: setting reliability strategy, owning incident response, and shaping how operations, observability, and release engineering work at Acquire.Where We Need Help We have a real product in production, real clinical workflows depending on it, and a small team carrying a lot of operational work. These are some of the areas where we need stronger ownership over the next stage: HIPAA-aware operations: strengthening PHI-safe logging, audit trails, production-data handling, access controls, incident evidence, vendor/tooling review, and the runbooks that support a regulated healthcare environment. Multi-tenant architecture: helping us operate safely as Acquire grows beyond a single internal deployment, including tenant isolation, tenant-scoped diagnostics, organization-safe migrations, repair scripts, alerts, and support workflows. Disaster recovery and restore confidence: improving backup verification, Atlas snapshot / point-in-time restore confidence, rollback procedures, break-glass paths, and recovery drills. Data integrity operations: making migrations, index changes, repair scripts, and production diagnostics safer to run, easier to audit, and harder to misuse under pressure. Security and production access hygiene: tightening IAM, secrets management, least-privilege access, deploy permissions, dependency monitoring, and the boundaries around who can touch production systems. Incident response maturity: defining practical severity levels, SLOs, alerting standards, post-incident follow-up, and the difference between "known noisy" and "wake somebody up." Scale and cost visibility: keeping an eye on AWS and MongoDB costs, slow queries, index health, capacity planning, and performance regressions before they turn into customer-facing problems.What You'll DoProduction reliability and incident response Own day-to-day production health across our AWS environment, MongoDB Atlas clusters, and supporting infrastructure. Triage and respond to production alerts (CloudWatch, Sentry, Google Chat ops-alerts channel), including first-pass investigation, root-cause analysis, and incident communication. Maintain and extend our observability stack: CloudWatch metric filters, SNS alert routing, Sentry projects across web and native apps, and the alerting templates that feed our ops channels. Lead incident retros for the issues you own and turn them into runbooks, alerting improvements, and follow-up work. Help close gaps in HIPAA-aware observability and production readiness, including PHI-safe logging, auditability, access controls, incident evidence, and regulated-environment runbooks.DevOps and release engineering Operate and improve our deploy pipelines (GitHub Actions across backend, webapp, and native app), including release coordination, migration verification, and post-deploy validation. Maintain and extend Terraform-managed infrastructure, including authoring infrastructure changes, reviewing infrastructure pull requests for safety, and improving module boundaries. Improve CI signal quality: test reliability, build times, environment parity, and release-readiness checks. Coordinate native mobile app releases through TestFlight and Google Play Console, including release verification on real devices. Improve rollback, restore, and break-glass procedures so production incidents have a clear path from detection to recovery.Reliability-focused codebase work (TypeScript / Node.js) This is real engineering work, scoped to the reliability and architecture areas of the codebase, done in collaboration with engineering. Examples of work that lives here: Authoring and maintaining break-glass repair scripts under documented procedures. Writing data migrations and keeping the migration audit and index management surfaces clean. Extending observability and logging instrumentation across backend and frontend (CloudWatch metric filters, structured logging, Sentry breadcrumbs, alert template wiring). Hardening tenant-aware operational tooling so diagnostics, repair scripts, migrations, and alerts stay safely scoped by organization. Maintaining background job lifecycle and tombstone hygiene. Improving deploy and release tooling. Extending e2e (Playwright) and integration test coverage in the areas of the product where regressions hurt clinicians most. Reviewing pull requests in the reliability and infrastructure areas of the codebase. You will not be expected to ship product features. You will be expected to be a credible collaborator with engineering on the reliability and architecture surfaces of the code, and to grow that ownership over time.Release-quality and QA partnership Own release-gate health: validate the release candidate, walk core clinical workflows on web, iOS, and Android, and raise risk clearly before code reaches clinicians and technicians in the field. Maintain and run our existing automated test suites (Playwright, Jest, Postman or comparable tools) as part of the release process. Partner with engineering to extend automation coverage in the areas of the product where regressions hurt clinicians most. Drive practical release checklists and post-release verification. Use AI tools thoughtfully on QA and release work: test drafting, log triage, regression idea generation, support-case investigation, and workflow automation.Customer-facing support escalations Act as the first internal point of contact for customer-reported issues that reach our support inbox or escalation channels. Reproduce, isolate, and document reported issues clearly enough that engineering can act quickly. Triage support cases by urgency and clinical impact, and own the loop back to the customer through resolution. Build and maintain internal support runbooks for the most common workflows and recovery scenarios.Who You Are You have 3-5 years of professional experience that meaningfully includes SRE, DevOps, production engineering, platform engineering, or infrastructure work. You are comfortable in cloud environments (AWS strongly preferred) and have real experience operating production systems — not just provisioning them. You have meaningful exposure to CI/CD pipelines, Terraform or comparable IaC, MongoDB or another production database, and observability tooling (CloudWatch, Datadog, Sentry, Grafana, or similar). You are comfortable reading and writing Node.js and TypeScript at the level needed for operational and reliability work: repair scripts, migrations, instrumentation, job lifecycle, deploy tooling, and integration / e2e test code. You are not expected to ship feature controllers; you are expected to be a credible engineering collaborator on reliability and architecture work. You are comfortable on the command line, in production logs, and in cloud consoles, and can talk through how you have used these in real incidents. You can run a deploy, verify it landed cleanly, and recognize when something is off. You have experience working production incidents or customer-facing escalations in a real product environment. You can reproduce, document, and prioritize an issue clearly enough that an engineer can act on it immediately. You think carefully about production access, customer data, tenant boundaries, and the difference between a helpful diagnostic and an accidental data leak. You are pragmatic about reliability and quality: you know when to investigate deeper, when to escalate, when to automate, and when to ask better technical or product questions. You are excited by a broad role today and a clearer Lead SRE trajectory tomorrow. You are comfortable on a small team where some process already exists, some process needs to be created, and everyone stays close to the product. You care about the mission and understand that reliability and quality issues in this product can affect real clinical work.Using AI Tools Practically We use AI tools where they genuinely help. You do not need to be an AI expert, and we are not looking for someone who treats AI as a substitute for judgment. We are looking for someone who is comfortable experimenting with tools like Claude, Cursor, or similar systems to move faster on operational work while still checking the result like an engineer. That might mean: Scanning production logs to triage an alert or support case. Drafting or refining repair scripts, runbooks, and automated tests. Investigating customer-reported issues against application logs and metrics. Generating release-day verification checklists. Summarizing bug patterns, alert noise, or release risk. Authoring or reviewing infrastructure and reliability code changes. Automating repetitive operational, release, and support workflows. Practical use matters more than hype. If a tool helps you find the sharp edge faster, great. If it creates uncertainty, you should know when to slow down and verify.Bonus Points Experience in healthcare, HIPAA-regulated environments, education technology, clinical software, or another regulated industry. Experience with multi-tenant SaaS systems, especially tenant isolation, tenant-scoped logging, support tooling, migrations, or data repair workflows. Experience with disaster recovery, restore drills, incident command, SLOs, or production access programs. Familiarity with AWS (ECS, CloudFront, IAM, CloudWatch, SNS), MongoDB Atlas, Terraform, GitHub Actions, Node.js, TypeScript, Playwright, Sentry, or ClickUp. Experience supporting React, React Native, or Node.js / Express applications in production, including working in monorepos (pnpm, Turborepo, or comparable). Experience coordinating native mobile app releases through TestFlight and Google Play Console. Experience helping introduce SRE, DevOps, observability, or QA practices on a small team or early-stage product. A prior step from senior IC operations / DevOps work into broader reliability ownership.The First Year and the Path to Lead SRE In the first few months you will learn the product, the infrastructure, and the highest-impact reliability surfaces. You will take over ownership of the alert response loop, become a reliable second pair of hands on deploys and incidents, build out an internal support runbook, and start contributing to the reliability and architecture areas of the codebase under engineering collaboration. Over the first year, you will take increasing ownership of: Reliability strategy and incident response. Observability, alerting, and post-incident hygiene. Release engineering, deploy tooling, and release-quality signals. The reliability and architecture areas of the codebase, including repair scripts, migrations, instrumentation, and job lifecycle work. The operational and QA playbooks that keep Acquire trustworthy as it scales. The goal is to grow into Lead SRE at Acquire: setting how reliability, operations, and release engineering work as the company and team grow. Compensation will be revisited as the role and ownership scope grow into that trajectory.Compensation & Benefits Base salary: $90,000-$105,000, with compensation expected to be revisited as the role grows into Lead SRE ownership. Semi-annual performance reviews with potential for raises. Location: Denver, CO Hybrid schedule: 3-4 days per week in office Health benefits PTO Additional benefits