AWS Observability or Grafana Architect
hace 2 días
Warren
AWS Observability or Grafana Architect Location: Warren, NJ (Onsite) We are seeking a highly skilled AWS Observability Architect with deep, hands-on expertise in designing and implementing enterprise-grade observability platforms on AWS — with Grafana as the primary observability tool and OpenTelemetry as the instrumentation standard. This is a technical specialist role requiring genuine implementation experience, not platform familiarity. The ideal candidate has personally architected and delivered large-scale observability solutions for production AWS environments — building telemetry pipelines, designing dashboards that operations teams actually use, and creating alerting frameworks that reduce MTTR rather than add noise. You understand the full observability stack: from application instrumentation with OpenTelemetry SDKs through to Grafana dashboards consumed by SREs, on-call engineers, and engineering leadership. This role sits at the intersection of cloud infrastructure, software engineering discipline, and operational excellence — requiring someone who can design an enterprise observability architecture in the morning, write a Grafana dashboard query in the afternoon, and advise a development team on OpenTelemetry instrumentation strategy the next day. Key Responsibilities Observability Architecture & Strategy • Define and own the enterprise observability architecture for AWS environments — establishing the target-state design across the four pillars of observability: metrics, logs, traces, and events., • Design end-to-end telemetry pipelines — from instrumentation at the application and infrastructure layer through collection, processing, storage, and visualisation — with Grafana as the enterprise observability platform., • Develop observability standards and reference architectures — defining how AWS workloads across compute (EC2, EKS, ECS, Lambda), storage, networking, and managed services should be instrumented, collected, and visualised consistently across the organisation., • Establish signal-to-noise discipline across the observability platform — designing alerting frameworks that surface actionable signals, eliminate false positives, and ensure on-call engineers are alerted only when human intervention is genuinely required., • Define observability maturity roadmaps for client environments — assessing current-state coverage, identifying gaps, and building a phased improvement plan from reactive monitoring to proactive, AIOps-ready observability., • Drive FinOps for observability — optimising telemetry data volumes, retention policies, and Grafana Enterprise licensing costs to ensure the observability platform itself does not become a significant cost centre. Grafana Enterprise Implementation • Architect, deploy, and operate Grafana Enterprise or Grafana SaaS as the primary observability platform — including high-availability Grafana deployment on AWS (EKS-based or managed via Grafana Cloud), data source federation, RBAC configuration, and enterprise plugin management., • Design and implement Grafana data source integrations across the AWS observability ecosystem:, • Amazon CloudWatch — metrics, logs, and alarms as a core AWS data source, • Grafana Mimir — for scalable, long-term Prometheus-compatible metrics storage, • Grafana Loki — for cost-efficient, label-based log aggregation at scale, • Grafana Tempo — for distributed tracing storage and trace-to-log-to-metric correlation, • Amazon Managed Service for Prometheus (AMP) — for AWS-native Prometheus metrics, • Amazon OpenSearch — for log analytics and full-text search use cases, • Elasticsearch / OpenSearch — for existing log infrastructure integration, • Build and maintain a Grafana dashboard library — covering infrastructure health, application performance, SLO/SLA tracking, capacity planning, cost visibility, incident response, and executive reporting — using reusable, variable-driven, and consistently styled templates., • Implement Grafana alerting at enterprise scale — including alert routing, notification policies, silence management, and integration with PagerDuty, OpsGenie, ServiceNow, and Slack for multi-channel incident notification., • Configure Grafana RBAC and team structures — designing role hierarchies, folder permissions, and data source access controls that enable self-service dashboarding for development teams while protecting sensitive operational data., • Deploy and manage Grafana Oncall for on-call scheduling and alert routing, or integrate Grafana alerting with existing incident management platforms., • Implement Grafana SLO (Service Level Objectives) — defining, tracking, and reporting error budgets across production services, enabling data-driven reliability decisions., • Manage Grafana as code — using Grafana's provisioning capabilities (YAML/JSON), Terraform provider, and Grizzly/Grafonnet for dashboard version control, environment promotion, and GitOps-based dashboard management. OpenTelemetry Implementation • Define and lead the organisation's OpenTelemetry (OTel) instrumentation strategy — establishing standards for automatic and manual instrumentation across application stacks running on AWS., • Design and deploy the OpenTelemetry Collector as the central telemetry processing layer — including:, • Collector deployment patterns: agent (DaemonSet on EKS), gateway (centralised), and sidecar configurations, • Receiver configuration — OTLP, Prometheus, Jaeger, Zipkin, AWS X-Ray, CloudWatch, Fluent Bit, • Processor pipeline design — batch processing, memory limiting, attribute enrichment, tail-based sampling, and resource detection processors, • Exporter configuration — routing telemetry to Grafana Mimir (metrics), Grafana Loki (logs), Grafana Tempo (traces), AMP, and CloudWatch, • Instrument AWS workloads with OpenTelemetry SDKs across languages (Java, Python, Node.js, Go) — including auto-instrumentation for containerised EKS workloads, Lambda instrumentation using OTel Lambda layers, and ECS task definition instrumentation., • Implement distributed tracing using OpenTelemetry — establishing trace propagation standards across microservices, configuring context propagation (W3C TraceContext, B3), and ensuring end-to-end trace visibility from frontend to backend to database., • Design OTel-based log correlation — enriching logs with trace IDs and span IDs to enable trace-to-log navigation in Grafana, supporting faster RCA during incidents., • Implement OTel-based metric instrumentation — defining custom business and application metrics alongside system metrics, following OTel semantic conventions for consistent metric naming and attribute tagging across services., • Define sampling strategies for distributed traces — including head-based sampling for development environments and tail-based sampling (via OTel Collector) for production environments, balancing observability coverage with storage cost., • Manage OTel Collector as infrastructure — including horizontal scaling, resource limits, high-availability deployment, collector health monitoring, and pipeline performance optimisation. AWS Observability Services Integration • Design the integration architecture between AWS-native observability services and Grafana — positioning Grafana as the unified observability plane while leveraging AWS-native services as data sources:, • Amazon CloudWatch — metrics, logs, alarms, dashboards, Contributor Insights, and Synthetics, • Amazon Managed Grafana (AMG) — evaluating and advising on AMG vs self-managed Grafana deployment decisions, • Amazon Managed Service for Prometheus (AMP) — remote write from OTel Collector and Prometheus agents, recording rules, and alert manager integration, • AWS X-Ray — ingesting X-Ray traces into Grafana Tempo or directly via Grafana X-Ray data source, • AWS CloudTrail — audit log integration for security and compliance observability, • VPC Flow Logs — network observability integration for security monitoring and traffic analysis, • Implement infrastructure-level observability for core AWS services — EC2 (CloudWatch agent, Node Exporter via OTel), EKS (kube-state-metrics, cAdvisor, OTel DaemonSet), RDS (Enhanced Monitoring, Performance Insights), Lambda (OTel Lambda layer, custom metrics), and API Gateway (access logs, CloudWatch metrics)., • Design business and synthetic monitoring — implementing Grafana Synthetic Monitoring or CloudWatch Synthetics for endpoint availability, API health, and user journey monitoring with Grafana alerting integration. Experience • 10+ years of overall experience in cloud infrastructure, platform engineering, or DevOps., • 5+ years of hands-on AWS experience in production environments — not advisory or oversight roles., • 3+ years of hands-on Grafana Enterprise or SaaS implementation experience — designing, deploying, and operating Grafana at enterprise scale, including Mimir, Loki, Tempo, and the LGTM stack., • Proven experience implementing OpenTelemetry in production environments — including OTel Collector deployment, SDK-based instrumentation, and distributed tracing implementation., • Demonstrated experience building production-grade observability pipelines — from instrumentation through collection, processing, storage, and visualisation., • Hands-on experience with PromQL for metrics querying and alerting — including complex queries, recording rules, and alert expression design., • Experience with LogQL (Grafana Loki) for log querying and log-based alerting., • Hands-on experience deploying observability infrastructure on Kubernetes (EKS) — including Prometheus Operator, OTel DaemonSets, Grafana deployment, and persistent storage configuration., • Experience with Grafana as code — provisioning dashboards, data sources, and alert rules via YAML, Terraform, or Grafonnet.