Platform Engineer (Mid-Level) - Fintech
5 days ago
Miami
Job DescriptionCareer Renew is recruiting for one of its clients a Platform Engineer (Mid-Level) - Fintech - this is a fully remote role for US/Latam/Europe based candidates as long as they can work EST hours. Build and operate a secure, reliable, and developer-friendly platform on AWS and Kubernetes. You’ll drive GitOps with ArgoCD, Infrastructure as Code with Terraform/Terragrunt, and reusable GitLab CI pipelines. You’ll partner with product teams for platform-related issues and support advanced IT initiatives beyond standard help desk scope (e.g., Tailscale VPN design and operations). Bash is required; Python is a plus. What you’ll do: AWS infrastructure and IaC * Provision, and manage AWS infrastructure using Terraform and Terragrunt with modular patterns, environment overlays, remote state, and CI validation. * Implement multi-account foundations, IAM (including IRSA), VPC networking, ALB/NLB, Route 53, ACM, and baseline security guardrails. * Establish drift detection, tagging standards, cost visibility, and budgets with actionable alerts. Kubernetes and GitOps * Operate EKS clusters (lifecycle, upgrades, scaling, node groups) with strong change control. * Use Amazon EKS Auto Mode for compute provisioning (Karpenter-backed) and tune capacity types, priorities, and disruption policies. * Manage workloads via ArgoCD (app-of-apps, sync waves, health checks) and maintain Helm charts/values schemas for services and platform add-ons. * Standardize core add-ons: AWS Load Balancer Controller, ExternalDNS, Metrics Server, External Secrets. CI/CD and developer experience * Build reusable GitLab CI pipelines using templates, includes, rules, needs, caching, and artifacts with minimal duplication. * Implement environment promotions, trunk-based workflows, and ephemeral Review Apps where appropriate. * Operate and optimize GitLab Runners (autoscaling, caching, security for protected branches) to keep feedback fast. Security, reliability, and observability * Shift-left security with IaC and container scanning, image policies, and signed artifacts where applicable. * Implement secrets management via External Secrets with AWS Secrets Manager/SSM. * Deliver metrics, logs, and tracing with Datadog; define SLOs and actionable alerting. * Participate in incident response and on-call rotations; drive post-incident remediation and improvements. Partnering and enablement * Serve as a partner for product teams on infra, CI/CD, and Kubernetes issues; triage and resolve platform incidents. * Publish clear docs, internal workshops, and templates that enable self-service and consistent delivery. How you’ll work * Treat infrastructure and environments as code with Git-based workflows and reviews. * Use GitOps for cluster and app config with clear promotion paths across environments. * Keep CI/CD DRY with shared templates and fast, reliable pipelines. * Maintain paved paths: golden Helm charts, environment baselines, and up-to-date runbooks. * Operate observability-first with SLOs and continuous improvement after incidents. Tools you’ll use * AWS (EKS, IAM, VPC, ALB/NLB, Route 53, S3, CloudWatch) * Terraform, Terragrunt * Kubernetes (EKS), Helm * ArgoCD * GitLab CI and GitLab Runners * Docker and ECR (or equivalent registry) * Datadog * ExternalDNS, External Secrets (AWS Secrets Manager/SSM) * Bash; Python or Go a plus * Tailscale Minimum qualifications * 3–6 years in DevOps/Platform/SRE roles operating production systems. * Strong Terraform and Terragrunt experience (module design, environment orchestration, CI validation). * Kubernetes (preferably EKS) operations and Helm proficiency. * Hands-on ArgoCD and GitOps workflows. * GitLab CI experience building reusable pipelines with templates and includes. * Proficient in Bash; Python is a plus. * Solid Linux and networking fundamentals (TCP/IP, DNS, HTTP, TLS). * Practical AWS experience: EC2, EKS, IAM, VPC, ALB/NLB, Route 53, S3, CloudWatch. Success looks like 30 days * Access and environment readiness completed; local tooling configured (AWS, kubectl, Helm, GitLab). * Architecture and workflows understood: EKS + ArgoCD deployment model, GitLab CI patterns, Terraform/Terragrunt repo layout. * Shadowed platform on-call and common runbooks; able to follow deploy/rollback and ArgoCD sync flows. * First contributions landed: * Minor Helm values change or chart fix merged and deployed to a non-prod environment via GitOps. * Small GitLab pipeline improvement (e.g., cache, rules, or template include) merged in one service. * Documentation updates to clarify onboarding steps or a common troubleshooting path. * Partnered with a developer to resolve a CI/CD or containerization issue end-to-end. * Tailscale familiarity established: reviewed ACLs, SSO setup, and runbooks; completed a supervised ACL or device access change. 60 days * Owns a small platform enhancement from design to rollout: * Publishes a reusable GitLab CI template (e.g., build/test/scan or image caching) and helps 1–2 teams adopt it. * Introduces plan-only GitLab MR pipelines for a subset of Terraform/Terragrunt stacks, with artifacts and reviewer gates. * Delivers a non-prod change independently through GitOps that touches multiple components (e.g., platform add-on config via Helm/ArgoCD). * Handles a rotation of developer support tickets unassisted (pipelines failing, Helm chart quirks, EKS access), with clear comms and root-cause notes. * Improves at least one runbook or dashboard that accelerates triage for recurring issues. * Executes a Tailscale change safely (ACL refinement, subnet router validation, or MagicDNS entry) and documents the rationale and rollback. 90 days * Recognized as the primary contact for 1–2 product teams on platform topics; drives issues from intake to resolution without hand-holding. * Terraform/Terragrunt CI expanded: * Consistent MR plan across targeted stacks and environments, with formatting/linting and state hygiene. * Non-prod apply path landed with approvals and change records; production path scoped and scheduled. * Reusable GitLab pipeline patterns adopted by multiple services; duplication reduced and feedback time improved. * Leads a small reliability improvement (e.g., noisy alert reduction, flaky job fix, or ArgoCD app health check tuning) with measurable impact. * Ships a modest platform change with change management (e.g., EKS add-on version bump, ALB/Ingress tweak, or External Secrets integration for a service). * Tailscale operational comfort demonstrated: independently handles a user connectivity or access case and updates docs for future self-service. Compensation & Package Health Stipend Office Stipend Laptop Unlimited Leave Options