Senior Platform Engineer
hace 6 días
Barcelona
About Us Si los siguientes requisitos del puesto y la experiencia coinciden con sus habilidades, por favor, asegúrese de enviar su solicitud sin demora. Axiomatic AI is building a new class of AI systems designed to reason with the rigor of the scientific method. By combining deep learning with formal logic and physics-based modeling, we create verifiable, interpretable AI systems that collaborate with and support human researchers in high-stakes scientific and engineering workflows. Our mission, 30x30, is to deliver a 30x improvement in the speed, accessibility, and cost of semiconductor and photonic hardware development by 2030. We aim to revolutionize hardware design and simulation in these industries and are building a team of highly motivated professionals to bring these innovations from research into commercial products. Position Overview As a Senior Platform Engineer at Axiomatic, you will own the reliability, deployment, and operational excellence of our AI platform. This role focuses primarily on infrastructure, CI/CD, and operations, with additional responsibilities for automation and tooling development. You Will • Lead deployment strategies and CI/CD pipelines across multiple environments, • Architect and maintain multi-cloud infrastructure (Azure, AWS, GCP) and on-premise deployments, • Own infrastructure as code using Terraform to automate provisioning and configuration, • Build comprehensive observability systems: monitoring, metrics, logging, and alerting, • Implement security controls, compliance frameworks, and data governance policies, • Develop automation tools, APIs, and scripts (Python) to improve operational efficiency, • Ensure system reliability, performance, and scalability, • Drive incident response, postmortems, and continuous improvement, • Troubleshoot infrastructure and application issues across multiple environments Deployment & CI/CD • Design and implement deployment pipelines for multi-environment releases (dev, staging, production), • Own the full deployment lifecycle: build, test, release, and rollback strategies, • Implement blue-green deployments, canary releases, and progressive rollouts, • Build automated deployment tooling and workflows, • Ensure zero-downtime deployments and rollback capabilities, • Optimize build and deployment performance, • Manage artifact repositories and container registries Infrastructure & Cloud Operations • Design and operate multi-cloud infrastructure across Azure, AWS, and GCP, • Architect and deploy on-premise solutions for enterprise customers (Linux-based), • Manage Kubernetes clusters, container orchestration, and networking, • Implement disaster recovery, backup strategies, and business continuity, • Optimize cloud costs and resource utilization, • Define and track SLIs, SLOs, and error budgets for critical services Infrastructure as Code • Write and maintain Terraform modules for infrastructure provisioning, • Implement GitOps workflows for infrastructure changes, • Automate infrastructure scaling, updates, and operations, • Ensure reproducible and version-controlled infrastructure Observability & Monitoring • Design comprehensive monitoring, logging, and alerting (Prometheus, Grafana, Datadog, or similar), • Build dashboards for system health, performance, and business metrics, • Implement distributed tracing for microservices, • Conduct capacity planning and performance analysis, • Drive reliability improvements through data-driven insights Security & Compliance • Implement security best practices: identity management, secrets management, network policies, • Work towards or maintain security certifications (SOC 2, ISO 27001, or similar), • Conduct security audits and vulnerability remediation, • Implement data governance policies for AI pipelines and user data, • Ensure compliance with data privacy regulations (GDPR, CCPA) Automation & Tooling Development • Write automation scripts and tools in Python for operational tasks, • Build internal tooling for deployments, monitoring, and incident response, • Develop runbooks, automation, and self-healing systems, • Create APIs for infrastructure operations when needed, • Maintain high code quality and testing standards for tooling Reliability & Incident Management • Participate in on-call rotation and lead incident response, • Conduct blameless postmortems and drive action items, • Build and maintain incident response playbooks, • Improve system resilience and failure modes Collaboration • Partner with engineering teams on deployment strategies and architecture, • Work with security team on compliance and governance, • Mentor engineers on operational best practices, • Document systems, procedures, and runbooks Key Requirements • 7+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps, or Infrastructure Engineering roles, • Deep experience with CI/CD pipelines, release strategies, and production deployments at scale, • Hands-on experience with Azure and AWS (GCP is a plus), • Linux system administration, bare-metal provisioning, networking for on-premise deployments, • Expert proficiency writing and maintaining Terraform as code, • Proven track record building monitoring, alerting, and metrics platforms, • Experience implementing security controls and best practices; security certification preferred (CISSP, CEH, AWS/Azure Security Specialty, or similar), • Understanding of data privacy, residency requirements, and governance frameworks, • Backend/scripting skills: Python (preferred) or Go for automation, tooling, and operational scripts, • Experience with Kubernetes and container orchestration in production, • Strong Linux/Unix administration and scripting (Bash, Python), • Familiarity with CI/CD platforms (GitHub Actions, GitLab CI, Jenkins, or similar), • Version control and GitOps practices, • Strong problem-solving and debugging skills, • Fluent in English (Spanish is a plus) Nice-to-Have • Python proficiency for automation and internal tooling, • Experience with cloud AI platforms (Vertex AI, Azure ML, AWS SageMaker), • Service mesh experience (Istio, Linkerd) or API gateways, • Experience with GPU workloads and ML infrastructure, • FinOps and cloud cost optimization, • Compliance frameworks experience (SOC 2, ISO 27001, HIPAA, FedRAMP), • Database operations: PostgreSQL, Redis administration, • Experience with FastAPI or similar frameworks for internal tools, • Contributions to open-source infrastructure projects, • Background in hardware or semiconductor industries Work model & location expectations • Hybrid work model, open to remote, • Primary location: Preferential timezone EU, • On-site expectations: hybrid or remote, ~2 days per week in the office (with flexibility). Occasional travel to our Barcelona or Boston office may be required if remote. Why join us? At Axiomatic AI, you will be working on technology that drives innovation in AI for scientific and engineering applications in line with our 30x30 mission. This is your opportunity to contribute to the development of new AI architectures that can reason coherently and produce interpretable and verifiable solutions. Consequently, see those ideas commercialized into products that will shape the future of hardware and computing, while collaborating with a global team of engineers and AI specialists. We believe in pushing the boundaries of what is possible and continuously seek to redefine the intersection of AI, with focus on formal consistency. If you're ready to take your expertise in artificial intelligence and physics to the next level, we want to hear from you! Worried about not meeting every qualification? Studies show that women and people of color are less likely to apply for jobs unless they meet every listed requirement. At Axiomatic-AI, we are dedicated to creating a diverse, inclusive, and authentic workplace. xcskxlj If this role excites you but your background doesn't perfectly match every qualification, we still encourage you to apply. You could be the perfect fit for this position or another opportunity with us. #J-18808-Ljbffr