Solution Architect - NVIDIA Cluster (End-to-End Design & Validation)
9 days ago
London
Job Specification: Solution Architect - NVIDIA Cluster (End-to-End Design & Validation) Location: London (1 day per week onsite) Travel: Occasional travel to datacenter sites outside the UK Engagement: Contract Inside IR35 Department: Engineering/Advanced Compute Role Overview We are seeking a highly skilled Solution Architect with deep experience in designing, validating, and delivering end-to-end NVIDIA GPU clusters in enterprise and hyperscale environments. This individual will own the full life cycle of architectural design-from requirements gathering through implementation oversight and performance validation. They will work closely with engineering, networking, DevOps, security, and datacenter operations teams to ensure high-performance, scalable, and resilient GPU infrastructure for AI, HPC, and ML workloads. The role is primarily London-based one day per week, with occasional international travel required to support datacenter design reviews, deployment validation, or site acceptance testing. Key Responsibilities Architecture & Design * Lead the architecture of NVIDIA GPU clusters leveraging technologies such as H100/H200, NVLink, NVSwitch, DGX, HGX, or SuperPod-class designs. * Produce high-level and low-level designs (HLD/LLD), including compute, network, storage, and power/cooling considerations. * Validate hardware and platform selections, ensuring architectural alignment with customer requirements and scalability goals. * Design fabric architectures including InfiniBand (200/400Gb), RoCE, and high-performance east-west traffic patterns. * Ensure designs adhere to NVIDIA reference architectures (NVAIE, Base Command, DGX SuperPod specs, etc.). Cluster Integration & Validation * Define and execute validation test plans for GPU cluster performance, resilience, networking throughput, and workload behaviour. * Oversee integration of GPU nodes, networking, and storage systems into the existing datacenter environment. * Collaborate with DevOps/Platform teams to validate cluster orchestration (Kubernetes, Slurm, Bright Cluster Manager, or equivalents). * Validate firmware, drivers, NCCL, CUDA libraries, and container environments for production readiness. Deployment & Delivery Oversight * Provide technical leadership across the full deployment life cycle. * Partner with datacenter operations to ensure correct rack layouts, cabling, airflow and power design. * Support delivery teams during build-out phases, ensuring the design is executed correctly. * Participate in factory acceptance tests (FAT), site acceptance tests (SAT), and operational readiness reviews. Stakeholder Collaboration * Work closely with internal and external teams including network engineering, platform engineering, procurement, and vendors such as NVIDIA, Mellanox, Supermicro, Dell, or HPE. * Provide technical guidance to customers, partners, and cross-functional engineering teams. * Communicate complex architectural concepts clearly to both technical and non-technical audiences. Documentation & Governance * Produce detailed architecture documents, diagrams, acceptance criteria, and operational runbooks. * Ensure security, compliance, and governance standards are built into the design. * Provide knowledge transfer (KT) and training sessions to internal teams where required. Required Skills & Experience Technical Expertise * Proven experience architecting and delivering NVIDIA GPU clusters at scale (AI/ML/HPC environments). * Strong hands-on understanding of GPU interconnects (NVLink/NVSwitch) and DGX/HGX/SuperPod architectures. * Deep knowledge of InfiniBand and high-performance networking architectures. * Experience with cluster orchestration: Kubernetes, Slurm, PBS, or similar. * Familiarity with AI/ML workload requirements, CUDA, Docker/OCI containers, and NVIDIA software stacks (NCCL, CUDA Toolkit). * Comfort with Linux systems engineering, hardware validation, and troubleshooting across compute/network layers. Soft Skills * Strong communication skills, with the ability to bridge engineering and business discussions. * Comfortable owning architecture decisions and delivering executive-ready documentation. * Ability to work autonomously while coordinating with multi-disciplinary teams. * Problem-solver with strong critical-thinking abilities and a delivery-focused mindset. Desirable Experience * Experience with hyperscaler-class deployments or multi-megawatt datacenter environments. * Work with NVIDIA Base Command Manager or similar cluster management tooling. * Exposure to data pipelines, storage systems (Lustre, GPUDirect Storage, Ceph), or AI workflow platforms. * Certifications such as NVIDIA Certified Associate/Expert, Kubernetes certifications (CKA/CKS), or related vendor accreditations. What We Offer * Hybrid working: 1 day per week in London * Opportunity to design next-generation high-performance GPU infrastructure * Exposure to cutting-edge AI compute at scale