HPC Sr. Scientific Software Engineer (IT@JH Research Computing)
9 days ago
Baltimore
< IT@JH Research Computing is seeking a HPC Sr. Scientific Software Engineer who will design, build, and support Johns Hopkins University's high-performance computing and AI research infrastructure. This role integrates elements of both systems and software engineering, ensuring scalable, secure, and reproducible environments for scientific and data-intensive research. The Engineer develops and automates system and application workflows across CPU/GPU clusters, parallel storage, and hybrid cloud platforms. Responsibilities include configuring and optimizing large-scale Linux environments, implementing job scheduling and orchestration frameworks, containerizing applications, and supporting researchers in optimizing performance and reproducibility. Work combines project-based engineering with operational support, requiring both independent problem-solving and close collaboration with the Research Computing team and faculty stakeholders. Specific Duties & Responsibilities Software Deployment and Design • Develop and refine deployment strategies for scientific software on HPC and AI systems., • Design computational workflows, selecting optimal software configurations, and utilizing tools like Ansible for automation., • Assist teams in implementing, tuning, and optimizing AI models and gateway applications (e.g., XDMoD, Coldfront, Open OnDemand, CryoSPARC Live, SBGrid, AI Agents). Performance Optimization • Analyze and optimize the performance of AI models and HPC applications, focusing on GPU-enabled computing., • Implement parallel processing, distributed computing, and resource management techniques for efficient job execution. Integration and Optimization • Develop, debug, and maintain software tools, libraries, and frameworks supporting HPC and AI workloads., • Collaborate with the system team and software vendors (e.g., NVIDIA, Intel, Matlab) to optimize systems for maximum performance., • Utilize CUDA, DNN, TensorRT, and Intel Compilers to enhance system performance. HPC Scientific Software Support • Manage and support scientific software deployment across HPC, cloud-based, and colocation facilities., • Oversee installation, configuration, and maintenance of HPC packages with tools like CMake, Make, EasyBuild, Spack, and Lua module files Collaboration and Mentorship • Work closely with cross-functional teams, including researchers, data scientists, and software developers, to address complex HPC/AI challenges., • Mentor junior engineers and foster a culture of continuous learning. Technical Support and Training Workshops and Troubleshooting • Resolve complex technical issues and perform root cause analysis for HPC/AI software challenges., • Implement effective solutions to prevent recurrence and improve system reliability, • Provide training workshops for researchers and students, focusing on troubleshooting, optimizing workflows, and effectively using HPC systems. Learning and Development • Stay current with advances in HPC and AI technologies and methodologies., • Incorporate new research findings into existing systems to improve performance and capabilities. Container Orchestration • Develop and manage container orchestration strategies to ensure scalability, reliability, and security of applications., • Oversee the container lifecycle from creation and deployment to scaling and removal. Documentation and Compliance • Create comprehensive documentation for system designs, performance metrics, and project status., • Ensure compliance with security and regulatory standards for all HPC and AI systems. In Addition to the Duties Described Above • Design, deploy, and maintain large-scale Linux HPC clusters with CPU/GPU resources, high-speed networks, and distributed storage., • Develop and maintain automation frameworks for provisioning, monitoring, and software lifecycle management., • Implement and optimize job scheduling, container orchestration, and workflow automation tools to support diverse research workloads., • Collaborate with faculty and research teams to parallelize, containerize, and scale computational workflows for multi-GPU and distributed environments., • Benchmark and tune application performance across architectures, documenting findings and sharing best practices., • Integrate and support AI/ML frameworks, scientific libraries, and workflow engines (Snakemake, Nextflow, Dask, Ray)., • Ensure system and application reliability through proactive monitoring (Prometheus, Grafana, ELK) and incident response participation., • Support reproducibility and FAIR data principles through version-controlled, containerized environments., • Contribute to documentation, training materials, and technical guidance to enhance user experience and self-service capabilities., • Participate in evaluation and adoption of new technologies to advance performance, efficiency, and sustainability in research computing. Minimum Qualifications • PhD in a quantitative discipline., • Five years of experience in HPC user support, software deployment, and performance optimization within an academic or research environment., • Additional education may substitute for required experience and additional related experience may substitute for required education beyond a high school diploma/graduation equivalent, to the extent permitted by the JHU equivalency formula. Preferred Qualifications • Eight + years of professional experience in high-performance computing, large-scale systems, or research software engineering., • Deep proficiency in Linux systems administration, performance tuning, and automation tools (Ansible, Terraform, Jenkins, or similar)., • Experience with cluster management, workload schedulers (e.g., Slurm), and distributed or parallel file systems (e.g., GPFS, Lustre, WekaFS, Ceph)., • Strong background in programming or scripting (Python, Bash, C/C++, Go, or Rust)., • Familiarity with containerization and orchestration technologies used in HPC (Singularity, Apptainer, Docker, Kubernetes)., • Understanding of high-speed interconnects (InfiniBand, 100/400 Gb Ethernet) and storage/data access patterns for AI and analytics., • Experience developing or maintaining CI/CD pipelines and module environments (Lmod/Spack) for research software., • Knowledge of GPU computing (CUDA, ROCm), MPI/OpenMP, and AI/ML frameworks., • Demonstrated ability to collaborate with researchers on performance optimization, workflow design, and reproducible computing. Classified Title: HPC Sr. Scientific Software Engineer Job Posting Title (Working Title):HPC Sr. Scientific Software Engineer (IT@JH Research Computing) Role/Level/Range: ATP/04/PG Starting Salary Range: $99,800 - $175,000 Annually (Commensurate w/exp.) Employee group: Full Time Schedule: Mon-Fri, 8:30am-5pm FLSA Status:Exempt Location: Johns Hopkins Bayview Department name: IT@JH Research Computing Personnel area: University Administration