Senior HPC Systems and Storage Engineer - 139215
14 hours ago
San Diego
Hybrid Filing Deadline: Thu 4/23/2026 ___ UC San Diego values and welcomes people from all backgrounds. If you are interested in being part of our team, possess the needed licensure and certifications, and feel that you have most of the qualifications and/or transferable skills for a job opening, we strongly encourage you to apply. UCSD Layoff from Career Appointment: Apply by 4/13/26 for consideration with preference for rehire. All layoff applicants should contact their Employment Advisor. Reassignment Applicants: Eligible Reassignment clients should contact their Disability Counselor for assistance. DEPARTMENT OVERVIEW: The Mission of the San Diego Supercomputer Center is to translate innovation into practice. SDSC adopts and partners on innovations in industry and academia in the areas of software, hardware, computational and data sciences, and related areas, and translates them into cyberinfrastructure that solves practical problems across any and all scientific domains and societal endeavors. Cyberinfrastructure refers to an accessible, integrated network of high-performance computing, data, and networking resources and expertise, focused on accelerating scientific inquiry and discovery. With more than 250 employees and $30-50M of revenue a year, SDSC is a global leader in the design, development, and operations of cyberinfrastructure. SDSC supports hundreds of multidisciplinary programs spanning a wide variety of domains, from earth sciences and biology to astrophysics, bioinformatics, and health IT. SDSC presently operates multiple large HPC systems ranging from a 120k x86 CPU core general purpose system to a system explicitly designed for Artificial Intelligence and Machine Learning, and a nationally distributed system open for all of academia to integrate with. SDSC offers research data services across the entire vertical stack from universally scalable storage to consulting services on FAIR, Big Data, and AI. SDSC offers a rich set of cloud services both on-premise, in the commercial cloud, and as hybrid services across both. SDSC has three geographic scopes, a national scope supporting cyberinfrastructure for the entire US research and education community, a California scope with a special focus on convergence research that addresses the three dominant threats to CA: Drought, Fire, Earthquakes, and a campus scope focusing on advancing the global impact of SDSC by advancing the research objectives of the UC San Diego faculty, researchers, and students. SDSC impacts researchers at scales from 1,000's to Millions. SDSC annually trains thousands of researchers in cyberinfrastructure tools and software, and supports thousands of individual researchers via Unix accounts on its large HPC systems. SDSC was a leader developing the Science Gateway concept, and continues to be a global leader in its evolution. SDSC operates multiple major such gateways with user communities ranging from the tens of thousands to the millions. SDSC's educational programs includes online courses that have been attended by more than a million students. SDSC is committed to democratizing access to cyberinfrastructure across all of its geographic scopes. SDSC strives towards a culture that supports our employees to be their best, achieve their goals, and enjoy their lives, both professionally and personally. SDSC's High-Performance Systems Group is responsible for and operates SDSC's high-performance computing clusters and related systems. The group operates large-scale compute and storage systems funded by the National Science Foundation (current ACCESS resources and previously via the XSEDE and TeraGrid programs), the UCSD campus (e.g., the Triton Shared Compute Cluster) and other entities; these systems support users from campus andnational communities across a broad range of scientific disciplines. The Group is part of SDSC's Data-Enabled Scientific Computing (DESC) Division. The Data Enabled Scientific Computing (DESC) division within SDSC designs and jointly proposes with other SDSC researchers, supercomputing systems in response to tens of millions of dollars call-for-proposals from the National Science Foundation (NSF), various government organizations and UC entities; it responds to calls for proposals for cyberinfrastructure (CI) related research, solutions and support. DESC manages, operates and troubleshoots issues with advanced, leading edge, complex, multi-petaflop and multi-petabyte data intensive supercomputer systems, file systems (Lustre, Ceph etc.), interconnects (such as InfiniBand, NVLink, Slingshot, ethernet etc.) and CI projects housed at SDSC. Research leaders within DESC submit high performance computing (HPC), high throughput computing (HTC), AI, CI, data science, computational science, science gateways and scientific software research proposals and acquire funding from NSF, National Institutes of Health (NIH), Department of Energy (DOE), Department of Defense (DOD) and industry. DESC carries out supercomputing, CI, data science, computational science and scientific software research and development projects. This division provides consulting and user support to researchers and users from academia as well as collaborates with them and industrial users. DESC provides advanced computational science, CI and scientific software support for the national and UC user communities as a part of projects/machines such as the Expanse machine (a five-year ~$34-million project supported by the NSF and enables tens of thousands of users to use HPC, HTC and GPUs), Voyager machine (a five-year, ~$12-million project supported by the NSF and enables researchers to experiment with and use AI-focused hardware for scientific applications), PNRP project ( a five-year , ~$12-million project supported by the NSF and enables distributed computing with resources of GPUs, FPGAs and CPUs), Cosmos machine (a five-year ~$12-million project supported by the NSF and democratizes access to accelerated computing), the Triton Shared Compute Cluster (TSCC - which is a UCSD condo cluster for UCSD and external researchers and includes NIST 800 171 compliant computing), and the CloudBank project ( a five-year, ~$25-million project supported by the NSF to enable usage of commercial cloud resources). Various other funded CI research and development, and domain science (e.g. biochemistry, bioinformatics, cosmology etc.) and AI/ML projects are directed by DESC researchers. DESC staff and researchers are involved with the NSF funded Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program. DESC is involved in various HPC/HTC/AI training, workshop, outreach, workforce development and K-12 student programs and associated NSF funded projects. This division stays current with HPC, HTC, accelerators, CI, computational science and scientific software research and technology trends and engages with supercomputer vendors (e.g. Dell, Supermicro, Intel, NVIDIA, AMD, IBM, Hewlett Packard Enterprise, Data Direct Network, Aeon Computing, Arista etc.) to remain current with future technologies utilized in supercomputer designs The Senior HPC Systems and Storage Engineer will apply advanced systems and software integration concepts, and location or institutional objectives, to resolve highly complex issues where analysis of systems and software requires an in-depth evaluation of variable factors to resolve and implement medium to large projects of broad scope and complexity. They will regularly resolve highly complex business processes, system functionality, implementation issues and system and software integration issues where analysis of situations or data requires an in-depth evaluation of variable factors and select tools, methods, techniques and evaluation criteria to obtain results. They will also give technical presentations to associated team, other technical units and management as well as evaluate new technologies including performing moderate to complex cost / benefit analyses. They may lead a team of systems / infrastructure professionals. As part of SDSC's High Performance Systems group, the Senior HPC Systems and Storage Engineer is responsible for designing, deploying, and operating SDSC HPC compute clusters and their associated storage systems, and for maintaining their performance, reliability, and availability at the national, state, and campus level. This role requires in-depth knowledge of HPC cluster architecture, Linux systems administration, and the integration of compute, storage, and network systems. The incumbent contributes to the design, deployment, and operation of high-performance HPC systems and storage environments, including parallel file systems operating at scale (tens to hundreds of servers supporting thousands of clients) across high-speed networks such as Ethernet and InfiniBand. In collaboration with senior technical leadership, contributes to the architecture and implementation of scalable solutions at the cluster, data center, and campus levels. They will also plan and execute system lifecycles, including deployment, upgrades, and decommissioning of HPC systems and storage services as well as contribute to technical planning and effort estimation for new deployments, proposals, and recharge-based services. Additionally, the incumbent will evaluate and recommend improvements to tools and workflows, and participates in the selection and integration of new technologies and work with vendors and SDSC staff to benchmark and evaluate storage systems and cluster platforms, and maintains current knowledge of emerging technologies. The incumbent will also develop advanced processes and scripts for system analysis, testing, and automation to improve operational efficiency, scalability, and reliability across compute, storage, and network systems and lead efforts to integrate monitoring and alerting, improving incident detection, response, and user communication, and coordinates across compute and storage platforms to ensure graceful handling of service degradation. The incumbent will additionally oversee collaboration with SDSC security teams to implement best practices for system deployment, identity management, and software updates, promoting consistent security and maintenance across the environment as well as oversee development and maintenance of related documentation. For more information, please visit: ___ * Bachelor's degree in related area and / or equivalent experience / training. * Proven experience administering and supporting large-scale HPC clusters or other distributed POSIX (Linux) systems, including advanced knowledge of Linux system administration, primarily Red Hat and its derivatives (e.g., Rocky Linux). * Proven experience designing, deploying, and operating large-scale (petabyte-class) high-performance parallel and distributed file systems (e.g., Lustre, Ceph, BeeGFS, GPFS), as well as enterprise and local file systems (e.g., NFS, ZFS, ext4, XFS) in Linux-based environments, including troubleshooting and performance tuning. * Demonstrated experience with scripting and automation using languages such as Bash and Python; use of configuration management tools (e.g., Ansible, CFEngine); and version control systems (e.g., Git) to manage and maintain system configurations and infrastructure. * Advanced knowledge of HPC middleware stack including cluster management tools, job schedulers and resources managers. Examples include: Slurm, PBS, HPCM, and Bright Cluster Manager. * Demonstrated knowledge of TCP/IP networking, including sockets, VLANs, and firewalls. * Job offer is contingent upon satisfactory clearance based on Background Check results. * Occasional evenings and weekends may be required. * On-call rotation may be required. Pay Transparency Act Annual Full Pay Range: Unclassified - No data available (will be prorated if the appointment percentage is less than 100%) Hourly Equivalent: Unclassified - No data available Factors in determining the appropriate compensation for a role include experience, skills, knowledge, abilities, education, licensure and certifications, and other business and organizational needs. The Hiring Pay Scale referenced in the job posting is the budgeted salary or hourly range that the University reasonably expects to pay for this position. The Annual Full Pay Range may be broader than what the University anticipates to pay for this position, based on internal equity, budget, and collective bargaining agreements (when applicable). If employed by the University of California, you will be required to comply with our Policy on Vaccination Programs, which may be amended or revised from time to time. Federal, state, or local public health directives may impose additional requirements. To foster the best possible working and learning environment, UC San Diego strives to cultivate a rich and diverse environment, inclusive and supportive of all students, faculty, staff and visitors. For more information, please visit ___. The University of California is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, protected veteran status, or other protected status under state or federal law. For the University of California's Anti-Discrimination Policy, please visit: ___ UC San Diego is a smoke and tobacco free environment. Please visit ___ for more information. Misconduct Disclosure Requirement: As a condition of employment, the final candidate who accepts an offer of employment will be required to disclose if they have been subject to any final administrative or judicial decisions within the last seven years determining that they committed any misconduct; or have filed an appeal of a finding of substantiated misconduct with a previous employer. a. "Misconduct" means any violation of the policies governing employee conduct at the applicant's previous place of employment, including, but not limited to, violations of policies prohibiting sexual harassment, sexual assault, or other forms of harassment, or discrimination, as defined by the employer. For reference, below are UC's policies addressing some forms of misconduct: