Senior Software Engineer, Storage Platform
hace 19 días
Chicago
Job Description Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance. Your Role: You will be central to building our high-performance storage platform that enables research teams to work with massive datasets for AI training, large-scale simulations, and data-intensive computational workloads. Working closely with product, your platform team members, and infrastructure specialists, you'll design and implement storage systems that deliver high-throughput data access, manage multi-TB to PB-scale datasets, and provide the data protection and lifecycle management required by enterprise and research organizations. Job Responsibilities • Storage Platform Architecture: Design and build scalable storage orchestration systems supporting block, object, and file storage optimized for AI training datasets, model checkpoints, simulation data, and large-scale data processing pipelines., • Research Cluster Storage: Design and implement storage systems for research computing environments including Kubernetes and SLURM clusters, enabling shared datasets, persistent storage for distributed training, model checkpoints, and high-throughput data access for batch processing workloads., • High-Performance Data Access: Implement storage solutions that deliver consistent high-throughput and low-latency performance for demanding research workloads, including distributed training, large-scale simulations, and real-time data processing., • Data Pipeline Engineering: Build robust data ingestion, processing, and movement systems that handle massive datasets, support efficient data loading for training pipelines, and enable seamless data access across compute infrastructure., • Multi-Tiered Storage Orchestration: Build systems that coordinate across NVMe, SSD, and high-capacity storage tiers, placing and moving data based on access patterns and workload requirements., • Enterprise Data Protection: Implement comprehensive backup, snapshot, replication, and disaster recovery systems that meet enterprise data protection requirements and support zero-data loss policies., • Storage APIs & Integration: Create storage management APIs and SDKs that integrate seamlessly with compute platforms, research workloads, and financial data feeds., • Observability and Performance: Build monitoring and optimization systems that ensure consistent storage performance, track capacity utilization, and provide visibility into data access patterns. Requirements, • Experience: 5+ years in software engineering with proven experience building storage platforms, distributed storage systems, or data infrastructure for production environments., • Kubernetes Storage & Container Orchestration: Strong familiarity with Kubernetes storage architecture, persistent volumes, storage classes, and CSI drivers. Understanding of pods, deployments, stateful sets, and how Kubernetes manages storage resources., • Storage Systems Expertise: Deep understanding of storage architectures including block storage, object storage, file systems, distributed storage, and performance optimization techniques., • Programming Skills: Expert-level Python proficiency. Experience with C/C++, Rust, or Go for performance critical storage components is highly valued., • Linux & Systems Programming: Strong experience with Linux in production environments, including file systems, storage subsystems, and kernel-level storage interfaces, • Data Systems Engineering: Strong background in building data pipelines, large-scale data processing systems, and managing data lifecycle at scale., • Experience: 5+ years in software engineering with proven experience building compute platforms, container orchestration, or distributed systems for performance-critical applications., • Storage Systems Expertise: Deep understanding of storage architectures including NVMe, distributed storage, caching strategies, and performance optimization for latency-sensitive workloads., • Programming Skills: Expert-level Python proficiency. Experience with C/C++,, Rust, Go for performance-critical components is highly valued., • Data Systems: Strong background in building data pipelines, ETL processes, and large-scale data processing systems., • Platform & API Design: Proven experience building storage platforms with multi-tenancy, data isolation, and enterprise-grade reliability features., • Problem Solving & Architecture: Demonstrated ability to solve complex performance and scalability challenges while balancing pragmatic shipping with good long-term architecture considerations., • Autonomy and Communication: Comfortable navigating ambiguity, defining requirements collaboratively, and communicating technical decisions through clear documentation., • Commitment to Growth: Growth mindset with focus on learning and professional development.Preferred Qualifications, • Background provisioning or managing storage for research computing environments (Kubernetes, SLURM, or HPC clusters), • Experience with high-performance storage solutions and their APIs, • Background with object storage systems (S3-compatible APIs) and distributed file systems, • Knowledge of storage networking technologies (NVMe-oF, iSCSI, storage fabric protocols), • Understanding of AI/ML data pipeline requirements and model training data access patterns, • Experience building storage solutions for regulated industries with audit and compliance requirements, • Knowledge of time-series databases and market data storage optimization is a plus, • Extra points for experience with financial services data infrastructure and regulatory data protection requirementsKey Technologies, • Python, NVMe, Distributed Storage Systems, C/C++, Rust, Go, Kafka, PostgreSQL, Redis, S3 APIs, Docker, Kubernetes, Terraform, FastAPI, Time-Series DatabasesWhy Moonlite, • Build Next-Generation Infrastructure: Your work will create the platform foundation that enables financial institutions to harness AI capabilities previously impossible with traditional infrastructure., • Hands-On Ownership: As an early engineer, you'll have end-to-end ownership of projects and the autonomy to influence our product and technology direction., • Shape Industry Standards: Contribute to defining how enterprise AI infrastructure should work for the most demanding regulated environments., • Collaborate with Experts: Work alongside seasoned engineers and industry professionals passionate about high-performance computing, innovation, and problem-solving. #li-remote