Senior Site Reliability Engineer (SRE) - AI Inftastructure Job at Confidential, San Francisco, CA

cXVzNVhiOWJVdm9zUnFtcUNUYWI0YVRqVHc9PQ==
  • Confidential
  • San Francisco, CA

Job Description

Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready for experimentation, full-scale model training, or inference. As a Platform Engineer/Senior Site Reliability Engineer, you’ll own the reliability, performance, and automation of this GPU-powered infrastructure, ensuring seamless orchestration across environments managed by Slurm, Kubernetes, or direct SSH access. As well as supporting their extremely exciting new products coming to the market! 

This is a rare opportunity to work at the intersection of AI infrastructure and AI, shaping the operational backbone of one of the largest GPU clusters in private deployment.

If you want to build and operate infrastructure for frontier AI workloads, automate systems at petascale, and be part of a founding engineering team, this is the place to do it. Get in touch and apply today! 

Responsibilities:

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
  • Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
  • Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

  • 7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
  • Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
  • Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
  • Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
  • Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
  • Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
  • Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Salary & Benefits:

  • $300,000 gross per year 
  • Equity

Job Tags

Permanent employment

Similar Jobs

McCarthy Fabrication

Miscellaneous Metals Fitter/Welder Job at McCarthy Fabrication

 ...McCarthy Fabrication LLC is currently looking for a Highly skilled Metal Fabricator to join our team. We see a wide variety of projects including; custom architectural pieces, handrail fabrication, stair fabrication, and heavy industrial/structural fabrication. Position... 

Planet Fitness

Overnight Custodian 10pm-6am Job at Planet Fitness

 ...Job Summary The Overnight Custodian will be responsible for creating a positive member experience by providing a superior level of customer service to Planet Fitness members, prospective members and guests. The Overnight Closer will be responsible for creating... 

Lorven technologies

Site Reliability Engineering (SRE) Project Manager - NYC, NY(Onsite) - Full-time Job at Lorven technologies

 ...Job Title: Site Reliability Engineering (SRE) Project Manager Location: NYC, NY - Hybrid Duration: Full-time Required Skills & Qualifications: Proven experience (5+ years) as a Project Manager in SRE or DevOps domains. Strong understanding of observability... 

Confidential

Job Opportunity: Website Developer / SEO Specialist Job at Confidential

 ...Join a Great Team! Website Developer / SEO Specialist FasTrax Solutions is seeking an energetic and diligent website developer with a focus on SEO to join our Rock Hill, NY location. The right candidate plays an integral function in our companys digital marketing... 

Joy Memories

CBL Easter Photo Set Bunny Character - Northgate Mall Job at Joy Memories

VIP Holiday Photos is seeking enthusiastic and friendly individuals to join our team as the Easter Bunny character at our Easter photo set. In this role, you will have the unique opportunity to bring joy to children and families as they capture special moments with the...