AI Infra Site Reliability Engineer - AI Infrastructure

Hamilton Barnes Associates Limited

Full-time Networks & Systems
Apply Now
Location
singapore, singapore, Singapore
Posted
July 03, 2026

Job Description

Ready to architect AI infrastructure that powers next-generation research and cloud platforms?

Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration.

Build resilient, scalable AI platforms that empower startups and innovation. Apply today!

Key Responsibilities

  • Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
  • Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
  • Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
  • Collaborate with ML, networking, and platform t...