AI Infra Site Reliability Engineer - AI Infrastructure
Hamilton Barnes Associates Limited
Full-time
Networks & Systems
Location
singapore, singapore, Singapore
Posted
July 03, 2026
Job Description
Ready to architect AI infrastructure that powers next-generation research and cloud platforms?
Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration.
Build resilient, scalable AI platforms that empower startups and innovation. Apply today!
Key Responsibilities
- Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
- Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
- Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
- Collaborate with ML, networking, and platform t...