Job Description
What the Role Entails
- Design and implement high-performance orchestration systems to automate the deployment, scaling, and lifecycle management of large-scale model training and real-time inference services
- Develop scheduling algorithms to optimize the utilization of global GPU/CPU clusters. Focus on improving resource allocation efficiency and managing the complexities of heterogeneous hardware
- Participate in development of training-related requirements (incremental training and API for online learning) and enhance framework efficiency
- Participate in development of pipelines for offline-to-online synchronization. Ensure strict data consistency and minimize model-update latency to ensure the most current intelligence is served
- Collaborate with cross-functional teams to ensure platform features are available, secure, and compliant for a global user base.
Who We Look For
- Currently enrolled in an undergraduate or g...