你所在的国家不提供此工作机会。

Computer Officer I (Information Technology Services Centre)

明報網站Hong Kong, Hong Kong

10 天前

职位描述

this.com / images / email_32.png');">

Descriptions Ref : 2500016W

Closing date : July 25, 2025

The High-Performance Computing (HPC) team of the Information Technology Services Centre seeks a skilled member to support and enhance the research computing cluster. The position involves architecting and optimizing computing infrastructure for general research computing tasks, Large Language Models (LLMs) and other AI systems applications.

The appointee will be responsible for (a) designing, deploying, and optimizing HPC infrastructure specifically tailored for AI workloads, focusing on distributed training and inference of large-scale models, and implementing efficient job scheduling strategies and resource allocation for GPU & CPU clusters; (b) developing automation tools and workflows to streamline machine learning operations (MLOps) pipelines, ensuring researchers can focus on innovation rather than infrastructure challenges; (c) serving as a critical bridge between cutting-edge research and computational infrastructure by partnering with research teams to architect solutions for their unique computational challenges in AI model development; (d) optimizing model training performance through advanced parallelization strategies and system-level improvements while staying current with emerging technologies in distributed computing and AI infrastructure; (e) leading technical workshops on HPC best practices for AI / ML workflows and developing comprehensive documentation and training materials for cluster users; and (f) mentoring researchers in the efficient utilization of computational resources to build a stronger, more capable research community capable of leveraging Stanford's world-class computing infrastructure to its fullest potential.

Applicants should have (i) a Master's degree or higher in Computer Science, Artificial Intelligence, Data Science, or a related field; (ii) at least five years of experience in a technical lead role with demonstrated success in large-scale system implementation and deployment; (iii) strong programming proficiency in Python, C++, or Java; (iv) experience in large-scale system deployment, preferably using AI frameworks such as TensorFlow, PyTorch, and LangChain, particularly in the context of LLM development; (v) experience in parallel computing, GPU programming using CUDA or ROCm, and distributed training optimization techniques; (vi) proficiency in administering complex computational infrastructures; (vii) a proven track record of designing and implementing robust distributed systems capable of supporting resilience and dynamic load balancing, architecting solutions to maintain high availability under varying computational demands while ensuring efficient resource utilization across the cluster; (viii) experience with fault-tolerant system design and the ability to implement adaptive scheduling strategies that respond to changing workload patterns for meeting the demanding requirements of modern research computing environments; (ix) strong analytical and problem-solving abilities with meticulous attention to system performance; (x) excellent written and verbal communication skills, with the ability to translate complex technical concepts for diverse audiences ranging from graduate students to senior researchers; (xi) a self-directed work style with proven ability to manage multiple projects simultaneously while maintaining high standards; and (xii) a collaborative mindset with experience working effectively in cross-functional research teams.

Appointment will initially be made on contract basis for two years, renewable subject to good performance and mutual agreement.

为此搜索创建职位提醒

Officer Officer • Hong Kong, Hong Kong