This is a Senior Manager/Associate Director-level position:
We Are Nextira, now part of Accenture, builds cloud-based solutions and services with cutting-edge engineering skills, artificial intelligence (AI), machine learning (ML), and data analytics that enable clients to design, build, launch and optimize high-performance computing environments. Nextira joined the Accenture AWS Business Group and our AWS North America delivery practice in June 2023. The Accenture AWS Business Group is part of Accenture Cloud First, Accenture’s industry-leading business group focused on redefining how organizations operate and create value by using cloud, data, and artificial intelligence for total enterprise reinvention.
You Are:
An experienced, highly motivated MLOps Engineering Lead looking to join our team in supporting our large-scale GPU-based AI training and research cluster hosted on various cloud providers such as AWS, Azure, and GCP. The ideal candidate should have expert knowledge of Linux at the kernel level and be able to configure and troubleshoot NVIDIA drivers and utilities, particularly on virtual machines running in the cloud. The candidate should be able to use their expertise to guide experienced ML engineers in their efforts to train large AI models in the cloud.
The Work:
You (Full Stack MLOps Engineering Lead) will lead the design, development, and operational management of cloud-native computing clusters to perform ML training and inference. You will lead the development and delivery of tooling needed to optimize performance and troubleshoot issues with their training workloads.
You will manage HPC (High Performance Computing) clusters, including schedulers such as Slurm, and compute nodes accelerated with NVIDIA GPUs. You’ll help experienced ML engineers configure and manage their Conda environments, optimizing them for their specific AI training and research needs. As a MLOps Engineering Lead, you will design, deploy, and maintain cloud infrastructure with infrastructure-as-code (IaC) tools such as Terraform, AWS CDK.
Our Senior MLOps Engineers engage in clear and effective communication with highly technical users, providing support and guidance on a wide range of technical topics related to the cluster while utilizing their strong Linux skills to troubleshoot and resolve issues, optimize system performance, and ensure a stable and reliable environment for AI training and research.
Travel may be required for this role. The amount of travel will vary from 0 to 100% depending on business need and client requirements.
Here’s What You Need (Basic Qualifications):
Bachelor's degree or equivalent (minimum 12 years) work experience. (If Associate’s Degree, must have minimum 6 years of work experience)
Minimum of 6 years of professional experience working in a software engineering or DevOps role
Minimum of 4 years of experience in Linux Systems Administration, including kernel tuning, networking, and storage
Minimum of 3 years of experience with at least three of the following: Python, Docker / Kubernetes, C++, GPU stack (e.g. CUDA, SMI, ROCM)
Minimum of 4 years of experience in a platform engineering or developer role in a cloud environment
Minimum of 12 months of experience with IaC tools, e.g. Terraform
Excellent problem-solving and analytical skills
Bonus points if you have (Preferred Qualifications):
Experience in MLOps, Artificial Intelligence (AI), Large Language Models (LLMs), or High Performance Computing (HPC)
Experience with full-stack application development, particularly using cloud provider APIs
Experience with parallel file systems, e.g. Lustre, GPFS, Weka
Experience with managing virtualized Python environments, e.g. conda, pyenv.
Experience working in a consulting environment, engaging with client stakeholders at a senior level.
Experience leading a team of cloud platform engineers.
Strong written and verbal communication skills, with the ability to explain complex technical concepts to both technical and non-technical audiences.
Compensation at Accenture varies depending on a wide array of factors, which may include but are not limited to the specific office location, role, skill set, and level of experience.
Information on benefits is here.
#J-18808-Ljbffr