NVIDIA is looking for a Senior Cloud Infrastructure Development Engineer to design, develop, and improve a robust and efficient private cloud infrastructure used for batch job execution and bazel remote execution, remote caching, services for its Software groups. As a team we work with various groups within NVIDIA such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Autonomous Vehicles to cater to their infrastructure needs. These cloud services run on thousands of servers and execute half a million automated jobs per day helping with the productivity of thousands of NVIDIA's software engineers worldwide. Are you passionate about infrastructure and looking for complex and meaningful problems, ready to build the next generation of cloud services, design innovative solutions, mine through data to uncover real problems and fix them? We are excited to have a fun-loving person like you!
What You'll Be Doing
- Design and implement a scalable, low latency, high throughput and highly reliable remote execution & remote caching services for Bazel Remote execution.
- Work on challenging problems in the area of infrastructure such as multi-cluster, multi data center services supporting low latency and high throughput for data access.
- Support job execution on a heterogeneous mix of machines in Kubernetes cluster having both NVIDIA GPUs (vGPUs) and Tegra processors.
- Chase system resiliency across databases, storage, network and web servers to achieve high availability goals.
- Implement security best practices for the remote execution cluster, ensuring the integrity and confidentiality of data.
- Analyze data, apply deep learning algorithms / machine learning to improve the performance/predictability of the system.
- Strong object-oriented programming background, Java, Golang strongly preferred
- Experience developing large scale cloud infrastructure applications
- Background with Relational Databases such as MySQL and NoSQL DBs such as ElasticSearch, MongoDB
- Background with Containers (Docker, Kubernetes), Kubevirt, Web Services (SOAP/REST) and Scalable Storage(HDFS/Ceph, Artifactory, Object storage)
- Experience working with messaging technologies such as Kafka
- Excellent problem-solving and troubleshooting skills.
- Ability to collaborate across multiple teams and across people working in different time zones.
- BS/MS in Computer Science or Computer Engineering or equivalent experience
- 10+ years of industry experience.
- Worked on computer algorithms and demonstrated ability to choose the best possible algorithms to solve complex problems
- Background in design, implementation and deployment of major infrastructure features across multiple clusters in incremental rollout mode
- Experience with bazel, bazel remote execution services and familiarity with cloud computing platforms (e.g., AWS, GCP, Azure).
- Knowledge of build toolchains and dependency management.
- Previous contributions to open-source projects related to build systems or distributed computing.
- Experience with Machine Learning and Data Analytics and application of them in Infrastructure
The base salary range is 176,000 USD - 333,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
#J-18808-Ljbffr