Senior SRE Software Engineer - Storage and Data Services

Company:  NVIDIA
Location: Santa Clara
Closing Date: 07/11/2024
Salary: £150 - £200 Per Annum
Hours: Full Time
Type: Permanent
Job Requirements / Description

Site Reliability Engineering (SRE) is an engineering discipline that involves designing, building, and maintaining large-scale production systems with high efficiency and availability. It encompasses various areas, including software and systems engineering practices, storage, data management, and services. SRE professionals are highly specialized and possess expertise in different domains such as systems, networking, storage, coding, database management, capacity management, continuous delivery, and open-source cloud-enabling technologies like Kubernetes, containers, and virtualization. Their responsibilities encompass ensuring reliable storage solutions, managing data efficiently, and providing related services to support the overall stability and performance of the production systems.


SRE at NVIDIA ensures that our DGX Cloud platform continues to be reliable and performant to meet the needs of our users. SRE is also a mindset and a set of engineering approaches to running efficient production systems, focusing on eliminating manual work through modern automation practices and performance tuning. Limiting time spent on reactive operational work, blameless postmortems, and proactive identification of potential outages are key to product quality, providing interesting and dynamic day-to-day work. SRE's culture of diversity, intellectual curiosity, problem-solving, and openness is important to its success. We encourage collaboration, thinking big, and taking risks in a blame-free environment. We promote self-direction to work on meaningful projects while striving to build an environment that provides the support and mentorship needed to learn and grow.


What You Will Be Doing


  1. Assist in designing, implementing, and supporting large-scale storage clusters and data services.
  2. Build and improve service reliability tools and frameworks - logging, tracing, and alerts.
  3. Work with AI/ML workloads to capture and correlate behavior in large clusters and workflows, which are otherwise hard to understand.
  4. Support services before they go live through activities such as system design consulting, developing software and frameworks, capacity management, launch reviews, managing data ingress and egress across multiple data centers and public clouds.
  5. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  6. Scale systems sustainably through mechanisms like AI/ML and automation, and evolve systems by pushing for changes that improve reliability and velocity.
  7. Be part of an on-call rotation to support production systems.

What We Need To See


  1. BS degree in Computer Science or related technical field involving coding/automation or equivalent experience.
  2. At least 5+ years equivalent practical experience.
  3. Experience with Infrastructure as Code and Configuration Management Tools: Terraform, CloudFormation, CDK, Ansible.
  4. Proficiency in one or more of the following: Python, Golang.
  5. Knowledge with Public Clouds: AWS, Azure, GCP.
  6. Expertise with Kubernetes, and common Kubernetes Tooling and Approaches such as GitOps and CI/CD.
  7. Skills with Observability: Logging and Metrics with tools such as Prometheus, Mimir, Loki, Graylog, Grafana.

Ways To Stand Out From The Crowd


  1. Demonstrated experience in having an SRE mindset, customer-first approach, and focus on customer satisfaction and passion for ensuring customer success. Experience with Git, code review, pipelines, and CI/CD.
  2. Interest in crafting, analyzing, and fixing large-scale distributed systems. Strong debugging skills with a systematic problem-solving approach to identify complex problems.
  3. Thrive in collaborative environments and enjoy working with various teams. Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack, and Docker. Experience with building Cloud Architectures: Serverless, Containers.

The base salary range is 144,000 USD - 270,250 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. You will also be eligible for equity and benefits.


NVIDIA accepts applications on an ongoing basis.


NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

#J-18808-Ljbffr
Apply Now
An error has occurred. This application may no longer respond until reloaded. Reload 🗙