Site Reliability Engineer

Company:  Altimetrik
Location: Mountain View
Closing Date: 03/11/2024
Hours: Full Time
Type: Permanent
Job Requirements / Description

Design, implement, and maintain complex data systems supporting millions of customers with Cloud Native principles and best practices to ensure highly available, secure, performant and scalable database systems

• Build and maintain CI/CD pipelines in Jenkins

• Build and deploy services in Kubernetes cluster using helm, kustomize, etc

• Contribute to infrastructure changes to AWS with deep understanding of AWS services

• Engage in on-call for pre-production and production systems supporting multi-million users

• Write/Review RCA docs to prevent recurrence of Incidents in future and share the learnings

• Contribute to major system upgrades, deployment automation, monitoring enhancements and Production changes

• Create operational playbooks, contribute to how-to articles, and gain domain knowledge to drive changes in the team

• Participate and contribute in FMEA/Chaos testing, Security remediations, etc

• Share best practices and patterns for operational excellence and cost optimization

• Reduce or eliminate manual steps by automating as much as possible

• Continuously look for opportunities to increase developer velocity and productivity

Qualifications:

• Bachelor’s or master’s degree in computer science or a related technical field. Equivalent experience will be considered

• 4+ years of hands-on development & operational experience with building and maintaining infrastructure in AWS

• Extensive performance monitoring, troubleshooting & tuning experience

• Experience with AWS services and hands-on knowledge of hosting on Cloud

• Experience with scripting languages for DevOps automation

• Experience with any one of the programming languages: Java/Python/Ruby

• Knowledge of Docker & Kubernetes, ArgoCD,

• Experience with monitoring and observability using Splunk, Wavefront, AppDynamics, Prometheus, Tracing, etc

Apply Now
An error has occurred. This application may no longer respond until reloaded. Reload 🗙