Senior Site Reliability Engineer

Company:  ValueBase Consulting
Location: Boston
Closing Date: 28/10/2024
Salary: £150 - £200 Per Annum
Hours: Full Time
Type: Permanent
Job Requirements / Description

Site Reliability Engineer (SRE )

12+ Months

Boston, MA

$50-$55/HR on W2

W2 Candidates ONLY

Job Description:

We are seeking a talented and experienced Site Reliability Engineer (SRE) to join our dynamic team. As an SRE, you will be responsible for the reliability, performance, and scalability of our services and infrastructure. You will work closely with development teams, operations, and other key stakeholders to ensure our systems are highly available and performant.

Key Responsibilities:

Infrastructure Management:

Design, implement, and manage scalable, reliable infrastructure using cloud platforms (AWS).

Automate infrastructure provisioning, deployment, and scaling to support rapid growth and changes.

Monitor system performance and ensure the availability and reliability of services.

Implement infrastructure as code (IaC) using tools like Terraform, Cloudformation, or similar.

Reliability and Performance:

Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and improve system performance.

Conduct root cause analysis of incidents and implement corrective actions to prevent recurrence.

Optimize system performance and resource utilization through tuning and optimization.

Automation and Tooling:

Create and maintain automation scripts and tools to streamline operational processes.

Implement continuous integration and continuous deployment (CI/CD) pipelines to support rapid development and deployment cycles.

Develop and maintain monitoring, alerting, and logging solutions to ensure the health and performance of systems.

Collaboration and Communication:

Collaborate with development teams to design and implement scalable, reliable services.

Provide guidance and support to development teams on best practices for building and maintaining reliable systems.

Communicate effectively with stakeholders to understand requirements and provide updates on system performance and reliability.

Incident Management:

Respond to and resolve incidents in a timely manner, minimizing impact on users and services.

Participate in on-call rotations to provide 24/7 support for critical systems.

Document incidents and resolutions to improve future response and prevent recurrence.

#J-18808-Ljbffr
Apply Now
An error has occurred. This application may no longer respond until reloaded. Reload 🗙