Site Reliability Engineer (SRE )
12+ Months
Boston, MA
$50-$55/HR on W2
W2 Candidates ONLY
Job Description:
We are seeking a talented and experienced Site Reliability Engineer (SRE) to join our dynamic team. As an SRE, you will be responsible for the reliability, performance, and scalability of our services and infrastructure. You will work closely with development teams, operations, and other key stakeholders to ensure our systems are highly available and performant.
Key Responsibilities:
Infrastructure Management:
Design, implement, and manage scalable, reliable infrastructure using cloud platforms (AWS).
Automate infrastructure provisioning, deployment, and scaling to support rapid growth and changes.
Monitor system performance and ensure the availability and reliability of services.
Implement infrastructure as code (IaC) using tools like Terraform, Cloudformation, or similar.
Reliability and Performance:
Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and improve system performance.
Conduct root cause analysis of incidents and implement corrective actions to prevent recurrence.
Optimize system performance and resource utilization through tuning and optimization.
Automation and Tooling:
Create and maintain automation scripts and tools to streamline operational processes.
Implement continuous integration and continuous deployment (CI/CD) pipelines to support rapid development and deployment cycles.
Develop and maintain monitoring, alerting, and logging solutions to ensure the health and performance of systems.
Collaboration and Communication:
Collaborate with development teams to design and implement scalable, reliable services.
Provide guidance and support to development teams on best practices for building and maintaining reliable systems.
Communicate effectively with stakeholders to understand requirements and provide updates on system performance and reliability.
Incident Management:
Respond to and resolve incidents in a timely manner, minimizing impact on users and services.
Participate in on-call rotations to provide 24/7 support for critical systems.
Document incidents and resolutions to improve future response and prevent recurrence.
#J-18808-Ljbffr