Job Title: - Senior BizOps Engineer
Location: - O'Fallon, MO / St. Louis, MO
Duration: - 6-12 + Months
Experience: - 8+ Years
Job Description: -
The Account Management Services Business Operations (Biz Ops) team is seeking a Business Operations Site Reliability Engineer (SRE) with particular experience in Operational Resiliency
Overview
The role of Business Operations Organization is to be the production readiness steward for Client products. As a Business Operations SRE, we are responsible for ensuring that our platform is stable and healthy. We break down barriers to run our products by fostering developer run ownership and empowering developers to build resilient products. We support our developers during the application build phase in software run principals that includes operational design, automation, capacity planning, monitoring that leads to fault-tolerant, scalable products. We see the big picture and help create and enforce operations standards while facilitating an agile and learning culture.
We support daily operations with a hyper focus on triage, root cause by understanding the business impact of our products and subsequently performing blameless post-mortems. The goal of every Business Operations team is to engage early in the development lifecycle to be more proactive and upfront in the development process, and to proactively manage production and change activities to maximize customer experience and increase the overall value of supported applications. Business Operations teams also focus on risk management by tying all our activities together with an overarching responsibility for compliance and risk mitigation across all our environments.
Ultimately, the role of Business Operations is to align Product and Customer Focused priorities with Operational needs by providing continuous feedback throughout the lifecycle.
Role
It is not expected that any single candidate would have expertise across all these areas, but a Biz Ops engineer will spend a bit of time throughout their career with all of these aspects of the role:
Operational Resiliency
1.Ensure that Health checks are implemented with auto routing when a site is unhealthy.
2.Make sure every service is built around an active-active-active architecture.
3.Identify, review, and remediate single points of failure.
4.Ensure the application can auto scale (Add instances) on demand using service metrics.
5.Create strategy for Chaos testing/experimentation with recovery from common failure scenarios.
• Site Reliability Engineering:
o Serve as the primary contact responsible for ensuring application scalability, performance, and resilience.
o Practice sustainable incident response and blameless post-mortems while taking a holistic approach to problem solving and optimizing time to recover.
o Automate data-driven alerts to proactively escalate issues. Work with development teams to establish SLOs and improve reliability.
• DevOps/Automation:
o Tackle complex development, automation, and business process problems. Engage in and improve the whole lifecycle of services-from inception and design, through deployment, operation, and refinement.
o Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead Client in DevOps automation and best practices.
o Increase automation and tooling to reduce toil and manual intervention
• ITSM Practices:
o Analyses ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns
About you
The ideal candidate will have experience in many of these areas:
• BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience.
• Coding or scripting exposure.
• Appetite for change and pushing the boundaries of what can be done with automation. Be curious about new technology, infrastructure, and practices to scale our architecture and prepare for future growth.
• Experience with algorithms, data structures, scripting, pipeline management, and software design
• Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
• Interest in designing, analysing, and troubleshooting large-scale distributed systems.
• Willingness and ability to learn and take on challenging opportunities and to work as a member of matrix based diverse and geographically distributed project team.
• Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.
• Comfortable collaborating with cross-functional teams to ensure that expected system behaviour is understood and monitoring exists to detect anomalies.