We are seeking a skilled Manager, Reliability, and Incident Response with experience in modern DevOps best practices, continuous deployment, and AI Ops platform use to join our major incidents team. The Manager, Reliability and Incident Response will be responsible for coordinating and leading the response to major incidents that impact our ecommerce platform. The successful candidate will ensure timely restoration of service, minimize the impact on customers, and prevent future incidents through AI Ops correlation improvement and advancement of service restoration tools.
Key Responsibilities:
- Lead the response to major incidents impacting our ecommerce platform
- Coordinate with technical teams across DevOps, AI Ops, distributed computing, and other areas to prevent future incidents through AI Ops correlation improvement and advancement of service restoration tools
- Manage communication with stakeholders, including customers, business partners, and senior management, to provide regular updates and manage expectations
- Develop and implement processes for incident management, including escalation procedures, activation of service restoration processes and tools, validation of AI Ops correlation models
- Continuously review and improve incident management processes to ensure efficiency and effectiveness
- Collaborate with technical teams to identify areas for improvement and implement changes to prevent future incidents
- Conduct incident trend analysis to identify recurring issues and proactively address them
- Manage vendor relationships related to incident management tools and services
- Provide guidance and support to incident management team members and other technical staff