Who we are
We’re Redis. We built the product that runs the fast apps our world runs on. (If you checked the weather, used your credit card, or looked at your flight status online today, you’re welcome.) At Redis, you’ll work with the fastest, simplest technology in the business—whether you’re building it, telling its story, or selling it to our 10,000+ worldwide customers. We’re creating a faster world with simpler experiences. You in?
Why Would You Love This Job?
- High-Visibility Role: You’ll collaborate with teams and leaders across the entire organization—including Engineers, VPs, SVPs, and occasionally C-level executives—to drive responses that keep our services robust and our customers satisfied.
- Direct Impact on Business & Customers: Your ability to swiftly and effectively manage incidents will protect both customer experiences and revenue, making you a key player in our company’s success.
- Lead Critical Events & Teach Others: From orchestrating major incident bridges to mentoring your team, you’ll shape how we handle urgent challenges and implement changes throughout the production environment.
- Cross-Functional Influence & Growth: By engaging with diverse stakeholders and overseeing complex initiatives, you’ll gain broad organizational exposure, accelerate your professional development, and leave a lasting mark on the company’s operational excellence.
What You Will Do
- Oversee Incident Management
- Direct the entire incident lifecycle, ensuring timely detection, escalation, and resolution.
- Facilitate clear communication among technical teams, customer support, and stakeholders during high-priority incidents.
- Maintain thorough incident documentation, conduct post-incident reviews, and champion ongoing improvements.
- Design and implement standardized procedures for scheduling, assessing, and approving changes across production environments.
- Bring together relevant stakeholders (engineering, operations, product, etc.) to evaluate risks, coordinate timing, and align changes with business objectives.
- Define and monitor change metrics (e.g., success rate, rollback frequency) to inform data-driven decisions.
- Drive the adoption of Zabbix, Prometheus, and similar platforms to proactively identify system anomalies and risks.
- Refine alerts, thresholds, and dashboards to reduce both mean time to detection (MTTD) and mean time to resolution (MTTR).
- Work closely with Problem Management to highlight recurring issues and enable deeper root-cause analysis.
- Partner with Customer Support, Engineering, and Operations to ensure consistent, unified responses and seamless escalations.
- Regularly assess and improve incident and change management frameworks, staying current with industry best practices (ITIL, etc.).
- Document policies, procedures, and guidelines for easy adoption and training across teams.
- Develop and maintain dashboards and KPIs (e.g., in Tableau) to track incident trends, resolution times, change success rates, and system stability.
- Present actionable insights to leadership, shaping both tactical responses and long-term operational strategies.
- Recruit, mentor, and guide a group of incident and change management professionals.
- Foster a culture of accountability, collaboration, and continuous learning.
What You Will Need
- 5+ years of combined experience in Incident Management, Change Management, or related IT Service Management roles, with 2+ years in a leadership capacity.
- Proven track record in mission-critical production environments (SaaS, Cloud, or Enterprise IT), directing both urgent incident resolution and structured change processes.
- Advanced knowledge of log collection and monitoring tools (e.g., Zabbix, Prometheus) for proactive issue detection.
- Proficiency with incident management and project tracking platforms (e.g., Jira, Squadcast, BigPanda) and data visualization tools (e.g., Tableau).
- Excellent communication and stakeholder management skills, capable of uniting diverse teams under high-pressure conditions.
- Strong organizational abilities, including meticulous documentation and the capacity to balance multiple high-priority tasks.
Extra – Great If You Have
- Certifications: ITIL-4 (Service Transition, Service Operation), Prosci, PMP, or similar credentials.
- Automation & Scripting: Basic Python or comparable experience to integrate systems and reduce manual tasks.
- Process Improvement Background: Familiarity with Lean, Six Sigma, or other frameworks to enhance IT service delivery.
- Database Knowledge: Exposure to Redis or other database technologies to inform more effective incident triage.
If you’re ready to make a tangible impact by driving efficient incident response and reliable change management in a dynamic production environment, we’d love to hear from you. In this role, you will shape essential processes, lead a dedicated team, and ensure that our systems remain robust and customer-focused every step of the way!