The Site Reliability Engineer is responsible for ensuring the reliability, availability, and performance of the company’s information technology systems and infrastructure.
This is a highly skilled role that bridges the gap between development and operations to optimize system performance, and therefore requires a proactive mindset to drive innovation and collaborate to set strategy for system-wide improvements.
In this role, you will work with the development and operations teams to design, build, and maintain scalable and robust infrastructure, automate processes, and troubleshoot and resolve incidents while providing long-term solutions.
This role is a strategic partner, capable of recognizing and analyzing trends, identifying opportunities and aligning initiatives with organizational goals that directly impacts the stability and efficiency of the company’s production environment, driving continuous improvement and resilience across the organization.
• System Reliability: Ensure high availability, performance, and scalability of production systems and infrastructure.
• Monitoring & Alerting: Design, implement, and maintain monitoring tools, alerts, and dashboards to increase visibility of system performance and to proactively detect and resolve issues before they impact users.
• Strategic Partner: Champion forward-looking strategies that anticipate industry trends and position the company for long term success. Translate high-level vision into actionable roadmaps and measurable outcomes. Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems.
• Performance Optimization: Identify bottlenecks and optimize systems and services to improve latency, throughput, and resource usage. Perform capacity planning and resource allocation to ensure optimal system performance and scalability. Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards.
• Automation & Tooling: Develop automation for routine tasks, deployments, and infrastructure management to reduce manual work and improve reliability.
• Troubleshooting & Diagnostics: Analyze and resolve critical incidents and problems, including system failures, performance issues, and security breaches.
• Incident Management: Respond to Level 3 system outages and performance issues; lead post-incident reviews and implement preventative measures.
• Root Cause Analysis: Perform in-depth analysis of recurring issues and provide permanent preventative solutions to reduce future incidents.
• Documentation: Create and maintain technical documentation, including troubleshooting guides, procedures, and knowledge base articles and champion the transition to Operations teams.
• Continuous Learning: Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering.
Software Powered by iCIMS
www.icims.com