Skip to main content
SRE Pillars, Processes and Procedure
About the author
Principal Site Reliability Engineer (SRE), Startup Founder and IT Leader, experience building, tuning and supporting mission critical systems. Platform creator, inquisitive technologist and innovation enthusiast with more than 23 years of Information Technology Industry experience.
Experience includes Airlines, E-Commerce, Fintech, Hospitality and Streaming Services Industries, as well as, State and Local Government. Has launched several startups, providing an expert technical perspective on critical technology decisions. For more information about the author please visit his site at www.briandibella.com or Linkedin.
Pillars of SRE
- Observability: Monitoring and alerting for all components within the application performance and infrastructure metrics.
- Reliability: Measuring reliability as a key performance metrics as Service Level Objectives (SLO) and Service Level Indicators (SLI). These SLOs and SLIs include error budgets that can determine development teams release cadence so that application bugs are not impeding users or business functions, thereby creating a more reliable experience.
- SRE Culture: Building a team culture that is constructive and blameless, which results in better communication between team members and the ability to openly share opportunities to improve all aspects of the SRE practice. Achieving high performance through psychological security among team members that spreads across the enterprise as more teams engage with our SRE team members, thereby clearing a path to remove silos within an organization. Transfer knowledge whenever possible.
- Incident Management: Defining application ownership and escalation paths for expedient resolution of incidents. Measure incident metrics around detection, resolution and continual failure.
- Automation: Removing toil by systematically identifying manual operations and replacing those repetitive tasks with automated solutions.
- Continuous Improvement: Continue to practice processes and procedures that strengthen the culture of SRE, reduce toil, foster psychological security, remove silos, metric review and sustainable on-call practices.
SRE Process and Procedure in Priority
- Implement observability by building dashboards to reflect application performance metrics by component and the overall application.
- Create an ownership catalog that defines all components of an application, owners, stakeholders and technical leads.
- Apply Alert escalation tool, configure teams and begin collecting metrics on Mean Time to Detect (MTTD), Mean Time to Resolve (MTTR) and Mean Time to Failure (MTTF).
- Foster an internal blameless culture that encourages communication between team members and the external teams that SRE supports.
- Identify opportunities for automation during incident resolution, rollback scenarios, testing and all manual technical recurring tasks.
Comments
Post a Comment