Site Reliability Engineering (SRE)

Site Reliability Engineering (SRE) is a discipline that provides a prescriptive method of measuring and achieving reliability through engineering and operations work.  It provides specific methods to achieve the objectives of the DevOps movement.  SRE uses a software engineering approach to solve operational problems by providing engineers who share skill sets with the development teams but focus on operational issues around availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of the service(s) they are looking after. 

 

Mission Impact

  • Highly available services to customers
  • Services that scale to meet customer workload
  • Accelerated delivery of solutions by pairing systems experts with the software development teams
  • Standardized practices for operations by breaking down silos with shared ownership
  • Decreased cost via continual automation of operational tasks
  • Increased responsiveness by providing business metrics via measurement of all work

Features

  • Error budgets provide a defined way of knowing when to focus on features and when to refocus on reliability
  • Proactive monitoring and application performance management
  • Ensure system visibility through working with software development to integrate applications with logging and monitoring platforms
  • Team size that scales sub-linearly with supported systems
  • Focus on tooling and automation to effectively support business systems

Use Cases

  • Organizations that operate their own applications as a core business function
  • Organizations struggling to scale their application support teams
  • Organizations working to improve their availability and application delivery speed

Infographic

SRE Graphic

 

Considerations

  • SRE has three key principles:
    • SLOs are defined and enforced
    • SREs have time for engineering work to improve the environment
    • SREs accept work based on their workload
  • Must have organizational buy-in for success
  • Contractual requirements between existing development, test, and operational service contracts may impede some of the automation processes and may require future modifications.
  • SREs work with development teams, colocation and collaboration are critical to success

ValidaTek’s Solution Process

  1. Bootstrap SRE – provide a start towards a single capability
         a. The first SRE will work to understand the operational needs of the system they are assigned to
         b. The first SRE will create a list of recommendations and improvements
  2. Bootstrap team
         a. Based on organizational needs, decide what model – supporting a single application, spreading across the organization, or transforming an existing team
         b. Establish standards for organization
  3. Take on a single project, provide focus and measured improvement
  4. Provide presentation of practices and successes to leadership as precursor to further rollout
  5. Use practices developed to spread across the organization
  6. Deliver additional capabilities to existing and new projects

 

For More Information

Email: [email protected]

Phone: 703-972-2272