SRE essentials: What to expect in site reliability engineering

blog-SRE.jpg

Over the past 20 years, most leading businesses have adopted cloud computing and distributed systems to develop their applications. An unintended consequence: Traditional IT operations (ITOps) often struggle to handle the complexities of increased workloads and cloud technologies. 

As distributed systems scale, keeping operations and development separate ultimately leads to stagnation. Developers might want to push out new applications or updates, while the operations team, already overwhelmed with keeping tabs on the existing infrastructure, might push back on any risks to the infrastructure.

Site reliability engineering (SRE) is a discipline that offers a more nuanced approach by combining software engineering principles with operational practices that ensure service reliability and optimal performance at scale. The people in this role are site reliability engineers (SREs), simplifying and automating tasks that the operations team would perform manually. Less time spent on tedious, repetitive work opens the door for innovation and business growth.

Site reliability engineering has become an essential component of a modern organization. The benefits include saying goodbye to reactive problem-solving and hello to predictable performance, proactive system design, improved scalability, minimized service disruptions, and new opportunities for improvement. 

Want to know more about the SRE role and the world of site reliability engineering? Let’s start with the basics.

What is site reliability engineering?

Site reliability engineering is the practice of incorporating software engineering tools and principles into IT operations. SREs create and maintain reliable, resilient, efficient, and scalable infrastructure and services. SREs improve the reliability of scalable systems. They build systems that are resilient by design and use software and automation to manage an ever-growing infrastructure, which is a much more sustainable practice than manually managing it.

Site reliability engineers are the people responsible for managing and automating IT operations. With their software engineering expertise, they ensure systems stay resilient and available, automatically remediating any issues. This role oversees the delivery and deployment of new applications, preventing potential outages and interruptions.

History of site reliability engineering

Benjamin Treynor Sloss, vice president of Google Engineering, first coined the term “site reliability engineering” in 2003, saying, “SRE is what happens when you ask a software engineer to design an operations team.” And that’s what he did.1

As a new hire at Google, he was tasked with finding an engineering solution for managing rapidly expanding operations and a massive, distributed infrastructure. At the rate the company’s infrastructure was growing, it would be impossible to hire the necessary number of engineers to manage new services manually while innovating at the same pace. Instead, Treynor’s team balanced innovation with system reliability, fostering a culture of continuous learning and improvement.  

Soon, the growing team of SREs at Google was focused on pushing new features to the production environment while maintaining its stability and reliability. Within a few years, more companies faced the same problem as Google. They needed to manage massive, distributed infrastructures while maintaining the availability and reliability of services. Soon, the practice of site reliability engineering spread beyond Google and has become key to modern IT operations.

The role of SREs in modern IT infrastructures

In today’s digital-first world, businesses of all sizes have come to rely on highly available, scalable, and resilient systems. One outage, whether a website or a mobile app, can result in financial loss, bad customer experience, and operational inefficiencies. That’s why SREs play an essential role in any company. 

SREs make it easier to keep up with your competition. They can resolve availability issues in minutes (instead of days) and ensure fast page load times, no matter the number of users. 

At enterprise businesses, SREs do the same tasks on a different scale. They automate reliability, optimize performance, and prevent system failures through proactive monitoring and incident management. By fostering collaboration between development and operations teams, SREs create reliable and efficient systems.

Core principles of site reliability engineering

At its core, site reliability engineering is about treating operational issues in production with a software development approach. Other core principles include embracing risk, using automation, and setting up service-level objectives (SLOs) and service-level indicators (SLIs).

Embracing risk

An SRE recognizes that no system can perform perfectly. Failures and outages are expected as part of the innovation process. Instead of trying to avoid failure, SREs focus on understanding an acceptable risk level.

Embracing risk is about figuring out the tipping point between improving reliability, deploying new code, and managing the potential impact on users. The time, energy, or other resources it takes to improve the reliability of a service is the acceptable risk. The rest is the excess. But how much risk is acceptable? And at what point in the process will the user experience start to degrade? 

This core principle of SRE is based on four factors:

  1. An acceptable level of reliability for users (determined by setting up SLOs and SLIs)

  2. Cost of reliability improvements (including automation and tooling)

  3. Risk of not improving 

  4. Cost vs. risk (ascertained through error budgets)

Often, the key to embracing risk is a cultural perspective. SREs work in a nonpunitive culture. It entails learning from failure and implementing preventative measures to continuously enhance system reliability and improve application performance.

Error budgets
An error budget is a clear metric for understanding and managing risk. It’s the amount of downtime (or the number of errors) a service can experience within a certain time. 

A permissible amount of system unreliability (a.k.a. an error budget) helps to balance innovation and reliability. Engineers are encouraged to take risks, like speeding up deployments and releasing new features, because they have a budget of errors. Once this threshold is reached, engineers stabilize the system, improving reliability.

SRE teams calculate an error budget by determining the acceptable level of errors (or downtime) based on the SLO. In other words, it’s the margin of error permitted by an SLO.

Setting up service-level objectives (SLOs) and indicators (SLIs)

Service-level objectives, or SLOs, are the target values for performance over a certain amount of time. By design, both engineering teams and business stakeholders must understand these targets to set clear expectations that guide decision-making. 

Service-level indicators, or SLIs, measure the service performance. Usually, SLIs represent user priorities such as service latency, availability, throughput, error rates, and others. 

Neither SLOs nor SLIs are static. They evolve over time and require regular review and improvement.

Developing automation and tooling

Finally, automation. SREs strive to replace manual and repetitive tasks with automation. Reducing toil means improving system reliability and innovating faster and more efficiently.

Some SRE teams spend up to half of their time developing automation tools for deployment, incident response, and testing. Over time, advanced automation capabilities help decrease the cost of scaling while ensuring service reliability and optimal performance.

Key practices in site reliability engineering

When running services, SRE teams focus on key everyday activities such as monitoring and observability, incident management, capacity planning, and change management.

Monitoring and observability

System monitoring and observability are critical for SREs. They provide true visibility into whether services are working, how well, where the problems are, and so on. 

Monitoring helps site reliability engineers detect and address issues quickly.
Observability provides insights into system performance in real time and historically to address unknown unknowns. Traces, logs, and metrics are the main observability signals.

4 golden signals of site reliability engineering

The four golden signals of site reliability engineering are latency, traffic, errors, and saturation. These metrics are foundational for application reliability, which is the health and performance of a service in a distributed system. 

  • Latency is the time it takes a system to respond to a request (either successfully or with an error). High latency signals performance issues that require immediate attention from SREs.

  • Traffic measures the demand on the system. Depending on the system, it could be the number of transactions per second or HTTP requests per second. SREs use traffic to figure out whether the user experience is degrading or not.

  • Errors are the rate of requests that fail. They can fail explicitly (such as HTTP 500s), implicitly (e.g., HTTP 200 but with content errors), or by policy (e.g., any requests that take longer than one second to fail). Depending on the system, SRE can prioritize one type of error over the others and address recurring issues.

  • Saturation indicates the overall capacity of the system or how “full” the service is. It can be measured in various ways, depending on the resources that are most constrained or the remaining load a system can handle.

The four golden signals help SREs focus on capacity planning, improving system reliability over time, responding to and managing incidents, and finding the root cause of a problem


Still, the four golden signals on their own won’t make complex distributed systems completely reliable. That’s where distributed tracing comes in. It places all the performance numbers into context.

Incident management

As we mentioned earlier, when it comes to SRE, incidents and outages are unavoidable. However, every incident still requires a response. An effective incident response plan includes triage procedures, clear communication protocols, and escalation paths.

Perhaps as important are the postmortems. A constructive practice, the postmortem is a learning experience rather than an opportunity to point fingers. SREs should keep track of each incident, discover its root cause, and work with the developer team to fix the code or build tools (if possible) to prevent it from recurring. 

Take a deeper dive into incident management.

Capacity planning

Capacity planning ensures service reliability today and tomorrow. It protects from both over-provisioning and under-provisioning. 

SRE teams forecast demand by analyzing historical usage patterns and predicting future resource needs. They look for inefficiencies, optimize and reallocate resources based on real-time information, and regularly update these plans depending on the changing data. 

Using capacity planning, site reliability engineers ensure systems can handle growth and demand spikes.

Change management

Changes often lead to outages, which even traditional ITOps can attest to. However, instead of fearing outages, and therefore fearing change, SREs embrace change with a set of three best practices:

  • Progressive, controlled rollouts minimize the impact of potential problems and allow for early detection.

  • Monitoring ensures SREs can detect problems accurately in real time.

  • A rollback plan guarantees a safe and quick procedure to roll back changes if problems arise.  

All three of these practices should be automated if possible.

Full-stack observability solution for SREs with Elasticsearch

Elastic Observability provides a unified solution for collecting, monitoring, and analyzing observability metrics across your technology stack. With Elastic Observability, you can collect, store, and visualize observability metrics from any source and speed up problem resolution with Elastic’s Search AI Platform


Combine conversational search and agentic AI with Elastic Observability for a context-aware chat experience that extends to include your proprietary data and runbooks.
Elastic AI Assistant can help SREs interpret log messages and errors, optimize code, write reports, and even identify and execute a runbook. Accelerate problem resolution, foster collaboration, and unlock knowledge to empower all users and ensure reliability.

Learn more about observability with Elastic.

Sources1. Google, “Google SRE Book,” 2017.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.