Nir is a technical writer at Squadcast, exploring the areas of DevOps and Site Reliability Engineering. He enjoys writing about new tools that SREs will love.
Major failures are inevitable, even in the best maintained infrastructure and systems. Being able to quickly classify the level of severity also allows your on-call team to react more effectively.
Imagine a scenario where your on-call team receives critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is down, revenue loss increases every minute. How do you get your candidacy back on track? This is where understanding the severity and priority of incidents can be invaluable. In this article, we’ll look at severity levels and how they can improve your incident response process.
Severity and priority: how are they different?
In most cases, end-user impact is a measure of the severity of an incident. Error information directly from the monitoring tool helps classify the severity level. Each organization will have defined severity levels and procedures that suit them. To begin defining incident severity levels, we must first understand how to categorize them.
Two major questions to ask:
- Are major workflows now affected?
- Does it interfere with a user’s ability to perform an essential task?
Identifying the most critical workflows of your applications or services is one of the first steps in setting severity levels. It helps to identify what defines an event. Using the “SEV” criteria, we can classify incidents according to their severity. Major incidents are rated with lower SEV ratings and require a quick response.
Each business needs to understand their own business, their team, and the type of SEV level descriptions that work best for them. Below we have a table you can use to set severity levels for your organization.
It may seem that the severity and priority of the incident are one and the same. Isn’t it reasonable to privilege the treatment of a catastrophic event rather than a minor event? In reality, it’s more complicated than that for most businesses.
Once the error information has been received, the incident commander assigns a priority level to the incident. This could be P1 (Priority Level 1) for issues that need to be resolved as soon as possible. Severity speaks to impact on the user, and priority is the order in which on-call engineers will work on issues affecting the infrastructure.
For example, on an e-commerce platform, if customers are unable to check out their cart, this is an example of a serious problem. In this specific case, it is also a high priority incident. On the other hand, if there is a typo in the brand logo or if the font size is too large, it is a high priority incident without being a high severity incident. Customers can continue to shop on the website.
For example, suppose an event causes your application to crash because it prevents users from doing what they need to do. It has a high severity index. This incident only affects 0.01% of your users. However, it may not be considered a higher priority if there are other incidents that affect a larger number of users.
It is important to know when the two measurements are aligned and when they are not. When an item is high priority, it does not necessarily follow that it is of high severity.
Setting severity levels for your organization
Not all situations are the same and not all companies handle them the same way. In addition to the consequences of an event, you will need to consider the following when establishing severity levels and the accompanying procedures and expectations:
A reliability platform like Squadcast and an e-commerce platform will have different ways of defining gravity. As each of them has users with different requirements and tolerance levels, it is essential to first understand what the user expectations are.
How to determine severity levels?
Consider the following before deciding on severity levels:
Periods of high and low traffic for your service
At certain times of the week, your customer traffic may be low. If an incident occurs at this time, few of your users will be affected. For example, if the shopping cart of an e-commerce site is not functional late at night, few users will be affected.
The architecture of your infrastructure
You may be using a microservices-based architecture that has multiple redundancies and can easily scale with higher user load. In such a scenario, the failure of a component will not be considered a high severity incident as it can be easily replaced with a redundant service. For example, if the authentication service goes down, which sometimes cannot be easily replicated, this automatically becomes a high severity incident because even if the other components are working correctly, your users will not be able to use the product.
Using SLOs to Determine Severity Levels
Since each service has its own specific service level objective, which determines its functionality, we can use this to determine the severity level. For example, if the SLO of a particular service is the transaction rate, if the number of successful transactions drops below a certain threshold, we can classify it as a high severity incident.
Severity definitions are organization-specific. An incident classified as SEV-1 may have a lower severity rating in another organization. There are also cases where some organizations only have three severity levels. The general rule followed is that the higher the number of journeys/workflows affected by the incident, the higher the severity level.
Some organizations may also categorize severity levels based on affected SLIs (Service Level Indicators) or SLOs (Service Level Objectives). The table below lists one of many possible ways to set severity levels.
|SEV-1||Typically, incidents are considered SEV-1 if large-scale outages occur in your infrastructure that negatively affect most users. Critical services are interrupted or unavailable. Database read/write errors, security vulnerabilities, and other issues can fall under this generic term. If third-party services (such as Google SSO) are down, users may not be able to log in, which may also be considered a Severity 1 issue.|
|SEV-2||Usually, an SEV-2 incident is declared when the user experience is severely affected. This may include unacceptably high latency levels or a significant violation of SLAs/SLOs. These types of incidents can result in a significant loss of revenue for your organization. Any incident affecting more than 70% of users can be classified as SEV-2.|
|SEV-3||This is an event that has minimal infrastructure impact, but still creates high load or latency issues for your users. This may include unacceptable website load times, shopping cart timeouts, and other similar issues.|
|SEV-4||This is an issue that affects the customer experience but does not have a major impact on the operation of the service. This can include inconsistent page load times, display issues in different browsers, and similar issues.|
|SEV-5||Low-level errors, such as formatting or display issues that do not affect usability, are classified as SEV-5. This can include typos in product descriptions, incorrect colors displayed in brand logos, and other such issues.|
Properly classifying incident severity levels is critical to staying ahead of the game in resolving infrastructure issues. Working with pre-defined severity levels helps on-call teams quickly triage major issues. As we have seen in this article, each organization will have its own way of deciding the severity and priority of incidents.
As the nature and scale of your infrastructure grows and the needs of your user base change over time, you may want to review and modify severity level definitions. Continuous learning is an essential part of good incident response. We hope you find this article helpful in charting the way forward for better incident response in your organization.
Squad broadcast is an incident management tool specifically designed for site reliability engineering. Get rid of unwanted alerts, get relevant notifications, and integrate popular ChatOps tools. Work collaboratively using virtual crisis rooms and use automation to eliminate labor.
Feature image via Pixabay