SRE
Challenges in establishing SLOs in a Micro Services environment
Though Micro Service architecture comes up with many advantages, SREs may face practical challenges in maintaining and improving the reliability of an underlying micro-service system.
Particularly in an environment where hundreds of interdependent micro services are required to provide services for end users, setting up SLOs and maintaining them are harder. If you provide a Platform as a Service, where the system performance can be impacted or altered by end users, this can become much harder.
In this post, I would like to discuss some of the challenges I encountered as an SRE and provide some thoughts on improving the adoption of SLOs.
Identifying the Critical User Journeys
Every service owners/teams can come up with their own list of critical user journeys (CUJs) for which they are directly responsible. So prioritizing the list of CUJs, selecting the top N and implementing them needs coordination with teams and numerous meetings.
I kinda started to hate the word “Priority”. Priority is not ignoring the rest of the items in the list. It is a way of balancing current and the future.
How to do solve efficiently:
- Use a proper prioritization method. This article lists 9 prioritization methods: Product Prioritization Frameworks: The 9 Most Popular (roadmunk.com)
- Pick one SLO per module. This is also harder based on the application you are dealing with. For example, a Library Management System can be comprised of N modules which are backed by separate Micro Services.
I personally use MoSCoW method, which really helps to get a sense of the timeline.
Teams just worry about their services
Most of the times the product development teams just worry about how their service is working instead of looking at the PaaS in a macro level. Also CUJs may not be applicable in all cases. For example, consider a team who is just responsible for the data layer.
As a result, every team want SLOs defined for their services. In some cases, teams want to group SLOs for their services.
For example, say a team owns 10 services in an application having 100 micro services. First they expect separate SLOs for these 10 services. Second, within these 10 services, they want to give special treatment to 2 services.
How to solve this: In one or other way, this is a valid concern, if you look from the perspective of teams. They want to make sure their quality is high. They don’t want to listen to the noise created by other services.
- Create awareness around Key Performance Indicators (KPIs). In this case, IMO, teams should focus on their KPIs. Teams can create as many KPIs they want and build monitors around that. As SREs, we can help the teams to setup those KPIs, but we are not enforcing them.
- Create awareness around functional and performance testing. All these reliability numbers can be obtained before the service is reaching prod, so that the teams can be more confident on their services and how they perform with the other services.
- Establish team level/domain level SLOs, but establish only few. Allowing the teams to come up with their SLIs and SLOs will enable more commitment from the teams
API Performance differences
In some cases, APIs of the same service may behave differently based on the functionality or input parameters. In this cases, teams may want to exclude the APIs from the SLO calculation, or establish individual SLOs for those APIs.
This is particularly problematic if you have Error Rate, Throughput, Latency SLOs based on HTTP requests. This also can skew the site level SLOs.
How to solve this:
- Do not use HTTP SLOs for such services. The service can publish their own high level metrics, which can then be used to establish SLOs. BTW, these services are
- Improve the performance or reliability of the underlying APIs. The whole concept of KPIs, SLIs, SLOs are to make sure that the system is performing efficiently. If one or few APIs has to be excluded, this indicates there is an actual underlying issue (either architectural/functional/non functional). And my belief is that can fixed.
Queue Based Systems
If there are any services which leverages worker queues, then its necessary to establish custom metrics. You cannot rely on HTTP metrics.
How to solve this:
- Publish custom KPIs and establish SLOs around that
Multiple Environments and Locations
In some cases, SREs has to make sure all environments are operating normally not only prod.
- Dev, Stag, Demo, Perf, Prod environments
- Production Deployments in multiple global locations. These deployments can be isolated or shared.
How to solve this:
- Still rely on global SLOs, if you are focussed on CUJs. Use KPIs to monitor different sites.
- Setup SLOs for all environment but enforce SLOs only key environments or just production environments.
Managing thousands of SLO
Based on the above discussions, you may end up with thousands of SLOs. I know the panic when you hear thousands of SLOs. Hold on a second and continue reading…
If you are a single SRE team, supporting a complex micro services environment, it is a reality because of the organization, teams and product structure. Every service may publish tens/hundreds of KPIs. Adding new environment/region may multiply the number of SLOs.
How to manage the SLOs efficiently:
- If you are using any monitoring system, programmatically create the SLOs. For example, DataDog SLOs can be managed using Terrform, so that you dont need to worry about configuring the SLOs manually.
-
AWS3 years ago
How to install NodeJS in Amazon Linux 2
-
Infrastructure3 years ago
How to test CPU, Memory and File System Performance using Sysbench
-
AWS3 years ago
How to install .Net 6 in Amazon Linux 2
-
Uncategorized3 years ago
How to install Docker in Amazon Linux 2?
-
Infrastructure3 years ago
How to get Linux OS Information using uname command
-
Infrastructure3 years ago
How to reproduce CVE-2021-44228 (Log4J vulnerability), patch it, and validate the fix
-
Uncategorized3 years ago
Everything, Everywhere, All At Once
-
Linux3 years ago
How to install git in Amazon Linux 2