How to create SLI specifications for your Web Services ?

Arjit Sharma
4 min readMay 1, 2021

Recently while working on a product having microservices based architecture and a much reliance on Cloud PaaS Services, I faced a much of hassle with-in me to understand the right indicators to measure that my services are running successfully or if they need any attentions.

Although being the developer of the services, we think we know what would be the best indicators for success. However that’s not true and rather users (in our scenarios QA & PVT team) are the best judge and for if they are not satisfied with the performance, they will punish you to be a sugar cookie :-).

And as Peter Durker has said, “you can’t manage what you can’t measure” getting the SLOs based on right indicators was a must for me.

Image Credits: Seth Eckert: https://dribbble.com/seth_eckert

I got a chance to go through great content (and books) by Google on Site Reliability Engineering explaining in detail on measuring metrics to create SLIs and SLOs and accordingly set the expectations in SLAs with customers. While reading the books, I started making notes and thought to write this article on my learning so that I can refer to it later and it could be a help to someone who doesn’t have much time to go into details and want to the point examples in an easy language.

So in here, I would like to take a simple sample Web service and identify the right SLI that would help me to make sure if my users are happy using the service. We will write the High level specifications for identified SLIs which can be easily implemented in monitoring tools (e.g Azure Monitor, Prometheus) to generate the right set of alerts

Enough of Talk, lets start the actual work.
Let say my simple sample Website has a user Home page that display user details and a Today’sToDo Page that is used to update and display today ToDo tasks for user. I’ll have a sequence diagram here for ease of understanding, of request and response to get the User’s Home Page.

Figure 1: Sequence Diagram represent the call to Uses’s Home Page

User journey for accessing Home Page is Request / Response based, hence in accordance with SLI Menu (or Defined SLI standards) we have to use Availability and Latency SLI.
So every typical website user (including me :-) ) will have an expectation that whenever I click on home page, it should be loaded successfully and quickly.

But the question here is, how to measure a ‘quick’ and a ‘successful’ response? Home Page loading successfully is my Availability and Home page loading quickly is my Latency.

So lets write the specification of it…

1. Availability SLI Specification:

Lets start with Availability SLI specification, which states:
“The proportion of Valid Requests served Successfully”.

Valid Requests : — All HTTP GET request for /home/{user} and /home/{user}/profileImage

Successfully : — 2XX, 3XX, 4XX (except 429 rate limiting) response codes measured at Load Balancer.

Hence defining “Successful”.

so Availability SLI specification of my service will be:

Proportion of HTTP GET requests
for /home/{user} or /home/{user}/profileImage
that have 2XX, 3XX or 4XX (excl. 429) status
measured at the load balancer

2. Latency SLI Specification:

Next Latency SLI specification, which states:
“The proportion of Valid Requests that were served faster than a Given Threshold.

Valid Requests : — All HTTP GET request for /home/{user} .

Given Threshold : — X ms (or 150 ms in our scenario) measured at Load Balancer.

Hence defining “Quickly”.

so Latency SLI specification of my service will be:

The proportion of HTTP GET requests
for /home/{user}
that send their entire response within 150 ms
measured at the load balancer.

With these two SLI specifications in place, it is easy to structure aspirations SLOs for my service, which would be:

Figure 2: Aspirational Service Level Objectives based on SLI specifications

This is a very basic example to just give a overview on how we do write SLI specification of a simple web service. My next step would be to take a bit more complex example having multiple calls and we group them to write SLI specifications.

I would be more than happy to get inputs, suggestions and claps to make it better for all who reads :-)

Sources:

  1. Site Reliability Engineering Books: https://sre.google/books/
  2. Class SRE implements DevOps: Youtube Playlist
  3. Google SRE Homepage: https://sre.google/
  4. Acronym Sheet (SLI / SLO / SLA): Google Cheat Sheet

--

--

Arjit Sharma

My favourite Quote — printf(“Hello World..!”)