Includes the most relevant metrics to measure service of your system.
If you see historical data, you can not take overall view, instead every slot should be analysed. For example, if we have a pick in a specific day, maybe the traffic was high at that time and that is where we need to guarantee reliability. This should be consider even if it occurs only once.
Examples:
How percentiles should be treated? can we aggregate along the time? When can be aggreagted when not?
Can we averga percentiles of many modules and for a long time of data analysis?
If we do not use logs, what will be the benefit to counts events as soon as they happen, and evaluate SLI base on this fresh results?
Metrics have two faces:
1. The proportion of successfully retrived SKU IDs associated to products, for /api/getSKUs that have a valid ID retrieved from server-side and measure at the client-side.
The proportion of .... for .... that have ... measured at ...
https://developer.android.com/google/play/billing/integrate
1. The proportion of valid retrived SKU IDs associated to products, for /api/getSKUs that have a valid ID retrieved from server-side and measure at the client-side.
- This shows the rate of SKU IDs presented to the customer. If we have 10 products in our store, this SLI specifies how many of them are presented to the customer.
- The validity of a SKU ID should be evaluated at the client side as it defines that the received ID are the ones that matched the server-side.
- The implementation should be instrumented at the client side, by counting the number of SKU IDs and validate them
- This SLI is estimated every time the customer wants to see the available products in store.
- Every time that the customer refresh, or re-visit this stage this rate should be updated.
- This SLI is active while the customer is viewing the list of products and needs to be stopped when the customer moves to the next stage, which is select a specific product asking for more details.
- This SLI needs to be implemented in the client-side
- The gap for this SLI includes lost of connectivity due to client-side lost internet connection, and customer side device used for this application.
2. The proportion of HTTP get requests for /api/completePurchase that have 0 and 1 as a response code value measured at the client side.
- This SLI shows the rate of successfully transactions, associating succesfull transaction to response code 0 and 1 : success and user presed back or cancel a dialog.
- This SLI should be instrumented at the client side and count the number of totals reponse codes versus the total number of 0 and 1 responses.
- This SLI is estimated every time the customer press the "buy" botton on the screen.
- This SLI happens for the time the "buy" transaction is required.
- The gap for this SLI includes lost of connectivity as in that case the application should mention that connectivity was lost.
Set an alert ONLY when a manual internvention is required, not for infomative purposes. Alerts mean action and now. If this is not possible, you can separate alerts that requries action and alerts where no action is required.
Monitoring should be display in one place.
NOTES
Get differencd between service satisfaction and content satisfaction.
how error budget leads with customer trust: i believe customers should be able to set our error budget, not us
Also postmortem should be quick and not generate historic toil.