What should we monitor


Notes from talk given by Sneha Inguva

  • Counters - cumulative, increasing metric
  • Gauges - single metric that goes up and down
  • Histogram - samples and bucket observations
  • Summaries - also samples and buckets but can calculate things like quantiles

4 Golden signals - Google SRE

latency - histogram + summaries traffic - counter + rate() error - counter + rate() saturations - gauge

RED - Subset of the 4 golden signals

R - request rate E - error rate D - duration

USE - Brendan Gregg USE-ful metrics

U - Utilization S - Saturation E - Error rate

Real life examples

Cluster CPU reservation Node memory utilization Load balancer connection error rate - Check if there are 500 errors in last hour. Service http request duration

comments powered by Disqus