Grafana tutorial: simple synthetic monitoring for applications
Often there’s a focus on how a service is running from the perspective of the organization. But what does service health monitoring look like from the perspective of a user?
There are many metrics that indicate the overall health of a container, vm, or application, but independently they do not indicate if the system is functioning correctly.
Often these metrics (CPU, disk, memory) are too narrow, and they can be poor indicators. High CPU may be desirable or bursts of memory usage may be normal.
Synthetic metrics address the user experience, whether measuring a simple API call or authenticating into an application and viewing a dashboard.
In this example, we’ll use hosted Grafana since the entire process is well-known. This will demonstrate the common steps and metrics collected that can be used to monitor service health and, as a by-product, show where bottlenecks exist.
Here’s the final dashboard:
What are synthetic metrics?
Synthetic metrics are a collection of multi-stage steps required to complete an API call or transaction.
A set of metrics for an API call typically include:
- Time to connect to API (connect latency)
- Duration of request (response latency)
- Size of response payload
- Result Code of request (200, 204, 400, 500, etc)
- Success/Failure state of the request
That’s a very high-level synthetic and can be used as a model for more complex API calls.
Taking this idea further, an API call may require authentication before making a request. The user making the request may have a valid authentication token but not the authorization to make some API calls.
A “read only” user would not be modifying data but could make some useful queries.
Why use synthetics?
User experience is the most important aspect of service offerings. As long as the user can perform their tasks according to expectations, a service is healthy.
From the SRE viewpoint, a service can be “degraded” but remain operational:
- A database could be degraded (Two out of three nodes in a cluster are healthy, but the third is offline)
- Kafka replication may not be working, but enough nodes are online to continue working
- Cassandra storage may be running out (It always does over time, particularly when you are on-call next)
- Kubernetes Masters are offline (This does happen, even in the best of clouds)
From the user experience, none of the above issues matter as long as the service is functioning.
Synthetic metrics with hosted Grafana
A very basic Python script will be used to traverse 10 steps required to login and validate a session with a hosted Grafana instance. The metrics generated by the script are in Graphite format and will be sent to a hosted metrics instance with tags enabled.
The same script can be adapted to send this data to InfluxDB or provide a metrics API that can be scraped by Prometheus.
Time series databases
Grafana offers hosted metrics for both Graphite and Prometheus. The script currently generates metrics suitable for Graphite with tags enabled.
10 steps to success
There are 10 steps for the entire process, with a final step that parses the result and ensures the login has succeeded.
To discover these steps, a combination of using Chrome Developer tools and Postman was used to duplicate the process.
Step 1: Target: https://bkgann3.grafana.net
SYNTHETIC GET - STEP 1: https://bkgann3.grafana.net
SYNTHETIC GET - STEP 1: https://bkgann3.grafana.net DURATION: 567ms
Step 2: Target: https://bkgann3.grafana.net/login
SYNTHETIC GET - STEP 2: https://bkgann3.grafana.net/login
SYNTHETIC GET - STEP 2: https://bkgann3.grafana.net/login DURATION: 155ms
Step 3: Target: https://bkgann3.grafana.net/login/grafana_com
SYNTHETIC GET - STEP 3: https://bkgann3.grafana.net/login/grafana_com
SYNTHETIC GET - STEP 3: https://bkgann3.grafana.net/login/grafana_com DURATION: 138ms
Step 4: Target: https://grafana.com/oauth2/authorize?access_type=online&client_id=4579dc0323c2042eb808&redirect_uri=https%3A%2F%2Fbkgann3.grafana.net%2Flogin%2Fgrafana_com&response_type=code&scope=user%3Aemail&state=PuaU_YRJSko1-yV1UtBCM_9rUMeVOMBjBmfCmG9DT7U%3D
SYNTHETIC GET - STEP 4: https://grafana.com/oauth2/authorize?access_type=online&client_id=4579dc0323c2042eb808&redirect_uri=https%3A%2F%2Fbkgann3.grafana.net%2Flogin%2Fgrafana_com&response_type=code&scope=user%3Aemail&state=PuaU_YRJSko1-yV1UtBCM_9rUMeVOMBjBmfCmG9DT7U%3D
SYNTHETIC GET - STEP 4: https://grafana.com/oauth2/authorize?access_type=online&client_id=4579dc0323c2042eb808&redirect_uri=https%3A%2F%2Fbkgann3.grafana.net%2Flogin%2Fgrafana_com&response_type=code&scope=user%3Aemail&state=PuaU_YRJSko1-yV1UtBCM_9rUMeVOMBjBmfCmG9DT7U%3D DURATION: 158ms
Step 5: Target: https://grafana.com/auth/sign-in?to=%2Foauth2%2Fauthorize%3Faccess_type%3Donline%26amp%253Bclient_id%3D4579dc0323c2042eb808%26amp%253Bredirect_uri%3Dhttps%253A%252F%252Fbkgann3.grafana.net%252Flogin%252Fgrafana_com%26amp%253Bresponse_type%3Dcode%26amp%253Bscope%3Duser%253Aemail%26amp%253Bstate%3DPuaU_YRJSko1-yV1UtBCM_9rUMeVOMBjBmfCmG9DT7U%253D
SYNTHETIC GET - STEP 5: https://grafana.com/auth/sign-in?to=%2Foauth2%2Fauthorize%3Faccess_type%3Donline%26amp%253Bclient_id%3D4579dc0323c2042eb808%26amp%253Bredirect_uri%3Dhttps%253A%252F%252Fbkgann3.grafana.net%252Flogin%252Fgrafana_com%26amp%253Bresponse_type%3Dcode%26amp%253Bscope%3Duser%253Aemail%26amp%253Bstate%3DPuaU_YRJSko1-yV1UtBCM_9rUMeVOMBjBmfCmG9DT7U%253D
SYNTHETIC GET - STEP 5: https://grafana.com/auth/sign-in?to=%2Foauth2%2Fauthorize%3Faccess_type%3Donline%26amp%253Bclient_id%3D4579dc0323c2042eb808%26amp%253Bredirect_uri%3Dhttps%253A%252F%252Fbkgann3.grafana.net%252Flogin%252Fgrafana_com%26amp%253Bresponse_type%3Dcode%26amp%253Bscope%3Duser%253Aemail%26amp%253Bstate%3DPuaU_YRJSko1-yV1UtBCM_9rUMeVOMBjBmfCmG9DT7U%253D DURATION: 174ms
Step 6: Target: https://grafana.com/api/login
SYNTHETIC POST - STEP 6: https://grafana.com/api/login DURATION: 383ms
Step 7: Target: https://grafana.com/api/oauth2/clients/4579dc0323c2042eb808
SYNTHETIC GET - STEP 7: https://grafana.com/api/oauth2/clients/4579dc0323c2042eb808
SYNTHETIC GET - STEP 7: https://grafana.com/api/oauth2/clients/4579dc0323c2042eb808 DURATION: 145ms
Step 8: Target: https://grafana.com/api/oauth2/grants?clientId=4579dc0323c2042eb808
SYNTHETIC GET - STEP 8: https://grafana.com/api/oauth2/grants?clientId=4579dc0323c2042eb808
SYNTHETIC GET - STEP 8: https://grafana.com/api/oauth2/grants?clientId=4579dc0323c2042eb808 DURATION: 146ms
{u'instanceId': 73788, u'name': u'bkgann3.grafana.net', u'links': [{u'href': u'/oauth2/clients/4579dc0323c2042eb808', u'rel': u'self'}, {u'href': u'/orgs/bkgann', u'rel': u'org'}], u'url': u'https://bkgann3.grafana.net', u'orgSlug': u'bkgann', u'id': u'4579dc0323c2042eb808', u'orgName': u'Brian Gann', u'orgId': 127614, u'updatedAt': u'2019-01-15T00:01:15.000Z', u'redirectUri': u'https://bkgann3.grafana.net/login/grafana_com', u'createdAt': u'2018-12-14T22:02:08.000Z', u'description': u''}
Step 9: Target: https://grafana.com/api/oauth2/authorize
SYNTHETIC POST - STEP 9: https://grafana.com/api/oauth2/authorize DURATION: 162ms
Step 10: Target: https://bkgann3.grafana.net/login/grafana_com?code=a5c9c606f8fbfa61367c3806899ae9ad70d430f5&state=PuaU_YRJSko1-yV1UtBCM_9rUMeVOMBjBmfCmG9DT7U%3D
SYNTHETIC GET - STEP 10: https://bkgann3.grafana.net/login/grafana_com?code=a5c9c606f8fbfa61367c3806899ae9ad70d430f5&state=PuaU_YRJSko1-yV1UtBCM_9rUMeVOMBjBmfCmG9DT7U%3D
SYNTHETIC GET - STEP 10: https://bkgann3.grafana.net/login/grafana_com?code=a5c9c606f8fbfa61367c3806899ae9ad70d430f5&state=PuaU_YRJSko1-yV1UtBCM_9rUMeVOMBjBmfCmG9DT7U%3D DURATION: 738ms
Step 10: Target: https://bkgann3.grafana.net
SYNTHETIC GET - STEP 11: https://bkgann3.grafana.net
SYNTHETIC GET - STEP 11: https://bkgann3.grafana.net DURATION: 162ms
YES!
Metrics generated
The following metrics are generated:
name | unit | description |
---|---|---|
result | boolean | 0 for failure, 1 for Success |
duration | milliseconds | time to perform request |
status_code | integer | HTTP response code |
content_size | bytes | size of content returned |
hosted_grafana.step_01.result;step=hosted_grafana.step_01;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_01.duration;step=hosted_grafana.step_01;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 567 1560461215
hosted_grafana.step_01.status_code;step=hosted_grafana.step_01;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 302 1560461215
hosted_grafana.step_01.content_size;step=hosted_grafana.step_01;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 29 1560461215
hosted_grafana.step_02.result;step=hosted_grafana.step_02;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_02.duration;step=hosted_grafana.step_02;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 155 1560461215
hosted_grafana.step_02.status_code;step=hosted_grafana.step_02;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 200 1560461215
hosted_grafana.step_02.content_size;step=hosted_grafana.step_02;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 30408 1560461215
hosted_grafana.step_03.result;step=hosted_grafana.step_03;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_03.duration;step=hosted_grafana.step_03;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 138 1560461215
hosted_grafana.step_03.status_code;step=hosted_grafana.step_03;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 302 1560461215
hosted_grafana.step_03.content_size;step=hosted_grafana.step_03;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 289 1560461215
hosted_grafana.step_04.result;step=hosted_grafana.step_04;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_04.duration;step=hosted_grafana.step_04;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 158 1560461215
hosted_grafana.step_04.status_code;step=hosted_grafana.step_04;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 302 1560461215
hosted_grafana.step_04.content_size;step=hosted_grafana.step_04;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 682 1560461215
hosted_grafana.step_05.result;step=hosted_grafana.step_05;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_05.duration;step=hosted_grafana.step_05;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 174 1560461215
hosted_grafana.step_05.status_code;step=hosted_grafana.step_05;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 200 1560461215
hosted_grafana.step_05.content_size;step=hosted_grafana.step_05;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 30761 1560461215
hosted_grafana.step_06.result;step=hosted_grafana.step_06;runner=ares.local;request_method=POST;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_06.duration;step=hosted_grafana.step_06;runner=ares.local;request_method=POST;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 383 1560461215
hosted_grafana.step_06.status_code;step=hosted_grafana.step_06;runner=ares.local;request_method=POST;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 200 1560461215
hosted_grafana.step_06.content_size;step=hosted_grafana.step_06;runner=ares.local;request_method=POST;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 5257 1560461215
hosted_grafana.step_07.result;step=hosted_grafana.step_07;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_07.duration;step=hosted_grafana.step_07;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 145 1560461215
hosted_grafana.step_07.status_code;step=hosted_grafana.step_07;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 200 1560461215
hosted_grafana.step_07.content_size;step=hosted_grafana.step_07;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 538 1560461215
hosted_grafana.step_08.result;step=hosted_grafana.step_08;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_08.duration;step=hosted_grafana.step_08;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 146 1560461215
hosted_grafana.step_08.status_code;step=hosted_grafana.step_08;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 200 1560461215
hosted_grafana.step_08.content_size;step=hosted_grafana.step_08;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 941 1560461215
hosted_grafana.step_09.result;step=hosted_grafana.step_09;runner=ares.local;request_method=POST;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_09.duration;step=hosted_grafana.step_09;runner=ares.local;request_method=POST;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 162 1560461215
hosted_grafana.step_09.status_code;step=hosted_grafana.step_09;runner=ares.local;request_method=POST;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 200 1560461215
hosted_grafana.step_09.content_size;step=hosted_grafana.step_09;runner=ares.local;request_method=POST;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 430 1560461215
hosted_grafana.step_10.result;step=hosted_grafana.step_10;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_10.duration;step=hosted_grafana.step_10;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 738 1560461215
hosted_grafana.step_10.status_code;step=hosted_grafana.step_10;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 302 1560461215
hosted_grafana.step_10.content_size;step=hosted_grafana.step_10;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 24 1560461215
hosted_grafana.step_11.result;step=hosted_grafana.step_11;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=boolean 1 1560461215
hosted_grafana.step_11.duration;step=hosted_grafana.step_11;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=ms 162 1560461215
hosted_grafana.step_11.status_code;step=hosted_grafana.step_11;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=integer 200 1560461215
hosted_grafana.step_11.content_size;step=hosted_grafana.step_11;runner=ares.local;request_method=GET;info=HostedGrafanaSynthetic;instance_name=bkgann3;org_id=4068;mtype=gauge;unit=B 40699 1560461215
Basic dashboard
A dashboard can be built with Grafana using a combination of SingleStat panels and Graph panels.
The top portion of the dashboard displays the overall health of a hosted Grafana instance:
The next section displays OK/CRIT for each stage of the synthetic operation:
The last section gives more detail in graph format:
Queries
The Graphite queries are very basic. A few are shown below. See the supplied dashboard json for more details.
Step 1 Results
averageSeries(seriesByTag('name=hosted_grafana.step_01.result'))
Step 1 Durations
alias(averageSeries(seriesByTag('name=hosted_grafana.step_01.duration')), 'duration')
Time to Login - COLD (clean browser, no cookies, session, etc), which is the sum of the duration of steps 1 through 10.
sumSeries(seriesByTag('name=~hosted_grafana.hosted_grafana.(step_0\d|step_10).duration'))
Time to Login - HOT (cookie/session/etc cached) is the time it takes to hit the instance when a session is already active.
alias(sumSeries(seriesByTag('name=~hosted_grafana.step_10.duration')), 'duration')
Deep dive into creating synthetics
Here’s the general process used to figure out each step for a hosted Grafana login. The script that performs each step is written in Python, but could easily be written in other languages.
Step 1
Starting with Chrome and devtools show, enable preserve log
, and visit the destination, in this case it is
https://bkgann3.grafana.net
In dev tools you’ll see a 302 (redirect) as the response. The response will also include the redirect_url. With those two items, we can test step 1 by connecting, checking for a 302 HTTP response code (anything else is an error), and get the redirect_url, which we’ll use in the next step.
Step 2
Connecting to the redirect_url from step 1, we’ll be sent to the login path, in this case https://bkgann3.grafana.net/login. We get a 200 response from this step. Anything else is an error.
Step 3
This step requires minor digging through the web page. Inspecting the login button gives us our next url to query. https://bkgann3.grafana.net/login/grafana_com
Connecting to that url will respond with a 302 (redirect) and another url to visit.
Step 4
The redirect from step 3 is to use OAuth with grafana.com. We’ll get another 302 redirect url here.
Step 5
Next we’ll post our login info to https://grafana.com/api/login, expecting a 200 response.
Step 6
Next we’ll query for the client ID: https://grafana.com/api/oauth2/clients/4579dc0323c2042eb808 This will return a 200 response, with a payload we’ll use next.
Step 7
To get the OAuth “grants,” we next query https://grafana.com/api/oauth2/grants?clientId=4579dc0323c2042eb808
This responds with a 200 and a session cookie.
Step 8
Next we use the cookie and authorize by querying https://grafana.com/api/oauth2/authorize?access_type=online&client_id=4579dc0323c2042eb808 (There is more in the query; the important part is to get a 200 response and the session.)
Step 9
We’ll now use the code and session cookie from step 8 and try to login again: https://bkgann3.grafana.net/login/grafana_com?code=X&state=Y
We get a 302 redirect and an url again, which is exactly where we started off!
Step 10
Now that we have authorization and a valid session established, we can connect and get back a 200 response.
Step 11
The 11th step is to parse the body of step 10 for a successful login string, which is easy to locate:
"isSignedIn": true
If we see this string in the body, we’ve completed our login successfully.
Wrapping it all up
In this example, the end-user experience is measured and provides real feedback on site reliability.
Granular metrics like CPU, disk, and memory are also collected but only leveraged by an SRE when looking for opportunities to optimize the service. The synthetics can provide insight as to where to start looking.
The synthetic script can be cloned from this repo.
Use it to monitor your own experience with hosted Grafana or adapt it for your application!