In Numbers We Trust

Every design decision that can be made, can be made better with numbers backing it. You might have heard of being data-driven as the latest addition to buzzword bingo, but let’s ignore the cringe-worthy abstract generalizations and look into concrete details about how we brought data into our development workflow at Semantics3.

Whether for measuring routine performance, or for identifying anomalous behavior in underlying systems, we have come to rely on recording application metrics as an essential requirement. Identifying and tracking key application metrics has helped us in testing hypotheses and estimating performance impacts when we roll-out changes to our systems.

In order to simplify the on-boarding process for our developers looking to quantify their applications, we have an always-on, configurable measurement platform, inspired by Etsy’s work on measuring anything and everything. In this post, we’ll describe the thought-process, design choices and implementation details of the system that we currently maintain for monitoring our applications.

We started with a few basic requirements that we wanted to cover:

  1. Support basic measurements, including counters, timers, and uniqueness identifiers (sets)
  2. Allow tagging of data points with arbitrary metadata to permit roll-ups, slices and other meaningful aggregates
  3. Have a minimal impact on underlying application performance, i.e. avoid Heisenbugs
  4. Visualize collected data with an easy-to-use, extensible interface
  5. Achieve all of the above with an as-simple-as-possible (but not simpler) implementation.

Overview of our measurement platform

After considering a number of options, we decided to standardize our measurement platform using the stack below:

pipeline to enlightenment
Each application that we run, reports its metrics to a central StatsD server. After aggregating the data over a fixed time-interval, the metrics, together with useful aggregates is flushed to an InfluxDB database. Finally, queries and visualizations are built using Grafana.

Gathering metrics

Statsd, open-sourced by Etsy, is a collector daemon that runs on NodeJS. You can send stats like counters, timers and gauges to this daemon over UDP (or TCP too). The StatsD daemon collects these stats, aggregates them and flushes out periodically to a configurable database backend.

The key feature that made us choose StatsD was its support for a wide variety of metric types. Knowing which metrics to collect is often a stumbling block when starting to build a measurement solution. Thus, if we start with a solution generic enough to cover a lot of different measurements, adding support for additional metric types is a problem that can be avoided.

Also, the loss of delivery guarantees when using UDP, was a trade-off that we were willing to make, in-exchange for a light-weight solution (which was primarily intended for sampling application performance).

A number of client library implementations are available for plugging in to your language of choice. These libraries make reporting data from applications trivial.

//#!/usr/bin/env node
var SDC = require('statsd-client');
var sdc = new SDC({host: ''});

//increment a counter
//record timing
sdc.timing('executionTime', 7500000);
//record unique values
sdc.set('ultimateAnswer', 42);

Storing and efficiently querying the data

Once StatsD has collected the data, we needed a database for saving the gathered metrics. With many options to choose from, such as Graphite, RRDtool, Prometheus and InfluxDB, the choice can be a difficult one to make. We decided to go with InfluxDB, as we considered its support for arbitrary tags, horizontal scaling capabilities and low disk IO requirements important to us.

To flush the aggregated data from StatsD to InfluxDB, you would need to configure a backend for defining the communication protocol.

We started out with widely used bernd/statsd-influxdb-backend. Later, we noticed that it lacked support for InfluxDB tags which help ease parsing of recorded measurements by more efficient GROUP BY queries. Thus, we use our own customized version of this statsd-influxdb-backend, available at ramananbalakrishnan/statsd-influxdb-backend.

For those interested, some of the non-standard tuning parameters that we use in our StatsD collector’s config.js are as below.

  backends: ['./backends/console', 'statsd-influxdb-backend'],
  debug: false,
  legacyNamespace: false,
  keyNameSanitize: false,
  deleteIdleStats: true

Visualizing the data via dashboards

Now, all we needed was an effective graphing tool to visualize our metrics data. We chose Grafana, an open source real-time graphing front-end. With support for numerous graphing formats, designing pretty graphs becomes effortless with Grafana.

it’s so beautiful! source:

Fortunately, InfluxDB is one of the many data sources officially supported by Grafana. With only a few clicks, getting Grafana to talk with InfluxDB requires very little effort.

With Grafana’s easy-to-use query builder, it is easy to modify the parameters for the graph. Dashboards can be created on the fly within minutes.

building is easy, just need to protect the exhaust port now

This efficient interface is also an important reason in getting widespread adoption of data analysis within our team.

A critical aspect of moving fast is gathering feedback from each deployment. This stack, we believe, has given us just that. An effective platform where developers can go from asking a question (How many requests returned HTTP 418?) to getting it answered (2324) quickly on a simple dashboard with a few lines of code.