Thank you to Kofi Jedamzik for contributing to this project and blog post.
At Groupon, the majority of applications use a system called Grapher for monitoring. It can plot simple rrd graphs with one minute resolution. However, during the Kill Bill migration, we encountered multiple deficiencies with the existing solution:
- Grapher offers limited support for templates, which makes it difficult to reuse and maintain the rrd graph definitions.
- Timerange and Timezone are stored in a cookie, thus sharing the graph link will most likely lead to a different graph (so we ended up sharing screenshots).
- It is time consuming to change and add new metrics because most of them are based on Splunk searches or cron jobs.
- The biggest deficiency is the high cardinality of our metrics: we need a lot of context around our metrics. For example, we wanted to get notified when a specific payment method in a specific country for a specific client starts failing. But the number of combinations caused capacity problems in our Splunk cluster.
As a result, we started looking for a simple solution to improve the situation. The starting point was the dropwizard metrics library, which has been established as the de facto standard for metrics in Java based applications. The library makes it very easy to measure different metrics within your application. It supports five different metric-types: Gauges, Counters, Histograms, Meters and Timers. You can also use numerous modules to instrument common libraries like Jetty, Logback, Log4j, Apache HttpClient, Ehcache, JDBI and Jersey.
The collected metrics are kept in metric registries, on top of those you can use reporters to publish your metrics: we use the JmxReporter to expose metrics as JMX MBeans and Metrics Servlets to publish the metrics as JSON objects via HTTP.
Storing metrics in a time series database
Additionally, we just started experimenting with InfluxDB. InfluxDB is a new time series database with promising features like tags and fields.
Tags are indexed and allow fast querying by tag values, which should give us the abilities to breakdown numerous attributes. Another noteworthy feature is its mechanism for downsampling stored data, called Continuous Queries: it lets you aggregate and precompute expensive queries on the fly.
Unfortunately, InfluxDB reporting is not yet supported by the metrics library but as InfluxDB supports the graphite line protocol we use the metrics-graphite reporter to stream the metrics directly into InfluxDB. The drawback is that we have to parse the metric name to extract the metadata. This can be done with InfluxDB’s Graphite Plugin which lets you extract tags from metric names by using a template. For example, instead of
duration, host=myhost, method=create-payment, country=de value=1234 we send
duration.myhost.create-payment.de 1234 and configure a template like this:
measurement.host.method.country to extract the tags. This feature is a bit limited as you can only use wildcards and the dot separator, regular expressions would be a nice feature for the future.
Visualizing time series
To visualize the time series, we use Grafana. It supports a variety of time series data sources including InfluxDB and has lots of visualization options, for example annotations, which allow you to mark events like restarts or deployments in your graphs.
Dashboard templating lets you create dynamic visualizations, it even supports variables to change query parameters.
This stack was easy to deploy and mitigated a few of our pain points. Compared to the rrd graph syntax, it’s very nice to have a full featured graph editor.
The dashboard definitions can simply be exported and synchronized with other instances via Grafana’s HTTP API. Sharing graphs is not a problem anymore, you can even choose between UTC and the browsers time zone.
But there is still work to do: instead of only creating metrics via Splunk queries, we measure them directly in the application. This makes the whole pipeline less error prone but the drawback is that the aggregation has to take place in InfluxDB or the application. In addition, the need for graphite templates to extract metadata from the metric name means frequent config changes when we change or add new metrics. So, we are currently figuring out what’s the best way to get support for tags and fields into Kill Bill directly.