Evolving Realtime Metrics and Monitoring at Groupon

at July 9th, 2013

li, ol, ul li { font-size: 12px !important }

Sudden drop in successful orders– should we freak out? Alarms go off, escalated emails and pages go out.

Sudden drop in order creation.

A few clicks through our realtime metrics dashboards quieted the storm.

Open the Orders Dashboard, notice percent of orders resulting in an error spiked, clicked to find breakdown by error message: “attempt to buy a deal too many times”, clicked through to query top deals with purchase attempts with that error message state: Vera Bradley Hard-Shell iPhone Case just sold out, thus suggesting all the glitters maybe, in fact, gold.

Many things can go wrong in large scale distributed e-commerce systems but which are technical errors and which are the outcomes of the normal course of our business? How quickly can you be alerted of a problem and then dive into the root cause? Our engineering teams have the fun challenge of dreaming, building and shipping new experiences for our customers on web, mobile and email. We have high needs for uptime, performance and understanding the state of our systems. Measuring the health of our data centers, APIs, client applications, backend systems, as well as behavioral and business signals is critical.

To give us these views, we’ve built and pieced together a set of systems for realtime monitoring of system performance and business impact that can help us diagnose problems and send alerts. We have ~10TBs of structured and unstructured log data flowing from machines and applications in our data center and from client side user events. We need an easy way (1) for developers to log events, (2) to forward logs, (3) to flexibly query the raw files, (4) to schedule and aggregate queries, and (5) to view, create, and edit graphs and alerts on top of this data stream. We also wanted this to be the same graphing/alerting tool that we use to track system/OS/service metrics, so that we may drill into a problem from a business issue through the layers of the system to see that perhaps it was caused by a lack of CPU resources.

Rather than building graphing functionality atop a monitoring system (ala Nagios with Cacti), we have built a metrics platform, with a flexible broker component to push metric events to different systems for graphing, alerting and other analysis.

All metric data is treated the same in this system — whether it’s the number of amps currently flowing out of a power outlet in the datacenter, to the number of Groupon Getaways packages we have sold in the last minute. Any time-series data may be pushed into this system, visualized with graphing tools and alerted on if rules are triggered.

Primary Components

Metrics System Architecture

  • Splunk – Application and system log data from all datacenter hosts is forwarded to a central Splunk cluster comprised of 40 nodes. We use Splunk to do ad-hoc searching and to transform the raw event data contained within logs into statistics. For instance, each line of our webserver access logs contain data about one web request event. By counting different types of events (e.g. by virtual host name or request URI) for a specific time slice (e.g. the last minute) on a rigid schedule (e.g. every minute) we can transform the raw log event stream into time-series metrics. We have written a custom search command for Splunk to push the metrics directly from Splunk to a Monitord Server (described below).
  • Monitord Agent – Executes metric collection scripts periodically, and pushes results to the central Monitord Server cluster. The Monitord Agent is primarily responsible for collecting metrics for system resources such as CPU, memory, disk, and network. Additionally, the Monitord Agent runs scripts to collect and push statistics from specific daemon software like MySQL, HAProxy, or Nginx. Since the Monitord Agent is really just a distributed scheduler that runs scripts, the realm of possible data collection is limitless.
  • Monitord Server – This cluster of machines receives check script results from the Monitord Agent and metric data from our Splunk custom command and pushes the data to a number of destinations. One destination is our Ganglia-based metrics storage system. Another destination is our Nagios alerting infrastructure. We may add other destinations as needed, for instance to migrate to a new metrics storage backend or to utilize a new analysis tool.
  • Ganglia – We use Ganglia to persist the time-series metrics data for use in graphing. Our Ganglia server receives approximately 300,000 metric updates per minute. To handle this load, we utilize rrdcached to buffer writes, and store the data on a FusionIO solid state disk.
  • Nagios – Our use of Nagios is atypical in that it only receives passive check results. Nagios does not execute any active checks, its job is only to send notifications and provide a “current state of the world” display. The Nagios configuration is generated automatically from our centralized, comprehensive host configuration management system.
  • Grapher – The web UI that comes with Ganglia is very limited, especially in terms of building new graph definitions. We wrote our own tool that sits atop Ganglia’s RRD file tree to make it easy to browse the system hierarchy, and define graph templates and dashboards.

A few noteworthy design choices:

  • Metric alerting thresholds are metrics themselves. This allows us to plot the alert threshold in a graph just as another data series. This eliminates the need to update the graph definition when thresholds change, and allows us to use the same graph template for many possible threshold values. Additionally, the alerting threshold may be dynamic based on time of day or other functions. The graph below shows the current value (green area), historical norm (blue line), and alerting thresholds (yellow/red lines). The historical number is calculated as the median value from samples around that time of day, from that day of the week, for the past three weeks. The alerting thresholds in this case are fixed deltas from the historical.
  • Groupons Per Minute

  • In our Grapher visualization tool, graphs are not explicitly assigned directly to hosts. Instead, graph templates specify a set of required metrics. If the user has browsed to a host/cluster/grid that has those metrics, the graph will be drawn. This way, if we decide that a new view of CPU statistics is valuable, we define it once and it will show up for any host, present or future, that provides the necessary metrics.

This system isn’t without its shortcomings, though…

  • A single Ganglia server has proven to scale past our initial expectations, especially after the addition of rrdcached (reduced IO operations by 90%) and FusionIO (increased IO capacity by 100x+). However, it remains a critical single point of failure. We have daily backups of metric data, but any downtime in the system has a direct effect on our ability to gain insight into the running state of our system. We are evaluating a move to Graphite, OpenTSDB, or another platform where we can have a cluster of hosts storing and providing metric data in a scalable, fault-tolerant manner.
  • Our Grapher tool is built around Ganglia’s RRD files, and uses RRDTool to render graph images in PNG form or dump metric data in XML or JSON form. RRDTool is a very capable piece of software, but the capabilities of web browsers is such that we can provide a far more rich experience by moving to rendering graphs on the client.
  • A shortcoming of using Ganglia as a storage system is that the metric space is basically flat, with a single hierarchy comprised of Grid / Cluster / Host. The reality of metric data is that there are a number of ways people want to slice the data, whether it be by service, hardware type, or something as abstract as user location. We are looking to move to a system that supports tagged metrics, so that we can filter and aggregate metrics across many different dimensions.
  • The success of Splunk in our environment has been a double-edged sword. On the positive side, it has taken off in a big way here — with most of our 50 development teams being totally self-sufficient in crafting Splunk searches to push metrics to Monitord Server. On the downside, the success has been fast-reaching, so we occasionally deal with outages caused by overuse of Splunk resources. Overall though it has allowed us to gain deeper insight into our mountain of raw log data faster than any other solution we’ve looked at.

This system has evolved here over the last two years, and we see lots of room for future evolution. On the immediate horizon are a new metrics storage platform, improving the Grapher user experience by moving to client-side graph rendering, and using other stream-analysis tools alongside Splunk to mine the information gold from our raw log stream. As it stands now though, this system is a key piece of Groupon infrastructure for anyone in the company involved in building or supporting our products.

No Tags

2 thoughts on “Evolving Realtime Metrics and Monitoring at Groupon

  1. very interesting article and actual issue! >> (1) for developers to log events, (2) to forward logs, (3) to flexibly query the raw files, (4) to schedule and >> aggregate queries, and (5) to view, create, and edit graphs and alerts on top of this data stream. What about the first point? Will be it shown in some next post? What the correct policy to have some kind of enforcing about the correct use of logs by developers? Have You a specific log infrastructure or is open source project with custom code? Splunk & lot of logs means also Big Data: what about using Hadoop? Thank You!

    by stefano on July 15, 2013 at 1:41 am
  2. Hi Stefano -- Thanks for the feedback and questions! We will cover logging in more detail in a future post. Today's environment is a mix of structured and unstructured logs. We are trying to move toward structured logs by publicizing standards, making easy logging libraries available to developers, and by developing a standard event taxonomy. This will allow us to do more things automatically and with higher accuracy. You're right about Big Data -- we push several terabytes of log data through Splunk each day. We use Hadoop extensively at Groupon, but for the purposes of real-time analysis (latency less than 1 minute) it doesn't fit the bill. However, we are evaluating other technologies like Storm and Impala to help scale real time analysis.

    by Zack Steinkamp on July 15, 2013 at 9:59 am

Leave a Reply

Your email address will not be published. Required fields are marked *