The Groupon website sees a lot of action. Millions of people view, click on, and buy deals on the site every day. Naturally, all of those views, clicks, and purchases provide pieces of information that we could use to make our site better. And if the site is better, hopefully it will see even more action. Accordingly, we needed a platform that would allow us to record and interpret all of this information.
Recently, our very own Gene McKenna wrote about an experiment we ran to de-index the Groupon website from Google. There were many questions that followed in the discussion of the article: what do we use to track our incoming traffic? How do we determine referrals? Are we using Google Analytics? To learn more about the tracking system we developed, read on.
Tracking at Groupon started with simple impression tracking using pixels. It soon became apparent that this would not be sufficient, since we also wanted to track customer interactions with the site. Also, the management and analysis of tracking pixels had grown unwieldy. To solve this problem, we began by evaluating existing tools such as Google Analytics. We decided against an existing tool for a few reasons: we wanted exclusive access to our data; we wanted flexibility in data collection, and we have awesome developers that move fast and deliver great products. So, back in 2012, we built something: a custom tracking and analytics system that would allow us to record and interpret information from the site. When we were building this tool, there were a number of design decisions that had to be made.
First, we had to determine what kind of questions needed answering. We started with the basics – what are people buying? What are people clicking on? How many things did we sell today? Is this new feature better than the old feature? But we also wanted to answer more complex questions about full customer sessions, such as which field on the checkout form a customer clicked on first or which field they were on when they abandoned. How do we answer all of these questions? Obviously, it all starts with tracking. If we don’t have data to work with, we won’t be able to learn the answers to any of these questions.
What to Track
These questions have the same two core pieces in common: customers and the site. However, tracking these independently is not sufficient. We don’t just want to know that a customer visited 10 pages, we want to know which 10 pages they visited. We want to know what the customer clicked on. We want to know what they purchased. So we started with a library that is designed for linking that information together and logging it. At Groupon, we call it TrackingHub, and we will investigate some of the design decisions that went into this library. If you would like more detail about the history behind this solution, you should check out the talk Tracking All the Things by Groupon’s Chris Powers.
We could benefit by knowing several things about the customer, for example, what operating system they are on, which browser they are using, whether they are using a mobile device, how many times they have been to the site, etc. But first we need some way to reasonably identify a customer.
To identify the customer, we need to know if the customer is logged in; if they are, we know their customer identifier (something that ties the customer to the database) and we should track that. Note that we must not use personally identifiable information (like a permalink that includes the customer’s name) for the identifier due to privacy concerns. If the customer is not logged in, we can at least track a unique identifier for the customer’s browser. (Creating universally unique identifiers on the client is outside the scope of this post, but the wikipedia page is a good place to start investigating the concerns.) If the customer logs in later, then we may be able to reasonably associate the customer with the browser, but this is not always the case, since a browser may be shared between customers. Nevertheless, for our uses, this is usually a close enough approximation.
Data about the browser as found in the user agent string is also very helpful – especially when tracking down bugs that might be specific to a certain browser version. There are libraries that we can use to easily parse meaningful values from the user agent string, such as browser family and OS. When we’re done, we will have something like this for the customer data:
// ... Any other metadata ...
The Customer’s Session
The customer’s session information is also important to track if we want to be able to distinguish between an individual customer’s visits to the site. On the Groupon website, we set a session cookie that expires after thirty minutes of session inactivity. Again, we needed a universally unique identifier to identify the session.
The other data we want to track with the session is the referral. This is where the tracking data becomes very useful for search engine optimization and marketing. Without the referral information, we would not be able to tell the difference between customers coming from search engines, e-mails, or other external sources. Unfortunately, this is also where it gets a little tricky. There are a couple of ways to tell where a customer is coming from.
The first way to determine where a customer is coming from is through the HTTP referer header. This header generally contains the URL of the web page that the customer just navigated from. If we see that this is some page other than a Groupon page, then we know that the customer clicked on a link on the previous page in order to arrive on our site. Unfortunately, this header is not all that reliable, as Gene mentioned in his recent article, due to the fact that browsers do not necessarily have to send this header at all, let alone guarantee its accuracy. I recommend reading Gene’s article for more information about how we have been dealing with this limitation.
The second method for collecting referrals is via URL query parameters. If we have control over the links that customers are clicking on — as in the case of advertisements and e-mails — we can add query parameters to these URLs that we can then parse and log in our tracking code.
Once we have all of our session tracking code set up, it could end up looking something like this
Along with the customer and session information, we will need basic information about the site itself, or the page that the customer is visiting. First and foremost, we need to track the page URL. This is vital for determining page view and related metrics. The next thing we will need is a unique identifier for the page view. For this identifier, we typically concatenate the session identifier with a timestamp and an index. It is also useful to keep track of the previous page’s identifier, so that we can easily follow the customer’s navigation through pages on the site. Other things to include with this information could include the page’s locale or language, the application that generated the page (if your website comprises more than one application, like ours), or even the specific version of your tracking library (we have found this very useful for debugging purposes). When we’re finished, the setup for the page information will probably look something like this:
So now we have all of this data collected, what do we do with it? We include it with the events we want to track, of course. First and foremost, we want to send a page load event to be logged to the server. We will definitely want to include all of the above information with this event. That event would simply look something like this:
customer: // customer data from above
session: // session data from above
page: // page data from above
From there, we have a couple of options. We can include all of the customer, session, and page data with each tracking message for easy log inspection and cross-referencing or we can cut down on storage and just include the page identifier with each message. Including just the page identifier still allows for cross-referencing through the page load event.
Once we have this basic tracking set up, the possibilities are pretty much endless. We could log performance events. We could build a library for tracking views of and clicks on specific HTML elements and site content that we care about (there is more detail about this library in Chris’s talk). We could log when a customer makes a purchase. But how do we make sure all of this gets to the server?
Logging Tracking Data
Once we are generating these events on the client, we need to get them to the server. We could simply make a POST request for each tracking event. And if we only had a couple of events per page that might not be a bad solution. However, on Groupon’s website, we have many events that need to be logged, so we had to implement a different solution. We collect events as they happen on the client, then, every few seconds, we take any unsent events we have collected and send them in a single POST request to the server. This reduces the number of tracking requests made to the server, but there are some edge cases to consider.
The first thing to consider is, what happens when a customer navigates away from the current page before the latest collected events have been sent? To solve this problem at Groupon, we persist the unsent events in browser local storage, if available, and then send them at the regular interval on the next page.
But what if the next page is not a Groupon page? Well, we have a couple of options. We could not worry about the last few customer interactions before the customer left the page, or we could force the POST request to the server on page unload. The specific solution will likely depend on the tolerance for lost events and the tolerance for additional page load latency. A similar problem occurs when a customer navigates from an HTTP page to an HTTPS page. For security reasons, browser local storage cannot be shared across protocols. Therefore, it is necessary to use one of the above solutions, or some other scheme involving Cookie-based storage or something similar.
After you have your client-side events logged on the server, you can really start to have some fun. There are many tools available for indexing, searching, and analyzing log files. We use Splunk to do ad hoc analysis of our logs and some real-time reporting at Groupon, but there are other open source alternatives such as logstash, kibana, and Graylog2.
Once you have something like that up and running, you can do cool things like create an ad hoc graph for when a problem was fixed in a certain country:
Of course, once we have all this data, we can also load it into whatever data warehouse tools we want to use, generate regular reports on the data, and analyze historical data to our heart’s content. Furthermore, it is completely within our power to build our own real-time systems for analyzing the data and extracting meaning from it.
For example, we have a real-time system written in Clojure that uses Apache Storm to process over half a million daily A/B experiment events. The system associates these events with deal purchases and customer interactions with the site in order to give us near instantaneous experiment results. If there is an error with an experiment, we can see it immediately and react appropriately. If things are running well, we only have to wait until we have collected enough data for meaningful results before we can make a decision based on the experiment.
Owning your own data and tracking and analytics platform definitely has its advantages. It is flexible – we can track whatever we want in whatever format we want, and once we have the data, our only constraints are those of any software project.
If you are interested in learning more about these tools, please check the Groupon Engineering blog and look out for our open source announcements. If working on these awesome products sounds fun, check out our Engineering openings.