Building a Custom Tracking and Analytics Platform: Information is Power

at August 20th, 2014

The Groupon website sees a lot of action. Millions of people view, click on, and buy deals on the site every day. Naturally, all of those views, clicks, and purchases provide pieces of information that we could use to make our site better. And if the site is better, hopefully it will see even more action. Accordingly, we needed a platform that would allow us to record and interpret all of this information.

Recently, our very own Gene McKenna wrote about an experiment we ran to de-index the Groupon website from Google. There were many questions that followed in the discussion of the article: what do we use to track our incoming traffic? How do we determine referrals? Are we using Google Analytics? To learn more about the tracking system we developed, read on.

Tracking at Groupon started with simple impression tracking using pixels. It soon became apparent that this would not be sufficient, since we also wanted to track customer interactions with the site. Also, the management and analysis of tracking pixels had grown unwieldy. To solve this problem, we began by evaluating existing tools such as Google Analytics. We decided against an existing tool for a few reasons: we wanted exclusive access to our data; we wanted flexibility in data collection, and we have awesome developers that move fast and deliver great products. So, back in 2012, we built something: a custom tracking and analytics system that would allow us to record and interpret information from the site. When we were building this tool, there were a number of design decisions that had to be made.

Tracking

First, we had to determine what kind of questions needed answering. We started with the basics – what are people buying? What are people clicking on? How many things did we sell today? Is this new feature better than the old feature? But we also wanted to answer more complex questions about full customer sessions, such as which field on the checkout form a customer clicked on first or which field they were on when they abandoned. How do we answer all of these questions? Obviously, it all starts with tracking. If we don’t have data to work with, we won’t be able to learn the answers to any of these questions.

What to Track

Common Tracking

These questions have the same two core pieces in common: customers and the site. However, tracking these independently is not sufficient. We don’t just want to know that a customer visited 10 pages, we want to know which 10 pages they visited. We want to know what the customer clicked on. We want to know what they purchased. So we started with a library that is designed for linking that information together and logging it. At Groupon, we call it TrackingHub, and we will investigate some of the design decisions that went into this library. If you would like more detail about the history behind this solution, you should check out the talk Tracking All the Things by Groupon’s Chris Powers.

The Customer

We could benefit by knowing several things about the customer, for example, what operating system they are on, which browser they are using, whether they are using a mobile device, how many times they have been to the site, etc. But first we need some way to reasonably identify a customer.

To identify the customer, we need to know if the customer is logged in; if they are, we know their customer identifier (something that ties the customer to the database) and we should track that. Note that we must not use personally identifiable information (like a permalink that includes the customer’s name) for the identifier due to privacy concerns. If the customer is not logged in, we can at least track a unique identifier for the customer’s browser. (Creating universally unique identifiers on the client is outside the scope of this post, but the wikipedia page is a good place to start investigating the concerns.) If the customer logs in later, then we may be able to reasonably associate the customer with the browser, but this is not always the case, since a browser may be shared between customers. Nevertheless, for our uses, this is usually a close enough approximation.

Data about the browser as found in the user agent string is also very helpful – especially when tracking down bugs that might be specific to a certain browser version. There are libraries that we can use to easily parse meaningful values from the user agent string, such as browser family and OS. When we’re done, we will have something like this for the customer data:

{
  userAgent: window.navigator.userAgent,
  browser: getBrowserFamily(),
  browserVersion: getBrowserVersion(),
  os: getOS(),
  device: getDevice(),
  customerId: getConsumerId(),
  loggedIn: isLoggedIn(),
  browserId: getBrowserId()
  // ... Any other metadata ...
}
The Customer’s Session

The customer’s session information is also important to track if we want to be able to distinguish between an individual customer’s visits to the site. On the Groupon website, we set a session cookie that expires after thirty minutes of session inactivity. Again, we needed a universally unique identifier to identify the session.

The other data we want to track with the session is the referral. This is where the tracking data becomes very useful for search engine optimization and marketing. Without the referral information, we would not be able to tell the difference between customers coming from search engines, e-mails, or other external sources. Unfortunately, this is also where it gets a little tricky. There are a couple of ways to tell where a customer is coming from.

The first way to determine where a customer is coming from is through the HTTP referer header. This header generally contains the URL of the web page that the customer just navigated from. If we see that this is some page other than a Groupon page, then we know that the customer clicked on a link on the previous page in order to arrive on our site. Unfortunately, this header is not all that reliable, as Gene mentioned in his recent article, due to the fact that browsers do not necessarily have to send this header at all, let alone guarantee its accuracy. I recommend reading Gene’s article for more information about how we have been dealing with this limitation.

Another problem to consider when using the HTTP referer header is client-side redirects. If the customer is redirected to another page via Javascript before any tracking messages were sent to the server, then that referer header will be lost. There are several possible solutions to this. The first and least desirable is to ensure that the tracking code loads on the page and sends the first message. Of course, this adds undesirable (and probably unacceptable) latency onto the customer’s total page load time. The preferred solution is to move the redirect to the server and either preserve the referer header on the redirected request or log the referrer on the server with an identifier to link to the customer’s session. This is easy to do if the session identifier is always set on the server and passed to the client in a cookie.

The second method for collecting referrals is via URL query parameters. If we have control over the links that customers are clicking on — as in the case of advertisements and e-mails — we can add query parameters to these URLs that we can then parse and log in our tracking code.

Once we have all of our session tracking code set up, it could end up looking something like this

{
  id: getSessionId(),
  referral: {
    referrer: document.referrer,
    referralParams: parseReferralParams()
  }
}

The Page

Along with the customer and session information, we will need basic information about the site itself, or the page that the customer is visiting. First and foremost, we need to track the page URL. This is vital for determining page view and related metrics. The next thing we will need is a unique identifier for the page view. For this identifier, we typically concatenate the session identifier with a timestamp and an index. It is also useful to keep track of the previous page’s identifier, so that we can easily follow the customer’s navigation through pages on the site. Other things to include with this information could include the page’s locale or language, the application that generated the page (if your website comprises more than one application, like ours), or even the specific version of your tracking library (we have found this very useful for debugging purposes). When we’re finished, the setup for the page information will probably look something like this:

{
  id: getPageId(),
  parentPageId: getParentPageId(),
  url: window.location.href,
  country: getCountry(),
  app: getApp(),
  libraryVersion: getVersion()
}

Event Tracking

So now we have all of this data collected, what do we do with it? We include it with the events we want to track, of course. First and foremost, we want to send a page load event to be logged to the server. We will definitely want to include all of the above information with this event. That event would simply look something like this:

{
  event_name: "tracking-init",
  customer: // customer data from above
  session: // session data from above
  page: // page data from above
}

From there, we have a couple of options. We can include all of the customer, session, and page data with each tracking message for easy log inspection and cross-referencing or we can cut down on storage and just include the page identifier with each message. Including just the page identifier still allows for cross-referencing through the page load event.

Once we have this basic tracking set up, the possibilities are pretty much endless. We could log performance events. We could build a library for tracking views of and clicks on specific HTML elements and site content that we care about (there is more detail about this library in Chris’s talk). We could log when a customer makes a purchase. But how do we make sure all of this gets to the server?

Logging Tracking Data

Once we are generating these events on the client, we need to get them to the server. We could simply make a POST request for each tracking event. And if we only had a couple of events per page that might not be a bad solution. However, on Groupon’s website, we have many events that need to be logged, so we had to implement a different solution. We collect events as they happen on the client, then, every few seconds, we take any unsent events we have collected and send them in a single POST request to the server. This reduces the number of tracking requests made to the server, but there are some edge cases to consider.

The first thing to consider is, what happens when a customer navigates away from the current page before the latest collected events have been sent? To solve this problem at Groupon, we persist the unsent events in browser local storage, if available, and then send them at the regular interval on the next page.

But what if the next page is not a Groupon page? Well, we have a couple of options. We could not worry about the last few customer interactions before the customer left the page, or we could force the POST request to the server on page unload. The specific solution will likely depend on the tolerance for lost events and the tolerance for additional page load latency. A similar problem occurs when a customer navigates from an HTTP page to an HTTPS page. For security reasons, browser local storage cannot be shared across protocols. Therefore, it is necessary to use one of the above solutions, or some other scheme involving Cookie-based storage or something similar.

Analytics

After you have your client-side events logged on the server, you can really start to have some fun. There are many tools available for indexing, searching, and analyzing log files. We use Splunk to do ad hoc analysis of our logs and some real-time reporting at Groupon, but there are other open source alternatives such as logstash, kibana, and Graylog2.

Once you have something like that up and running, you can do cool things like create an ad hoc graph for when a problem was fixed in a certain country:

splunk_screenshot_1

Of course, once we have all this data, we can also load it into whatever data warehouse tools we want to use, generate regular reports on the data, and analyze historical data to our heart’s content. Furthermore, it is completely within our power to build our own real-time systems for analyzing the data and extracting meaning from it.

For example, we have a real-time system written in Clojure that uses Apache Storm to process over half a million daily A/B experiment events. The system associates these events with deal purchases and customer interactions with the site in order to give us near instantaneous experiment results. If there is an error with an experiment, we can see it immediately and react appropriately. If things are running well, we only have to wait until we have collected enough data for meaningful results before we can make a decision based on the experiment.

Summary

Owning your own data and tracking and analytics platform definitely has its advantages. It is flexible – we can track whatever we want in whatever format we want, and once we have the data, our only constraints are those of any software project.

If you are interested in learning more about these tools, please check the Groupon Engineering blog and look out for our open source announcements. If working on these awesome products sounds fun, check out our Engineering openings.


Gofer – HTTP clients for Node.js

at August 14th, 2014

Gofer – HTTP clients for Node.js

We recently transitioned the main Groupon website to Node.js and we’ve documented how we dismantled the monoliths and some of the integration testing tools such as testium in earlier blog posts. One strong requirement for the new architecture was that all features would be implemented on top of HTTP services. Our best practices for working with those services led to the creation of gofer: A wrapper around request that adds some additional instrumentation and makes it easier to manage configuration for different backend services.

We’d like to announce that gofer is now open source!

The README file contains a detailed walkthrough on how to use the library in case you just want to see it in action. Read on for some thoughts about gofer , calling HTTP services from node, and resilience in general.

Configuration

One goal was to create a format that fits nicely into one self-contained section of our configuration. At the same time we wanted to have one unified config for all HTTP calls an app would make. Having a common pattern that all HTTP client follow means that certain best practices can be globally enforced. For example we have monitoring across apps that can tell us quite a bit about platform health beyond pure error rates.

The result looks something like this:

globalDefaults:
  connectTimeout: 100
  timeout: 2000
myApi:
  baseUrl: "https://my-api"
  qs:
    client_id: "SOME_CLIENT_ID"
github:
  baseUrl: "https://api.github.com/v3"
  clientId: "XXX"
  clientSecret: "XYZ"

The semantics of this config are similar to passing an object into request.defaults. “Every config setting is just a default value” means that it is relatively easy to reason about the effect of configuration settings. The options passed into request for a call against myApi are just the result of merging global defaults, the myApi section of the config, and the explicit options on top of each other. For example:

myApi.fetch({ uri: '/cats' });

Would be roughly equivalent to the following (given the above configuration):

request({
  connectTimeout: 100, // from globalDefaults
  timeout: 2000,
  baseUrl: 'https://my-api', // from service defaults
  qs: { client_id: 'SOME_CLIENT_ID' },
  uri: '/cats' // explicit options
});

If you just checked the request docs and couldn’t find all the options, then that’s because gofer supports a handful of additional options. connectTimeout is described in more detail below (“Failing fast”) and is always available.

baseUrl is implemented using an “option mapper”. Option mappers are functions we can register for specific services that take a merged options object and return a transformed one. It’s an escape hatch when configuring the request options directly isn’t reasonable. If a service requires custom headers or has more complicated base url logic, we got it covered with option mappers.

The following option mapper takes an accessToken option and turns it into the OAuth2 header Github’s API expects:

function addOAuth2Token(options) {
  if (options.accessToken) {
    options.headers = options.headers || {};
    options.headers.authorization = 'token ' + options.accessToken;
    delete options.accessToken;
  }
  return options;
}

For every incoming request we create new instances of the API clients, passing in the requestId (among other things) as part of the global defaults. This makes sure that all API requests contain the proper instrumentation.

Failing fast

Every open connection costs resources. Not only on the current level but also further down in the stack. While Node.js is quite good at managing high numbers of concurrent connections, it’s not always wise to take it for granted. Sometimes it’s better to just fail fast instead of letting resources pile up. Potential candidates for failing fast include:

Connect timeout (connectTimeout)

This should, in most cases, be very short. It’s the time it takes to acquire a socket and connect to the remote service. If the network connectivity to (or the health of) the service is bad, this prevents every single connection hanging around for the maximum time. For example wrong firewall setups can cause connect timeouts.

Timeout (timeout)

The time it takes to receive a response. The caveat is that this only captures the arrival of the headers. If the service writes the HTTP headers but then (for any reason) takes its time to actually deliver the body, this timeout will not catch it.

Sockets queueing

gofer currently does not support this

By default node will queue sockets when httpAgent.maxSockets connections are already open. A common solution to this is to just set maxSockets to a very high number, in some cases even INFINITY. This certainly removes the risk of sockets queuing but it passes all load down to the next level without any reasonable limits. Another option is to chose a value for maxSockets that is considered a healthy level of concurrency and to fail requests immediately once that level is reached. This (“load shedding”) is what Hysterix does for example. gofer at least reports on sockets queuing and our policy is to monitor and treat this as a warning condition. We might add an option to actually fail fast on socket queuing in the future.

Error handling

So one of the service calls failed. What now?

Full failure

This is the easiest option. Just pass the error up the stack (or into next when using express) and render an error page. The advantage is that no further resources are wasted on the current request, the obvious disadvantage is that it severely affects the user’s experience.

Graceful degradation

This is a very loaded term, in this case it’s meant as “omit parts of the current page instead of failing completely”. For example some of our pages contain personal recommendations as secondary elements. We can provide the core features of these pages even when service calls connected to personal recommendations would fail. This can greatly improve the resilience of pages but for the price of a degraded user experience.

Caching

Caching is not only valuable to reduce render times or to reduce load on underlying services, it can also help bridge error conditions. We built a caching library called cached on top of existing node.js modules that adds support for generating cached values in the background. Example of using cached for wrapping a service call:

var entityId = 'xy';
var loadEntity = cached.deferred(_.partial(myService.fetch, '/entity/' + entityId));
cached('myService').getOrElse(entityId, loadEntity, function(err, entity) {
  // 80% of time this works every time
});

By configuring very high expiry times and low freshness times, we make sure that we have reasonably fresh data while pushing the actual service call latency out of the request dispatch and keeping the app responsive should the service ever go down.

The big gotcha is that this doesn’t do any cache invalidation but expiry, so it’s not easily applicable to data where staleness is not acceptable.

Instrumentation

For instrumentation we use a central “gofer hub” that is shared between clients. It has two general responsibilities:

  1. Add headers for transaction tracing (X-Request-ID). This idea might have originated in the rails world [citation needed], Heroku has a nice summary. Additionally every API request is assigned a unique fetchId which is passed down as the X-Fetch-ID header.

  2. Emit lifecycle events for all requests

The basic setup looks like this:

var http = require('http');
var Hub = require('gofer/hub');
var MyService = require('my-service-gofer');
var OtherService = require('other-service-gofer');

var hub = new Hub();

hub.on('success', function() {});
hub.on('socketQueueing', function() {});
// ...

http.createServer(function(req, res) {
  var requestId = req.headers['x-request-id'] || generateUUID();
  var config = { globalDefaults: { requestId: requestId } };

  var myService = new MyService(config, hub);
  var otherService = new OtherService(config, hub);

  myService.fetch('/some-url').pipe(res);
}).listen(process.env.PORT || 3000);

All available lifecycle events and the data they provide can be found in the API docs for gofer. At the very least they contain the requestId and the fetchId.

serviceName, endpointName, methodName

To more easily group requests, we use a hierarchy of service, endpoint, and method. This allows us to build graphs and monitoring with different levels of precision and to drill down when necessary to quickly find the cause of problems. By default they are the config section the gofer uses (serviceName), the path of the resource you access (endpointName), and the HTTP verb used (methodName).

To get nicer endpointNames we define all API calls we want to use in advance. This can be a little verbose but has the added benefit of being a free spell checker.

var goferFor = require('gofer');
var MyService = goferFor('myService'); // serviceName = myService
MyService.registerEndpoints({
  // endpointName = cats
  cats: function(request) {
    return {
      index: function(cb) {
        return request({ uri: '/cats', methodName: 'index' }, cb);
      },
      show: function(id, cb) {
        return request({ uri: '/cats/' + id, methodName: 'show' }, cb);
      },
      save: function(id, data, cb) {
        return request({ uri: '/cats/' + id, json: data }, cb);
      }
    };
  }
});

We can then use an instance of MyService like this:

// will be logged with serviceName=myService, endpointName=cats, methodName=index
myService.cats.index(cb);
// will be logged with serviceName=myService, endpointName=cats, methodName=put
myService.cats.save('kitty', { fur: 'fluffy' }, cb);

Since we didn’t provide an explicit methodName for the PUT call, it defaults to the HTTP method.

Want to give it a try?

You can find gofer on github and on npm today. There’s still a lot we want to improve and we’re excited to hear about your ideas on how to make API-based node apps simpler and more resilient. Let’s continue this discussion down in the comments or in Github issues!


Groupon Engineering Supports RailsGirls Summer of Code and Open Source

at August 12th, 2014

Rails Girls Summer of Code is a global fellowship program aimed at bringing more diversity into Open Source. Successful applicants from all over the world are paid a monthly stipend, from July-September, to work on Open Source projects of their choice. Groupon Engineering is supporting this great effort contributing as a Bronze Sponsor and recently hosted Rails Girls Santiago.

railsgirlssantiago photo

The workshop was started in Helsinki in 2010 as a one-time event, but the need for this sort of program was so strong that Rails Girls has now become an international event series, hosted by local chapters around the world. Rails Girls Summer of Code is happening for the second year and is about helping newcomers to the world of programming expand their knowledge and skills, by contributing to a worthwhile Open Source project. The focus is not on producing highly sophisticated code, but rather participants learning highly transferable skills from their project work.

Congratulations to our Berlin and Santiago teams for celebrating diversity and engaging in our global tech community! For more information on how you can be a part of this great organization check them out here: Rails Girls Summer of Code

The ETHOS Team

Engineering, Standards, Culture, Education and Engagement at Groupon


Groupon Engineering launches Otterconf’14

at August 10th, 2014

Groupon Engineering launched its first global front end developer conference: Otterconf’14. Groupon developers from our dev centers around the world flew into our Chicago office for two days of collaborating and sharing best practices around I-Tier, our global web platform.

Andrew Bloom, Adam Geitgey, Dan Gilbert and Antonios Gkogkakis, who all work on I-Tier, gloss the significance of Otterconf’14:

I-Tier (Interaction Tier) is our distributed web front end architecture where pages or sometimes sets of pages work as separate applications. I-Tier is a great innovation that enabled us to re-architect our previous existing monolith, migrating Groupon’s U.S. web traffic from a monolithic Ruby on Rails application to a new Node.js stack with substantial results.

Because of I-Tier, page loads are significantly faster across the Groupon site. Our development teams can also develop and ship features faster and with fewer dependencies on other teams and we can eliminate redundant implementations of the same features in different countries where Groupon is available. I-Tier also includes localization services, configuration services, a global routing layer, consumer behavior tracking and A/B testing.

Check out Jeff Ayars, VP of Engineering at Groupon, on our international rollout of I-Tier:

#otterconf, #grouponeng, #groupon, #I-Tier


Maven and GitHub: Forking a GitHub repository and hosting its Maven artifacts

at July 2nd, 2014

Two tools that the Android team frequently uses here are GitHub and Maven (and soon to be Gradle, but that’s another story). Groupon has one of the most widely deployed mobile apps in the world, and with more than 80 million downloads and 54% of our global transactions coming from mobile, it’s important that we have the right tools to keep our operation running smoothly all the time.

The Groupon Android team recently tried to move our continuous integration builds from a Mac Mini on one of our desks to the companywide CI cluster. We got held up during the process because the build simply wouldn’t run on the new VMs. We discovered the problem was a bug in the Maven Android plugin, which neglected to escape the filesystem path when executing ProGuard and caused our build to fail for paths that contained unusual characters. Instead of looking to change the paths of all of our builds company wide, we focused on what it would take to fix the bug in the plugin. Read on for details on how we managed this process.

Fixing the bug itself was easy. The harder part was figuring out how to distribute the fix while we waited for the plugin maintainer to review and merge it in. One of the great advantages of GitHub is that it makes it easy to fork a project and fix bugs. However, if you’re using Maven, you also need a way to host your fork’s artifacts in a Maven repo, and there isn’t a way to automatically do that with GitHub.

You could host your own Nexus server, but that’s a lot of overhead if you don’t already have one for just a simple fork. You could set up a cloud storage solution to hold your Maven repo, but the internet is already littered with defunct and broken Maven repo links – why add one more that you’ll have to maintain it forever?

A better solution to this problem is to use GitHub to host your Maven repositories. A Maven repository is, at its heart, just a structured set of files and directories that are publicly available via http, and GitHub allows you do this easily with its raw download support. The same technique is used by GitHub itself to serve up GitHub Pages websites.

The basic solution involves three steps:

  1. Create a branch called mvn-repo to host your Maven artifacts.
  2. Use the Github site-maven-plugin to push your artifacts to Github.
  3. Configure Maven to use your remote mvn-repo as a maven repository.

There are several benefits to using this approach:

  • It ties in naturally with the deploy target so there are no new Maven commands to learn. Just use mvn deploy as you normally would.
  • Maven artifacts are kept separate from your source in a branch called mvn-repo, much like github pages are kept in a separate branch called gh-pages (if you use github pages).
  • There’s no overhead of hosting a separate Maven Nexus or cloud storage server, and your maven artifacts are kept close to your github repo so it’s easy for people to find one if they know where the other is.

The typical way you deploy artifacts to a remote maven repo is to use mvn deploy, so let’s patch into that mechanism for this solution.

First, tell maven to deploy artifacts to a temporary staging location inside your target directory. Add this to your pom.xml:

<distributionManagement>
    <repository>
        <id>internal.repo</id>
        <name>Temporary Staging Repository</name>
        <url>file://${project.build.directory}/mvn-repo</url>
    </repository>
</distributionManagement>

<plugins>
    <plugin>
        <artifactId>maven-deploy-plugin</artifactId>
        <version>2.8.1</version>
        <configuration>
            <altDeploymentRepository>internal.repo::default::file://${project.build.directory}/mvn-repo</altDeploymentRepository>
        </configuration>
    </plugin>
</plugins>

Now try running mvn clean deploy. You’ll see that it deployed your maven repository to target/mvn-repo. The next step is to get it to upload that directory to github.

Add your authentication information to ~/.m2/settings.xml so that the github site-maven-plugin can push to github:

<!-- NOTE: MAKE SURE THAT settings.xml IS NOT WORLD READABLE! -->
<settings>
  <servers>
    <server>
      <id>github</id>
      <username>YOUR-USERNAME</username>
      <password>YOUR-PASSWORD</password>
    </server>
  </servers>
</settings>

(As noted, please make sure to chmod 700 settings.xml to ensure no one can read your password in the file.)

Then tell the github site-maven-plugin about the new server you just configured by adding the following to your pom:

<properties>
    <!-- github server corresponds to entry in ~/.m2/settings.xml -->
    <github.global.server>github</github.global.server>
</properties>

Finally, configure the site-maven-plugin to upload from your temporary staging repo to your mvn-repo branch on github:

<build>
    <plugins>
        <plugin>
            <groupId>com.github.github</groupId>
            <artifactId>site-maven-plugin</artifactId>
            <version>0.9</version>
            <configuration>
                <message>Maven artifacts for ${project.version}</message>  <!-- git commit message -->
                <noJekyll>true</noJekyll>                                  <!-- disable webpage processing -->
                <outputDirectory>${project.build.directory}/mvn-repo</outputDirectory> <!-- matches distribution management repository url above -->
                <branch>refs/heads/mvn-repo</branch>                       <!-- remote branch name -->
                <includes><include>**/*</include></includes>
                <merge>true</merge>                                        <!-- don't delete old artifacts -->
                <repositoryName>YOUR-REPOSITORY-NAME</repositoryName>      <!-- github repo name -->
                <repositoryOwner>YOUR-GITHUB-USERNAME</repositoryOwner>    <!-- github username  -->
            </configuration>
            <executions>
              <!-- run site-maven-plugin's 'site' target as part of the build's normal 'deploy' phase -->
              <execution>
                <goals>
                  <goal>site</goal>
                </goals>
                <phase>deploy</phase>
              </execution>
            </executions>
        </plugin>
    </plugins>
</build>

The mvn-repo branch does not need to exist, it will be created for you.

Now run mvn clean deploy again. You should see maven-deploy-plugin “upload” the files to your local staging repository in the target directory, then site-maven-plugin committing those files and pushing them to the server.

[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building DaoCore 1.3-SNAPSHOT
[INFO] ------------------------------------------------------------------------
...
[INFO] --- maven-deploy-plugin:2.5:deploy (default-deploy) @ greendao ---
Uploaded: file:///Users/mike/Projects/greendao-emmby/DaoCore/target/mvn-repo/com/greendao-orm/greendao/1.3-SNAPSHOT/greendao-1.3-20121223.182256-3.jar (77 KB at 2936.9 KB/sec)
Uploaded: file:///Users/mike/Projects/greendao-emmby/DaoCore/target/mvn-repo/com/greendao-orm/greendao/1.3-SNAPSHOT/greendao-1.3-20121223.182256-3.pom (3 KB at 1402.3 KB/sec)
Uploaded: file:///Users/mike/Projects/greendao-emmby/DaoCore/target/mvn-repo/com/greendao-orm/greendao/1.3-SNAPSHOT/maven-metadata.xml (768 B at 150.0 KB/sec)
Uploaded: file:///Users/mike/Projects/greendao-emmby/DaoCore/target/mvn-repo/com/greendao-orm/greendao/maven-metadata.xml (282 B at 91.8 KB/sec)
[INFO] 
[INFO] --- site-maven-plugin:0.7:site (default) @ greendao ---
[INFO] Creating 24 blobs
[INFO] Creating tree with 25 blob entries
[INFO] Creating commit with SHA-1: 0b8444e487a8acf9caabe7ec18a4e9cff4964809
[INFO] Updating reference refs/heads/mvn-repo from ab7afb9a228bf33d9e04db39d178f96a7a225593 to 0b8444e487a8acf9caabe7ec18a4e9cff4964809
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8.595s
[INFO] Finished at: Sun Dec 23 11:23:03 MST 2012
[INFO] Final Memory: 9M/81M
[INFO] ------------------------------------------------------------------------

Visit github.com in your browser, select the mvn-repo branch, and verify that all your binaries are now there.

mvn-repo

Congratulations!

You can now deploy your maven artifacts to a poor man’s public repo simply by running mvn clean deploy

There’s one more step you’ll want to take, which is to configure any poms that depend on your pom to know where your repository is. Add the following snippet to any project’s pom that depends on your project:

<repositories>
    <repository>
        <id>YOUR-PROJECT-NAME-mvn-repo</id>
        <url>https://raw.github.com/YOUR-USERNAME/YOUR-PROJECT-NAME/mvn-repo/</url>
        <snapshots>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
        </snapshots>
    </repository>
</repositories>

Now any project that requires your jar files will automatically download them from your Github Maven repository.

Here at Groupon, we used this technique to fix our bug in Maven Android plugin, and then easily shared our fork with the wider Android community until the fix was incorporated upstream. This works particularly well for projects that are no longer being maintained since those projects may never get around to merging in your pull requests.

We hope this makes forking Maven repos in GitHub as easy for you as it has been for us!


Introducing Geekon Talk – Seattle

at July 2nd, 2014

GEEKon_Talk_logo

Groupon Engineering Seattle is an important hub for us with continued recent expansion and growth. Groupon Engineering generally thrives on the exchange and cross-connection of ideas leading to new technologies, approaches and breakthroughs.

Accordingly, we’re excited to be launching our Geekon Talk series in our Seattle office.

The Geekon Talk series provides a platform for outside speakers to come into Groupon to present ideas and information on everything related to tech, entrepreneurship, tools, product design and start-ups.

Join us for our kick-off talk Tuesday, July 8. David Blevins will provide a code-first tour of TomEE, including quickly bootstrapping REST projects, doing proper testing with Arquillian, and setting up environments.

Apache TomEE is the Java EE version of Apache Tomcat and offers out-of-the-box integration with JAX-RS, JAX-WS, JMS, JPA, CDI, EJB and more. Finally Tomcat goes beyond just Servlets and JSP. Save time chasing down blog posts, eject libs from your webapps and start development with a more complete stack that is extremely well tested.

This will be a fun and lively talk with many engineering cross-connections. Sign up here.

When: Tuesday, July 8, 12:00pm -1:00pm Pacific. Where: Groupon Seattle, 505 Fifth Avenue South, Suite 310, Seattle, WA, 98104. What: Lively presentation and conversation (food will also be provided.)


Mobile Test Engineering – Odo

at June 26th, 2014

Earlier this year we told you about Odo, an innovative mobile test engineering tool we developed here at Groupon to overcome some of the challenges involved in testing our mobile app, which more than 80 million people have downloaded worldwide. We’re excited to tell you that Odo is now available at our Groupon Github home!

The struggle is real

Our engineering teams had to come up with a way to build the engaging mobile experiences fast to our end users. Along the way, our Mobile Test Engineering team started to come across common problems in the space and we were not able to find a solution outside that would fit our needs. Specifically, we attempted to tackle some of these challenges:

  1. Development and testing of new features without dependent service API’s being available yet.
  2. We gravitated towards using realistic data, but needed more than a traditional mock server. While a mock may work, you will need to update the stub data whenever the service changes. In a SOA world, the maintenance can come at a huge cost.
  3. We needed a mock server that has an HTTP based interface so we can integrate with our end to end test suites as well as a mock server that could easily integrate into development builds. We wanted to be able share the manipulated data corpus across dev and test. What’s the point in reinventing the wheel, right?
  4. We had to simulate a complex scenario that may not be possible using static data. Example: three requests are made to an API. The first two are successful, but the third fails.

Oh boy! It’s Odo!

Odo is a man-in-the-middle for client and server communication. It can easily be used as a mock/stub server or as a proxy server. Odo is used to modify the data as it passes through the proxy server to simulate a condition needed for test. For example, we can request a list of Groupon deals and set the “sold out” condition to true to verify our app is displaying a sold out deal correctly. This allows us to still preserve the state of the original deal data that other developers and testers may be relying on.

The behaviors within Odo are configurable through a REST API and additional overrides can be added through plugins, so Odo is dynamic and flexible. A tester can manually change the override behaviors for a request, or configurations can even be changed at runtime from within an automated test.

Odo In A Nutshell

As a request passes through Odo, the request’s destination host and URI are matched against the hosts and paths defined by the configuration. If a request matches a defined source host, the request is updated with the destination host. If the host name doesn’t match, the requests is sent along to its original destination. If the URI does not match a defined path, the request is executed for its new host. If the host and path match an enabled path, the enabled overrides are applied during its execution phase. There are two different types of overrides in Odo.

A Request Override is an override applied to a request coming from the mobile client on its way to the server. Generally, an override here will be for adding/editing a parameter or modifying request headers.

A Response Override executes on the data received from the server before passing data back to the client. This is where we can mock an API endpoint, change the HTTP response code, modify the response data to simulate a “sold out” Groupon deal, change the price, etc.

Automation Example

Client client = new Client(“API Profile", false); client.setCustomResponse("Global", "response text”); client.addMethodToResponseOverride("Global”, "com.groupon.proxy.internal.Common.delay"); client.setMethodArguments("Global", "com.groupon.proxy.internal.Common.delay", 1, 100);

In this example, we are applying a custom response (stub) with value “response text” to a path defined with a friendly name “Global”. Then we add a 100ms delay to that same path. Go nuts! Try adding a 10 second delay to simulate the effects of network latency on your app.

Benefits

Odo brings benefits to several different areas of development:

  • Test automation owners can avoid stub data maintenance, and tests gain the ability to dynamically configure Odo to simulate complex or edge-case scenarios.

  • Manual Testers gain the same ability to configure Odo through the UI. No coding knowledge is required.

  • Developers can use Odo to avoid dependency blocks. If a feature depends on an API currently in development, Odo can be used to simulate the missing component and unblock development.

  • Multiple client support. With this feature, a single Odo instance can be run, but have different behaviors for each client connected. This allows a team to run a central Odo instance, and the team members can modify Odo behavior without affecting other members’ configurations.

Broader Odo Adoption

With Odo’s flexibility, our Test Engineering teams adopted the usage of it for testing our interaction tier applications built on Node.js (you know, the ones powering the frontend of groupon.com). We even have an internal fork that scales so we can use it as part of our capacity testing efforts. We’ll push that update in the near future as well.

Simple as 1-2-3

Our Github page provides some getting started instructions to get you up and running quickly. All you need is Java(7+) and odo.war from our release. There is a sample to give you an idea of how to apply everything Odo has to offer to your projects.

Links

Github – https://github.com/groupon/odo

Readme – https://github.com/groupon/odo/blob/master/README.md

Download – https://github.com/groupon/odo/releases

Community Collaboration

We have a roadmap of new features we’ll be adding to Odo over time. We’d love your feedback and contribution towards making Odo a great project for everyone.


Kratos – Automated Performance Regression Analyzer

at June 26th, 2014

Groupon’s architecture comprises a number of services that are either real-time or offline computation pipelines. Like any huge data centric technology company, we have scaling challenges both in terms of growing data (more bytes) and increasing traffic (more pageviews). The gift of scale is that understanding the performance of our services is critical.

We constantly track performance deltas between code running in production and release candidates running in test environments. Release candidates may include both infrastructure and algorithmic changes. Manual inspection of hundreds of key metrics, like end-to-end latencies to serve a user request, is cumbersome and prone to human error, yet we must ensure new release candidates meet our SLA.

Introducing Kratos

Kratos is a platform that does differential analysis of performance metrics that originate from two different environments: production and release candidate. We had three clear goals in mind for it:

  • Providing a way to automatically measure, track and identify regressions in key performance metrics as changes are made to services in production.
  • Ability to provide an early (pre-commit/pre code-push) feedback mechanism to the developers in terms of the impact their changes have on the service performance.
  • Building a streamlined way of revisiting performance baselines for release candidates so that we are not in for a surprise when launching to production.

Architecture

At Groupon, all of our services have their performance related metrics written in real-time to a super cool platform that converts them into graphs. Hence before a new code goes live, it is hosted in either a developer vm or a dedicated staging environment. We bombard this service with a replay of few hours of production traffic and the performance graphs are being drawn out in real-time. Typical metrics include latency metrics (e.g. service latency, caching latency), counter metrics (e.g. requests, failures), and system stats (e.g. cpu, disk, and memory utilization).

Kratos_Architecture

Let’s look at all the data stores that are needed by Kratos.

Service Metadata

Contains metadata about every service that plugs into the platform. This contains information such as name of the service, the url where the service’s graphs reside.

Metrics Metadata

Contains metadata about the metrics which need to be compared. It contains information such as name of the metric, value of each metrics, the corresponding graph end points for the metric, service it is associated with and the acceptable delta for every metric.

Performance Analyzer Run Metadata

Stores all data regarding the current performance test i.e value for all metrics, the differentials and also if the differentials are within the threshold for every metric

Metrics Pipeline

Metrics filter
Applies a bunch of filters to decide and select all metrics related to a service by querying from service and metrics metadata stores.
Metrics Crawler
Metrics are then passed on to the metrics Crawler which grabs and extracts metrics from the graphs which then are pushed forward to a metrics Cleanser.
Metrics Cleanser
Cleanser looks at the time series of metrics and rejects any noise (if there are inconsistencies in the graph or lack of data) and accounts for them in the computation.

Differential Analyzer

The final piece in the pipe is the Differential which computes the delta’s between production and release candidate build metrics and then flushes them into the data store.

Performance Data Set This is the core data structure passed around between the various stages of the metrics pipeline and used by every stage to store metadata.

Findings

We did a study on how our current approach (manual inspection of graphs) compares to the new Kratos Platform. Here is what we found.

Scaling with increasing # of metrics

X axis is time in weeks. As you go from left to right in the graph, it is extremely clear that the time taken to analyze and report performance results have gone down drastically ( more than 30x gain).Also the gain is consistent with increasing no of metrics.

PerformanceMetrics

Number of performance regressions caught in local/staging performance tests

As the adoption of the platform increased, the number of performance regressions identified during local/staging automated performance tests has increased. This has resulted in fewer production crashes/regressions.

X axis is time in weeks and Y axis is the no of performance regressions identified (pre production) with old vs new approach.

RegressionsandScale

Kratos improves the quality of our services and eliminates human errors in the judgement of performance of a service by creating a framework that does a differential analysis of the performance graphs between 2 versions of the service and produces and stores the result in a user-friendly fashion.

We’ve seen great results so far, with the time taken to analyze at 20x gain. We also have greater ease in catching issues before production deploy – Relevance Service was at 100% availability during the last holiday season, owing to data feedback from Kratos regarding performance metrics.


PCI at Groupon – the Tokenizer

at June 17th, 2014

Any successful e-commerce company invariably has to become PCI compliant. The Payment Card Industry (PCI) is a credit card industry consortium that sets standards and protocols for dealing with credit cards. One of these standards targeted at merchants is called the PCI-DSS, or PCI Data Security Standards. It is a set of rules for how credit card data must be handled by a merchant in order to prevent that data from getting into the wrong hands.

Groupon has designed a PCI solution called Tokenizer – a purpose-built system, built with performance, scalability, and fault-tolerance in mind as a way to streamline this process and we are seeing great results.

The PCI Security Standards Council certifies independent auditors to assess merchants’ compliance with the PCI-DSS. Companies that handle more than six million credit card transactions in a year are held to the highest standard, known as PCI Level 1.

When faced with becoming PCI-compliant, most e-commerce businesses that need access to the original credit card number usually follow one of two typical paths:

  • Put all of their code and systems under PCI scope. Development velocity grinds to a halt due to the order-of-magnitude increase in overhead compared with typical agile processes.
  • Carve up their app into in-scope (checkout page, credit card number storage, etc.) and out-of-scope (everything else) portions. The once-monolithic application becomes a spaghetti mess of separation and interdependency between very different systems. The “checkout” page stagnates and becomes visually inconsistent with the rest of the site due to the order-of-magnitude increase in overhead noted above.

Groupon faced this dilemma a few years back. Prior to designing a solution, we set out our design goals:

  1. Have as few devices/services in PCI scope as possible.
  2. Make as few changes to existing services as possible.
  3. Maximize protection of our customers’ credit card data.

After many hours meditating at the Deer Head Center for Excellence (see the end of this post) we arrived at our solution.

Groupon’s PCI Solution

For the core tenet of the solution we would focus on the credit card data that we had to protect. Rather than re-implement our checkout application, we would shield the application from the credit card data. How would we do this? Read on…

Prior to becoming PCI compliant, users would submit their credit card information directly to the checkout application. With the PCI-compliant solution in place, there is an HTTP proxy layer that transforms the customer’s web request, replacing the credit card number with an innocuous token. The modified request is forwarded to the checkout app, who processes the request and responds to the user via the proxy. Since the proxy’s job is to replace sensitive data with a token, we call it the Tokenizer.

Tokenizer High Level Diagram

Enter the Tokenizer

With this architecture, our PCI footprint is minimal, flexible, and rarely needs to change. The Tokenizer is a purpose-built system, built with performance, scalability, and fault-tolerance in mind. Its only job is to find credit card numbers in specific kinds of requests, replace the credit card numbers with something else, store the encrypted credit card number, and forward the request to another system.

The Tokenizer does not render any HTML or error messages — all of that comes from the application that it is proxying to. Therefore, the user interface components are under complete control of systems that are not in PCI scope, thereby preserving the normal high iteration velocity of a modern e-commerce company.

“Yes, But…” or All About Format-Preserving Tokens

You may be asking, “So the credit card number is replaced with something else right? What is it replaced with? How can you satisfy your design goal of having minimal changes to the rest of the system?”

A credit card is defined as a numeric sequence of 13 to 19 digits. Credit card companies are assigned a number space in which they may issue cards. You may have noticed that most VISA cards start with 4, MasterCard with 5, etc.

There is a special number space that begins with “99″ that is by definition unassigned. The tokens that the Tokenizer generates are 19 digits long, and begin with “99″. Because the token follows the official definition of a credit card, it is called a Format-Preserving Token.

Within the 19 digit number, we encode a token version, the original card type, datacenter that created the token, a checksum digit, a sequence number, and the original last four digits of the customer-entered credit card number.

Groupon Token Breakdown:

9911211234567891234
 | || ||        lastfour
 | || |seqno(9)
 | || luhn
 | |card type(2)
 | version/colo
 cardspace

The only modification required of other systems at Groupon is that they need to recognize these tokens as valid credit cards, and look at the card type field to know the original card type, rather than running their normal pattern matching code for Visa, Mastercard, Discover, AmEx, etc. Usually this is a minor change within a CreditCard class.

Database schemas can stay the same, masked card display routines (only showing the card type and last four digits) may stay the same, even error cases can stay the same.

Error Handling

Since the Tokenizer has pattern matching rules for detecting card types, it can detect invalid card numbers as well. Rather than encrypting and storing invalid card numbers, the Tokenizer will generate a token that is invalid in the same way that the original input was invalid and replace the invalid card number with the generated invalid token.

For instance, if the credit card number is too short (e.g. a VISA with 15 digits instead of the typical 16) the Tokenizer will generate a token that is too short (e.g. 18 digits instead 19). If the card number the customer typed in does not pass the Luhn checksum test, then the token will not pass the Luhn checksum test. If the card is an unrecognizable type, then the token will be an unrecognizable type.

This allows all end-user messaging around card errors to be generated by the out-of-scope systems that the Tokenizer is proxying to. The checkout app will see an invalid card number, and return an appropriate error message via the Tokenizer back to the user. Less PCI-scoped code means that development teams may iterate on the product with less friction.

Security

Most similar tokenizing systems will use a cryptographic hashing algorithm like SHA-256 to transform the card number into something unrecognizable. The issue with hashed card numbers is that the plaintext space is relatively small (numbers with well known prefixes having 13-19 digits that pass the Luhn checksum test). Therefore, reverse-engineering of hashed credit card numbers is fairly simple with today’s computers. If the hashed tokens fall into the wrong hands, the assumption needs to be that the card numbers have been exposed.

This risk can be mitigated by using a salted hash or HMAC algorithm instead, with the salt treated as an encryption key and split between layers of the system, but the downside of having a token in a very different form than the original card number remains.

Aside from the card type and last four digits, the Groupon token is not derived from the card number at all. There is no way, given any number of tokens, to generate any number of credit card numbers. The information simply isn’t there for the bad guys to use. Less information is more secure.

Closing The Loop

At this point we have described how card numbers are replaced with tokens by the Tokenizer. Now we have many backend systems that have tokens stored instead of credit card numbers. You may be wondering how we collect money from tokens. Seems like squeezing water from a rock, right?

There is a complementary, internal-only system called the Detokenizer. The Detokenizer functions in a very similar way to the tokenizer, just in reverse.

We use our payment processor’s client library to construct a request back to the processor. Usually these requests are SOAP over HTTP. In the credit card number field in the SOAP payload, the out-of-scope system will insert the token. Naturally, we cannot send this request as-is to the payment processor, since the token is meaningless to them.

<?xml> <transaction> <card_number>990110000000121111</card_number>…

“Bzzzzt! That’s not a credit card number!”

Rather than sending the request directly to the payment processor, it is sent via the Detokenizer proxy. The Detokenizer recognizes the destination of the request and looks in a specific place in the payload (the card number field) for the token. If it’s found, the token is used to look up the original encrypted card number from the database, the card number is decrypted, and it replaces the token in the original request payload.

<?xml> <transaction> <card_number>4111111111111111</card_number>…

The Detokenizer then forwards the HTTP request to the payment processor and Groupon gets paid. Make it rain, Groupon.

Overall

Groupon’s current production footprint is on the order of thousands of servers. Our global PCI scope extends to one small source code repository, a dozen application servers, four database servers, and some network gear. Development, code deploys, and configuration changes follow the regimented rules of the PCI-DSS. But since the in-scope footprint is so small, the impact on the rest of the organization is minimal to nil.

There is a very small group of developers and operations staff who maintain these systems on a part-time basis. The rest of the time, these employees enjoy working on more high-velocity projects for the company. This design has meant that Groupon can still run at the speed of a startup, even though it is compliant with these strict industry rules.

And finally, The Deer Head Center For Excellence is Groupon’s in-office cocktail bar. Put up a monitor and serve up your own Deer Head.

Deer Head


DotCi and Docker

By
at June 12th, 2014

This week, I presented at Dockercon14 on a new project that we just open-sourced at Groupon called DotCi. DotCi is a Jenkins plugin that makes job management easy with built-in GitHub integration, push-button job creation, and YAML powered build configuration and customization. It comes prepackaged with Docker support as well, which means bootstrapping a new build environment from scratch can take as little as 15 minutes. DotCi has been a critical tool for us internally for managing build and release pipelines for the wide variety of technologies in our SOA landscape. We found it so useful that we wanted to share the benefits of DotCi with the wide world of Jenkins users out there. Here are just a few of those benefits:

  • Deep Integration with Source Control – for us that’s Github Enterprise
  • Integration with Github webhooks
  • Feedback sent to the committer or pusher via Email, Hipchat, Campfire, etc.
  • Setting of commit and pull request statuses on Github
  • Push button build setup
  • Dynamic dependency loading at build runtime
  • Easy, version controlled build configuration
  • Simple parallelization
  • Customizable behavior based on environment variables (e.g. branch name)
  • Docker support!

Groupon Engineering is an early adopter of Docker and we were featured in their news this week.

Go check out DotCi and the Plugin Expansion pack on our public Github:

https://github.com/groupon/DotCi
https://github.com/groupon/DotCi-Plugins-Starter-Pack

And try out the 15-minute Cloud CI setup with DotCi, Docker, Jenkins and Digital Ocean here.

Please contact me for comments and any questions.

Happy Testing!

DotCi