Meet the Groupon Engineering Team: Aaron Bedra and Sean Massa

at September 26th, 2014

Aaron Bedra, Senior Fellow and software security expert, talks with Sean Massa, Software Engineer, about the new UI Engineering team and what drives their work at Groupon:

Sean and Aaron head a number of dev and engineering groups in Chicago’s dev community. Here is a look into some of the groups they lead:

Check out these groups at the links below:

Geekfest Chicago, Chicago Node.js, Chicago Ember.js, Anti-Harassment Policy, Open Source Hangout


Optimizing Redis Storage

at September 22nd, 2014

One of the tasks of our Optimize team is to build a real-time analytics engine for our A/B testing framework which involves analyzing the consumers experiencing each of the variants on each experiment. Along with this, we need to look at each deal sold in the system and properly attribute each sale to the experiments the consumer visited on their way to that sale that might have influenced their buying decision. Based on this visiting and buying data, the different product teams can then determine which of the experiment variants they want to keep.

In order to improve any consumer-facing Groupon product, experiments are done where a random sample of consumers will be placed into a testing group and shown one or more variants of the original, control, experience, and then their responses will be tallied. This A/B testing data will come to our cluster in the form of several separate messages. Some will indicate the consumer, browser, and device when an experiment variant is encountered, others will indicate when a consumer purchased a deal. It is then the job of this cluster to correlate the actions taken by that consumer to see if the variant is better than the control. Did the larger image lead to more purchases? Did the location of the button cause more people to click on it? Read on to learn more about our approach and results.

All these experiments need to be classified and the consumer actions attributed. The basic systems diagram looks like this:

UCS Basic System Design

The web and mobile clients send their data via their own formats into a kafka cluster that is read by the Unified-Click Stream which generates a curated, unified formatted stream of messages that is then the input to the Finch Analytics where the attribution is done.

Recently, several production systems started using Clojure and given that Storm is written primarily in Clojure, it seemed like a very good fit to the problem of real-time processing of messages. There are several topologies in our cluster – one that unifies the format of the incoming data, another enriches it with quasi-static data, and then a simple topology that counts these events based on the contents of the messages. Currently, we’re processing more than 50,000 messages a second, but with Storm we have the ability to easily scale that up as the load increases. What proved to be a challenge was maintaining the shared state as it could not be stored in any one of the bolts as there are 30 instances of it spread out across five machines in the cluster. So we had to have an external shared state.

All of our boxes are located in our datacenter, and because we’re processing real-time data streams, we’re running on bare metal boxes – not VMs. Our tests showed that if we used the traditional Redis persistence option of the time/update limits, a Redis box in our datacenter with 24 cores and 96 GB of RAM was more than capable of handling the load we had from these 30 bolts. In fact, the CPU usage was consistently below 20% of one of the 24 cores, and all other system parameters equally low. Plenty of headroom to do a little more work on the redis server and still keep ahead of the flow.

Redis is primarily a key/value store, with the addition of primitive data types including HASH, LIST, and SET to allow a slightly nested structure and operations to the cache. And while it’s ability to recover after a crash with it’s data intact is a valuable step up over Memcached, it really makes you think about how to store data in a useful and efficient layout. The initial structure we chose for Redis was pretty simple. We needed to have a Redis SET of all the experiment names that were active. It turns out that there can be many experiments in the codebase, but only some are active. Others may have completed and just haven’t been removed from the code. To support this active list, we had a single key:

    finch|all-experiments => SET (names)

and then for each active experiment name, we had a series of counts: How many consumer interactions have there been with this experiment? How many errors were there on the page when dealing with an experiment? and even a count for the basic errors encountered in the stream itself – each updated with Redis’ atomic INCR function:

    finch|<expr-name>|counts|experiment => INT
    finch|<expr-name>|counts|errors => INT
    finch|<expr-name>|counts|null-b-cookies => INT

The next step was to keep track of all the experiments seen by all the consumers. As mentioned previously, this includes the browser they were using (Chrome 29.0, IE 9.0, etc.), the channel (a.k.a. line of business) the deal is from (Goods, Getaways, etc.), and the name of the variant they experienced. The consumer is represented by their browser ID:

    finch|<expr-name>|tuples => SET of [<browser>|<channel>|<name_of_variant>]
    finch|<expr-name>|variant|<browser>|<channel>|<name_of_variant> => SET of browserId

The Redis SET of tuples containing the browser name and version, the channel, and the name of the variant they saw was important so that we didn’t have to scan the key set looking for the SETs of browser IDs. This is very important as Redis is efficient at selecting a value from the key/value set – but it is horribly inefficient if it has to scan all the keys. While this function exists in the Redis command set, it’s also very clearly indicated as not to be used in a production system because of the performance implications.

Finally, we needed to attribute the sales and who bought them, again based on these tuples:

    finch|<expr-name>|orders|<browser>|<channel>|<name_of_variant>|orders => INT
    finch|<expr-name>|orders|<browser>|<channel>|<name_of_variant>|qty => INT
    finch|<expr-name>|orders|<browser>|<channel>|<name_of_variant>|revenue => FLOAT
    finch|<expr-name>|orders|<browser>|<channel>|<name_of_variant>|consumers => SET of uuid

As you can see, the lack of general, multi-level, nested structures in Redis means a lot needs to be accomplished by how you name your keys, which makes this all appear to be far more complicated than it really is. And at the same time, we have purposefully chosen to use the atomic Redis operations for incrementing values to keep the performance up. Consequently, this may seem like a lot of data to hold in Redis, but it lead to very fast access to the shared state and Redis’ atomic operations meant that we could have all 30 instances of the bolt hitting the same Redis instance and updating the data concurrently. Performance was high, the analytics derived from this data were able to be generated in roughly 5 sec, so the solution seemed to be working perfectly.

Until we had been collecting data for a few days.

The memory usage on our Redis machine seemed to be constantly climbing.

Finch Experiment Memory Growth

First it passed 20 GB, then 40 GB, and then it crashed the 96 GB machine. The problem stemmed from the fact that while an experiment was active we were be accumulating data for it. While the integers weren’t the problem, this one particular SETs was:

    finch|<expr-name>|variant|<browser>|<channel>|<name_of_variant> => SET of browserId

There would, over time, be millions of unique visitors, and with more than a hundred active experiments at any one time, and even multiple browserIDs per consumer. Add it all up, and the Redis SET would have hundreds of millions of entries. This would continue to grow as more visitors came to the site and experience the experiments. What we needed was a much more efficient way to store this data.

Wondering what Redis users do when wanting to optimize storage we did some research and found this blog post by the Engineering group at Instagram. We also found this post on the Redis site that reinforces this point and gives tuning parameters for storing efficiently in a HASH. Armed with this knowledge, we set about refactoring our data structures to see what gains we could get.

Our first change was to pull the ‘counts’ into a HASH. Rather than using the following:

    INCR finch|<expr-name>|counts|experiment
    INCR finch|<expr-name>|counts|errors
    INCR finch|<expr-name>|counts|null-b-cookies

we switched to this:

    HINCR finch|<expr-name>|counts experiment
    HINCR finch|<expr-name>|counts errors
    HINCR finch|<expr-name>|counts null-b-cookies

Clearly, we were not the first to go this route as Redis had the equivalent atomic increment commands for HASH entries. It was a very simple task of breaking up the original key and adding the ‘H’ to the command.

Placing the sales in a HASH (except the SET of consumerIDs as they can’t fit within a HASH), was also just a simple breaking up of the key and using HINCR and HINCRBY. Continuing along these lines we saw we could do a similar refactor and we switched from a SET of browserIDs to a HASH where the keys are the browserIDs – just as unique, and we can use the Redis command HKEYS to get the complete list. Going further, we realized the values of the new HASH could contain some of the data that was in other structures:

    finch|<browserID> => app-chan => <browser>|<channel>
    finch|<browserID> => trips|<expr-name>|<name_of_variant> => 0

where that zero was just a dummy value for the HASH key.

With this new structure, we can count the unique browserIDs in an experiment by using the Redis EXIST function to see if we have seen this browserID in the form of the above HASH, and if not, then we can increment the number of unique entries as:

    finch|<expr-name>|tuples => <browser>|<channel>|<name_of_variant> => INT

At the same time we get control over the ever-growing set of browserIDs that was filling up Redis in the first place by not keeping the full history of browserIDs, just the count. We realized we could have the browserID expire on a time period and let it get added back in as consumers return to use Groupon. Therefore, we can use the Redis EXPIRE function on the:

    finch|<browserID>

HASH, and then after some pre-defined period of inactivity, the browserID data would just disappear from Redis. This last set of changes – moving away from a SET to a HASH, counting the visits as opposed to counting the members of a SET, and then EXPIRE-ing the data after a time really made the most significant changes to the storage requirements.

So what have we really done? We had a workable solution to our shared state problem using Redis, but the space required was very large and the cost of keeping it working was going to be a lot more hardware. So we researched a bit, read a bit, and learned about the internals of Redis storage. We then did a significant data refactoring of the information in Redis – careful to keep every feature we needed, and whenever possible, reduce the data retained.

The end effect? The Redis CPU usage doubled, which was still very reasonable – about 33% of one core. The Redis storage dropped to 9 GB – less than 1/10th of the original storage. The latency in loading a complete experiment data set rose slightly – about 10% on average, based on the size and duration of the experiment. Everything we liked about Redis: fast, simple, robust, and persistent, we were able to keep. Our new-found understanding of the cost – both in CPU time and memory space, of different Redis data structures has enabled us to make it far more efficient. As with any tool, the more you know about it – including its internal workings, the more you will be able to do with it.


Groupon Mobile Helps Launch Apple Pay

at September 9th, 2014

With nearly 92 million app downloads worldwide and more than half of our transactions occurring on mobile devices, Groupon is one of the largest mobile commerce companies in the world. We continue to evolve with mobile and today we’re announcing our integration with Apple Pay, a new mobile payments service that will provide an easy, secure and private way to make purchases from the Groupon app on iPhone 6 and iPhone 6 Plus. Groupon is one of the first mobile commerce companies to announce integration with Apple Pay and the new payment option will be available to customers beginning in October.

“We’re thrilled to work with Apple to streamline how our customers pay in the Groupon app,” said Groupon Head of Mobile, Don Chennavasin. Groupon’s integration with Apple Pay allows customers to seamlessly check out with a single touch and pay using Touch ID from their iPhone 6 or iPhone 6 Plus. It also allows new customers to create user accounts and make purchases on the fly with just a few taps.

“Bringing Apple Pay to the Groupon mobile app makes it easier than ever to find and buy the best things around you,” said Groupon CEO Eric Lefkofsky.

Groupon’s consistently top-rated iOS app is one of the 25 most-downloaded free apps on the App Store. Groupon’s next updated app will launch alongside Apple Pay in October.

applepayscreenshot

 

 


Building a Custom Tracking and Analytics Platform: Information is Power

at August 20th, 2014

The Groupon website sees a lot of action. Millions of people view, click on, and buy deals on the site every day. Naturally, all of those views, clicks, and purchases provide pieces of information that we could use to make our site better. And if the site is better, hopefully it will see even more action. Accordingly, we needed a platform that would allow us to record and interpret all of this information.

Recently, our very own Gene McKenna wrote about an experiment we ran to de-index the Groupon website from Google. There were many questions that followed in the discussion of the article: what do we use to track our incoming traffic? How do we determine referrals? Are we using Google Analytics? To learn more about the tracking system we developed, read on.

Tracking at Groupon started with simple impression tracking using pixels. It soon became apparent that this would not be sufficient, since we also wanted to track customer interactions with the site. Also, the management and analysis of tracking pixels had grown unwieldy. To solve this problem, we began by evaluating existing tools such as Google Analytics. We decided against an existing tool for a few reasons: we wanted exclusive access to our data; we wanted flexibility in data collection, and we have awesome developers that move fast and deliver great products. So, back in 2012, we built something: a custom tracking and analytics system that would allow us to record and interpret information from the site. When we were building this tool, there were a number of design decisions that had to be made.

Tracking

First, we had to determine what kind of questions needed answering. We started with the basics – what are people buying? What are people clicking on? How many things did we sell today? Is this new feature better than the old feature? But we also wanted to answer more complex questions about full customer sessions, such as which field on the checkout form a customer clicked on first or which field they were on when they abandoned. How do we answer all of these questions? Obviously, it all starts with tracking. If we don’t have data to work with, we won’t be able to learn the answers to any of these questions.

What to Track

Common Tracking

These questions have the same two core pieces in common: customers and the site. However, tracking these independently is not sufficient. We don’t just want to know that a customer visited 10 pages, we want to know which 10 pages they visited. We want to know what the customer clicked on. We want to know what they purchased. So we started with a library that is designed for linking that information together and logging it. At Groupon, we call it TrackingHub, and we will investigate some of the design decisions that went into this library. If you would like more detail about the history behind this solution, you should check out the talk Tracking All the Things by Groupon’s Chris Powers.

The Customer

We could benefit by knowing several things about the customer, for example, what operating system they are on, which browser they are using, whether they are using a mobile device, how many times they have been to the site, etc. But first we need some way to reasonably identify a customer.

To identify the customer, we need to know if the customer is logged in; if they are, we know their customer identifier (something that ties the customer to the database) and we should track that. Note that we must not use personally identifiable information (like a permalink that includes the customer’s name) for the identifier due to privacy concerns. If the customer is not logged in, we can at least track a unique identifier for the customer’s browser. (Creating universally unique identifiers on the client is outside the scope of this post, but the wikipedia page is a good place to start investigating the concerns.) If the customer logs in later, then we may be able to reasonably associate the customer with the browser, but this is not always the case, since a browser may be shared between customers. Nevertheless, for our uses, this is usually a close enough approximation.

Data about the browser as found in the user agent string is also very helpful – especially when tracking down bugs that might be specific to a certain browser version. There are libraries that we can use to easily parse meaningful values from the user agent string, such as browser family and OS. When we’re done, we will have something like this for the customer data:

{
  userAgent: window.navigator.userAgent,
  browser: getBrowserFamily(),
  browserVersion: getBrowserVersion(),
  os: getOS(),
  device: getDevice(),
  customerId: getConsumerId(),
  loggedIn: isLoggedIn(),
  browserId: getBrowserId()
  // ... Any other metadata ...
}
The Customer’s Session

The customer’s session information is also important to track if we want to be able to distinguish between an individual customer’s visits to the site. On the Groupon website, we set a session cookie that expires after thirty minutes of session inactivity. Again, we needed a universally unique identifier to identify the session.

The other data we want to track with the session is the referral. This is where the tracking data becomes very useful for search engine optimization and marketing. Without the referral information, we would not be able to tell the difference between customers coming from search engines, e-mails, or other external sources. Unfortunately, this is also where it gets a little tricky. There are a couple of ways to tell where a customer is coming from.

The first way to determine where a customer is coming from is through the HTTP referer header. This header generally contains the URL of the web page that the customer just navigated from. If we see that this is some page other than a Groupon page, then we know that the customer clicked on a link on the previous page in order to arrive on our site. Unfortunately, this header is not all that reliable, as Gene mentioned in his recent article, due to the fact that browsers do not necessarily have to send this header at all, let alone guarantee its accuracy. I recommend reading Gene’s article for more information about how we have been dealing with this limitation.

Another problem to consider when using the HTTP referer header is client-side redirects. If the customer is redirected to another page via Javascript before any tracking messages were sent to the server, then that referer header will be lost. There are several possible solutions to this. The first and least desirable is to ensure that the tracking code loads on the page and sends the first message. Of course, this adds undesirable (and probably unacceptable) latency onto the customer’s total page load time. The preferred solution is to move the redirect to the server and either preserve the referer header on the redirected request or log the referrer on the server with an identifier to link to the customer’s session. This is easy to do if the session identifier is always set on the server and passed to the client in a cookie.

The second method for collecting referrals is via URL query parameters. If we have control over the links that customers are clicking on — as in the case of advertisements and e-mails — we can add query parameters to these URLs that we can then parse and log in our tracking code.

Once we have all of our session tracking code set up, it could end up looking something like this

{
  id: getSessionId(),
  referral: {
    referrer: document.referrer,
    referralParams: parseReferralParams()
  }
}

The Page

Along with the customer and session information, we will need basic information about the site itself, or the page that the customer is visiting. First and foremost, we need to track the page URL. This is vital for determining page view and related metrics. The next thing we will need is a unique identifier for the page view. For this identifier, we typically concatenate the session identifier with a timestamp and an index. It is also useful to keep track of the previous page’s identifier, so that we can easily follow the customer’s navigation through pages on the site. Other things to include with this information could include the page’s locale or language, the application that generated the page (if your website comprises more than one application, like ours), or even the specific version of your tracking library (we have found this very useful for debugging purposes). When we’re finished, the setup for the page information will probably look something like this:

{
  id: getPageId(),
  parentPageId: getParentPageId(),
  url: window.location.href,
  country: getCountry(),
  app: getApp(),
  libraryVersion: getVersion()
}

Event Tracking

So now we have all of this data collected, what do we do with it? We include it with the events we want to track, of course. First and foremost, we want to send a page load event to be logged to the server. We will definitely want to include all of the above information with this event. That event would simply look something like this:

{
  event_name: "tracking-init",
  customer: // customer data from above
  session: // session data from above
  page: // page data from above
}

From there, we have a couple of options. We can include all of the customer, session, and page data with each tracking message for easy log inspection and cross-referencing or we can cut down on storage and just include the page identifier with each message. Including just the page identifier still allows for cross-referencing through the page load event.

Once we have this basic tracking set up, the possibilities are pretty much endless. We could log performance events. We could build a library for tracking views of and clicks on specific HTML elements and site content that we care about (there is more detail about this library in Chris’s talk). We could log when a customer makes a purchase. But how do we make sure all of this gets to the server?

Logging Tracking Data

Once we are generating these events on the client, we need to get them to the server. We could simply make a POST request for each tracking event. And if we only had a couple of events per page that might not be a bad solution. However, on Groupon’s website, we have many events that need to be logged, so we had to implement a different solution. We collect events as they happen on the client, then, every few seconds, we take any unsent events we have collected and send them in a single POST request to the server. This reduces the number of tracking requests made to the server, but there are some edge cases to consider.

The first thing to consider is, what happens when a customer navigates away from the current page before the latest collected events have been sent? To solve this problem at Groupon, we persist the unsent events in browser local storage, if available, and then send them at the regular interval on the next page.

But what if the next page is not a Groupon page? Well, we have a couple of options. We could not worry about the last few customer interactions before the customer left the page, or we could force the POST request to the server on page unload. The specific solution will likely depend on the tolerance for lost events and the tolerance for additional page load latency. A similar problem occurs when a customer navigates from an HTTP page to an HTTPS page. For security reasons, browser local storage cannot be shared across protocols. Therefore, it is necessary to use one of the above solutions, or some other scheme involving Cookie-based storage or something similar.

Analytics

After you have your client-side events logged on the server, you can really start to have some fun. There are many tools available for indexing, searching, and analyzing log files. We use Splunk to do ad hoc analysis of our logs and some real-time reporting at Groupon, but there are other open source alternatives such as logstash, kibana, and Graylog2.

Once you have something like that up and running, you can do cool things like create an ad hoc graph for when a problem was fixed in a certain country:

splunk_screenshot_1

Of course, once we have all this data, we can also load it into whatever data warehouse tools we want to use, generate regular reports on the data, and analyze historical data to our heart’s content. Furthermore, it is completely within our power to build our own real-time systems for analyzing the data and extracting meaning from it.

For example, we have a real-time system written in Clojure that uses Apache Storm to process over half a million daily A/B experiment events. The system associates these events with deal purchases and customer interactions with the site in order to give us near instantaneous experiment results. If there is an error with an experiment, we can see it immediately and react appropriately. If things are running well, we only have to wait until we have collected enough data for meaningful results before we can make a decision based on the experiment.

Summary

Owning your own data and tracking and analytics platform definitely has its advantages. It is flexible – we can track whatever we want in whatever format we want, and once we have the data, our only constraints are those of any software project.

If you are interested in learning more about these tools, please check the Groupon Engineering blog and look out for our open source announcements. If working on these awesome products sounds fun, check out our Engineering openings.


Gofer – HTTP clients for Node.js

at August 14th, 2014

Gofer – HTTP clients for Node.js

We recently transitioned the main Groupon website to Node.js and we’ve documented how we dismantled the monoliths and some of the integration testing tools such as testium in earlier blog posts. One strong requirement for the new architecture was that all features would be implemented on top of HTTP services. Our best practices for working with those services led to the creation of gofer: A wrapper around request that adds some additional instrumentation and makes it easier to manage configuration for different backend services.

We’d like to announce that gofer is now open source!

The README file contains a detailed walkthrough on how to use the library in case you just want to see it in action. Read on for some thoughts about gofer , calling HTTP services from node, and resilience in general.

Configuration

One goal was to create a format that fits nicely into one self-contained section of our configuration. At the same time we wanted to have one unified config for all HTTP calls an app would make. Having a common pattern that all HTTP client follow means that certain best practices can be globally enforced. For example we have monitoring across apps that can tell us quite a bit about platform health beyond pure error rates.

The result looks something like this:

globalDefaults:
  connectTimeout: 100
  timeout: 2000
myApi:
  baseUrl: "https://my-api"
  qs:
    client_id: "SOME_CLIENT_ID"
github:
  baseUrl: "https://api.github.com/v3"
  clientId: "XXX"
  clientSecret: "XYZ"

The semantics of this config are similar to passing an object into request.defaults. “Every config setting is just a default value” means that it is relatively easy to reason about the effect of configuration settings. The options passed into request for a call against myApi are just the result of merging global defaults, the myApi section of the config, and the explicit options on top of each other. For example:

myApi.fetch({ uri: '/cats' });

Would be roughly equivalent to the following (given the above configuration):

request({
  connectTimeout: 100, // from globalDefaults
  timeout: 2000,
  baseUrl: 'https://my-api', // from service defaults
  qs: { client_id: 'SOME_CLIENT_ID' },
  uri: '/cats' // explicit options
});

If you just checked the request docs and couldn’t find all the options, then that’s because gofer supports a handful of additional options. connectTimeout is described in more detail below (“Failing fast”) and is always available.

baseUrl is implemented using an “option mapper”. Option mappers are functions we can register for specific services that take a merged options object and return a transformed one. It’s an escape hatch when configuring the request options directly isn’t reasonable. If a service requires custom headers or has more complicated base url logic, we got it covered with option mappers.

The following option mapper takes an accessToken option and turns it into the OAuth2 header Github’s API expects:

function addOAuth2Token(options) {
  if (options.accessToken) {
    options.headers = options.headers || {};
    options.headers.authorization = 'token ' + options.accessToken;
    delete options.accessToken;
  }
  return options;
}

For every incoming request we create new instances of the API clients, passing in the requestId (among other things) as part of the global defaults. This makes sure that all API requests contain the proper instrumentation.

Failing fast

Every open connection costs resources. Not only on the current level but also further down in the stack. While Node.js is quite good at managing high numbers of concurrent connections, it’s not always wise to take it for granted. Sometimes it’s better to just fail fast instead of letting resources pile up. Potential candidates for failing fast include:

Connect timeout (connectTimeout)

This should, in most cases, be very short. It’s the time it takes to acquire a socket and connect to the remote service. If the network connectivity to (or the health of) the service is bad, this prevents every single connection hanging around for the maximum time. For example wrong firewall setups can cause connect timeouts.

Timeout (timeout)

The time it takes to receive a response. The caveat is that this only captures the arrival of the headers. If the service writes the HTTP headers but then (for any reason) takes its time to actually deliver the body, this timeout will not catch it.

Sockets queueing

gofer currently does not support this

By default node will queue sockets when httpAgent.maxSockets connections are already open. A common solution to this is to just set maxSockets to a very high number, in some cases even INFINITY. This certainly removes the risk of sockets queuing but it passes all load down to the next level without any reasonable limits. Another option is to chose a value for maxSockets that is considered a healthy level of concurrency and to fail requests immediately once that level is reached. This (“load shedding”) is what Hysterix does for example. gofer at least reports on sockets queuing and our policy is to monitor and treat this as a warning condition. We might add an option to actually fail fast on socket queuing in the future.

Error handling

So one of the service calls failed. What now?

Full failure

This is the easiest option. Just pass the error up the stack (or into next when using express) and render an error page. The advantage is that no further resources are wasted on the current request, the obvious disadvantage is that it severely affects the user’s experience.

Graceful degradation

This is a very loaded term, in this case it’s meant as “omit parts of the current page instead of failing completely”. For example some of our pages contain personal recommendations as secondary elements. We can provide the core features of these pages even when service calls connected to personal recommendations would fail. This can greatly improve the resilience of pages but for the price of a degraded user experience.

Caching

Caching is not only valuable to reduce render times or to reduce load on underlying services, it can also help bridge error conditions. We built a caching library called cached on top of existing node.js modules that adds support for generating cached values in the background. Example of using cached for wrapping a service call:

var entityId = 'xy';
var loadEntity = cached.deferred(_.partial(myService.fetch, '/entity/' + entityId));
cached('myService').getOrElse(entityId, loadEntity, function(err, entity) {
  // 80% of time this works every time
});

By configuring very high expiry times and low freshness times, we make sure that we have reasonably fresh data while pushing the actual service call latency out of the request dispatch and keeping the app responsive should the service ever go down.

The big gotcha is that this doesn’t do any cache invalidation but expiry, so it’s not easily applicable to data where staleness is not acceptable.

Instrumentation

For instrumentation we use a central “gofer hub” that is shared between clients. It has two general responsibilities:

  1. Add headers for transaction tracing (X-Request-ID). This idea might have originated in the rails world [citation needed], Heroku has a nice summary. Additionally every API request is assigned a unique fetchId which is passed down as the X-Fetch-ID header.

  2. Emit lifecycle events for all requests

The basic setup looks like this:

var http = require('http');
var Hub = require('gofer/hub');
var MyService = require('my-service-gofer');
var OtherService = require('other-service-gofer');

var hub = new Hub();

hub.on('success', function() {});
hub.on('socketQueueing', function() {});
// ...

http.createServer(function(req, res) {
  var requestId = req.headers['x-request-id'] || generateUUID();
  var config = { globalDefaults: { requestId: requestId } };

  var myService = new MyService(config, hub);
  var otherService = new OtherService(config, hub);

  myService.fetch('/some-url').pipe(res);
}).listen(process.env.PORT || 3000);

All available lifecycle events and the data they provide can be found in the API docs for gofer. At the very least they contain the requestId and the fetchId.

serviceName, endpointName, methodName

To more easily group requests, we use a hierarchy of service, endpoint, and method. This allows us to build graphs and monitoring with different levels of precision and to drill down when necessary to quickly find the cause of problems. By default they are the config section the gofer uses (serviceName), the path of the resource you access (endpointName), and the HTTP verb used (methodName).

To get nicer endpointNames we define all API calls we want to use in advance. This can be a little verbose but has the added benefit of being a free spell checker.

var goferFor = require('gofer');
var MyService = goferFor('myService'); // serviceName = myService
MyService.registerEndpoints({
  // endpointName = cats
  cats: function(request) {
    return {
      index: function(cb) {
        return request({ uri: '/cats', methodName: 'index' }, cb);
      },
      show: function(id, cb) {
        return request({ uri: '/cats/' + id, methodName: 'show' }, cb);
      },
      save: function(id, data, cb) {
        return request({ uri: '/cats/' + id, json: data }, cb);
      }
    };
  }
});

We can then use an instance of MyService like this:

// will be logged with serviceName=myService, endpointName=cats, methodName=index
myService.cats.index(cb);
// will be logged with serviceName=myService, endpointName=cats, methodName=put
myService.cats.save('kitty', { fur: 'fluffy' }, cb);

Since we didn’t provide an explicit methodName for the PUT call, it defaults to the HTTP method.

Want to give it a try?

You can find gofer on github and on npm today. There’s still a lot we want to improve and we’re excited to hear about your ideas on how to make API-based node apps simpler and more resilient. Let’s continue this discussion down in the comments or in Github issues!


Groupon Engineering Supports RailsGirls Summer of Code and Open Source

at August 12th, 2014

Rails Girls Summer of Code is a global fellowship program aimed at bringing more diversity into Open Source. Successful applicants from all over the world are paid a monthly stipend, from July-September, to work on Open Source projects of their choice. Groupon Engineering is supporting this great effort contributing as a Bronze Sponsor and recently hosted Rails Girls Santiago.

railsgirlssantiago photo

The workshop was started in Helsinki in 2010 as a one-time event, but the need for this sort of program was so strong that Rails Girls has now become an international event series, hosted by local chapters around the world. Rails Girls Summer of Code is happening for the second year and is about helping newcomers to the world of programming expand their knowledge and skills, by contributing to a worthwhile Open Source project. The focus is not on producing highly sophisticated code, but rather participants learning highly transferable skills from their project work.

Congratulations to our Berlin and Santiago teams for celebrating diversity and engaging in our global tech community! For more information on how you can be a part of this great organization check them out here: Rails Girls Summer of Code

The ETHOS Team

Engineering, Standards, Culture, Education and Engagement at Groupon


Groupon Engineering launches Otterconf’14

at August 10th, 2014

Groupon Engineering launched its first global front end developer conference: Otterconf’14. Groupon developers from our dev centers around the world flew into our Chicago office for two days of collaborating and sharing best practices around I-Tier, our global web platform.

Andrew Bloom, Adam Geitgey, Dan Gilbert and Antonios Gkogkakis, who all work on I-Tier, gloss the significance of Otterconf’14:

I-Tier (Interaction Tier) is our distributed web front end architecture where pages or sometimes sets of pages work as separate applications. I-Tier is a great innovation that enabled us to re-architect our previous existing monolith, migrating Groupon’s U.S. web traffic from a monolithic Ruby on Rails application to a new Node.js stack with substantial results.

Because of I-Tier, page loads are significantly faster across the Groupon site. Our development teams can also develop and ship features faster and with fewer dependencies on other teams and we can eliminate redundant implementations of the same features in different countries where Groupon is available. I-Tier also includes localization services, configuration services, a global routing layer, consumer behavior tracking and A/B testing.

Check out Jeff Ayars, VP of Engineering at Groupon, on our international rollout of I-Tier:

#otterconf, #grouponeng, #groupon, #I-Tier


Maven and GitHub: Forking a GitHub repository and hosting its Maven artifacts

at July 2nd, 2014

Two tools that the Android team frequently uses here are GitHub and Maven (and soon to be Gradle, but that’s another story). Groupon has one of the most widely deployed mobile apps in the world, and with more than 80 million downloads and 54% of our global transactions coming from mobile, it’s important that we have the right tools to keep our operation running smoothly all the time.

The Groupon Android team recently tried to move our continuous integration builds from a Mac Mini on one of our desks to the companywide CI cluster. We got held up during the process because the build simply wouldn’t run on the new VMs. We discovered the problem was a bug in the Maven Android plugin, which neglected to escape the filesystem path when executing ProGuard and caused our build to fail for paths that contained unusual characters. Instead of looking to change the paths of all of our builds company wide, we focused on what it would take to fix the bug in the plugin. Read on for details on how we managed this process.

Fixing the bug itself was easy. The harder part was figuring out how to distribute the fix while we waited for the plugin maintainer to review and merge it in. One of the great advantages of GitHub is that it makes it easy to fork a project and fix bugs. However, if you’re using Maven, you also need a way to host your fork’s artifacts in a Maven repo, and there isn’t a way to automatically do that with GitHub.

You could host your own Nexus server, but that’s a lot of overhead if you don’t already have one for just a simple fork. You could set up a cloud storage solution to hold your Maven repo, but the internet is already littered with defunct and broken Maven repo links – why add one more that you’ll have to maintain it forever?

A better solution to this problem is to use GitHub to host your Maven repositories. A Maven repository is, at its heart, just a structured set of files and directories that are publicly available via http, and GitHub allows you do this easily with its raw download support. The same technique is used by GitHub itself to serve up GitHub Pages websites.

The basic solution involves three steps:

  1. Create a branch called mvn-repo to host your Maven artifacts.
  2. Use the Github site-maven-plugin to push your artifacts to Github.
  3. Configure Maven to use your remote mvn-repo as a maven repository.

There are several benefits to using this approach:

  • It ties in naturally with the deploy target so there are no new Maven commands to learn. Just use mvn deploy as you normally would.
  • Maven artifacts are kept separate from your source in a branch called mvn-repo, much like github pages are kept in a separate branch called gh-pages (if you use github pages).
  • There’s no overhead of hosting a separate Maven Nexus or cloud storage server, and your maven artifacts are kept close to your github repo so it’s easy for people to find one if they know where the other is.

The typical way you deploy artifacts to a remote maven repo is to use mvn deploy, so let’s patch into that mechanism for this solution.

First, tell maven to deploy artifacts to a temporary staging location inside your target directory. Add this to your pom.xml:

<distributionManagement>
    <repository>
        <id>internal.repo</id>
        <name>Temporary Staging Repository</name>
        <url>file://${project.build.directory}/mvn-repo</url>
    </repository>
</distributionManagement>

<plugins>
    <plugin>
        <artifactId>maven-deploy-plugin</artifactId>
        <version>2.8.1</version>
        <configuration>
            <altDeploymentRepository>internal.repo::default::file://${project.build.directory}/mvn-repo</altDeploymentRepository>
        </configuration>
    </plugin>
</plugins>

Now try running mvn clean deploy. You’ll see that it deployed your maven repository to target/mvn-repo. The next step is to get it to upload that directory to github.

Add your authentication information to ~/.m2/settings.xml so that the github site-maven-plugin can push to github:

<!-- NOTE: MAKE SURE THAT settings.xml IS NOT WORLD READABLE! -->
<settings>
  <servers>
    <server>
      <id>github</id>
      <username>YOUR-USERNAME</username>
      <password>YOUR-PASSWORD</password>
    </server>
  </servers>
</settings>

(As noted, please make sure to chmod 700 settings.xml to ensure no one can read your password in the file.)

Then tell the github site-maven-plugin about the new server you just configured by adding the following to your pom:

<properties>
    <!-- github server corresponds to entry in ~/.m2/settings.xml -->
    <github.global.server>github</github.global.server>
</properties>

Finally, configure the site-maven-plugin to upload from your temporary staging repo to your mvn-repo branch on github:

<build>
    <plugins>
        <plugin>
            <groupId>com.github.github</groupId>
            <artifactId>site-maven-plugin</artifactId>
            <version>0.9</version>
            <configuration>
                <message>Maven artifacts for ${project.version}</message>  <!-- git commit message -->
                <noJekyll>true</noJekyll>                                  <!-- disable webpage processing -->
                <outputDirectory>${project.build.directory}/mvn-repo</outputDirectory> <!-- matches distribution management repository url above -->
                <branch>refs/heads/mvn-repo</branch>                       <!-- remote branch name -->
                <includes><include>**/*</include></includes>
                <merge>true</merge>                                        <!-- don't delete old artifacts -->
                <repositoryName>YOUR-REPOSITORY-NAME</repositoryName>      <!-- github repo name -->
                <repositoryOwner>YOUR-GITHUB-USERNAME</repositoryOwner>    <!-- github username  -->
            </configuration>
            <executions>
              <!-- run site-maven-plugin's 'site' target as part of the build's normal 'deploy' phase -->
              <execution>
                <goals>
                  <goal>site</goal>
                </goals>
                <phase>deploy</phase>
              </execution>
            </executions>
        </plugin>
    </plugins>
</build>

The mvn-repo branch does not need to exist, it will be created for you.

Now run mvn clean deploy again. You should see maven-deploy-plugin “upload” the files to your local staging repository in the target directory, then site-maven-plugin committing those files and pushing them to the server.

[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building DaoCore 1.3-SNAPSHOT
[INFO] ------------------------------------------------------------------------
...
[INFO] --- maven-deploy-plugin:2.5:deploy (default-deploy) @ greendao ---
Uploaded: file:///Users/mike/Projects/greendao-emmby/DaoCore/target/mvn-repo/com/greendao-orm/greendao/1.3-SNAPSHOT/greendao-1.3-20121223.182256-3.jar (77 KB at 2936.9 KB/sec)
Uploaded: file:///Users/mike/Projects/greendao-emmby/DaoCore/target/mvn-repo/com/greendao-orm/greendao/1.3-SNAPSHOT/greendao-1.3-20121223.182256-3.pom (3 KB at 1402.3 KB/sec)
Uploaded: file:///Users/mike/Projects/greendao-emmby/DaoCore/target/mvn-repo/com/greendao-orm/greendao/1.3-SNAPSHOT/maven-metadata.xml (768 B at 150.0 KB/sec)
Uploaded: file:///Users/mike/Projects/greendao-emmby/DaoCore/target/mvn-repo/com/greendao-orm/greendao/maven-metadata.xml (282 B at 91.8 KB/sec)
[INFO] 
[INFO] --- site-maven-plugin:0.7:site (default) @ greendao ---
[INFO] Creating 24 blobs
[INFO] Creating tree with 25 blob entries
[INFO] Creating commit with SHA-1: 0b8444e487a8acf9caabe7ec18a4e9cff4964809
[INFO] Updating reference refs/heads/mvn-repo from ab7afb9a228bf33d9e04db39d178f96a7a225593 to 0b8444e487a8acf9caabe7ec18a4e9cff4964809
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8.595s
[INFO] Finished at: Sun Dec 23 11:23:03 MST 2012
[INFO] Final Memory: 9M/81M
[INFO] ------------------------------------------------------------------------

Visit github.com in your browser, select the mvn-repo branch, and verify that all your binaries are now there.

mvn-repo

Congratulations!

You can now deploy your maven artifacts to a poor man’s public repo simply by running mvn clean deploy

There’s one more step you’ll want to take, which is to configure any poms that depend on your pom to know where your repository is. Add the following snippet to any project’s pom that depends on your project:

<repositories>
    <repository>
        <id>YOUR-PROJECT-NAME-mvn-repo</id>
        <url>https://raw.github.com/YOUR-USERNAME/YOUR-PROJECT-NAME/mvn-repo/</url>
        <snapshots>
            <enabled>true</enabled>
            <updatePolicy>always</updatePolicy>
        </snapshots>
    </repository>
</repositories>

Now any project that requires your jar files will automatically download them from your Github Maven repository.

Here at Groupon, we used this technique to fix our bug in Maven Android plugin, and then easily shared our fork with the wider Android community until the fix was incorporated upstream. This works particularly well for projects that are no longer being maintained since those projects may never get around to merging in your pull requests.

We hope this makes forking Maven repos in GitHub as easy for you as it has been for us!


Introducing Geekon Talk – Seattle

at July 2nd, 2014

GEEKon_Talk_logo

Groupon Engineering Seattle is an important hub for us with continued recent expansion and growth. Groupon Engineering generally thrives on the exchange and cross-connection of ideas leading to new technologies, approaches and breakthroughs.

Accordingly, we’re excited to be launching our Geekon Talk series in our Seattle office.

The Geekon Talk series provides a platform for outside speakers to come into Groupon to present ideas and information on everything related to tech, entrepreneurship, tools, product design and start-ups.

Join us for our kick-off talk Tuesday, July 8. David Blevins will provide a code-first tour of TomEE, including quickly bootstrapping REST projects, doing proper testing with Arquillian, and setting up environments.

Apache TomEE is the Java EE version of Apache Tomcat and offers out-of-the-box integration with JAX-RS, JAX-WS, JMS, JPA, CDI, EJB and more. Finally Tomcat goes beyond just Servlets and JSP. Save time chasing down blog posts, eject libs from your webapps and start development with a more complete stack that is extremely well tested.

This will be a fun and lively talk with many engineering cross-connections. Sign up here.

When: Tuesday, July 8, 12:00pm -1:00pm Pacific. Where: Groupon Seattle, 505 Fifth Avenue South, Suite 310, Seattle, WA, 98104. What: Lively presentation and conversation (food will also be provided.)


Mobile Test Engineering – Odo

at June 26th, 2014

Earlier this year we told you about Odo, an innovative mobile test engineering tool we developed here at Groupon to overcome some of the challenges involved in testing our mobile app, which more than 80 million people have downloaded worldwide. We’re excited to tell you that Odo is now available at our Groupon Github home!

The struggle is real

Our engineering teams had to come up with a way to build the engaging mobile experiences fast to our end users. Along the way, our Mobile Test Engineering team started to come across common problems in the space and we were not able to find a solution outside that would fit our needs. Specifically, we attempted to tackle some of these challenges:

  1. Development and testing of new features without dependent service API’s being available yet.
  2. We gravitated towards using realistic data, but needed more than a traditional mock server. While a mock may work, you will need to update the stub data whenever the service changes. In a SOA world, the maintenance can come at a huge cost.
  3. We needed a mock server that has an HTTP based interface so we can integrate with our end to end test suites as well as a mock server that could easily integrate into development builds. We wanted to be able share the manipulated data corpus across dev and test. What’s the point in reinventing the wheel, right?
  4. We had to simulate a complex scenario that may not be possible using static data. Example: three requests are made to an API. The first two are successful, but the third fails.

Oh boy! It’s Odo!

Odo is a man-in-the-middle for client and server communication. It can easily be used as a mock/stub server or as a proxy server. Odo is used to modify the data as it passes through the proxy server to simulate a condition needed for test. For example, we can request a list of Groupon deals and set the “sold out” condition to true to verify our app is displaying a sold out deal correctly. This allows us to still preserve the state of the original deal data that other developers and testers may be relying on.

The behaviors within Odo are configurable through a REST API and additional overrides can be added through plugins, so Odo is dynamic and flexible. A tester can manually change the override behaviors for a request, or configurations can even be changed at runtime from within an automated test.

Odo In A Nutshell

As a request passes through Odo, the request’s destination host and URI are matched against the hosts and paths defined by the configuration. If a request matches a defined source host, the request is updated with the destination host. If the host name doesn’t match, the requests is sent along to its original destination. If the URI does not match a defined path, the request is executed for its new host. If the host and path match an enabled path, the enabled overrides are applied during its execution phase. There are two different types of overrides in Odo.

A Request Override is an override applied to a request coming from the mobile client on its way to the server. Generally, an override here will be for adding/editing a parameter or modifying request headers.

A Response Override executes on the data received from the server before passing data back to the client. This is where we can mock an API endpoint, change the HTTP response code, modify the response data to simulate a “sold out” Groupon deal, change the price, etc.

Automation Example

Client client = new Client(“API Profile", false); client.setCustomResponse("Global", "response text”); client.addMethodToResponseOverride("Global”, "com.groupon.proxy.internal.Common.delay"); client.setMethodArguments("Global", "com.groupon.proxy.internal.Common.delay", 1, 100);

In this example, we are applying a custom response (stub) with value “response text” to a path defined with a friendly name “Global”. Then we add a 100ms delay to that same path. Go nuts! Try adding a 10 second delay to simulate the effects of network latency on your app.

Benefits

Odo brings benefits to several different areas of development:

  • Test automation owners can avoid stub data maintenance, and tests gain the ability to dynamically configure Odo to simulate complex or edge-case scenarios.

  • Manual Testers gain the same ability to configure Odo through the UI. No coding knowledge is required.

  • Developers can use Odo to avoid dependency blocks. If a feature depends on an API currently in development, Odo can be used to simulate the missing component and unblock development.

  • Multiple client support. With this feature, a single Odo instance can be run, but have different behaviors for each client connected. This allows a team to run a central Odo instance, and the team members can modify Odo behavior without affecting other members’ configurations.

Broader Odo Adoption

With Odo’s flexibility, our Test Engineering teams adopted the usage of it for testing our interaction tier applications built on Node.js (you know, the ones powering the frontend of groupon.com). We even have an internal fork that scales so we can use it as part of our capacity testing efforts. We’ll push that update in the near future as well.

Simple as 1-2-3

Our Github page provides some getting started instructions to get you up and running quickly. All you need is Java(7+) and odo.war from our release. There is a sample to give you an idea of how to apply everything Odo has to offer to your projects.

Links

Github – https://github.com/groupon/odo

Readme – https://github.com/groupon/odo/blob/master/README.md

Download – https://github.com/groupon/odo/releases

Community Collaboration

We have a roadmap of new features we’ll be adding to Odo over time. We’d love your feedback and contribution towards making Odo a great project for everyone.