It’s a crucial question in ecommerce: How likely is customer X to buy product Y? For Local, we must of course consider the physical locations of both X and Y. This is the location relevance problem, which is one of the most important ingredients in determining the best deals for each of our users. When we send out emails or return search query results, the deals that we display have to be relevant. To solve this problem we need to know our users’ propensity to travel for the different services that we offer, and having an accurate measurement of these travel patterns helps us to understand demand and thus optimize our sales force.
We cannot simply assume that users will want to stay in their home neighborhoods. People want to get out and explore, and we want them to check Groupon first. One approach is to determine location relevance based on simple distance, but this is an over-simplification. We know that people flock to local hotspots and avoid certain neighborhoods. They are also more likely to travel farther for a rare service, a pricey restaurant, and many leisure activities such as museums and waterparks. Fortunately, we can capture these trends directly from the data. Here’s how.
Getting the data
For this analysis, we require data pertaining to where our users are located and what they have purchased. For this we leverage the fact that users can voluntarily provide us with their locations in the form of zipcodes or full addresses. For each historical order data point, there are several variables that we want to track due to their importance in determining whether or not someone is willing to travel for a Local deal. For this analysis we consider the following:
- the location of the user
- the location of the merchant
- the service that the merchant provides
- the price of the deal
In order to compute useful empirical probabilities regarding how likely users are to travel, we need to group the data points into bins. For the locations, we partition the world into markets (metropolitan areas plus the surrounding suburbs), which are further divided into “submarkets.” For the deals, we have a hierarchical taxonomy with three levels of merchant attribution: the type of services offered (e.g. milkshakes or body wraps), the merchant type (e.g. “Bakery & Desserts” or ”Spa Services”), and the general merchant category (e.g. “Food & Drink” or “Beauty / Wellness / Healthcare”). We further define price bins such as “$0-25″ on the low end and “$100+” on the high end.
Thus for each user we have a market and a submarket, and for each deal we also have a market and a submarket, in addition to a service (with the associated merchant type and category) and a price bin. Then for each <service, price bin, deal location, user location> combination we can empirically determine the odds that the user will be willing to travel for the type of deal in question, based on what has occurred in the past.
Based on these odds we can determine the most popular travel patterns, which will tell us where each city’s hotspots are located. We can further define an effective radius for each individual service, thereby determining how far users are typically willing to travel for, say, aerobics versus paintball.
Not so fast…
There are several subtleties to this analysis. For starters, many users and merchants will have multiple locations in our database. This can happen for instance when a user has multiple addresses registered with us and when merchants have multiple locations. To work around this, we assume the <user location, deal location> pair that corresponds to the shortest distance. In other words, if a user buys a deal that is closer to their home than to their workplace, we assume that they’re traveling from the former. Similarly, if someone buys a deal that has multiple locations, we assume that they’re going to redeem it at the location that is closest to them.
A bigger problem is data sparsity. Given the extremely broad variety of services that we offer, we find that some of our <service, price bin, deal location> combinations have too few data points, and thus a poor sampling of the relevant locations of the travel-willing users. For instance, in Chicago and the surrounding suburbs we have 14 submarkets. Thus if we wanted to determine which submarkets’ users are buying mid-price downtown yoga deals, we need to have far more than 14 data points to get a good sampling. We work around this problem by using geography-independent fallbacks, utilizing our taxonomy. For instance if we lack sufficient data at the <service, price bin> level, then we collapse out the price bin and only consider the service. If we still lack sufficient data, we then fall back a level in our taxonomy and use the merchant type or the even more granular merchant category.
Another important issue is outlier deals. Especially amazing deals might draw users from a much wider radius than is typical, which would skew our results. To deal with this we use outlier removal to exclude the very top- and bottom- performing historical deals from our dataset.
For each <service, price bin, user location, deal location> combination, our result is the probability that a user from that location would be willing to travel for a deal of that service, price bin, and location. To be sure that we’re not just seeing noise and that these travel flows are actual organic tendencies, we say that a flow is important enough to be deemed “travel-worthy” if this probability reaches a threshold of 20% or more. This level was found to be aggressive enough to leave us with only the truly statistically significant flows, yet low enough to give us sufficient useful information on the travel patterns for each city and service.
As expected, we find that users are indeed inclined to travel beyond their home neighborhoods, and that those travel propensities depend on where they live. For instance, the median distance Chicago users are willing to travel is about 5 miles. However, this median depends strongly on user submarket, and is under 2 miles for downtown users but approximately 12 miles for users from the South suburbs.
Where are these users traveling, and for what? Our results tell us these hotspots as a function of service, and we find that they depend on a combination of merchant density and merchant quality. For more common services, such as steakhouses, we find that users generally travel from areas of lower to higher merchant density. However, users are also willing to travel for particularly great merchants, and this preference is more likely to dictate the travel patterns for more unique services, like museums.
For example, one service in our “Food & Drink” category is “Cupcake.” We can query our results to give us the travel worthy patterns for this service for Chicago and the lowest price bin.
Chicago’s cupcake hotspots, here marked with red dots and defined as submarkets having at least 3 travel patterns ending there, all contain regions with the highest densities of cupcake-providing merchants. Similarly, for steakhouses we find that the submarkets containing the Magnificent Mile and Naperville are the major hotspots, as any Chicagoan might expect.
Which <service, deal location> combinations draw the most travelers? In Chicago in particular, we find that people are flocking to a Murder Mystery dinner spot in the West suburbs, a Hawaiian restaurant in the North suburbs, and the Field Museum and Segway tours downtown.
Despite our strict travel-worthy threshold, we still must verify that these travel patterns are actual organic travel tendencies, as opposed to being due to gaps in our inventory. Fortunately using census data for each submarket we find a strong correlation between the resident and business densities and the number of travel patterns that end there, indicating that our hotspots reflect actual travel patterns and are not biased by Groupon offerings. Still there are two behaviors that we need to stay on the lookout for:
- regions with low business density but high travel worthiness that are getting more than their fair share of deals
- regions with high business density but low travel worthiness that are getting less than their fair share of deals
In this way we can keep an eye on our inventory and troubleshoot as needed.
As mentioned above, we can also define an effective radius for each <service, price bin, location> combination, to determine how far users from said location are willing to travel for certain types of deals. We define this to be the 75th percentile of all of the relevant user-deal distances in our historical data set. By doing so we find that the lowest-radius services tend to include everyday fitness activities like aerobics classes, gym memberships, and spinning classes, whereas the highest-radius services tend to include weekend leisure activities such as white water rafting, off-roading, and skiing.
This analysis was performed entirely with subscription and order data, and thus it was limited to a study of the interplay of merchant and user home locations. The expansion of our mobile business provides a huge opportunity for further tracking of travel patterns. Assuming users have given us their explicit consent to track it, GPS data gives us a much finer-grained picture of their behavior, thereby enabling us to learn where users are when they open the app and where they are when they place orders. Coupled with Gnome, we can further empower merchants to build stronger ties with customers who routinely travel to their neighborhoods.