Groupon recently announced gross billings of $1,492,882,000 for Q2 2016 — that’s about $17M our systems charged every single day this quarter. There is a lot of complexity associated with processing such volume, which we’re going to explore in this blog post.
Before going into details, let’s first review how our payment system is set-up. Note that Groupon is partitioned into several datacenters, each taking care of a specific region in the world, but for simplicity, we will focus here on our North America deployment only.
Two years ago, Groupon decided to switch to Kill Bill, the open-source billing and payments system. All of the traffic is managed by seven Kill Bill instances (virtual machines), sharing a single MySQL database (dedicated hardware, running SSD, typical master/slave replication). We run it using Java 8 and Tomcat 8, i.e. a setup very similar to the open-source Kill Bill Docker images we publish, which we hope to migrate to internally one day.
Regarding our choice of database, since it is a hot topic these days: we love PostgreSQL too! But we simply have more in-house MySQL expertise. Besides, our Kill Bill philosophy is to keep the DAO layer as simple as possible: no DELETE, no UPDATE (except for two phase-commit scenarios), no JOIN, no stored procedures. We simply store and retrieve data, using straight-forward queries, while relying on the database (e.g. InnoDB) to provide us with strong transactions semantics.
Only a couple of internal services are triggering the payments, all of them using the open-source Ruby client.
Kill Bill offers a pluggable architecture: different plugin types (Payment Plugins, Control Plugins, Notifications Plugins, etc.) can be used to modify the behavior of the system at runtime. At Groupon, each Kill Bill instance runs ten plugins, four open-source ones (integrating our payment gateways and the analytics plugin, generating data for analytics and finance) and six Groupon specific ones (with very little business logic, they are mainly used to integrate other internal systems, like our message bus).
Thanks to our proxy tokenizer, Kill Bill falls outside of PCI scope (we only see our Groupon tokens).
Our first challenge is making sure the system is always up and running. While this is obviously a requirement common to all production systems, any downtime has a direct and significant financial impact: if it takes 15 minutes to be woken up by PagerDuty, to open your laptop and to log-in to the VPN, and if during that time the system is down, that’s a potential loss of $175,000. And unlike subscription-based businesses, we cannot always retry the payments at a later time, because not all payment methods can be stored on file (e.g. Paypal).
Luckily, the payment system doesn’t need to sustain a high throughput by today’s standards: in the US, on a typical day, Kill Bill processes on average 7.5 payment transactions per second, with a peak of 12.5 payment transactions per second.
Typical daily traffic in North America
Given our setup, each node needs to process only about 1 or 2 payments per second. With seven nodes, we’re largely over-provisioned, but this gives us a piece of mind for daily bursts and hardware failures.
Holiday periods (Thanksgiving, Christmas, etc.) put a heavy load on the system, however, which requires us to do regular load tests in production (thanks to the multi-tenancy feature in Kill Bill, where data can be compartmentalized in individual tenants, we use a specific test tenant so the data doesn’t impact financial reporting). Our last round verified we can sustain 120 payments per second.
Additionally, we’re facing the typical challenges when running a JVM (e.g. GC tuning) with the twist that because about half of our Kill Bill plugins are written in JRuby, the type of flags we can enable is limited (we have had issues when enabling invokedynamic for example). Having seven nodes lets us A/B test these settings: as an example, we currently have two groups of two nodes with different JVM flags, which we’re monitoring against our baseline to reduce GC latency.
Monitoring the impact of JVM changes on Tuesday
Multiple merchant accounts
Groupon has several very different lines of business: we sell vouchers, physical goods, and vacation packages. For accounting reasons, each line of business uses a dedicated merchant account. Also, acquirers and payment processors prefer that businesses with different characteristics — such as volume, average order size, product category, and payment method — use different accounts and different merchant category codes. Their fraud controls and reward programs can be more precise when the orders are more alike to each other.
To solve this, we use a Kill Bill Control Plugin to understand at runtime which line of business the payment is for and to tell the payment plugin to look-up the associated credentials for that specific merchant account. This selection needs to be sticky, meaning any follow-up transactions for the same payment (capture, refund, etc.) will have to be associated with the same merchant account. This association also needs to be persisted and reflected in our financial reports.
Multiple payment providers
Because we accept so many different payment methods in so many different countries, we cannot rely on a single payment processor. Another Kill Bill Control Plugin is used at runtime to route the request to the right provider. Because Kill Bill shields the complexity of having different gateways by offering a single Payment API, the routing is transparent to the clients.
On that topic, testing is done in several stages: first of all, each plugin is independently unit tested, outside of Kill Bill, using TestNG or RSpec. Second, we maintain an internal repository (codename Kill-IT) of integration tests, which contains not only tests against Kill Bill directly (using the open-source Java client), but also against our clients (such as the Orders system, which in turns call Kill Bill in our QA environment). Finally, we work with various QA teams for end-to-end testing against our website (desktop and mobile version) as well as mobile applications because not all payment methods are available through all clients or in all countries, Apple Pay being a good example.
Multiple payment methods
We support over 100 payment methods in dozens of countries, forcing us to implement various payment flows, such as synchronous and asynchronous (e.g. 3D-S) credit card transactions, or hosted payment pages. We designed the Kill Bill API to be generic, so that introducing a new flow is mostly transparent to our clients (except in some scenarios to the front-end and mobile applications, which need to support various redirects).
Additionally, not all payment methods behave the same. Most requests have a high latency (a typical API call with a gateway can take anywhere between 1 and 2 seconds), but there is also a huge variance, even for credit cards.
Credit card latency
An authorization call, for example, will be synchronous against the card issuer: timing will depend on the card bank (we’ve seen requests taking 10s, or more, for some local credit unions, for instance).
Latency depending on the issuing bank
We have to keep this in mind when tweaking the various timeout parameters between the clients and Kill Bill, and between Kill Bill and our providers.
Moreover, depending on the payment method and country, we have to obey local laws. As an example, in some european countries, it is mandatory to ask permission from the user before storing a card on file.
Finally, we constantly look at our data to understand how we can minimize our transaction fees while maximizing our acceptance rates. Sometimes, transactions are refused by the bank for no obvious reason (very often, the customer’s card issuing bank will reject the transaction and return a generic error message like “Do Not Honor”), which can be frustrating to the user who’s currently trying to checkout on the website.
We know for example that various card types (credit vs debit, high end rewards cards vs low end cards) perform differently depending on the issuing bank. Each issuer also performs differently depending on the line of business.
Part of our work is to do everything to ensure the transaction goes through. This means making subtle changes to the request sent to the payment gateway, selecting the right acquirer, being smart in how we retry, etc. At a high level, this is done in an Experiment (Control) plugin — we will describe the technical details in a future blog post.
Any form of optimization on our traffic will typically involve various experiments with control groups validated by Chi-Square tests (if that’s your thing, we’re hiring!). Sometimes, we even have to pick-up the phone and talk to the banks, explaining to them how much work we’re doing to prevent fraud and that our transactions are indeed legitimate.
There is a lot more that hasn’t been covered such as rolling deployments, failover across data centers, fraud systems, financial reporting, etc., but I hope this gives you a glimpse of what it takes to process payments at scale. Feel free to reach-out if you have any specific questions!