Groupon @ GHC: Pratyusha J

at October 6th, 2016

Groupon is excited to be a sponsor for the Grace Hopper Celebration. Leading up the conference, we’ll introduce you to some of our employees who are attending. Make sure you say hello when you see us!

rsz_img_0590
Pratyusha J.
Software Development Engineer, iOS

What does being a woman in tech mean to you?

As a woman in technology, I look forward to encouraging young girls to enter technical fields, promote diversity and build better products.

What does GHC mean to you?

GHC is a great place to meet and interact with successful women in tech, and I am looking forward to it.

Where will be seeing you around GHC?

I will be at the career fair and the emerging technology track.


Groupon @ GHC: Alan D.

at October 5th, 2016

Groupon is excited to be a sponsor for the Grace Hopper Celebration. Leading up the conference, we’ll introduce you to some of our employees who are attending. Make sure you say hello when you see us!

Alan D.
Sr. Engineering Manager & New Father of a Daughter

What does GHC mean to you?

We have to do a better job of identifying when and why young girls lose interests in the STEMs and creating a plan to keep them engaged. The Grace Hopper Celebration is a safe place where we can gather, talk about these types of issues, and devise that plan.

Where will be seeing you around GHC?

I can be found at the Groupon career fair booth.


Groupon @ GHC: Mariko W.

at October 4th, 2016

Groupon is excited to be a sponsor for the Grace Hopper Celebration. Leading up the conference, we’ll introduce you to some of our employees who are attending. Make sure you say hello when you see us!

profilePic
Mariko W.
Software Engineer

Why is it important to you for (more) women to work in technology?

Technology can help almost everyone and anyone in the world, so it’s important to understand the needs, problems, and solutions from diverse perspectives. A diverse team will build more influential and inclusive solutions for a wider audience. Unfortunately, women are still severely underrepresented in engineering today. More women and other minorities joining the field means solving more meaningful problems for everyone.

What does GHC mean to you?

I attended my first Grace Hopper Celebration when I was a junior in college, and it was a big motivation for me to become a software engineer. After a little over two years in computer science, I was doubting whether it was the right field for me. It was not only a challenging field of study, but with females comprising no more than 10% of my class, it was difficult to meet friends and mentors in a similar situation. It was inspiring to see so many fellow students in the same boat and experienced professionals who are already making contributions in the field.

Every year since then, Grace Hopper has given me the courage to keep working on becoming a better engineer because I’ve seen so many great examples. I’m happy to attend Grace Hopper this year as a professional software engineer, and I hope to encourage women who are in a similar situation as I was when I was a student.

Where will be seeing you around GHC?

I’ll be at the career fair and most of the software engineering track sessions.


Groupon @ GHC: Shilpa S.

at October 3rd, 2016

Groupon is excited to be a sponsor for the Grace Hopper Celebration. Leading up the conference, we’ll introduce you to some of our employees who are attending. Make sure you say hello when you see us!

Shilpa, Software Development Engineer
Shilpa S.
Software Development Engineer

What does GHC mean to you?

As a woman in engineering, what makes me happy is seeing new engineers (female engineers) focus on growth, innovation and learning rather than fear of bias. We know we have created a better environment for them when their worries are similar to those of their male allies. Grace Hopper Celebration is a time to regroup, re-energize and reconfirm our place in the world of engineering. It’s like a vacation one really needs.

Where will be seeing you around GHC?

I can be found at the Groupon career fair booth.


The Grace Hopper Celebration is coming!

at September 28th, 2016

Groupon is pleased to be a sponsor of the Anita Borg Institute’s Grace Hopper Celebration of Women in Computing again this year. I’m Emily Wilson, a manager in the engineering department here at Groupon, and I’m looking forward to sharing the experience of Groupon employees, including myself, at this year’s conference.

Grace Hopper Celebration (GHC) is the world’s largest conference for women technologists. Last year, the conference was attended by nearly 12,000 people. As someone who didn’t anticipate a tech-centric career, it was amazing to look around and appreciate the number of women in attendance while also keeping in mind that women are still underrepresented in technology. I had a blast, and I’m counting down the days to GHC ’16!

Keep checking in here leading up to, during, and after GHC. Follow @grouponjobs on Twitter as well for real-time updates during the conference. And if you too will be at GHC, stop by our booth at the GHC Expo to learn more about what we’re up to!


Processing Payments At Scale

at August 23rd, 2016

Groupon recently announced gross billings of $1,492,882,000 for Q2 2016 — that’s about $17M our systems charged every single day this quarter. There is a lot of complexity associated with processing such volume, which we’re going to explore in this blog post.

Setup overview

Before going into details, let’s first review how our payment system is set-up. Note that Groupon is partitioned into several datacenters, each taking care of a specific region in the world, but for simplicity, we will focus here on our North America deployment only.

Two years ago, Groupon decided to switch to Kill Bill, the open-source billing and payments system. All of the traffic is managed by seven Kill Bill instances (virtual machines), sharing a single MySQL database (dedicated hardware, running SSD, typical master/slave replication). We run it using Java 8 and Tomcat 8, i.e. a setup very similar to the open-source Kill Bill Docker images we publish, which we hope to migrate to internally one day.

Regarding our choice of database, since it is a hot topic these days: we love PostgreSQL too! But we simply have more in-house MySQL expertise. Besides, our Kill Bill philosophy is to keep the DAO layer as simple as possible: no DELETE, no UPDATE (except for two phase-commit scenarios), no JOIN, no stored procedures. We simply store and retrieve data, using straight-forward queries, while relying on the database (e.g. InnoDB) to provide us with strong transactions semantics.

Only a couple of internal services are triggering the payments, all of them using the open-source Ruby client.

Kill Bill offers a pluggable architecture: different plugin types (Payment Plugins, Control Plugins, Notifications Plugins, etc.) can be used to modify the behavior of the system at runtime. At Groupon, each Kill Bill instance runs ten plugins, four open-source ones (integrating our payment gateways and the analytics plugin, generating data for analytics and finance) and six Groupon specific ones (with very little business logic, they are mainly used to integrate other internal systems, like our message bus).

Screen Shot 2016-08-23 at 9.47.51 AM

Thanks to our proxy tokenizer, Kill Bill falls outside of PCI scope (we only see our Groupon tokens).

High availability

Our first challenge is making sure the system is always up and running. While this is obviously a requirement common to all production systems, any downtime has a direct and significant financial impact: if it takes 15 minutes to be woken up by PagerDuty, to open your laptop and to log-in to the VPN, and if during that time the system is down, that’s a potential loss of $175,000. And unlike subscription-based businesses, we cannot always retry the payments at a later time, because not all payment methods can be stored on file (e.g. Paypal).

Luckily, the payment system doesn’t need to sustain a high throughput by today’s standards: in the US, on a typical day, Kill Bill processes on average 7.5 payment transactions per second, with a peak of 12.5 payment transactions per second.

Screen Shot 2016-08-23 at 9.48.02 AM
Typical daily traffic in North America

Given our setup, each node needs to process only about 1 or 2 payments per second. With seven nodes, we’re largely over-provisioned, but this gives us a piece of mind for daily bursts and hardware failures.

Holiday periods (Thanksgiving, Christmas, etc.) put a heavy load on the system, however, which requires us to do regular load tests in production (thanks to the multi-tenancy feature in Kill Bill, where data can be compartmentalized in individual tenants, we use a specific test tenant so the data doesn’t impact financial reporting). Our last round verified we can sustain 120 payments per second.

Additionally, we’re facing the typical challenges when running a JVM (e.g. GC tuning) with the twist that because about half of our Kill Bill plugins are written in JRuby, the type of flags we can enable is limited (we have had issues when enabling invokedynamic for example). Having seven nodes lets us A/B test these settings: as an example, we currently have two groups of two nodes with different JVM flags, which we’re monitoring against our baseline to reduce GC latency.

Screen Shot 2016-08-23 at 9.48.18 AM
Monitoring the impact of JVM changes on Tuesday

Multiple merchant accounts

Groupon has several very different lines of business: we sell vouchers, physical goods, and vacation packages. For accounting reasons, each line of business uses a dedicated merchant account. Also, acquirers and payment processors prefer that businesses with different characteristics — such as volume, average order size, product category, and payment method — use different accounts and different merchant category codes. Their fraud controls and reward programs can be more precise when the orders are more alike to each other.

To solve this, we use a Kill Bill Control Plugin to understand at runtime which line of business the payment is for and to tell the payment plugin to look-up the associated credentials for that specific merchant account. This selection needs to be sticky, meaning any follow-up transactions for the same payment (capture, refund, etc.) will have to be associated with the same merchant account. This association also needs to be persisted and reflected in our financial reports.

Multiple payment providers

Because we accept so many different payment methods in so many different countries, we cannot rely on a single payment processor. Another Kill Bill Control Plugin is used at runtime to route the request to the right provider. Because Kill Bill shields the complexity of having different gateways by offering a single Payment API, the routing is transparent to the clients.

On that topic, testing is done in several stages: first of all, each plugin is independently unit tested, outside of Kill Bill, using TestNG or RSpec. Second, we maintain an internal repository (codename Kill-IT) of integration tests, which contains not only tests against Kill Bill directly (using the open-source Java client), but also against our clients (such as the Orders system, which in turns call Kill Bill in our QA environment). Finally, we work with various QA teams for end-to-end testing against our website (desktop and mobile version) as well as mobile applications because not all payment methods are available through all clients or in all countries, Apple Pay being a good example.

Multiple payment methods

We support over 100 payment methods in dozens of countries, forcing us to implement various payment flows, such as synchronous and asynchronous (e.g. 3D-S) credit card transactions, or hosted payment pages. We designed the Kill Bill API to be generic, so that introducing a new flow is mostly transparent to our clients (except in some scenarios to the front-end and mobile applications, which need to support various redirects).

Additionally, not all payment methods behave the same. Most requests have a high latency (a typical API call with a gateway can take anywhere between 1 and 2 seconds), but there is also a huge variance, even for credit cards.

Screen Shot 2016-08-23 at 9.48.31 AM
Credit card latency

An authorization call, for example, will be synchronous against the card issuer: timing will depend on the card bank (we’ve seen requests taking 10s, or more, for some local credit unions, for instance).

Screen Shot 2016-08-23 at 9.48.39 AM
Latency depending on the issuing bank

We have to keep this in mind when tweaking the various timeout parameters between the clients and Kill Bill, and between Kill Bill and our providers.

Moreover, depending on the payment method and country, we have to obey local laws. As an example, in some european countries, it is mandatory to ask permission from the user before storing a card on file.

Experimentation

Finally, we constantly look at our data to understand how we can minimize our transaction fees while maximizing our acceptance rates. Sometimes, transactions are refused by the bank for no obvious reason (very often, the customer’s card issuing bank will reject the transaction and return a generic error message like “Do Not Honor”), which can be frustrating to the user who’s currently trying to checkout on the website.

We know for example that various card types (credit vs debit, high end rewards cards vs low end cards) perform differently depending on the issuing bank. Each issuer also performs differently depending on the line of business.

Part of our work is to do everything to ensure the transaction goes through. This means making subtle changes to the request sent to the payment gateway, selecting the right acquirer, being smart in how we retry, etc. At a high level, this is done in an Experiment (Control) plugin — we will describe the technical details in a future blog post.

Any form of optimization on our traffic will typically involve various experiments with control groups validated by Chi-Square tests (if that’s your thing, we’re hiring!). Sometimes, we even have to pick-up the phone and talk to the banks, explaining to them how much work we’re doing to prevent fraud and that our transactions are indeed legitimate.

Conclusion

There is a lot more that hasn’t been covered such as rolling deployments, failover across data centers, fraud systems, financial reporting, etc., but I hope this gives you a glimpse of what it takes to process payments at scale. Feel free to reach-out if you have any specific questions!


Screwdriver: Improving Platform Resiliency at Groupon

By
at August 23rd, 2016

“Bob is an engineer. He gets his service tested for fault tolerance and resiliency. He feels confident. Be like Bob”

How confident do you feel about your service not going down in the middle of the night or during your favorite holiday? Having allocated new resources for the estimated increase in holiday traffic, would you still feel confident? We, the Screwdriver team, aim to build the confidence amongst engineers with our latest tool – Screwdriver, a Fault Tolerance Testing tool. Fault Tolerance is the property that enables a system to continue operating properly in the event of a failure of some of its components. Our goal is to certify all the services as Fault Tolerant.

Problem
At Groupon, there are thousands of nodes serving hundreds of inter-dependent micro-services. We are challenged with many potential commonly occurring failures such as node failures, network failures, and increase in network latencies. Understanding such failures of a given system and its dependent services, and to be prepared for such events is crucial in today’s world of micro-services. Testing for such types of faults/failures, and assessing the robustness, and resiliency of the system is very important as any system downtime would result in the loss of millions of dollars.

Objective
The objective of our team is to help simulate commonly occurring faults and failure scenarios, to understand the behavior during the simulation, and to take steps to prevent or mitigate such failures. We replicate the above failure scenarios by injecting faults ourselves in a controlled manner using automated scripts. We understand the behavior of a service, and its dependent services by observing the monitors for the given machine and the dependents to ensure that they are operating properly.

Architecture

Screen Shot 2016-08-22 at 5.11.57 PM

Components
Topology Translation Service
Understanding the architecture of a given service is very important before injecting a fault. It helps us answer questions like, “What does the service stack consist of?”, “Is caching handled by Varnish or Redis?”, “What set of machines to inject fault on to simulate rack failure?”. Topology Service has the capability to not just identify the machine characteristics, and get an apt fault for the same, but it also has the associated monitors to observe the Services, and the dependent Services before, during, and after the fault injection.

To persist the topologies, we needed a database that could help us visualize the topology, and to also understand the dependency between services. Storing / querying topologies in a SQL datastore would involve multiple joins between several tables especially to query dependent services up to multiple levels. Also, a more efficient, and natural way of querying for the partial set of machines to inject fault was required. We took a deep dive into looking at a Graph DB solution, and we observed that it facilitated all the above requirements in an efficient way.

For example, one can add a dependency relation between ‘Service A’ and ‘Service B’ using a more readable query.

         SERVICE_A ----DEPENDS_ON---> SERVICE_B

Similarly…

         SERVICE_B ----DEPENDS_ON---> SERVICE_C

Now it becomes easier to query for the dependency chain for a given Service. In the above example, the query to get the dependencies of ‘Service C’ would look like:

GIVEN Service named C
RETURN all relations named DEPENDS_ON upto 3 depths

The above query would return the dependent services ‘A’, and ‘B’ as a result.

Topology Translation Service requires such a database solution to help querying of entities, and their relationships in a natural way. Hence a Graph database was the natural choice. It helps us visualize the database as a graph, and supports querying of objects with relations up to multiple depths. We are using Neo4j graph database for Topology Service. Neo4j is one of the leading graph databases available, and supports all of the requirements of Topology Service. It comes with a built-in web-app to query for objects using cypher queries, and also helps us to visualize the objects like a graph.

Screen Shot 2016-08-22 at 5.11.50 PM

Capsule
One of the primary features of Screwdriver is the injection of the fault on a given machine. Our requirement was that the injection of a fault has to be as lightweight as possible. It should also be self-deployable that can run on any given machine, can kill itself on completion, and should be self-sustainable in case of communication failure. On every fault injection request, a Capsule is built to solve all the above requirements. It exposes a secure REST API through which we can control the fault, and stop it if necessary. Faults are configured as Java objects, and are run as bash scripts thereby providing a layer of abstraction. For additional security, we want to ensure that Capsule is not replicated, and run on an unintended machine. To address this, Capsule is built with expiration time, and signed with machine specific information. On startup, Capsule validates this information before it runs.

Metric Adapter
At Groupon, we have multiple metric, and event pipelines. All machines are equipped with agents to monitor the host both on the system level as well as the application level. The monitors uses the metrics published by each host, and alerts on any outliers based on custom thresholds. We built this loosely coupled plugin called Metric Adapter that can be adapted with any given metrics system such as RRDtool, and Splunk. We leverage this tool to gather metrics that we can further use to analyze the machine. With these metrics, we can observe the behavior of the machine as and when injecting a fault using the Capsule.

Anomaly Detector
One of the challenges in injecting a fault is to ensure that we have control over the lifecycle of the fault injection. It is a critical step in fault tolerance especially when we are dealing with production traffic. With the help of monitor metadata provided by the Topology Service, we can observe the monitor to identify any anomalies during any fault injection. An anomaly for a given system can be defined as any behavior of that system which deviates from the intended or expected behavior. We built the Anomaly Detector to understand the behavior of the machine when fault tested, and to help us to detect any anomalies.

Anomaly Detector has been designed smart enough to trigger a kill command to the Capsule if it is being mischievous. One of the challenges faced by the Anomaly Detector is to figure out if the anomaly being observed was caused by the fault execution or whether it is a regular pattern on the Machine Under Test (MUT). To give more clarity to Anomaly Detector, we observe for behavior patterns on the machine before, and after the fault execution.

Tower
The central component of Screwdriver known as the Tower was built to oversee all of the above components. The Tower is responsible for building, deploying capsule to the MUT, and starting the Anomaly Detector after injecting the fault. Tower also records the timeline of the given fault injection also known as Fault Run, and can generate a report for the run.

Playbook Store
Using Screwdriver, we support 7 different faults out of the box. In our conversations with multiple teams, we noticed that different teams have different fault requirements, and should not be restricted to only the above defined faults. To make our Screwdriver more extensible, we built a Playbook service to define custom faults that can be easily integrated into capsules. Using this Playbook service, one can define the set of scripts, and the commands to inject a fault, and also to abort the fault.

Test Run

Screen Shot 2016-08-22 at 5.12.07 PM

We successfully tested Fault injection on one of our staging clusters (ElasticSearch) at Groupon, and we are happy to share the results. We requested Tower to build a capsule for us to bring down one of the nodes, and once the fault started running, we observed a spike in the error rates in the monitors. Anomaly Detector triggered the killing of fault due to the increase in error rate. As we can see in the graph, the error rate started decreasing. As a result of the termination of fault, capsule brings the machine back up to its running state. Interestingly, the error rate peaked up even higher when the node rejoined the cluster.

This information was beneficial in understanding the behavior of the system when subjected to rare dangerous yet controlled scenarios, and we are excited that we could help fellow engineers to simulate the scenario before it happened in production. Using screwdriver, we were able to control the lifecycle of a fault from injection to recovery. As a result, we were able to identify architectural issues such as single point of failure, and missing caches.

Future Plans
We are currently focusing on more fault executions, and certifying our services at Groupon as Fault Tolerant. For the upcoming releases, we wish to make Screwdriver more flexible, smarter, and more productive.

Any Operations Engineer would agree that an incident should be tracked to the finest detail. Tracking a fault execution is necessary to understand the issues in a system, and make refinements. We are looking to store the Fault Runs as events which can help us analyze, and understand the pattern among execution of faults on any given machine. The fault runs would help us understand

  • How well the service performs with lesser worker machines during heavy production traffic, and whether it has improved with the increase in number of worker machines for the same traffic.
  • How many misses are observed in a cache layer with an introduction of a new cool cache host into a set of warm cache hosts.

We believe collection of such data will provide us with useful insights into the given machine, and also the Groupon-wide infrastructure. We would love to share those insights in our future blog posts.

From the Screwdriver team @ Groupon (Adithya Nagarajan, Ajay Vaddadi, Amruta Badami, Kavin Arasu)


Geekfest Palo Alto: WhatsApp’s Secret Sauce: Erlang from behind the Trenches by Francesco Cesarini

at May 9th, 2016

Screen Shot 2016-01-15 at 4.02.33 PMWe are pleased to be hosting our 5th Geekfest talk in Palo Alto this Thursday (May 12th). This talk is open to Groupon employees as well as tech enthusiasts across the Bay Area. If interested, read details below and RSVP on our Meetup.com page.

Schedule:
5:45 PM – Networking and Food
6:15 PM – Talk Begins
7:00 PM – QA and Networking
8:00 PM – Events End

Description:
Erlang is a programming language designed for the Internet Age, although it pre-dates the Web. It is a language designed for multi-core computers, although it pre-dates them too. It is a “beacon language”, to quote Haskell guru Simon Peyton-Jones, in that it more clearly than any other language demonstrates the benefits of concurrency-oriented programming.

In this talk, Francesco will introduce Erlang from behind the trenches, looking at how its history influenced its constructs. He will be doing so from a personal prospective, with anecdotes from his time as an intern at the Ericsson computer science lab at a time when the language was being heavily influenced and later when working on the OTP R1 release.

About the Speaker:
Francesco Cesarini is the founder of Erlang Solutions Ltd. He has used Erlang on a daily basis since 1995, starting as an intern at Ericsson’s computer science laboratory, the birthplace of Erlang. He moved on to Ericsson’s Erlang training and consulting arm working on the first release of OTP, applying it to turnkey solutions and flagship telecom applications. In 1999, soon after Erlang was released as open source, he founded Erlang Solutions, who have become the world leaders in Erlang based consulting, contracting, training and systems development. Francesco has worked in major Erlang based projects both within and outside Ericsson, and as Technical Director, has led the development and consulting teams at Erlang Solutions.

He is also the co-author of ‘Erlang Programming’ and ‘Designing for Scalability with Erlang/OTP’ both published by O’Reilly and lectures at Oxford University.

Hope to see you there!
http://www.meetup.com/Geekfest-Palo-Alto/events/230871932/


Geekfest Palo Alto: Intro to Geospatial in Redis 3.2 by Dave Nielsen of Redis Labs

at April 26th, 2016

Screen Shot 2016-01-15 at 4.02.33 PMWe are pleased to be hosting our 4th Geekfest talk in Palo Alto this Thursday (April 28th). This talk is open to Groupon employees as well as tech enthusiasts across the Bay Area. If interested, read details below and RSVP on our Meetup.com page:http://www.meetup.com/Geekfest-Palo-Alto/events/230497052/

Schedule:
6:30pm Networking
7:00pm Geospatial Talk
8:00pm Q&A
8:30pm End

Description: Join Dave in this talk where he demonstrates how to speed up mobile apps and web scale systems with Redis geospatial data structures and functions.

In his talk, Dave will demo an open source Geospatial app that depends solely on Redis. Data structures include Geospatial Indexes. Functions demoed are GEOADD, GEOHASH, GEOPOS, GEODIST, GEORADIUS, GEORADIUSBYMEMBER .

About the Speaker

Dave works for Redis Labs organizing workshops, hackathons, meetups and other events to help developers learn when and how to use Redis. Dave is also the co-founder and lead organizer of CloudCamp, a community of Cloud Computing enthusiasts in over 100 cities around the world. Dave graduated from Cal Poly: San Luis Obispo and has worked in developer relations for 12 years at companies like PayPal, Strikeiron and Platform D. Dave gained modest notoriety when he proposed to his girlfriend in the book “PayPal Hacks.”


Groupon’s Workflow Service, “Backbeat”, Now Open Source

By
at April 14th, 2016

Screen Shot 2016-04-14 at 3.24.13 PM
Groupon operates globally in 28 countries. In order to stay financially nimble at this scale, the company needs automated processes for just about everything. Groupon’s Financial Engineering Developers (FED) team developed and maintain our system for automating merchant payments. The system has paid millions of merchants billions of dollars. One of the tools that helped us reach this scale is a workflow service we created called Backbeat. We recently open-sourced this service and want to share how it helped us and how it can help you.

Let’s see how Backbeat helps out with a simplified example of Groupon’s merchant payment process. For this example we will say there are 3 distinct steps for making a payment-

1. Get Sold Vouchers – We query our inventory systems to discover Groupon vouchers that were sold for a particular contract.

2. Calculate Payment – We get the relevant payment term on the contract from the contract service. Payment term describes how we pay the merchant for that particular contract. We multiply the number of unpaid vouchers by the amount we owe the merchant per voucher. We take this total and create a payment object in our database.

3. Send to Bank– Here we take any payment objects that have not been sent to the bank and call the bank’s API telling them where we want to send the money and how much. If successful, we mark the payments as “sent_to_bank” so that we do not accidentally send them again.

These separate steps, or activities as we call them, need to happen in a specific order. It can be best organized into a simple workflow. A workflow is the series of activities that are necessary to complete a task. Our workflow for this example would look like this-

Screen Shot 2016-04-14 at 3.19.19 PM

We decided to break the payment process into separate steps because it is easier to retry at an individual activity level if they fail. For example: If the bank’s API cannot be reached for whatever reason, we will want to retry the “Send to Bank” activity, but not the whole “Make Payment” process. It also allows for resource level throttling. For example, it may be the case that as a client of the bank API you are only allowed X number of connections to their banking service. Now that “Send to Bank” is its own step, we can have it run asynchronous and limit the number of processes that can run that type of activity.

An alternative approach may be to maintain the workflow implicitly in our application code. Various activities would check the status of previous activities before executing, and we could use Zookeeper or database locks to ensure activities don’t race. This is a lot of bookkeeping for authoring new activities. Additional bookkeeping increases the risk of introducing a bug, and in payments, that might mean making a double payment or none at all! Splitting the business logic from the workflow state provides more visibility into the state of the system, simplifies authoring new workflows, and makes the system less brittle when changes are required.

Introducing Backbeat

Backbeat is a free to use, open-source, workflow service. It allows applications to process activities across N processes/machines while maintaining the order and time in which activities will run. Activities can spawn new activities as they run and will be automatically retried if they fail. Activities can be blocking or non-blocking so you can run things in a parallel or sequentially at any level in a workflow.

It is Backbeat’s responsibility to tell the application when it is time to run an activity and the application responds with the status of the activity (completed or errored) along with any child activities it may want to run. When Backbeat tells the client application to run an activity, the application can put it onto a queue and process it later. Here is sequence diagram of a client interacting with backbeat for our “Make Payment” example above-

Screen Shot 2016-04-14 at 3.20.28 PM

In our example we were running things sequentially, one activity after another. There may be cases where you want to run some activities in parallel and when all of those activities finish you want to perform some other business logic-

Screen Shot 2016-04-14 at 3.21.27 PM

Backbeat can do this by allowing activities to have different modes like blocking, non-blocking, or fire-and-forget. It has other features like scheduling activities to run in the future and linking activities across workflows.

Backbeat was written entirely in Ruby and uses PostgreSQL for its data-store. It uses Redis along with Sidekiq to do its async processing and scheduling. Several engineering teams at Groupon are using Backbeat in production to solve their asynchronous workflow problems. We invite you to start using it for your applications today. The software is free to use and customize with directions on how to set it up on your own server. We have a ruby client for your ruby applications that allows for easy interaction with Backbeat, and we have other language clients in development. We invite the open-source community to contribute and work with us to continually improve it.

Here is the Github Repo and Wiki Page for getting started.