The Grace Hopper Celebration is coming!

at September 28th, 2016

Groupon is pleased to be a sponsor of the Anita Borg Institute’s Grace Hopper Celebration of Women in Computing again this year. I’m Emily Wilson, a manager in the engineering department here at Groupon, and I’m looking forward to sharing the experience of Groupon employees, including myself, at this year’s conference.

Grace Hopper Celebration (GHC) is the world’s largest conference for women technologists. Last year, the conference was attended by nearly 12,000 people. As someone who didn’t anticipate a tech-centric career, it was amazing to look around and appreciate the number of women in attendance while also keeping in mind that women are still underrepresented in technology. I had a blast, and I’m counting down the days to GHC ’16!

Keep checking in here leading up to, during, and after GHC. Follow @grouponjobs on Twitter as well for real-time updates during the conference. And if you too will be at GHC, stop by our booth at the GHC Expo to learn more about what we’re up to!


Processing Payments At Scale

at August 23rd, 2016

Groupon recently announced gross billings of $1,492,882,000 for Q2 2016 — that’s about $17M our systems charged every single day this quarter. There is a lot of complexity associated with processing such volume, which we’re going to explore in this blog post.

Setup overview

Before going into details, let’s first review how our payment system is set-up. Note that Groupon is partitioned into several datacenters, each taking care of a specific region in the world, but for simplicity, we will focus here on our North America deployment only.

Two years ago, Groupon decided to switch to Kill Bill, the open-source billing and payments system. All of the traffic is managed by seven Kill Bill instances (virtual machines), sharing a single MySQL database (dedicated hardware, running SSD, typical master/slave replication). We run it using Java 8 and Tomcat 8, i.e. a setup very similar to the open-source Kill Bill Docker images we publish, which we hope to migrate to internally one day.

Regarding our choice of database, since it is a hot topic these days: we love PostgreSQL too! But we simply have more in-house MySQL expertise. Besides, our Kill Bill philosophy is to keep the DAO layer as simple as possible: no DELETE, no UPDATE (except for two phase-commit scenarios), no JOIN, no stored procedures. We simply store and retrieve data, using straight-forward queries, while relying on the database (e.g. InnoDB) to provide us with strong transactions semantics.

Only a couple of internal services are triggering the payments, all of them using the open-source Ruby client.

Kill Bill offers a pluggable architecture: different plugin types (Payment Plugins, Control Plugins, Notifications Plugins, etc.) can be used to modify the behavior of the system at runtime. At Groupon, each Kill Bill instance runs ten plugins, four open-source ones (integrating our payment gateways and the analytics plugin, generating data for analytics and finance) and six Groupon specific ones (with very little business logic, they are mainly used to integrate other internal systems, like our message bus).

Screen Shot 2016-08-23 at 9.47.51 AM

Thanks to our proxy tokenizer, Kill Bill falls outside of PCI scope (we only see our Groupon tokens).

High availability

Our first challenge is making sure the system is always up and running. While this is obviously a requirement common to all production systems, any downtime has a direct and significant financial impact: if it takes 15 minutes to be woken up by PagerDuty, to open your laptop and to log-in to the VPN, and if during that time the system is down, that’s a potential loss of $175,000. And unlike subscription-based businesses, we cannot always retry the payments at a later time, because not all payment methods can be stored on file (e.g. Paypal).

Luckily, the payment system doesn’t need to sustain a high throughput by today’s standards: in the US, on a typical day, Kill Bill processes on average 7.5 payment transactions per second, with a peak of 12.5 payment transactions per second.

Screen Shot 2016-08-23 at 9.48.02 AM
Typical daily traffic in North America

Given our setup, each node needs to process only about 1 or 2 payments per second. With seven nodes, we’re largely over-provisioned, but this gives us a piece of mind for daily bursts and hardware failures.

Holiday periods (Thanksgiving, Christmas, etc.) put a heavy load on the system, however, which requires us to do regular load tests in production (thanks to the multi-tenancy feature in Kill Bill, where data can be compartmentalized in individual tenants, we use a specific test tenant so the data doesn’t impact financial reporting). Our last round verified we can sustain 120 payments per second.

Additionally, we’re facing the typical challenges when running a JVM (e.g. GC tuning) with the twist that because about half of our Kill Bill plugins are written in JRuby, the type of flags we can enable is limited (we have had issues when enabling invokedynamic for example). Having seven nodes lets us A/B test these settings: as an example, we currently have two groups of two nodes with different JVM flags, which we’re monitoring against our baseline to reduce GC latency.

Screen Shot 2016-08-23 at 9.48.18 AM
Monitoring the impact of JVM changes on Tuesday

Multiple merchant accounts

Groupon has several very different lines of business: we sell vouchers, physical goods, and vacation packages. For accounting reasons, each line of business uses a dedicated merchant account. Also, acquirers and payment processors prefer that businesses with different characteristics — such as volume, average order size, product category, and payment method — use different accounts and different merchant category codes. Their fraud controls and reward programs can be more precise when the orders are more alike to each other.

To solve this, we use a Kill Bill Control Plugin to understand at runtime which line of business the payment is for and to tell the payment plugin to look-up the associated credentials for that specific merchant account. This selection needs to be sticky, meaning any follow-up transactions for the same payment (capture, refund, etc.) will have to be associated with the same merchant account. This association also needs to be persisted and reflected in our financial reports.

Multiple payment providers

Because we accept so many different payment methods in so many different countries, we cannot rely on a single payment processor. Another Kill Bill Control Plugin is used at runtime to route the request to the right provider. Because Kill Bill shields the complexity of having different gateways by offering a single Payment API, the routing is transparent to the clients.

On that topic, testing is done in several stages: first of all, each plugin is independently unit tested, outside of Kill Bill, using TestNG or RSpec. Second, we maintain an internal repository (codename Kill-IT) of integration tests, which contains not only tests against Kill Bill directly (using the open-source Java client), but also against our clients (such as the Orders system, which in turns call Kill Bill in our QA environment). Finally, we work with various QA teams for end-to-end testing against our website (desktop and mobile version) as well as mobile applications because not all payment methods are available through all clients or in all countries, Apple Pay being a good example.

Multiple payment methods

We support over 100 payment methods in dozens of countries, forcing us to implement various payment flows, such as synchronous and asynchronous (e.g. 3D-S) credit card transactions, or hosted payment pages. We designed the Kill Bill API to be generic, so that introducing a new flow is mostly transparent to our clients (except in some scenarios to the front-end and mobile applications, which need to support various redirects).

Additionally, not all payment methods behave the same. Most requests have a high latency (a typical API call with a gateway can take anywhere between 1 and 2 seconds), but there is also a huge variance, even for credit cards.

Screen Shot 2016-08-23 at 9.48.31 AM
Credit card latency

An authorization call, for example, will be synchronous against the card issuer: timing will depend on the card bank (we’ve seen requests taking 10s, or more, for some local credit unions, for instance).

Screen Shot 2016-08-23 at 9.48.39 AM
Latency depending on the issuing bank

We have to keep this in mind when tweaking the various timeout parameters between the clients and Kill Bill, and between Kill Bill and our providers.

Moreover, depending on the payment method and country, we have to obey local laws. As an example, in some european countries, it is mandatory to ask permission from the user before storing a card on file.

Experimentation

Finally, we constantly look at our data to understand how we can minimize our transaction fees while maximizing our acceptance rates. Sometimes, transactions are refused by the bank for no obvious reason (very often, the customer’s card issuing bank will reject the transaction and return a generic error message like “Do Not Honor”), which can be frustrating to the user who’s currently trying to checkout on the website.

We know for example that various card types (credit vs debit, high end rewards cards vs low end cards) perform differently depending on the issuing bank. Each issuer also performs differently depending on the line of business.

Part of our work is to do everything to ensure the transaction goes through. This means making subtle changes to the request sent to the payment gateway, selecting the right acquirer, being smart in how we retry, etc. At a high level, this is done in an Experiment (Control) plugin — we will describe the technical details in a future blog post.

Any form of optimization on our traffic will typically involve various experiments with control groups validated by Chi-Square tests (if that’s your thing, we’re hiring!). Sometimes, we even have to pick-up the phone and talk to the banks, explaining to them how much work we’re doing to prevent fraud and that our transactions are indeed legitimate.

Conclusion

There is a lot more that hasn’t been covered such as rolling deployments, failover across data centers, fraud systems, financial reporting, etc., but I hope this gives you a glimpse of what it takes to process payments at scale. Feel free to reach-out if you have any specific questions!


Screwdriver: Improving Platform Resiliency at Groupon

By
at August 23rd, 2016

“Bob is an engineer. He gets his service tested for fault tolerance and resiliency. He feels confident. Be like Bob”

How confident do you feel about your service not going down in the middle of the night or during your favorite holiday? Having allocated new resources for the estimated increase in holiday traffic, would you still feel confident? We, the Screwdriver team, aim to build the confidence amongst engineers with our latest tool – Screwdriver, a Fault Tolerance Testing tool. Fault Tolerance is the property that enables a system to continue operating properly in the event of a failure of some of its components. Our goal is to certify all the services as Fault Tolerant.

Problem
At Groupon, there are thousands of nodes serving hundreds of inter-dependent micro-services. We are challenged with many potential commonly occurring failures such as node failures, network failures, and increase in network latencies. Understanding such failures of a given system and its dependent services, and to be prepared for such events is crucial in today’s world of micro-services. Testing for such types of faults/failures, and assessing the robustness, and resiliency of the system is very important as any system downtime would result in the loss of millions of dollars.

Objective
The objective of our team is to help simulate commonly occurring faults and failure scenarios, to understand the behavior during the simulation, and to take steps to prevent or mitigate such failures. We replicate the above failure scenarios by injecting faults ourselves in a controlled manner using automated scripts. We understand the behavior of a service, and its dependent services by observing the monitors for the given machine and the dependents to ensure that they are operating properly.

Architecture

Screen Shot 2016-08-22 at 5.11.57 PM

Components
Topology Translation Service
Understanding the architecture of a given service is very important before injecting a fault. It helps us answer questions like, “What does the service stack consist of?”, “Is caching handled by Varnish or Redis?”, “What set of machines to inject fault on to simulate rack failure?”. Topology Service has the capability to not just identify the machine characteristics, and get an apt fault for the same, but it also has the associated monitors to observe the Services, and the dependent Services before, during, and after the fault injection.

To persist the topologies, we needed a database that could help us visualize the topology, and to also understand the dependency between services. Storing / querying topologies in a SQL datastore would involve multiple joins between several tables especially to query dependent services up to multiple levels. Also, a more efficient, and natural way of querying for the partial set of machines to inject fault was required. We took a deep dive into looking at a Graph DB solution, and we observed that it facilitated all the above requirements in an efficient way.

For example, one can add a dependency relation between ‘Service A’ and ‘Service B’ using a more readable query.

         SERVICE_A ----DEPENDS_ON---> SERVICE_B

Similarly…

         SERVICE_B ----DEPENDS_ON---> SERVICE_C

Now it becomes easier to query for the dependency chain for a given Service. In the above example, the query to get the dependencies of ‘Service C’ would look like:

GIVEN Service named C
RETURN all relations named DEPENDS_ON upto 3 depths

The above query would return the dependent services ‘A’, and ‘B’ as a result.

Topology Translation Service requires such a database solution to help querying of entities, and their relationships in a natural way. Hence a Graph database was the natural choice. It helps us visualize the database as a graph, and supports querying of objects with relations up to multiple depths. We are using Neo4j graph database for Topology Service. Neo4j is one of the leading graph databases available, and supports all of the requirements of Topology Service. It comes with a built-in web-app to query for objects using cypher queries, and also helps us to visualize the objects like a graph.

Screen Shot 2016-08-22 at 5.11.50 PM

Capsule
One of the primary features of Screwdriver is the injection of the fault on a given machine. Our requirement was that the injection of a fault has to be as lightweight as possible. It should also be self-deployable that can run on any given machine, can kill itself on completion, and should be self-sustainable in case of communication failure. On every fault injection request, a Capsule is built to solve all the above requirements. It exposes a secure REST API through which we can control the fault, and stop it if necessary. Faults are configured as Java objects, and are run as bash scripts thereby providing a layer of abstraction. For additional security, we want to ensure that Capsule is not replicated, and run on an unintended machine. To address this, Capsule is built with expiration time, and signed with machine specific information. On startup, Capsule validates this information before it runs.

Metric Adapter
At Groupon, we have multiple metric, and event pipelines. All machines are equipped with agents to monitor the host both on the system level as well as the application level. The monitors uses the metrics published by each host, and alerts on any outliers based on custom thresholds. We built this loosely coupled plugin called Metric Adapter that can be adapted with any given metrics system such as RRDtool, and Splunk. We leverage this tool to gather metrics that we can further use to analyze the machine. With these metrics, we can observe the behavior of the machine as and when injecting a fault using the Capsule.

Anomaly Detector
One of the challenges in injecting a fault is to ensure that we have control over the lifecycle of the fault injection. It is a critical step in fault tolerance especially when we are dealing with production traffic. With the help of monitor metadata provided by the Topology Service, we can observe the monitor to identify any anomalies during any fault injection. An anomaly for a given system can be defined as any behavior of that system which deviates from the intended or expected behavior. We built the Anomaly Detector to understand the behavior of the machine when fault tested, and to help us to detect any anomalies.

Anomaly Detector has been designed smart enough to trigger a kill command to the Capsule if it is being mischievous. One of the challenges faced by the Anomaly Detector is to figure out if the anomaly being observed was caused by the fault execution or whether it is a regular pattern on the Machine Under Test (MUT). To give more clarity to Anomaly Detector, we observe for behavior patterns on the machine before, and after the fault execution.

Tower
The central component of Screwdriver known as the Tower was built to oversee all of the above components. The Tower is responsible for building, deploying capsule to the MUT, and starting the Anomaly Detector after injecting the fault. Tower also records the timeline of the given fault injection also known as Fault Run, and can generate a report for the run.

Playbook Store
Using Screwdriver, we support 7 different faults out of the box. In our conversations with multiple teams, we noticed that different teams have different fault requirements, and should not be restricted to only the above defined faults. To make our Screwdriver more extensible, we built a Playbook service to define custom faults that can be easily integrated into capsules. Using this Playbook service, one can define the set of scripts, and the commands to inject a fault, and also to abort the fault.

Test Run

Screen Shot 2016-08-22 at 5.12.07 PM

We successfully tested Fault injection on one of our staging clusters (ElasticSearch) at Groupon, and we are happy to share the results. We requested Tower to build a capsule for us to bring down one of the nodes, and once the fault started running, we observed a spike in the error rates in the monitors. Anomaly Detector triggered the killing of fault due to the increase in error rate. As we can see in the graph, the error rate started decreasing. As a result of the termination of fault, capsule brings the machine back up to its running state. Interestingly, the error rate peaked up even higher when the node rejoined the cluster.

This information was beneficial in understanding the behavior of the system when subjected to rare dangerous yet controlled scenarios, and we are excited that we could help fellow engineers to simulate the scenario before it happened in production. Using screwdriver, we were able to control the lifecycle of a fault from injection to recovery. As a result, we were able to identify architectural issues such as single point of failure, and missing caches.

Future Plans
We are currently focusing on more fault executions, and certifying our services at Groupon as Fault Tolerant. For the upcoming releases, we wish to make Screwdriver more flexible, smarter, and more productive.

Any Operations Engineer would agree that an incident should be tracked to the finest detail. Tracking a fault execution is necessary to understand the issues in a system, and make refinements. We are looking to store the Fault Runs as events which can help us analyze, and understand the pattern among execution of faults on any given machine. The fault runs would help us understand

  • How well the service performs with lesser worker machines during heavy production traffic, and whether it has improved with the increase in number of worker machines for the same traffic.
  • How many misses are observed in a cache layer with an introduction of a new cool cache host into a set of warm cache hosts.

We believe collection of such data will provide us with useful insights into the given machine, and also the Groupon-wide infrastructure. We would love to share those insights in our future blog posts.

From the Screwdriver team @ Groupon (Adithya Nagarajan, Ajay Vaddadi, Amruta Badami, Kavin Arasu)


Geekfest Palo Alto: WhatsApp’s Secret Sauce: Erlang from behind the Trenches by Francesco Cesarini

at May 9th, 2016

Screen Shot 2016-01-15 at 4.02.33 PMWe are pleased to be hosting our 5th Geekfest talk in Palo Alto this Thursday (May 12th). This talk is open to Groupon employees as well as tech enthusiasts across the Bay Area. If interested, read details below and RSVP on our Meetup.com page.

Schedule:
5:45 PM – Networking and Food
6:15 PM – Talk Begins
7:00 PM – QA and Networking
8:00 PM – Events End

Description:
Erlang is a programming language designed for the Internet Age, although it pre-dates the Web. It is a language designed for multi-core computers, although it pre-dates them too. It is a “beacon language”, to quote Haskell guru Simon Peyton-Jones, in that it more clearly than any other language demonstrates the benefits of concurrency-oriented programming.

In this talk, Francesco will introduce Erlang from behind the trenches, looking at how its history influenced its constructs. He will be doing so from a personal prospective, with anecdotes from his time as an intern at the Ericsson computer science lab at a time when the language was being heavily influenced and later when working on the OTP R1 release.

About the Speaker:
Francesco Cesarini is the founder of Erlang Solutions Ltd. He has used Erlang on a daily basis since 1995, starting as an intern at Ericsson’s computer science laboratory, the birthplace of Erlang. He moved on to Ericsson’s Erlang training and consulting arm working on the first release of OTP, applying it to turnkey solutions and flagship telecom applications. In 1999, soon after Erlang was released as open source, he founded Erlang Solutions, who have become the world leaders in Erlang based consulting, contracting, training and systems development. Francesco has worked in major Erlang based projects both within and outside Ericsson, and as Technical Director, has led the development and consulting teams at Erlang Solutions.

He is also the co-author of ‘Erlang Programming’ and ‘Designing for Scalability with Erlang/OTP’ both published by O’Reilly and lectures at Oxford University.

Hope to see you there!
http://www.meetup.com/Geekfest-Palo-Alto/events/230871932/


Geekfest Palo Alto: Intro to Geospatial in Redis 3.2 by Dave Nielsen of Redis Labs

at April 26th, 2016

Screen Shot 2016-01-15 at 4.02.33 PMWe are pleased to be hosting our 4th Geekfest talk in Palo Alto this Thursday (April 28th). This talk is open to Groupon employees as well as tech enthusiasts across the Bay Area. If interested, read details below and RSVP on our Meetup.com page:http://www.meetup.com/Geekfest-Palo-Alto/events/230497052/

Schedule:
6:30pm Networking
7:00pm Geospatial Talk
8:00pm Q&A
8:30pm End

Description: Join Dave in this talk where he demonstrates how to speed up mobile apps and web scale systems with Redis geospatial data structures and functions.

In his talk, Dave will demo an open source Geospatial app that depends solely on Redis. Data structures include Geospatial Indexes. Functions demoed are GEOADD, GEOHASH, GEOPOS, GEODIST, GEORADIUS, GEORADIUSBYMEMBER .

About the Speaker

Dave works for Redis Labs organizing workshops, hackathons, meetups and other events to help developers learn when and how to use Redis. Dave is also the co-founder and lead organizer of CloudCamp, a community of Cloud Computing enthusiasts in over 100 cities around the world. Dave graduated from Cal Poly: San Luis Obispo and has worked in developer relations for 12 years at companies like PayPal, Strikeiron and Platform D. Dave gained modest notoriety when he proposed to his girlfriend in the book “PayPal Hacks.”


Groupon’s Workflow Service, “Backbeat”, Now Open Source

By
at April 14th, 2016

Screen Shot 2016-04-14 at 3.24.13 PM
Groupon operates globally in 28 countries. In order to stay financially nimble at this scale, the company needs automated processes for just about everything. Groupon’s Financial Engineering Developers (FED) team developed and maintain our system for automating merchant payments. The system has paid millions of merchants billions of dollars. One of the tools that helped us reach this scale is a workflow service we created called Backbeat. We recently open-sourced this service and want to share how it helped us and how it can help you.

Let’s see how Backbeat helps out with a simplified example of Groupon’s merchant payment process. For this example we will say there are 3 distinct steps for making a payment-

1. Get Sold Vouchers – We query our inventory systems to discover Groupon vouchers that were sold for a particular contract.

2. Calculate Payment – We get the relevant payment term on the contract from the contract service. Payment term describes how we pay the merchant for that particular contract. We multiply the number of unpaid vouchers by the amount we owe the merchant per voucher. We take this total and create a payment object in our database.

3. Send to Bank– Here we take any payment objects that have not been sent to the bank and call the bank’s API telling them where we want to send the money and how much. If successful, we mark the payments as “sent_to_bank” so that we do not accidentally send them again.

These separate steps, or activities as we call them, need to happen in a specific order. It can be best organized into a simple workflow. A workflow is the series of activities that are necessary to complete a task. Our workflow for this example would look like this-

Screen Shot 2016-04-14 at 3.19.19 PM

We decided to break the payment process into separate steps because it is easier to retry at an individual activity level if they fail. For example: If the bank’s API cannot be reached for whatever reason, we will want to retry the “Send to Bank” activity, but not the whole “Make Payment” process. It also allows for resource level throttling. For example, it may be the case that as a client of the bank API you are only allowed X number of connections to their banking service. Now that “Send to Bank” is its own step, we can have it run asynchronous and limit the number of processes that can run that type of activity.

An alternative approach may be to maintain the workflow implicitly in our application code. Various activities would check the status of previous activities before executing, and we could use Zookeeper or database locks to ensure activities don’t race. This is a lot of bookkeeping for authoring new activities. Additional bookkeeping increases the risk of introducing a bug, and in payments, that might mean making a double payment or none at all! Splitting the business logic from the workflow state provides more visibility into the state of the system, simplifies authoring new workflows, and makes the system less brittle when changes are required.

Introducing Backbeat

Backbeat is a free to use, open-source, workflow service. It allows applications to process activities across N processes/machines while maintaining the order and time in which activities will run. Activities can spawn new activities as they run and will be automatically retried if they fail. Activities can be blocking or non-blocking so you can run things in a parallel or sequentially at any level in a workflow.

It is Backbeat’s responsibility to tell the application when it is time to run an activity and the application responds with the status of the activity (completed or errored) along with any child activities it may want to run. When Backbeat tells the client application to run an activity, the application can put it onto a queue and process it later. Here is sequence diagram of a client interacting with backbeat for our “Make Payment” example above-

Screen Shot 2016-04-14 at 3.20.28 PM

In our example we were running things sequentially, one activity after another. There may be cases where you want to run some activities in parallel and when all of those activities finish you want to perform some other business logic-

Screen Shot 2016-04-14 at 3.21.27 PM

Backbeat can do this by allowing activities to have different modes like blocking, non-blocking, or fire-and-forget. It has other features like scheduling activities to run in the future and linking activities across workflows.

Backbeat was written entirely in Ruby and uses PostgreSQL for its data-store. It uses Redis along with Sidekiq to do its async processing and scheduling. Several engineering teams at Groupon are using Backbeat in production to solve their asynchronous workflow problems. We invite you to start using it for your applications today. The software is free to use and customize with directions on how to set it up on your own server. We have a ruby client for your ruby applications that allows for easy interaction with Backbeat, and we have other language clients in development. We invite the open-source community to contribute and work with us to continually improve it.

Here is the Github Repo and Wiki Page for getting started.


Codeburner – security-focused static code analysis for everyone

at March 11th, 2016

codeburner_main

Last year, the Application Security team set out to improve upon a challenging situation: with a single security team and such a large developer community, how do we keep on top of security analysis for the ever-increasing mountain of code?

The answer came about as the result of a GeekOn project to trigger automated static code analysis based on internal deployment notifications.

After some time in development adding features and getting things just right, we’re proud to announce the open source release of Codeburner!

What is Codeburner?

Codeburner uses the OWASP pipeline project to run multiple open source and commercial static analysis tools against your code, and provides a unified web interface to sort and act on the issues it finds.

Since the core backend and scanning engine is built on Rails, Codeburner also provides a full REST API for easy integration with other tools or an existing CI process.

Key Features:

  • Asynchronous scanning (via sidekiq) that scales
  • Advanced false positive filtering
  • Publish issues via GitHub or JIRA
  • Track statistics and graph security trends in your applications
  • Integrates with a variety of open source and commercial scanning tools
  • Full REST API for extension and integration with other tools, CI processes, etc.

Documentation

You can find full documentation for Codeburner at http://groupon.github.io/codeburner

Get Involved!

If you’d like to contribute, fork us on GitHub and check out the Developer Guide.


Girl Scouts explore the world of STEM at Groupon’s 4th Scout Out Engineering

at February 15th, 2016

SOE_logo

This President’s Day, we welcomed 100 3rd – 6th grade scouts to Groupon, 85 in our Chicago office and 15 in Palo Alto, with the support of almost 50 employee volunteers from the Engineering and Product teams. Throughout the day, the girls got to work with our technical employees to walk through activities on Code.org, practice the process of iteration in a hands-on bridge building activity, hear from a panel of female employees, and say hello to our friends across the country all while earning their very own “Scout Out Engineering” merit badge.

_1060225 copy

Chicago volunteer and Software Development Engineer, Shilpa, mentioned, “It’s exciting to see so many young girls here, because I don’t want to be the only girl on the Engineering team. Diversity on our Engineering teams is so important. People solve problems that they can see; and when we have diverse teams, we are more likely to solve a problem from multiple angles and build a better product.”

_1060334 copy

I could not be more proud of the work that employee volunteers have done to grow this program as we’ve traditionally hosted Scout Out Engineering in October with one event in Chicago – and this year, we have had so much excitement around supporting STEM Education and women in technology that we added a February event to the calendar and expanded programming to the west coast.

_1060386 copy

In addition to expanding the program, this year, even the grown-ups got in on the fun as we know it’s so important for students to have the opportunity to continue learning outside of the classroom. Girl Scout Moms, Diana and Carmen, hopped on Code.org, and they were hooked. “As Girl Scout parents and troop leaders, this type of program helps our girls to see themselves as the engineers of tomorrow. It gives the girls the opportunity to see different types of STEM careers and meet a number of grown up scouts in the jobs that they could fill in the future. It was fun to hear the girls saying ‘I want to do this when I grow up!’ as they worked with the Groupon volunteers.” both moms stated.

_1060417 copy

Palo Alto volunteer and fellow Software Development Engineer, Sarah, said, “It’s really cool to see girls have the opportunity to be immersed into this kind of environment at such a young age. Growing up, I didn’t have opportunities to meet people like me, so I’m excited to volunteer and help open the doors for young girls to stay interested in STEM.”

STEM Education is an area that Groupon’s Social Responsibility team has committed to as we work to help people and communities thrive and prosper. Our portfolio of community initiatives is designed to help us achieve greater impact and value for society, our employees and the business. Our hope for STEM Education is for more students in diverse and under-served communities to have access and be inspired to pursue education and careers in science, technology, engineering, and math. We are excited to continue this work with the Girl Scouts and show them how cool technology really is!

_1060431 (1) copy


Geekfest Palo Alto Meetup: Machine Learning from Lukas Biewald

at January 15th, 2016

Screen Shot 2016-01-15 at 4.02.33 PM

Groupon’s Palo Alto office is excited to be the newest chapter participating in Geekfest! Our first talk in this monthly series will be from Lukas Biewald of Crowdflower. Lukas will be speaking about machine learning, active learning and human in the loop computing; exploring how artificial intelligence systems interact with human intelligence systems.

The event is Tuesday evening, January 19th at the Groupon office in Palo Alto. See our Meetup page for details and to RSVP.

Geekfest is a technology-agnostic software developer meetup. We strive to have both technical presentations as well as presentations about intra- and inter-personal relationships. We have chapters in Chicago, Seattle and now Palo Alto. If you are interested in speaking at Geekfest please email us at geekfest@groupon.com.

Hope to see you there on Tuesday!


Kill Bill Metrics

at December 3rd, 2015

Thank you to Kofi Jedamzik for contributing to this project and blog post.

CyberSource rate

Status quo

At Groupon, the majority of applications use a system called Grapher for monitoring. It can plot simple rrd graphs with one minute resolution. However, during the Kill Bill migration, we encountered multiple deficiencies with the existing solution:

  • Grapher offers limited support for templates, which makes it difficult to reuse and maintain the rrd graph definitions.
  • Timerange and Timezone are stored in a cookie, thus sharing the graph link will most likely lead to a different graph (so we ended up sharing screenshots).
  • It is time consuming to change and add new metrics because most of them are based on Splunk searches or cron jobs.
  • The biggest deficiency is the high cardinality of our metrics: we need a lot of context around our metrics. For example, we wanted to get notified when a specific payment method in a specific country for a specific client starts failing. But the number of combinations caused capacity problems in our Splunk cluster.

As a result, we started looking for a simple solution to improve the situation. The starting point was the dropwizard metrics library, which has been established as the de facto standard for metrics in Java based applications. The library makes it very easy to measure different metrics within your application. It supports five different metric-types: Gauges, Counters, Histograms, Meters and Timers. You can also use numerous modules to instrument common libraries like Jetty, Logback, Log4j, Apache HttpClient, Ehcache, JDBI and Jersey.

The collected metrics are kept in metric registries, on top of those you can use reporters to publish your metrics: we use the JmxReporter to expose metrics as JMX MBeans and Metrics Servlets to publish the metrics as JSON objects via HTTP.

Storing metrics in a time series database

Additionally, we just started experimenting with InfluxDB. InfluxDB is a new time series database with promising features like tags and fields.

Tags are indexed and allow fast querying by tag values, which should give us the abilities to breakdown numerous attributes. Another noteworthy feature is its mechanism for downsampling stored data, called Continuous Queries: it lets you aggregate and precompute expensive queries on the fly.

Unfortunately, InfluxDB reporting is not yet supported by the metrics library but as InfluxDB supports the graphite line protocol we use the metrics-graphite reporter to stream the metrics directly into InfluxDB. The drawback is that we have to parse the metric name to extract the metadata. This can be done with InfluxDB’s Graphite Plugin which lets you extract tags from metric names by using a template. For example, instead of duration, host=myhost, method=create-payment, country=de value=1234 we send duration.myhost.create-payment.de 1234 and configure a template like this: measurement.host.method.country to extract the tags. This feature is a bit limited as you can only use wildcards and the dot separator, regular expressions would be a nice feature for the future.

Visualizing time series

To visualize the time series, we use Grafana. It supports a variety of time series data sources including InfluxDB and has lots of visualization options, for example annotations, which allow you to mark events like restarts or deployments in your graphs.

Dashboard templating lets you create dynamic visualizations, it even supports variables to change query parameters.

Grafana annotations

Summary

This stack was easy to deploy and mitigated a few of our pain points. Compared to the rrd graph syntax, it’s very nice to have a full featured graph editor.

Grafana editor

The dashboard definitions can simply be exported and synchronized with other instances via Grafana’s HTTP API. Sharing graphs is not a problem anymore, you can even choose between UTC and the browsers time zone.

But there is still work to do: instead of only creating metrics via Splunk queries, we measure them directly in the application. This makes the whole pipeline less error prone but the drawback is that the aggregation has to take place in InfluxDB or the application. In addition, the need for graphite templates to extract metadata from the metric name means frequent config changes when we change or add new metrics. So, we are currently figuring out what’s the best way to get support for tags and fields into Kill Bill directly.