We recently completed a year-long project to migrate Groupon’s U.S. web traffic from a monolithic Ruby on Rails application to a new Node.js stack with substantial results.
Groupon’s entire U.S. web frontend has been a single Rails codebase from its inception. The frontend codebase quickly grew large, which made it difficult to maintain and challenging to ship new features. As a solution to this gigantic monolith, we decided to re-architect the frontend by splitting it into small, independent and more manageable pieces. At the center of this project, we rebuilt each major section of the website as an independent Node.js application. We also rebuilt the infrastructure to make all the independent apps work together. Interaction Tier (I-Tier) was the result.
Some of the highlights of this great architecture migration include the following:
- Page loads are significantly faster across the site
- Our development teams can develop and ship features faster and with fewer dependencies on other teams
- We can eliminate redundant implementations of the same features in different countries where Groupon is available
This post is the first in a series about how we re-architected the site and the great benefits we’re seeing that will be key to driving Groupon business forward. Read on for the full story.
A Little History
Groupon started as a single web page that showed one deal each day to people in Chicago. An example of a typical deal might be something like a discount to a local restaurant or a ticket to a local event. Each deal had a “tipping point” – the minimum number of people that had to buy the deal for it to be valid. If enough people bought the deal reaching the tipping point, everyone got the discount. Otherwise, no one got the discount.
The site was originally built as a Ruby on Rails application. Rails was a great choice in the beginning as it was one of the easiest ways for the very small development team we had to get our site up and running quickly. It was also easy to implement new features on Rails; this was a huge asset for us in the early days as the feature set was constantly evolving.
The original Rails architecture was very simple:
However, we quickly outgrew being able to serve all of our traffic through a single Rails application pointing to a single database cluster. We added more frontend servers and database replicas and put everything behind a CDN, but that only worked only until the database writes became a bottleneck. Processing orders caused a number of database writes; as a result, we decided to move that code out of our Rails app and into a new service with its own database cluster.
We kept with following this pattern of breaking out existing backend functionality into new services, but the rest of the website (views, controllers, assets, etc) remained part of the original Rails application:
This architecture change bought us time but we knew it would only be temporary. The codebase was still manageable for the small development team we had at that time and it allowed for us to keep the site from falling over during peak traffic.
Around this time, Groupon began expanding internationally. Over a short period, we went from operating just in the U.S. to expanded operations in 48 different countries. Along the way, we also acquired several international companies such as CityDeal. Each acquisition came with its own pre-existing software stack.
The CityDeal architecture was similar to Groupon architecture, but it was a totally separate implementation built by a different team. As a result there were differences in design and technology—Java instead of Ruby, Apache instead of nginx, PostgreSQL instead of MySQL.
As we see with fast-growing companies, we had to choose between slowing down to integrate the different stacks or keep both systems, knowing that we were taking on technical debt we would have to repay later. We made an intentional decision to keep the U.S. and European implementations separate at first in exchange for growing the business faster. And as more acquisitions followed, more complexity was added to the architecture.
We also built mobile clients for iPhone, iPad, Android and Windows Mobile; we definitely did not want to build a different mobile app for each country where Groupon operated. Instead, we decided to build an API layer on top of each of our backend software platforms; our mobile clients connected to whichever API endpoint matched the user’s country:
This worked well for our mobile team. They were able to build a single mobile app that worked across all of our countries.
But there was still a catch. Whenever we built a new product or feature, we built it first for the web and then later built an API so that the feature could be implemented on mobile. We were repeating our efforts.
Now that nearly half of our business is mobile in the U.S., we need to build with a mindset of mobile first. Accordingly, we want an architecture where a single backend could serve mobile and web clients with minimal development effort.
As Groupon continued to evolve and new products were launched, the frontend Ruby codebase grew larger. There were too many developers working in the same codebase. It got to the point where it was difficult for developers to run the application locally. Test suites slowed down and flakey tests became a real problem. And since it was a single codebase, the entire application had to be deployed at once. When a production issue required a rollback, everyone’s changes would get rolled back instead of just the broken feature. In short, we had all the problems of a monolithic codebase that had grown too large.
But we had this problem multiple times over. Not only did we have to deal with the U.S. codebase, but we had many of the same problems with the European codebase. We needed to totally re-architect the frontend.
Rebuilding the entire frontend is a risky endeavor. It takes a lot of time involving a lot of different people and there’s a real chance that you won’t come up with anything that’s any better than the old system. Or worse — it takes too long and you give up halfway through with no results to show for the effort.
But we had great success in the past rearchitecting smaller pieces of our infrastructure. For example, both our mobile website and our merchant-facing website had been rebuilt with great results. This experience gave us a good starting point and from it we set out clear goals for this project.
Goal 1: Unify our frontends
With multiple software stacks implementing the same features in different countries, we weren’t able to move as fast as we wanted. We needed to eliminate redundancy in our software stack.
Goal 2: Put mobile on the same level as web
Since nearly half of our business in the U.S. is mobile, we couldn’t afford to build a web version and a mobile version. We needed an architecture where web was just another client using the same APIs as our mobile apps.
Goal 3: Make the site faster
Our site was slower than we wanted. In the rush to handle the growth of the site, the U.S. frontend had accumulated tech debt which made it challenging to optimize. We wanted a solution that didn’t require so much code to serve a request. We wanted something simple.
Goal 4: Let teams move independently
When Groupon was first launched, the site was indeed simple. But since then, we’ve added many new product lines with supported development teams located around the world. We wanted each team to be able to build and deploy their features independently and quickly. We needed to break the interdependency between product teams that existed because everything was in a single codebase.
First, we decided to split each major feature of the website into a separate web application:
We built a web application framework in Node.js that included common features needed by each application to make it easy for our teams to build out these individual web apps.
Sidebar: Why Node.js?
Before building our new frontend layer, we evaluated several different software stacks to see which would be the best fit for us.
We were looking for a solution to a very specific problem – efficiently handling many incoming HTTP requests, making parallel API requests to service each of those HTTP requests, and rendering the results into HTML. We also wanted something that we could confidently monitor, deploy and support.
We wrote prototypes using several software stacks and tested them. We’ll post a more detailed follow-up with the specifics, but overall we found Node.js to be a good fit for this very specific problem.
Next, we added a routing layer on top that forwarded users to the appropriate application based on the page they were visiting:
We built the Groupon routing service (which we call Grout) as an nginx module. It allows us to do lots of cool things like conduct A/B tests between different implementations of the same app on different servers.
And to make all of these independent web apps work smoothly together, we’ve built separate services for sharing layouts and style, maintaining shared configuration and managing A/B test treatments. We’ll post more details on these services in the future.
All of this sits in front of our API and nothing in the frontend layer is allowed to talk to a database or backend service directly. This allows us to build a single federated API layer that serves both our web and mobile apps:
We are working on unifying our backend systems, but for the short term we still need to support our U.S. and European backends. So we designed our frontend to work for both backends at the same time:
We’ve just finished migrating our U.S. frontend from Ruby to our new Node.js infrastructure. The old monolithic frontend was split up into approximately 20 separate web apps, each of which was a clean rewrite. We’re currently serving 50k rpm off of these servers on an average day, but we expect multiples of that traffic during the holiday season. And that number will increase greatly as we migrate over traffic from our other 48 countries.
These are the benefits we’ve seen so far:
- Page loads are faster across the board—typically by 50%. Part of this is due to technology changes and part of this is because we had a chance to rewrite all of our web pages to be much slimmer. And we still expect to make significant gains here as we roll out additional changes.
- We’re serving the same amount of traffic with less hardware compared to the old stack.
- Teams are able to deploy changes to their applications independently.
- We’ve been able to make site-wide feature and design changes much more quickly than we would have been able to with our old architecture.
Overall, this migration has made it possible for our development teams to ship pages more quickly with fewer interdependencies and removed some of the performance limitations of our old platform. But we have many more improvements planned for the future and we’ll be posting details soon.