I-Tier: Dismantling the Monolith

at October 30th, 2013

We recently completed a year-long project to migrate Groupon’s U.S. web traffic from a monolithic Ruby on Rails application to a new Node.js stack with substantial results.

Groupon’s entire U.S. web frontend has been a single Rails codebase from its inception. The frontend codebase quickly grew large, which made it difficult to maintain and challenging to ship new features. As a solution to this gigantic monolith, we decided to re-architect the frontend by splitting it into small, independent and more manageable pieces. At the center of this project, we rebuilt each major section of the website as an independent Node.js application. We also rebuilt the infrastructure to make all the independent apps work together. Interaction Tier (I-Tier) was the result.

Some of the highlights of this great architecture migration include the following:

  • Page loads are significantly faster across the site
  • Our development teams can develop and ship features faster and with fewer dependencies on other teams
  • We can eliminate redundant implementations of the same features in different countries where Groupon is available

This post is the first in a series about how we re-architected the site and the great benefits we’re seeing that will be key to driving Groupon business forward. Read on for the full story.

A Little History

Groupon started as a single web page that showed one deal each day to people in Chicago. An example of a typical deal might be something like a discount to a local restaurant or a ticket to a local event. Each deal had a “tipping point” – the minimum number of people that had to buy the deal for it to be valid. If enough people bought the deal reaching the tipping point, everyone got the discount. Otherwise, no one got the discount.

The site was originally built as a Ruby on Rails application. Rails was a great choice in the beginning as it was one of the easiest ways for the very small development team we had to get our site up and running quickly. It was also easy to implement new features on Rails; this was a huge asset for us in the early days as the feature set was constantly evolving.

The original Rails architecture was very simple:

Architecture

However, we quickly outgrew being able to serve all of our traffic through a single Rails application pointing to a single database cluster. We added more frontend servers and database replicas and put everything behind a CDN, but that only worked only until the database writes became a bottleneck. Processing orders caused a number of database writes; as a result, we decided to move that code out of our Rails app and into a new service with its own database cluster.

We kept with following this pattern of breaking out existing backend functionality into new services, but the rest of the website (views, controllers, assets, etc) remained part of the original Rails application:

Architecture

This architecture change bought us time but we knew it would only be temporary. The codebase was still manageable for the small development team we had at that time and it allowed for us to keep the site from falling over during peak traffic.

Going Global

Around this time, Groupon began expanding internationally. Over a short period, we went from operating just in the U.S. to expanded operations in 48 different countries. Along the way, we also acquired several international companies such as CityDeal. Each acquisition came with its own pre-existing software stack.

The CityDeal architecture was similar to Groupon architecture, but it was a totally separate implementation built by a different team. As a result there were differences in design and technology—Java instead of Ruby, Apache instead of nginx, PostgreSQL instead of MySQL.

arch3b

As we see with fast-growing companies, we had to choose between slowing down to integrate the different stacks or keep both systems, knowing that we were taking on technical debt we would have to repay later. We made an intentional decision to keep the U.S. and European implementations separate at first in exchange for growing the business faster. And as more acquisitions followed, more complexity was added to the architecture.

Mobile

We also built mobile clients for iPhone, iPad, Android and Windows Mobile; we definitely did not want to build a different mobile app for each country where Groupon operated. Instead, we decided to build an API layer on top of each of our backend software platforms; our mobile clients connected to whichever API endpoint matched the user’s country:

Architecture

This worked well for our mobile team. They were able to build a single mobile app that worked across all of our countries.

But there was still a catch. Whenever we built a new product or feature, we built it first for the web and then later built an API so that the feature could be implemented on mobile. We were repeating our efforts.

Now that nearly half of our business is mobile in the U.S., we need to build with a mindset of mobile first. Accordingly, we want an architecture where a single backend could serve mobile and web clients with minimal development effort.

Multiple Monoliths

As Groupon continued to evolve and new products were launched, the frontend Ruby codebase grew larger. There were too many developers working in the same codebase. It got to the point where it was difficult for developers to run the application locally. Test suites slowed down and flakey tests became a real problem. And since it was a single codebase, the entire application had to be deployed at once. When a production issue required a rollback, everyone’s changes would get rolled back instead of just the broken feature. In short, we had all the problems of a monolithic codebase that had grown too large.

But we had this problem multiple times over. Not only did we have to deal with the U.S. codebase, but we had many of the same problems with the European codebase. We needed to totally re-architect the frontend.

Rewrite Everything!

Rebuilding the entire frontend is a risky endeavor. It takes a lot of time involving a lot of different people and there’s a real chance that you won’t come up with anything that’s any better than the old system. Or worse — it takes too long and you give up halfway through with no results to show for the effort.

But we had great success in the past rearchitecting smaller pieces of our infrastructure. For example, both our mobile website and our merchant-facing website had been rebuilt with great results. This experience gave us a good starting point and from it we set out clear goals for this project.

Goal 1: Unify our frontends

With multiple software stacks implementing the same features in different countries, we weren’t able to move as fast as we wanted. We needed to eliminate redundancy in our software stack.

Goal 2: Put mobile on the same level as web

Since nearly half of our business in the U.S. is mobile, we couldn’t afford to build a web version and a mobile version. We needed an architecture where web was just another client using the same APIs as our mobile apps.

Goal 3: Make the site faster

Our site was slower than we wanted. In the rush to handle the growth of the site, the U.S. frontend had accumulated tech debt which made it challenging to optimize. We wanted a solution that didn’t require so much code to serve a request. We wanted something simple.

Goal 4: Let teams move independently

When Groupon was first launched, the site was indeed simple. But since then, we’ve added many new product lines with supported development teams located around the world. We wanted each team to be able to build and deploy their features independently and quickly. We needed to break the interdependency between product teams that existed because everything was in a single codebase.

Approach

First, we decided to split each major feature of the website into a separate web application:

Architecture

We built a web application framework in Node.js that included common features needed by each application to make it easy for our teams to build out these individual web apps.

Sidebar: Why Node.js?

Before building our new frontend layer, we evaluated several different software stacks to see which would be the best fit for us.

We were looking for a solution to a very specific problem – efficiently handling many incoming HTTP requests, making parallel API requests to service each of those HTTP requests, and rendering the results into HTML. We also wanted something that we could confidently monitor, deploy and support.

We wrote prototypes using several software stacks and tested them. We’ll post a more detailed follow-up with the specifics, but overall we found Node.js to be a good fit for this very specific problem.

Approach, continued…

Next, we added a routing layer on top that forwarded users to the appropriate application based on the page they were visiting:

Architecture

We built the Groupon routing service (which we call Grout) as an nginx module. It allows us to do lots of cool things like conduct A/B tests between different implementations of the same app on different servers.

And to make all of these independent web apps work smoothly together, we’ve built separate services for sharing layouts and style, maintaining shared configuration and managing A/B test treatments. We’ll post more details on these services in the future.

All of this sits in front of our API and nothing in the frontend layer is allowed to talk to a database or backend service directly. This allows us to build a single federated API layer that serves both our web and mobile apps:

Architecture

We are working on unifying our backend systems, but for the short term we still need to support our U.S. and European backends. So we designed our frontend to work for both backends at the same time:

Architecture

Results

We’ve just finished migrating our U.S. frontend from Ruby to our new Node.js infrastructure. The old monolithic frontend was split up into approximately 20 separate web apps, each of which was a clean rewrite. We’re currently serving 50k rpm off of these servers on an average day, but we expect multiples of that traffic during the holiday season. And that number will increase greatly as we migrate over traffic from our other 48 countries.

These are the benefits we’ve seen so far:

  • Page loads are faster across the board—typically by 50%. Part of this is due to technology changes and part of this is because we had a chance to rewrite all of our web pages to be much slimmer. And we still expect to make significant gains here as we roll out additional changes.
  • We’re serving the same amount of traffic with less hardware compared to the old stack.
  • Teams are able to deploy changes to their applications independently.
  • We’ve been able to make site-wide feature and design changes much more quickly than we would have been able to with our old architecture.

Overall, this migration has made it possible for our development teams to ship pages more quickly with fewer interdependencies and removed some of the performance limitations of our old platform. But we have many more improvements planned for the future and we’ll be posting details soon.

No Tags


56 thoughts on “I-Tier: Dismantling the Monolith

  1. Nice work Adam, Sean, Todd, Jan, Tristan and the whole itier team. Congrats.

    by Par Trivedi on October 30, 2013 at 2:11 pm
  2. A very interesting read! It appears that the re-architecture is pretty much the same as whats happening to most web apps these days, looks like your standard SoA (Service Oriented Architecture). i.e. mostly centralised web service API that serve multilple different client platforms (mobile, desktop, etc) One thing that I didn't see in the write up is any caching layer? or is that done semi-automatically via nginx routing? If you are using caching, what is your caching policy and where is that controlled? Not to sure about the use of node.js. as the bottle necks will always be the database (IMHO) one way or another.

    by Coding Ninja on October 30, 2013 at 3:08 pm
  3. Impressive work, and I can't wait for the follow-up. As a rails developer myself, I wonder, what would you have done differently with your initial single-app architecture that would have made breaking it up like this easier?

    by Steve on October 30, 2013 at 4:56 pm
  4. Adam, I'm impressed at the work that went into this. Working with multiple technology stacks and trying to avoid frankenstein software is very challenging.

    by Ryan Nickell on October 30, 2013 at 6:08 pm
  5. In general the single app architecture is the problem, not the language/runtime/web framework. I'm sure there are ways that we could have improved the monolith, but having apps with limited responsibility gave us more value. We could have used Rails to get to this new architecture, but we felt Node with our home-grown application stack gave us more flexibility.

    by Sean McCullough on October 30, 2013 at 8:34 pm
  6. We have caching at various levels in our infrastructure, from in-app data caching to service response caching, but due to the heavily dynamic nature of our pages doing page-level caching wasn't useful.

    by Sean McCullough on October 30, 2013 at 8:35 pm
  7. I'm curious on "API Client Library", is it a browser javascript framework that directly talk to api? Or is it a server side library that make http or whatever call to api?

    by Kieve Chua on October 30, 2013 at 9:06 pm
  8. What languages/frameworks/platforms are you using for Backend Services?

    by Jorge on October 30, 2013 at 10:40 pm
  9. It's a server-side library, but client-side communication is possible too via other means when needed. To keep the diagrams from becoming too complex, I just tried to point out major pieces.

    by Adam Geitgey on October 30, 2013 at 11:10 pm
  10. Java and ruby are the most commonly used technologies in our backend systems, but we also have a good mix of other technologies in use in various systems like clojure, python, etc.

    by Adam Geitgey on October 30, 2013 at 11:15 pm
  11. Very nice! Congrats! I am curious about configuration management. 20 web apps is not a small number.

    by Jesse on October 31, 2013 at 1:19 am
  12. Hi! Did you look on EvenMachine + Fibers? Does't it look more simple with similar efficiency?

    by Peter on October 31, 2013 at 5:23 am
  13. Very Impressive note on past and the approach taken. Isn't the database layer another bottle neck. with two different data centers taking in data through two different code, how difficult is that to combine? Is it left intentionally? what are all the possible breaks that were thought through which stopped from getting your hands on Database?? How Friendly would be NoSQL for certain type of data that you were handling that are being thought through?

    by Ananth on October 31, 2013 at 8:49 am
  14. I'd like to see more benefits of node to mention such as coffeescript, one language for frontend and backend, async, client side mvc etc.

    by Peking2 on October 31, 2013 at 9:29 am
  15. I compared EM with node, EM is a little frustrating and node just works. I did some research and found many people tried EM first but ended up with node.

    by Peking2 on October 31, 2013 at 9:32 am
  16. How about scala?

    by Peking2 on October 31, 2013 at 9:34 am
  17. We did evaluate using Sinatra with EM, and the performance was similar. The third party module support wasn't great, and we felt that fibers aren't a great way to express concurrency.

    by Sean McCullough on October 31, 2013 at 10:25 am
  18. I'm working on another post to dive into more of the technical details of using Node and the challenges of ramping an engineering team up onto a new stack. Stay tuned!

    by Sean McCullough on October 31, 2013 at 10:27 am
  19. We have other teams working on combining the backend systems as a separate project. The diagrams here are a very high-level approximation of the backend just to give you an idea of the flow of data. There's a lot of different pieces in the backend. Some of those pieces use NoSql databases and some don't.

    by Adam Geitgey on October 31, 2013 at 10:33 am
  20. That's a great point! Configuration management becomes critical in this kind of architecture. We are building a solution that we'll cover in a future post.

    by Adam Geitgey on October 31, 2013 at 10:35 am
  21. Thanks for taking the time to explain what this i-Tier we have been hearing about for months is all about. The explanation is understandable even for a non technical audience. Thanks!

    by Regis Bectarte on October 31, 2013 at 3:24 pm
  22. great read - look forward to your followup. Sean McCullough's inline comments here are gems as well ps. love that you linked to 'Simple Made Easy' :)

    by Steve Gentile on November 1, 2013 at 8:36 am
  23. Does your company develop web mobile or native applications?

    by Jimmy on November 1, 2013 at 11:23 am
  24. Very interesting. We're facing a similar quandary with our monolithic web app. Out of curiosity--why an nginx routing module? Why not simply use node for that as well?

    by Ted Jenkins on November 1, 2013 at 3:18 pm
  25. We already had a lot of logic baked into our nginx config that we needed to maintain through the transition. Adding the routing layer to the existing infrastructure was safer.

    by Sean McCullough on November 1, 2013 at 9:08 pm
  26. So is node rendering everything serverside or just pumping json out to clientside?

    by Troy on November 1, 2013 at 9:28 pm
  27. Thanks a lot for this article. Good to see architecture in action. I see in your paper that you managed to dismantle your monolithic web front end. But what about the API component ? Is it one component as it appears on your schemas ? One runtime ? One team in charge of it ? Do you put in production versions of your services without stoping the others ? And finally what is the underlying technology for this API ? Thanks !

    by Jan Wit on November 2, 2013 at 7:44 am
  28. Excellent write up on solving a really tough problem. It seems like a great solution. Diversifying your applications rather than lumping them all together, not to mention drastically improving your page load time. Have you noticed any difference or change in customer interaction after the changes? Particularly with a faster website...

    by Andrew on November 2, 2013 at 8:01 am
  29. One of the challenges of breaking single monolithic app into multiple services/tiny web apps, is the set of common Models or SW-CCC (System Wide Cross Cutting Concerns) such as Logs, Stats, Exception reporting, Security (User Model), etc.. What was the approach taken to share that across the services?

    by Avnerner on November 3, 2013 at 1:27 am
  30. It's a very nice story about the complex work that sometimes we need resolve!

    by Gonzalo on November 3, 2013 at 1:27 pm
  31. Great article! Looking forward to the next post about the rewrite, especial on what nodejs technologies were used! Were there any downsides that was experienced in this rewrite using node?

    by Surge on November 3, 2013 at 4:30 pm
  32. Using Nginx OpenResty with Lua solved a similar multiple upstreams problem for us. It made the Nginx config really short, plus Lua is likely a lot easier to maintain than a custom Nginx module.

    by Mark Selby on November 3, 2013 at 10:28 pm
  33. Nice to see more companies see the value in Node.js, I am hoping that local companies here in the UK take notice, as I am one of a few in Northern Ireland using it in production. Looking forward to hearing about the specifics on what stack you used.

    by Chris Johnson on November 4, 2013 at 5:22 am
  34. Great article. Would love to know more about config management. When do you plan to post about configs?

    by Manasi on November 4, 2013 at 12:33 pm
  35. We render the pages mostly server-side and enhance them with client-side rendering if the product calls for it.

    by Sean McCullough on November 5, 2013 at 9:13 am
  36. Our Ruby monolith had two facets: web frontend and REST API. The REST API is still there, but we're starting down the road to breaking it up into smaller pieces. Much of the data model has been moved out into smaller services and the API is quickly becoming an aggregation point.

    by Sean McCullough on November 5, 2013 at 9:14 am
  37. Hi - This is great stuff, thanks for the post. One question: within your specialized front end applications, how do you handle duplicate code - especially javascript? You mentioned that you built an application framework in Node.js which contains the shared components for each of the web apps. Do you store shared client side components in this same framework to cut down on duplicate implementations of JS assets? * Thanks again for the great post. It's great to hear a success story for a large Node application. Makes me want to try it out...

    by Ron Williams on November 5, 2013 at 9:51 am
  38. Curious if you looked into Celluloid::IO on the Ruby side. I'm gonna guess no...

    by Tony Arcieri on November 5, 2013 at 5:36 pm
  39. Great article! Looking forward to more detailed discussion of Node.js: the features and the challenges.

    by Jim on November 6, 2013 at 12:10 pm
  40. Thanks for sharing... I am just beginning to consider ways to break apart a large (13 years) Java web app, and start anew. Front-end first, wrapper back-end services, etc.

    by Jon Kern on November 6, 2013 at 12:40 pm
  41. What view engine are you using with Node or is it all JSON with client side rendering?

    by Guy Ellis on November 7, 2013 at 2:07 pm
  42. Very cool!

    by Robert Lawson on November 7, 2013 at 3:46 pm
  43. The site is certainly faster! As for improved customer engagement, we usually measure that over a longer period of time so it's harder to tell. We do know that our customers do engage with the site better when it's up, and the new architecture makes it easier to keep the site up.

    by Sean McCullough on November 8, 2013 at 4:02 pm
  44. Also the site is faster primarily because we gutted a lot of legacy code paths on our backend and we rewrote all our HTML/CSS/JS to be much slimmer. That in and of itself was a huge win.

    by Sean McCullough on November 8, 2013 at 4:03 pm
  45. We have some solutions in place for managing logs, stats, and exceptions for our current SOA, but they're far from perfect. There are a few platform teams working on building out a solid distributed tracing system to help get a better view of all this. Cross cutting business requirements have been handled a few different ways. User authentication is distributed via module to all teams that need that functionality. The Layout Service hosts many web features that are used in all applications. But we've found that it's actually nice not having these big concerns across our website anymore because regressions in these crucial pieces are often difficult to fix (e.g. if you have a before filter where you try to get the active user on every request and for some reason that call gets slow, your entire site gets slow in the same way).

    by Sean McCullough on November 8, 2013 at 4:07 pm
  46. All server side. It's the best fit for our read-heavy limited-step product. We're using our own view/rendering pipeline (a fork of https://github.com/quackingduck/bulk-hogan/) and we use mustache for our templating.

    by Sean McCullough on November 8, 2013 at 6:05 pm
  47. Thank you for your post! I think that it's crazy try to use something based on JavaScript for building ultimate production ready solutions. You wrote that all is cool. Great! But you didn't write anything about problems with new architecture and using node.js. Why?

    by Vladimir on November 25, 2013 at 12:45 pm
  48. How does your Node implementation work with your CDN (Akamai)? What gets cached differently under Node versus your previous RoR implementation?

    by Jeffrey Costa on November 26, 2013 at 11:45 am
  49. Great article :)

    by Rajesh on November 28, 2013 at 3:03 am
  50. I have been browsing the web to find some inspiration to build our next generation web architecture. This is by far the most useful post. Good job Question 1: Are all your API stateless or some of them statefull and need to keep track of ‘user session’ ? Question 2 : Is there any situation that some code you implemented in your Interaction Thier need to be re-use for your Mobile App (ie creating a Link between the Mobile and your Interaction layer)?

    by hugo villeneuve on February 5, 2014 at 2:01 pm
  51. Have yours tried to test vert.x framework?

    by Y.H. on February 20, 2014 at 12:21 am
  52. [...] to the same topic. Incidentally, the Groupon team has decided to do something similar when they moved from a monolithic RoR app to a collection of smaller Node.js [...]

    pingback by The Queue Is the Message | Dejan Glozic on February 24, 2014 at 6:44 am
  53. [...] from a Ruby on Rails framework to Node.js for their back-end services. Engineer Adam Geitgey elaborated on why they went with [...]

    pingback by Node.js Popularity & Usage on High-Profile Sites | ROI DNA on June 26, 2014 at 2:06 pm
  54. How do you manage stylesheets and design across the site with multiple webapps?

    by AP Fritts on November 6, 2014 at 12:02 pm
  55. Hey, How did you guys manage the data updates and deletes between systems? Lets say you have a item in groupon goods, which has a corresponding in the groupon deals. Now the good has been deleted from that service, but a deal is still lurking around pointing to a non existent good. How do you guys manage this?

    by Azhagu Selvan on December 18, 2014 at 1:25 am
  56. @Azhagu, Adam's post describes our front-end architecture which does not have any data storage. The data storage is abstracted behind our REST APIs. To answer your question though, we generally use asynchronous messaging described in a previous blog post: https://engineering.groupon.com/2013/hornetq/building-a-distributed-messaging-system/

    by Kyle O on January 6, 2015 at 11:58 am

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>