“Bob is an engineer. He gets his service tested for fault tolerance and resiliency. He feels confident. Be like Bob”
How confident do you feel about your service not going down in the middle of the night or during your favorite holiday? Having allocated new resources for the estimated increase in holiday traffic, would you still feel confident? We, the Screwdriver team, aim to build the confidence amongst engineers with our latest tool – Screwdriver, a Fault Tolerance Testing tool. Fault Tolerance is the property that enables a system to continue operating properly in the event of a failure of some of its components. Our goal is to certify all the services as Fault Tolerant.
At Groupon, there are thousands of nodes serving hundreds of inter-dependent micro-services. We are challenged with many potential commonly occurring failures such as node failures, network failures, and increase in network latencies. Understanding such failures of a given system and its dependent services, and to be prepared for such events is crucial in today’s world of micro-services. Testing for such types of faults/failures, and assessing the robustness, and resiliency of the system is very important as any system downtime would result in the loss of millions of dollars.
The objective of our team is to help simulate commonly occurring faults and failure scenarios, to understand the behavior during the simulation, and to take steps to prevent or mitigate such failures. We replicate the above failure scenarios by injecting faults ourselves in a controlled manner using automated scripts. We understand the behavior of a service, and its dependent services by observing the monitors for the given machine and the dependents to ensure that they are operating properly.
Topology Translation Service
Understanding the architecture of a given service is very important before injecting a fault. It helps us answer questions like, “What does the service stack consist of?”, “Is caching handled by Varnish or Redis?”, “What set of machines to inject fault on to simulate rack failure?”. Topology Service has the capability to not just identify the machine characteristics, and get an apt fault for the same, but it also has the associated monitors to observe the Services, and the dependent Services before, during, and after the fault injection.
To persist the topologies, we needed a database that could help us visualize the topology, and to also understand the dependency between services. Storing / querying topologies in a SQL datastore would involve multiple joins between several tables especially to query dependent services up to multiple levels. Also, a more efficient, and natural way of querying for the partial set of machines to inject fault was required. We took a deep dive into looking at a Graph DB solution, and we observed that it facilitated all the above requirements in an efficient way.
For example, one can add a dependency relation between ‘Service A’ and ‘Service B’ using a more readable query.
SERVICE_A ----DEPENDS_ON---> SERVICE_B
SERVICE_B ----DEPENDS_ON---> SERVICE_C
Now it becomes easier to query for the dependency chain for a given Service. In the above example, the query to get the dependencies of ‘Service C’ would look like:
GIVEN Service named C
RETURN all relations named DEPENDS_ON upto 3 depths
The above query would return the dependent services ‘A’, and ‘B’ as a result.
Topology Translation Service requires such a database solution to help querying of entities, and their relationships in a natural way. Hence a Graph database was the natural choice. It helps us visualize the database as a graph, and supports querying of objects with relations up to multiple depths. We are using Neo4j graph database for Topology Service. Neo4j is one of the leading graph databases available, and supports all of the requirements of Topology Service. It comes with a built-in web-app to query for objects using cypher queries, and also helps us to visualize the objects like a graph.
One of the primary features of Screwdriver is the injection of the fault on a given machine. Our requirement was that the injection of a fault has to be as lightweight as possible. It should also be self-deployable that can run on any given machine, can kill itself on completion, and should be self-sustainable in case of communication failure. On every fault injection request, a Capsule is built to solve all the above requirements. It exposes a secure REST API through which we can control the fault, and stop it if necessary. Faults are configured as Java objects, and are run as bash scripts thereby providing a layer of abstraction. For additional security, we want to ensure that Capsule is not replicated, and run on an unintended machine. To address this, Capsule is built with expiration time, and signed with machine specific information. On startup, Capsule validates this information before it runs.
At Groupon, we have multiple metric, and event pipelines. All machines are equipped with agents to monitor the host both on the system level as well as the application level. The monitors uses the metrics published by each host, and alerts on any outliers based on custom thresholds. We built this loosely coupled plugin called Metric Adapter that can be adapted with any given metrics system such as RRDtool, and Splunk. We leverage this tool to gather metrics that we can further use to analyze the machine. With these metrics, we can observe the behavior of the machine as and when injecting a fault using the Capsule.
One of the challenges in injecting a fault is to ensure that we have control over the lifecycle of the fault injection. It is a critical step in fault tolerance especially when we are dealing with production traffic. With the help of monitor metadata provided by the Topology Service, we can observe the monitor to identify any anomalies during any fault injection. An anomaly for a given system can be defined as any behavior of that system which deviates from the intended or expected behavior. We built the Anomaly Detector to understand the behavior of the machine when fault tested, and to help us to detect any anomalies.
Anomaly Detector has been designed smart enough to trigger a kill command to the Capsule if it is being mischievous. One of the challenges faced by the Anomaly Detector is to figure out if the anomaly being observed was caused by the fault execution or whether it is a regular pattern on the Machine Under Test (MUT). To give more clarity to Anomaly Detector, we observe for behavior patterns on the machine before, and after the fault execution.
The central component of Screwdriver known as the Tower was built to oversee all of the above components. The Tower is responsible for building, deploying capsule to the MUT, and starting the Anomaly Detector after injecting the fault. Tower also records the timeline of the given fault injection also known as Fault Run, and can generate a report for the run.
Using Screwdriver, we support 7 different faults out of the box. In our conversations with multiple teams, we noticed that different teams have different fault requirements, and should not be restricted to only the above defined faults. To make our Screwdriver more extensible, we built a Playbook service to define custom faults that can be easily integrated into capsules. Using this Playbook service, one can define the set of scripts, and the commands to inject a fault, and also to abort the fault.
We successfully tested Fault injection on one of our staging clusters (ElasticSearch) at Groupon, and we are happy to share the results. We requested Tower to build a capsule for us to bring down one of the nodes, and once the fault started running, we observed a spike in the error rates in the monitors. Anomaly Detector triggered the killing of fault due to the increase in error rate. As we can see in the graph, the error rate started decreasing. As a result of the termination of fault, capsule brings the machine back up to its running state. Interestingly, the error rate peaked up even higher when the node rejoined the cluster.
This information was beneficial in understanding the behavior of the system when subjected to rare dangerous yet controlled scenarios, and we are excited that we could help fellow engineers to simulate the scenario before it happened in production. Using screwdriver, we were able to control the lifecycle of a fault from injection to recovery. As a result, we were able to identify architectural issues such as single point of failure, and missing caches.
We are currently focusing on more fault executions, and certifying our services at Groupon as Fault Tolerant. For the upcoming releases, we wish to make Screwdriver more flexible, smarter, and more productive.
Any Operations Engineer would agree that an incident should be tracked to the finest detail. Tracking a fault execution is necessary to understand the issues in a system, and make refinements. We are looking to store the Fault Runs as events which can help us analyze, and understand the pattern among execution of faults on any given machine. The fault runs would help us understand
- How well the service performs with lesser worker machines during heavy production traffic, and whether it has improved with the increase in number of worker machines for the same traffic.
- How many misses are observed in a cache layer with an introduction of a new cool cache host into a set of warm cache hosts.
We believe collection of such data will provide us with useful insights into the given machine, and also the Groupon-wide infrastructure. We would love to share those insights in our future blog posts.
— From the Screwdriver team @ Groupon (Adithya Nagarajan, Ajay Vaddadi, Amruta Badami, Kavin Arasu)