Guaranteeing critical microservice actions at Grubhub

Published in

Grubhub Bytes

7 min readMay 14, 2019

with Samuel Raghunath

At Grubhub, we want you to get your food. We don’t care if a powerline falls on an AWS datacenter, a construction worker accidentally turns off all the running water at the office, or if everyone in the U.S. just finished watching Beyoncé’s Coachella documentary and now wants to order lemonade — through it all, we want you to get your food.

So how do those of us on the technology team try to make that happen? We use a framework library we built that we call the “supervisor framework.” Its job is to guarantee that our microservices finish their actions within a reasonable time.

A brief history of our supervision framework

Back when Seamless and Grubhub were two separate companies, we ran everything out of a single datacenter (DC). When our companies merged, we needed to plan for future growth, so we moved to the cloud and built out a service framework.

Along came more folks ordering more food from more places. With that came more load on our architecture. When it came time to send those orders to restaurants, they’d be sent over to the on-premise datacenter asynchronously. And so there came a time when the orders overwhelmed our system. If our VPN went down for a few minutes or our datacenter had issues, we’d lose orders, leave diners hungry and hurt restaurant business..

These outages were usually fixed quickly by smart people, but we still had orders that didn’t reach the restaurants during that outage. And we didn’t want to have to manually fix those orders after the fact.

First of all, just replaying the communication to restaurants can make the problem worse. Dumping more traffic on an overwhelmed system can make those outages last longer. Second, it’s not actually what the business wants. Three retries over the course of five, ten, thirty seconds doesn’t do anything for the business. So we started to rethink how we dealt with outages.

Instead of just thinking in terms of number of retries, we started thinking about the time it takes to get to fulfillment. If we give an order, say, seven minutes to get to fulfillment, we can be more clever in our solutions. It’s a much more tangible way to think about the problem that uses metrics that actually matter to our business and to our hungry diners.

So how do we supervise this fulfillment pipeline over a given period of time?

How we define “success”

We needed to come up with a way to shepherd an asynchronous process to completion over a time period — not a set number of retries — that could smartly back off the system when the pressure overwhelmed it. This “supervision framework” wasn’t just useful for getting orders to restaurants, it was going to be available to any service within the Grubhub ecosystem.

But, for this to do what Grubhub needed, the supervisor had to avoid anything that would make it too brittle:

It needed to be hot-hot — active in multiple datacenters in AWS. These datacenters would know that they weren’t the only ones trying to work this job, but they wouldn’t know what the other datacenters were or communicate with them directly.
It needed to handle dependencies elegantly, in the case where one datacenter could talk to an external dependency, but another could not.
A failure in any of these areas should not affect the overall success of the supervised job.

Our supervisor would be able to retry an older record, but only over a given time frame. If that piece of work could not be completed within the time frame, we couldn’t just give up on the job. So it would need to have a contingency for when it could not complete a piece of work within the given time frame — when the process times out.

At the end of the day, the supervisor would need to ensure that the work got done without introducing additional errors into the process. We’d prefer to have the work attempted twice, thrice, or more, rather than not at all, so any job given to supervision needed to be idempotent. So we set out to build a library that anyone at Grubhub could bootstrap to guarantee their work was completed!

Hark! A wild supervisor appears!

The solution we came up with is a framework based on a time-bucketing strategy. Whenever a service needs to supervise something, it submits an ID. The framework calculates which time-bucket to drop this ID into (based on the current time) and then persists a “supervisor record” for that ID.

When the unit of work is complete, the framework expects the service to call a completion method with this ID. So if the unit of work completes with no problems on the first try, the service “completes” the record. No supervision necessary.

In a background thread, in every node of our service, each datacenter elects a leader node to query relevant time buckets. If the timebuckets have any record that has not been “completed,” it is marked “stale.” Once a record goes stale, the framework calls a handler designated by the service. In the example of a failed checkout, a handler tries to process the order and send it to restaurants again. Likewise, after some time, records time out — a handler can be registered for that as well.

This gives services enough to build robust systems that can handle temporary outages — if a dependency is unresponsive, a service can retry when the record for that dependency becomes stale. If the work can’t be completed at the end of the time range, then the record gets to a time-out handler, which can do things like note the work information for later manual processing/remediation.

Here’s the basic logic for processing a time-bucket’s records:

Supervising across datacenters

Sometimes, the issue that caused a failure is the datacenter. We use multiple datacenters to maintain high availability. Our supervision framework relies on our distributed data store — Cassandra — which means any supervision record replicates to all other datacenters. If that replication fails, then some datacenters may not be able to work records or may end up with a “split brain,” where a supervision record is in different states in different places. Again, ensuring that all supervised processes are idempotent helps with this.

The original datacenter gets the first crack at working the record, at least until a “quiet period” — again, configurable per service — ends. In the case where the original datacenter doesn’t complete the job, each datacenter’s processing uses a random offset to better distribute the workload.

Best and worst practices for supervision

The supervisor framework isn’t free — it takes up additional resources. If there’s an outage, the supervisor could potentially add load to the system once it comes back online in Cassandra and on the nodes start working the time buckets. If we haven’t load tested the system, then we could very well cause another outage with the supervised calls. For example, if an outage leaves an API unavailable for six minutes, and that API receives a thousand requests a minute, we have 6,000 additional requests hitting this API alongside regular traffic.

As with any complex system, we need to set up monitoring and logging. Our framework provides all this.

On the other side of the coin, engineers need to be mindful of how they use supervision. We don’t want to supervise every request from every service. Some requests can just fail gracefully. And we don’t want the supervisor to kick in outside of failure scenarios — it’s not a job scheduler. Most of the time, a service will have a piece of work they need to do, submit it to the supervisor, try and do the work, complete successfully, and complete it to the supervisor. The supervisor doesn’t do anything. This is the ideal case.

To make logging easier, our supervisor takes an ID from the requesting service. This is an arbitrary string, so some folks think that means they can use it as a data dump, encoding metadata into a JSON-ified string. It’s pretty clever actually, but it turns out to be really bad. This causes unnecessary strain on our backend, especially in the outage scenarios where it would come into play. The only reason we know about it that we did it — boom, killed Cassandra.

Finally, for anyone supervising work, you need to make sure your handler functions don’t have any long running process or heavy work. It’ll back your queues up, so a process that you expect to be completed in two minutes is actually completed in 20 minutes. Async those processes out to let the framework do the work it’s designed to do.

How this can help you

At Grubhub, we’ve saved ourselves many times with our supervisor framework. As developers, we know there’s no better feeling than engineering something that doesn’t page us later. We hope this post sparks some ideas about how you consider production issues — maybe you can adopt a supervisor design pattern, too!

Do you want to learn more about opportunities with our team? Visit the Grubhub careers page.