365 days of A/B testing at Grubhub

Posted by Sudev Balakrishnan on

Grubhub’s newly formed product team develops new features in order to fulfill over 290,000 orders each and every day.  With a laser focus on user experience, we’ve made significant strides in 2016. Leveraging learnings from our massive troves of customer data, we started testing changes to make sure that they offer the best possible user experience for our customers.

I am often asked  about Grubhub’s position on testing. “Is it always better to test more?” “How do you pick tests?” What is the difference between 10, 1,000 and 10,000 tests? “Do you like multivariate tests?” So I decided to use this platform to answer everyone’s questions at once – we process over 290,000 orders every day so efficiency is key, right?

Having an obsession with data is necessary to make product improvements. Product teams get a lot of “million dollar ideas”  thrown at them. These ideas come from both inside and outside of the company. The fact of the matter is, anything that is measurable stands a much better chance of improvement, and things that improve are better for the company. Period.  

Here’s a rundown of how we test new features at Grubhub:

We’re equal opportunity testers here. Every SINGLE feature that gets released on our site has to follow a process we call Daily Average Grubs (DAG) testing. DAGs are essentially the same as commerce orders, but we always think of our consumers as “Diners” and orders as “Grubs”.  Our testing platforms span across our web and mobile devices, as well as our in-house restaurant technologies.

We pride ourselves on going beyond immediate conversion metrics and using Lifetime Value (LTV) as the bar that our tests have to pass. This makes the product testing hurdle a few notches higher as you have to consider the consequences of trading off immediate gratification for creating longer value. The framework for analyzing our test results are homegrown and provide for sophisticated analysis of tests by diner segments.

Testing is a science, but when done correctly, can be a work of art. Three things dictate your testing strategy the most; the maturity of your industry, the competency of your technology and the amount of traffic you have.

Maturity: A new company’s products have likely had no A/B testing. If anything, they could have some qualitative tests and feature releases. The eventual destination is a mature product where the majority of return in innovation is in small increments.

These mature products must go for scale in testing as a badge of honor. 1000 tests are better than 100, and if you had 10,000, well, I profess envy for your traffic profiles. A more open space for feature development has the ability to change the ratio to be skewed towards step function improvements versus incremental improvements. The tradeoff with step function improvements is that they typically are heavier and slower. There are mechanisms to decrease risk and create low-cost versions like painted door tests, but the test is likely still heavy on product and technology groups. The former strategy of testing is akin to the Venture Capital mindset where the sheer volume of creative innovation outweighs the risk of the return. The latter is more like Warren Buffett who has strict principles for value investing.

Remember, it’s not testing that makes you win, but the features themselves. If your industry has space for feature innovation (you’ll find many do!), spend time on strategies to pick where to test in addition to scale. Qualitative research is an excellent way to find a hypothesis to increase the return on quantitative tests.  

The competency of your technology: This is an overarching variable. Is your app implementation server side or client side? Is your testing framework client or server side? Do you have continuous deployment and a multiple build framework? How componentized and patternized is your site? All of these will affect your ability to deploy tests at scale.

Grubhub has  a reasonably scaled framework and deploys 200-300 tests every year. Our amazing teams get faster every day.

Traffic profile: Every test tends to have a duration that is inversely correlated to your traffic size and will get shorter or longer depending on what type of feedback you are looking for. You might want immediate feedback metrics like conversion/clicks or could be looking for delayed metrics like lifetime value. Perhaps you have tests that need traffic that has to be cordoned off (non overlapping segments) which will cause the need for more  traffic.

There are no simple answers. You need to assess where your organization is in its development and imprint the right kind of testing structure. Qualitative testing is just as useful to develop hunches on where to test.

This year is every bit as packed as last year for testing. We are extending our testing frameworks to products beyond our diners and looking for new ways to create great features for our restaurant partners and delivery drivers.

We have a fantastic product team at GH that works on amazing products that our diners love to use, and we are always looking for more ways to fit into their everyday lives.

Creative teams thrive off of  creative freedom, but this freedom comes with a price. For product managers, the price of freedom is usually called A/B testing.