Illustration by Rehana Khan & Dylan Wright

Comparison of Image Recognition APIs on food images

Published in

Grubhub Bytes

5 min readDec 28, 2017

The media service at Grubhub ingests and manages images for every menu item currently available on the Grubhub platform. These images need to be moderated for prohibited content and quality before they are presented to our diners. Manual moderation of millions of pre-existing images on the platform along with the ones constantly being added everyday, is a tedious task. Automating this process saves time of the manual moderators allowing them to focus only on moderating images that cannot be approved by the automated process.

Owing to the increase in computational power due to the advent of Graphic Processing Units (GPUs), usage of Neural Networks to identify objects in images has become feasible. But deploying and maintaining Neural Networks in applications is tedious and costly.

Companies like Google, Amazon and Microsoft to name a few, have bundled their research in image recognition into Application Programming Interfaces (APIs) so that every software developer can use this technology in applications.

This post is a comparison of three popular image recognition APIs : Google Vision API, Amazon Rekognition, and Microsoft Computer Vision API based on their feature offerings, constraints, and pricing. We will also look at real-world results provided by these APIs for some food images so that we can understand the strengths of each API. Let’s begin!

Feature comparison

The basic features required for object recognition are present in all three APIs. Features like image tagging, explicit content detection, dominant colors detection, and optical character recognition are the most relevant when moderating food images. Additionally, Google Vision can detect similar images on the web, which we can use to detect copyright infringements and identify sources of food images found on the internet. Microsoft Computer Vision API is the only API among the three that provides the video tagging feature. So if your application requires object recognition and analysis of videos, Microsoft’s Computer Vision API would be an easy choice to make.

Pricing comparison

We will compare the price for some important features and at a scale that is suitable to the needs of most businesses : processing 1000–5,000,000 images per month. The prices in the table below are in US Dollars and per 1000 images processed.

From just the price perspective, for mid-sized projects (up to five million images processed per month), Amazon Rekognition would prove to be the most cost effective. In any scenario though, Google Vision API does not compare to the pricing of its competitors.

Image size limits comparison

If your application requires analysis of very large images, Amazon Rekognition should be your API of choice. However, if the images used by your application are hosted on a cloud service that provides resizing, you could work around the image size limitations by downsizing the images to acceptable proportions. Though this will deteriorate image quality, there should not be any visible effect on the results returned by the image recognition APIs.

Time for some food

Now let’s test the above APIs on some food images and analyse the tags that they provide. We will be comparing these APIs only based on their image tagging features as for food images, this is the most important feature because it helps in auto-classifying such images.

1) Pad-thai

Google vision API clearly takes the cake here as it tags the dish by its name “pad-thai.” It also is able to identify an ingredient in the dish: “noodles.” Rekognition identifies that the image has “bean-sprouts” which is impressive, but misses the point as it doesn’t recognize “noodles”, the main ingredient of the dish.

2) Chicken wings

Google Vision API again manages to tag the dish as “Buffalo wing,” hence wins. It falsely tags it as “General Tso’s chicken,” but this image does have some similarity to General Tso’s chicken dish, so it’s acceptable. Amazon Rekognition API follows a close second, tagging the image with “Barbecue” and “fried food,” while Microsoft Computer Vision API completely misses the mark as it fails to detect presence of the chicken wings in the image.

3) Fruit salad

All APIs perform satisfactorily given this image. But Microsoft Computer Vision API seems a tad bit better because it tags the “fruit-salad” image as both fruit and salad. Other APIs tend to give generic labels like “food” and “fruit” for this image.

4) Samosas

The Vision API from Google performs the best here, as it determines the type of cuisine as “Indian.”

Conclusion

Looking at the above analysis, we can conclude that while using the Amazon Rekognition API would be the most cost effective, using Google Vision API for tagging food images gives more granularized tags and hence, better classification of the images.

The media moderation project at Grubhub automates a lot of the image moderation process by approving images that meet our expectations and tagging the ones that do not, so that they can be manually inspected at a later date. The moderation process uses the above third party image recognition APIs to obtain raw insights on the images which are used to validate images based on custom built rules.

The media moderation project has been successful in auto-moderating more than half a million images since its release. The APIs have also been fairly reliable with their error rate being only around 1% of all the images processed. We continue to constantly experiment and tweak the custom moderation rules in order to reduce human intervention and reach our goal of fully automated moderation of images on the Grubhub platform.

Do you want to learn more about opportunities with our team? Visit the Grubhub careers page.