Implementing Google Photos Cloud Vision API

Even before there was such a thing as a digital computer, science fiction authors and other dreamers imagined sentient machines that could see and understand what they saw. Just the “seeing” part—identifying objects in an image—has proven to be one of the thorniest problems in computing. However, with recent advances in machine learning, we are now a step closer to realizing that goal: Google has released its Google Photos Cloud Vision API for developers everywhere to use.

The Google Photos Cloud Vision system uses deep-learning algorithms to perform several advanced image-processing tasks:

Identify objects in an image: Google claims that it has taught its system to identify thousands of different objects, and can identify multiple different objects in the same image. In particular, the system recognizes logos and famous landmarks.
Recognize text in an image: The system’s optical character recognition algorithms can extract text in any of several major languages.
Identify sentiment: Although the system cannot (yet) match a face with its owner’s name, it can make a pretty good guess as to the emotional state of the person on the basis of the facial features shown in the image.
Identify inappropriate content: The system can classify adult and violent content in images.

The computational heavy lifting is performed on Google’s cloud service, so the system can be easily integrated into any web or mobile app.

The possibilities are practically without limit. You could, for example, create applications that:

Automatically flag inappropriate visual content in your blog’s public comments (no more checking and releasing each post manually!)
Identify and filter all-image spam email
Gauge audience reactions to a performance
Automatically tag collections of photos

You get the idea.

Google Photos Cloud Vision API: Better with Time

One of the most intriguing features of the Google Photos Cloud Vision API system is that the more it is used, the better it will become. Because it is based on machine learning, as more applications use the system, its repertoire of object recognition will grow, and the accuracy will improve over time.

What does the future hold? One obvious next step would be extending the system’s capabilities from still photos to full-motion video. Obviously, this involves quite a bit more data, so processing recorded video will probably come before real-time analysis of live video feeds. Once that happens, though, there will be essentially no limit to what machine vision systems can do.

The lack of machines’ ability to understand and react to visual data has long held them back from the level of automation (and autonomy) that humans have long dreamed of. We now appear to be on the verge of turning that corner. Strap in—it’s going to be an exciting ride!