Computer Vision: What Is It?

6 Likes

Artificial intelligence is all the buzz. A recent article by Jon Schuppe of NBC news asks if we are ready for police to use facial recognition software to track our every move. Well, ready or not, it's already starting to happen. The Maryland Image Repository System was used to identify the Annapolis Capital Gazette shooter. The shooter’s image was captured on a security camera, and software was used to match that image to the repository, which includes driver’s license photos as well as state and federal mug shots.

Facial recognition is just one of many uses of computer vision. Computer vision extracts meaningful information from video or still images. But how exactly does a computer see an image?

Below is a color photo from SAS Global Forum 2018. (Guess how many GEL team members are in the image below. Bonus points if you can name them!)

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

A computer “sees” an image as an array of pixel values. For simplicity’s sake, I am just showing 5 x 5 pixels below. Each pixel has three (color) intensity values, each of which corresponds to a channel (red, green, and blue). These values range from 0 to 255 (8 bit unsigned integer). These numbers are the inputs to computer vision algorithms.

Using a neural network for an image classification task, the machine learning algorithm can evaluate an input image to get an output that indicates the relative likelihood that the image belongs to certain pre-defined categories. For example,

human with a probability of 0.85
life preserver with a probability of 0.12
water with a probability of 0.01

Grayscale vision works similarly. Grayscale images are just created from color images by calculating a linear combination of red, blue, and green intensity values. Here we see the grayscale version, where each pixel is an intensity number from 0 to 255.

A link function can be applied to adjust the 0 to 255 scale to a number between 0 and 1, creating an input array of numbers between 0 and 1. Note that scaling the pixels to a 0 to 1 scale requires more storage (32 bits or 64 bits rather than the 8 bits needed for the intensity scale of 1 to 255). So it is more efficient to use the 0 to 255 intensity numbers as the inputs.

Edges can be detected where there is a abrupt change in the value of neighboring pixels.

Computer vision tasks include semantic segmentation, classification and localization, object detection, and instance segmentation. SAS Viya 3.4 supports classification and object detection. Future releases will support semantic segmentation and instance segmentation.

The graphic below is based on slide from Xindian Long, which in turn borrowed from Feifei Li, Serena Yeung & Justin Johnson’s Deep Learning class.

Semantic Segmentation is pixel by pixel. It examines each pixel and determine which object category it belongs to. For example, a pixel may be given the object category of human, toy, lifejacket, water, etc. Human 1 and Human 2 will both be labeled with the same object category, i.e., human.

A highly accurate neural network might even be able to distinguish a toy from a life preserver. Can you?

PSA: Pool toys, water wings, floating rings, etc. are not life preservers! These require touch supervision for a child who cannot swim. That means an responsible adult or older child who can swim should be touching or immediately ready to grab the child while they are using these toys. Children can turn upside down, etc. with these toys, so do not let them lull you into complacency. In contrast, a life preserver will turn the human onto their back and support their head, allowing them to breathe, even if unconscious.

Classification and Localization is for a single object. Given an image, what is the main content of that image. Under the assumption that there is only one main object in an image, determine the bounding box of that single object, i.e., where is that object in the image.

Object Detection is for multiple objects. It finds where is the bounding box of each object and what is the category of the object in that bounding box. Below we see two foolish humans who are not wearing any person flotation devices on their stand-up paddle boards.

Instance Segmentation is for multiple objects AND pixel by pixel. It combines semantic segmentation and object detection at the same time. It will label pixels.

So how do we get from an array of numbers to labeled objects and pixels? This is where our deep learning algorithms come into play!

Neural Networks

Neural networks have become very popular in the field of computer vision ever since Alex Krizhevsky used them to win the 2012 ImageNet classification challenge. Using convolutional neural networks he reduced the classification error from 26 to 15 percent! Now convolutional neural networks are used all over the place for computer vision tasks, for example, Google photo searches and Facebook tagging.

Following is a very simple outline of how a neural network works, using a grayscale example.

Each neuron in the first layer holds the grayscale number for a pixel (e.g., 155, 22, etc.), called the activation number.
Each neuron in the last layer of 10 neurons holds an activation number (between 0 and 1, such as 0.25, 0.97, etc.) that indicates the likelihood that a given image corresponds with a given category (human, toy, water).
Between the first and last layers are the hidden layers, where the magic happens.
- A weight is assigned to each connection from first layer to second layer, and so on.
- The weighted sum of all activations from the first layer are transformed into a number between 0 and 1, using a link function such as a sigmoid function or rectified linear unit (ReLU).
- A bias can be set, such that a neuron activates only when weighted sum is greater than a given number, or threshold.

As you can imagine, the number of parameters (weights and biases) starts adding up really quickly! Even if you started with only 784 pixels, and had two hidden layers of 16 neurons each, and 10 categories of outputs, you would end up with 13,002 parameters!

The graphic above is from a very accessible 19-min video by Patreon that demonstrates in more detail how neural networks work. It is definitely worth your time if you are interested in neural networks and don’t quite understand them.

The neural network is then trained on as many labeled images as possible (supervised learning), increasing the accuracy of correctly identifying images.

A specialized type of neural network called a Convolutional Neural Network (CNN) is commonly used for image processing. Convolutional neural networks include convolution and pooling layers. The simple graphic below from Asimov Institute illustrates this.

References and More Information

Excellent 19-min video explaining convolutional neural networks – completely worth the 19-minute investment of your time!
SAS page on Computer Vision

Thank you to Xindian Long for her excellent presentation. Thank you to Ethem Can for technical edits.