BookmarkSubscribeRSS Feed

Computer Vision: What Is It?

Started ‎06-21-2019 by
Modified ‎06-21-2019 by
Views 2,728

Artificial intelligence is all the buzz. A recent article by Jon Schuppe of NBC news asks if we are ready for police to use facial recognition software to track our every move. Well, ready or not, it's already starting to happen.  The Maryland Image Repository System was used to identify the Annapolis Capital Gazette shooter. The shooter’s image was captured on a security camera, and software was used to match that image to the repository, which includes driver’s license photos as well as state and federal mug shots.

 

Facial recognition is just one of many uses of computer vision. Computer vision extracts meaningful information from video or still images. But how exactly does a computer see an image?

 

Below is a color photo from SAS Global Forum 2018. (Guess how many GEL team members are in the image below. Bonus points if you can name them!)

 

2cv.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

A computer “sees” an image as an array of pixel values. For simplicity’s sake, I am just showing 5 x 5 pixels below. Each pixel has three (color) intensity values, each of which corresponds to a channel (red, green, and blue).  These values range from 0 to 255 (8 bit unsigned integer). These numbers are the inputs to computer vision algorithms.

 

3cv.png

 

Using a neural network for an image classification task, the machine learning algorithm can evaluate an input image to get an output that indicates the relative likelihood that the image belongs to certain pre-defined categories.  For example,

  • human with a probability of 0.85
  • life preserver with a probability of 0.12
  • water with a probability of 0.01

Grayscale vision works similarly.  Grayscale images are just created from color images by calculating a linear combination of red, blue, and green intensity values.  Here we see the grayscale version, where each pixel is an intensity number from 0 to 255.

 

4cv.png

 

A link function can be applied to adjust the 0 to 255 scale to a number between 0 and 1, creating an input array of numbers between 0 and 1.  Note that scaling the pixels to a 0 to 1 scale requires more storage (32 bits or 64 bits rather than the 8 bits needed for the intensity scale of 1 to 255).  So it is more efficient to use the 0 to 255 intensity numbers as the inputs.

 

grayscale.png

 

Edges can be detected where there is a abrupt change in the value of neighboring pixels.

 

edge.png

 

Computer vision tasks include semantic segmentation, classification and localization, object detection, and instance segmentation. SAS Viya 3.4 supports classification and object detection. Future releases will support semantic segmentation and instance segmentation.

 

The graphic below is based on slide from Xindian Long, which in turn borrowed from Feifei Li, Serena Yeung & Justin Johnson’s Deep Learning class.

 

BethRevisedPhoto1.png

 

Semantic Segmentation is pixel by pixel. It examines each pixel and determine which object category it belongs to. For example, a pixel may be given the object category of human, toy, lifejacket, water, etc. Human 1 and Human 2 will both be labeled with the same object category, i.e., human.

 

BethRevisedPhoto2.png

 

A highly accurate neural network might even be able to distinguish a toy from a life preserver. Can you?

 

BethRevisedPhoto3.png

 

9cv.png

 

PSA:  Pool toys, water wings, floating rings, etc. are not life preservers!  These require touch supervision for a child who cannot swim.  That means an responsible adult or older child who can swim should be touching or immediately ready to grab the child while they are using these toys.  Children can turn upside down, etc. with these toys, so do not let them lull you into complacency.  In contrast, a life preserver will turn the human onto their back and support their head, allowing them to breathe, even if unconscious.

 

Classification and Localization is for a single object. Given an image, what is the main content of that image. Under the assumption that there is only one main object in an image, determine the bounding box of that single object, i.e., where is that object in the image.

 

BethRevisedPhoto4.png

 

Object Detection is for multiple objects. It finds where is the bounding box of each object and what is the category of the object in that bounding box.  Below we see two foolish humans who are not wearing any person flotation devices on their stand-up paddle boards.

 

11cv.png

 

Instance Segmentation is for multiple objects AND pixel by pixel. It combines semantic segmentation and object detection at the same time. It will label pixels.

 

12cv.png

 

So how do we get from an array of numbers to labeled objects and pixels? This is where our deep learning algorithms come into play!

Neural Networks

Fb.png

Neural networks have become very popular in the field of computer vision ever since Alex Krizhevsky used them to win the 2012 ImageNet classification challenge.  Using convolutional neural networks he reduced the classification error from 26 to 15 percent!  Now convolutional neural networks are used all over the place for computer vision tasks, for example, Google photo searches and Facebook tagging.

 

Following is a very simple outline of how a  neural network works, using a grayscale example.

  • Each neuron in the first layer holds the grayscale number for a pixel (e.g., 155, 22, etc.), called the activation number.
  • Each neuron in the last layer of 10 neurons holds an activation number (between 0 and 1, such as 0.25, 0.97, etc.) that indicates the likelihood that a given image corresponds with a given category (human, toy, water).
  • Between the first and last layers are the hidden layers, where the magic happens.
    • A weight is assigned to each connection from first layer to second layer, and so on.
    • The weighted sum of all activations from the first layer are transformed into a number between 0 and 1, using a link function such as a sigmoid function or rectified linear unit (ReLU).
    • A bias can be set, such that a neuron activates only when weighted sum is greater than a given number, or threshold.

As you can imagine, the number of parameters (weights and biases) starts adding up really quickly! Even if you started with only 784 pixels, and had two hidden layers of 16 neurons each, and 10 categories of outputs, you would end up with 13,002 parameters!

 

13cv.png

 

The graphic above is from a very accessible 19-min video by Patreon that demonstrates in more detail how neural networks work. It is definitely worth your time if you are interested in neural networks and don’t quite understand them.

 

The neural network is then trained on as many labeled images as possible (supervised learning), increasing the accuracy of correctly identifying images.

 

A specialized type of neural network called a Convolutional Neural Network (CNN) is commonly used for image processing.  Convolutional neural networks include convolution and pooling layers.  The simple graphic below from Asimov Institute illustrates this.

 

CNN-1.png

 

References and More Information

Thank you to Xindian Long for her excellent presentation. Thank you to Ethem Can for technical edits.

Version history
Last update:
‎06-21-2019 05:10 PM
Updated by:
Contributors

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags