Spanning the reality gap between AI and Industry 4.0

A summary of industry-ready state of the art computer vision techniques (by Philip Montsho)

Artificial intelligence has become a booming trend in the industrial sector these days, as automation and optimization continue to be the primary focus of the digital revolution. In this article, we will take a look at the state of the art computer vision techniques that have generated a lot of excitement in AI community in the last few years, and are considered to be industry-ready and are sure to have a significant and practical impact for industrial use cases. Some of these techniques demonstrate incremental yet incredible advancements in performance, surpassing human level performance and thus surpassing precision and reliability standards expected by most industries. The incredible advancement in basic computer vision tasks, such as image classification, have made it feasible to reliably combine multiple techniques to create new, compound techniques that enable entirely new use cases that have never been explored in industrial environments before. That being said, these new techniques have demonstrated that is it possible to obtain precision and reliability results comparable to those that would otherwise only be obtainable with specialized systems that are very hardware intensive. While there are practical difficulties and limitations in implementing these specialized systems and installing the required hardware, cameras are readily available, thereby dramatically broadening the scope of use cases. Computer vision systems powered by AI make it possible to leapfrog into an entirely new domain accelerating the progress towards Industry 4.0 and the true digitization and augmentation of the physical reality.

“In 2015 the state of the art deep learning algorithms surpassed human level performance for image classification”

Before we dive into the recent advances in the field of computer vision, let’s cover a few basic ideas and some of the historical events that brought to deep learning and computer vision together.

Introduction to computer vision

Computer vision is the science that primarily aims to give computers the ability to understand and draw insights from images and videos. Computer vision makes it possible automate visual tasks like the extraction and analysis of useful information from images or videos. The secondary aim of computer vision systems is to improve the quality of images and videos.

Introduction to machine learning and deep learning

Machine learning is the scientific study of algorithms and statistical models that rely on a data-driven approach for decision-making instead of using a logical rule-based approach often based on first principles. Given a lot of high-quality data and by improving the algorithms machine learning systems have the ability to progressively improve their performance on a specific task. Deep learning is a subcategory of machine learning and focuses entirely on a set of mathematical algorithms that can be described as a network. Their inception was inspired by the biological neural networks found in the human brain and, similarly, artificial neural networks have millions of artificial synapses, which are represented mathematically by millions of simple linear algebraic equations.

Deep learning powered computer vision

Since 2012 deep learning neural networks have been the primary focus for computer vision and with good reason. The advantages of computer vision systems powered by deep learning are that they have a higher accuracy, are more flexible, and more tolerant of a high amount of variation in lighting conditions, viewpoint, scale, orientation, background clutter, intra-class variation, deformation and visual obstruction. But most importantly, they have enabled new use cases.

Early computer vision models relied on raw pixel data as the input to the machine learning model. However, raw pixel data alone is not sufficient to encompass the countless number of variations of an object in an image.

Deep learning powered computer vision is based on the idea that deep neural networks can extract and create task-specific features automatically during the training phase, which are then used to perform the Computer Vision task.

The following graph highlights some of the most important events in the 6-year history of deep learning and computer vision.

  1. The breakthrough caused by the introduction of deep neural networks 2012 lead to a roughly 10% decrease in image classification error (from 25.8% in 2011 to 16.4% in 2012).
  2. In 2015 the state of the art deep learning algorithms surpassed human level performance for image classification (5.1%, Russakovsky et al.) with an accuracy of 3.57%.
  3. Overall the introduction of deep neural networks resulted in a 10× reduction in image classification error (from 25.8% in 2011 to 2.3% in 2017).

It is important to note that the results above were achieved on the ImageNet data set with 20,000 categories with a typical category, such as “balloon” or “strawberry”, consisting of several hundred low-resolution 469 x 387 pixel images. The accuracy of a computer vision system applied to a specific task with fewer categories, less variation, and a larger number of higher resolution images can achieve up to 99.9% accuracy. This makes it possible to confidently run a system completely independently.

Now that we have covered the fundamentals we can take a look at the techniques in more detail.


Image classification

In this section, we will introduce Image Classification, which is the task of assigning one label from a fixed set of categories to an image. This is one of the core problems in computer vision that, despite its simplicity, has a large variety of practical applications. Many other seemingly distinct computer vision tasks (such as image captioning, object detection, keypoint detection, and segmentation) can be reduced to image classification whilst others leverage entirely new neural network architectures. The following video clip illustrates a very simple classification example.

Simple Image Classification using Convolutional Neural Network(Venkatesh Tata Dec 2017)

Image keywording and captioning

These techniques are at the intersection of the two most interesting fields of AI, computer vision and Natural Language Processing (NLP). Keywords are words that are used to describe the elements of your photograph or image. Image captioning refers to the process of generating textual description from an image or video, based on the objects and actions in the image. An example of this can be seen in the following image:

Object detection

Object detection is a computer vision technique that identifies and locates objects in images or videos. This is typically done by encompassing the objects with bounding labeled boxes. Object detection is a key technology behind self-driving cars, enabling them to recognize other cars or distinguish a pedestrian from a lamppost. It is also useful in a variety of applications such as industrial inspection, and robotic vision. Due to the ImageNet competition, there has been a 1.7× reduction in localization error (from 42.5% to 25.3%) between 2010 and 2014 alone. The video clip below shows the results of a real-time implementation of this technique for the detection of cars, people and other common objects found in a city that are relevant to the vision system of a self-driving car.

YOLO v3: An Incremental Improvement(Redmon et al. Apr 2018)

Keypoint detection and pose estimation

A key point is a feature that is considered as an interesting or important part of an image. They are spatial locations or points in the image that define what is interesting or what stands out in the image. The reason why key points are special is that it is possible to track the same key points in a modified image where the image, or the objects in the image, are subject to rotation, contraction/expansion or deformation.

Pose Estimation is a general problem in computer vision where the aim is to detect the position and orientation of an object. This usually means detecting keypoint locations of the object. This technique can be used to create a very accurate 2D/3D model that describes the position of the key points of the object, which can then be used to create a digital twin that can be updated in real time.

For example, in the problem of pose estimation common boxy household objects the corners can be detected to gain insight into the 3D position of the objects in the environment.

Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects (Trembley et al. Sep 2018)

The same can be done for detecting human poses, where the key points on a human body such as shoulders, elbows, hands, knees, and feet are detected.

OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields (Cao et al. 18 Dec 2018)

Semantic segmentation (a.k.a. masking)

The next technique is known as semantic segmentation and it addresses one of the key problems in the field of computer vision as it intuitively separates objects in an image by clearly defining their boundary. Looking at the big picture, semantic segmentation paves the way towards complete scene understanding. This is incredibly useful because it gives a computer the ability to precisely identify the boundaries of different objects. The importance of scene understanding as a core computer vision problem is highlighted by the fact that an increasing number of applications nourish from the knowledge gained with semantic segmentation. In the self-driving car example shown below, it helps the car identify the exact position of the road and other objects.

Semantic Segmentation with Deep Learning (George Seif Sep 2018)

The techniques mentioned below fall into the category of image-to-image translation. For the techniques below the networks serve the purpose of augmenting images and videos by improving the quality instead of extracting insights or drawing conclusions.

Super resolution

The goal of this task is to increase the resolution of images while increasing the level of detail simultaneously. A very deep neural network has recently achieved great success for image super-resolution. The magnification works well for a 2x magnification and is shown on the image below.

Night vision

Imaging in low light is challenging. Short-exposure images suffer from noise and long-exposure times can result in motion-blur. The latter is often also impractical, especially for hand-held photography. A variety of denoising, deblurring, and enhancement techniques have been proposed, but their effectiveness is limited in extreme conditions, such as high-speed photography at night. To improve the current standards researchers introduced a technique for processing low-light images, based on end-to-end training of a deep network. The network operates directly on raw sensor data and replaces much of the traditional image processing techniques. This can be seen clearly in the image below where a dark noisy image is significantly brightened enhanced.

Super slow motion

Video interpolation aims to generate intermediate frames between two consecutive frames. These artificially generated frames adhere indistinguishably to the same visual characteristics as the original footage. This technique is ideal for amplifying the capabilities of camera systems. Experimental results on several data sets demonstrate that the deep learning approach performs consistently better than existing methods. The results of this technique can be seen in the video clip below, where a smooth slow motion video was created by adding 7 intermediate frames between the original frames.

Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation(Jiang et al. Jul 2018)

In this article, we took a look at a lot of computer vision techniques that are powered by deep learning that have been developed in recent months and have demonstrated incredible results and are ready to be implemented in the industry. The techniques are at the cutting edge of technology and have been shown to significantly outperform previous techniques by increasing the speed, accuracy, reliability, and flexibility.

The key driving factor of innovation is the number of AI-research papers has skyrocketed in recent years, especially in the field of computer vision, making it even more important to stay up to date with the latest trends to fully leverage the technological advances to improve industrial operations.


Thanks for reading! Hopefully, you learned something new and useful about the state of the art computer vision techniques that are ready for practical applications in industry.

If you want to discuss a use case that relates to your production environment feel free to reach out to me directly at