

# ACCELERATE SMART VISION DEPLOYMENTS WITH SCALABLE INDUSTRIAL READY KITS & EASY-TO-USE INFERENCE ENGINE SOFTWARE

Intel<sup>®</sup> Programmable Solutions Group

Jean-Michel Vuillamy – May 22, 2019

## **Title or Abstract**

### **Productize Your AI with the OpenVINO<sup>™</sup> Toolkit on FPGA**

Deep learning boom is now a few years old and while it remains an important research topic, the technology is now mature enough to arrive to production.

We present a toolkit, named OpenVINO<sup>™</sup>, which optimizes deep learning models for inference in order to execute them on a wide spectrum of hardware in terms of power consumption, cost, and inference performance. This leaves full flexibility for the end user to integrate AI into their products given their given hardware constraints. In this presentation, we will provide a general overview about the tool. We will then describe the general graph optimizations given by our tool and designed to create a smaller and leaner graph read to use for inference. Those optimizations include batch normalization fusion, patterns, and subgraphs replacement.

After that, we will dive deep into the deep learning accelerator (DLA), included in the OpenVINO<sup>™</sup> toolkit, and understand how it powers neural network acceleration in FPGAs. We explain how DLA is used to execute neural network graphs on programmable logic even without the need to have detailed FPGA knowledge in the first place. But we will also give insights about how the functionality is physically implemented in the device and how it can be customized by the developer as needed given the fluid nature of neural network development these days.

Finally, we will perform a face detection demo based on a SqueezeNet SSD topology running on multi-streaming videos where we demonstrate that FPGA technology enables the scaling of number of channels whilst maintaining frame rate, providing significant acceleration compared to CPU.





## Agenda

- OpenVINO<sup>™</sup> toolkit
  - Overview
  - Graph Optimizations
  - Inference Engine
- Deep learning acceleration
  - Machine Learning on FPGA
  - Intel<sup>®</sup> FPGA Deep Learning Acceleration Suite

(intel)

3

- Execution on the FPGA
- Demo

Programmable Solutions Group

## Deep Learning: Training vs. Inference



Programmable Solutions Group

## Intel<sup>®</sup> AI Portfolio



Programmable Solutions Group

\*Other names and brands may be claimed as the property of others

5

## OpenVINO<sup>™</sup> Toolkit

**Programmable Solutions Group** 

- A toolkit to accelerate the development of high-performance computer vision and deep learning into vision applications
- Increase application performance through Intel<sup>®</sup> accelerators and flexible heterogeneous architectures (CPU, CPU w/ integrated GPU, FPGA, and Intel Movidius<sup>™</sup> Vision Processing Unit)
- Easy deployment across multiple Intel platforms using **one single API**
- Accelerate workloads for a wide range of solutions and vertical use cases
- Drive power, cost, and development efficiencies to designs and applications



## What's Inside Intel<sup>®</sup> Distribution of OpenVINO<sup>™</sup> Toolkit



An open source version is available at <u>01.org/openvinotoolkit</u> (some deep learning functions support Intel CPU/GPU only).



### Intel<sup>®</sup> Deep Learning Deployment Toolkit For Deep Learning Inference

### **Model Optimizer**

- What it is: A python-based tool to import trained models and convert them to Intermediate representation.
- Why important: Optimizes for performance/space with conservative topology transformations; biggest boost is from conversion to data types matching hardware.

### **Inference Engine**

- What it is: High-level inference API
- Why important: Interface is implemented as dynamically loaded plugins for each hardware type. Delivers best performance for each type without requiring users to implement and maintain multiple code pathways.



GPU = Intel CPU with integrated graphics processing unit/Intel Processor Graphics

| Programmable Solutions Group | OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos | (intel) 8 |
|------------------------------|---------------------------------------------------------------------------------------|-----------|
|------------------------------|---------------------------------------------------------------------------------------|-----------|

## **Optimal Model Performance Using the Inference Engine**

- Simple and unified API for inference across all Intel<sup>®</sup> architecture
- Optimized inference on large IA hardware targets (CPU/GEN/FPGA)
- Heterogeneity support allows execution of layers across hardware types
- Asynchronous execution improves performance
- Future-proof or scale your development for future Intel processors



#### Transform Models & Data into Results & Intelligence

(intel)

9

OpenVX and the OpenVX logo are trademarks of the Khronos Group Inc. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

GPU = Intel CPU with integrated graphics/Intel® Processor Graphics/GEN GNA = Gaussian mixture model and Neural Network Accelerator

Programmable Solutions Group

### Speed Deployment with Intel Optimized Pre-trained Models

The OpenVINO<sup>™</sup> toolkit includes optimized pre-trained models to expedite development and improve deep learning inference on Intel<sup>®</sup> processors. Use these models for development and production deployment without the need to search for or to train your own models.

- Age & Gender
- Face Detection standard & enhanced
- Head Position
- Human Detection eye-level & high-angle detection
- Detect People, Vehicles & Bikes
- License Plate Detection: small & front facing
- Vehicle Metadata
- Human Pose Estimation
- Text Detection

### **Pre-Trained Models**

- Vehicle Detection
- Retail Environment
- Pedestrian Detection
- Pedestrian & Vehicle Detection
- Person Attributes Recognition Crossroad
- Emotion Recognition
- Identify Someone from Different Videos – standard & enhanced
- Facial Landmarks

- Identify Roadside objects
- Advanced Roadside Identification
- Person Detection & Action Recognition
- Person Re-identification ultra small/ultra fast
- Face Re-identification
- Landmarks Regression
- Smart Classroom Use Cases
- Single Image Super Resolution (3 models)

(intel)

10



## Save Time with Deep Learning Samples and Computer Vision Algorithms

#### **Samples**

Use Model Optimizer and Inference Engine for public models and Intel<sup>®</sup> pretrained models.

- Object Detection
- Standard & Pipelined Image Classification
- Security Barrier
- Object Detection for Single Shot Multibox Detector (SSD) using Asynch API
- Object Detection SSD
- Neural Style Transfer
- Hello Infer Classification
- Interactive Face Detection
- Image Segmentation
- Validation Application
- Multi-channel Face Detection

#### **Computer Vision Algorithms**

Start quickly with highly-optimized, ready-todeploy, custom-built algorithms using Intel pretrained models.

- Face Detector
- Age & Gender Recognizer
- Camera Tampering Detector
- Emotions Recognizer
- Person Re-identification
- Crossroad Object Detector
- License Plate Recognition
- Vehicle Attributes Classification
- Pedestrian Attributes Classification

## Agenda

- OpenVINO<sup>™</sup> toolkit
  - Overview
  - Graph Optimizations
  - Inference Engine
- Deep learning acceleration
  - Machine Learning on FPGA
  - Intel<sup>®</sup> FPGA Deep Learning Acceleration Suite

(intel)

12

- Execution on the FPGA
- Demo

## Solving Machine Learning Challenges with Intel® FPGA









### EASE OF USE Software Abstraction, Platforms & libraries

Intel<sup>®</sup> FPGA solutions enable software-defined programming of customized machine learning accelerator libraries.

### **REAL TIME** DETERMINISTIC LOW LATENCY

Intel FPGA hardware implements a deterministic low-latency datapath unlike any other competing compute device. **FLEXIBILITY** CUSTOMIZABLE HARDWARE FOR NEXT GEN DNN ARCHITECTURES

Intel FPGAs can be customized to enable advances in machine learning algorithms.



## Why Intel<sup>®</sup> FPGAs for Machine Learning?

#### **Convolutional Neural Networks are Compute Intensive**



| Feature                                                     | Benefit                                                                                                                |
|-------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| Highly parallel<br>architecture                             | Facilitates efficient low-batch video stream processing and reduces latency                                            |
| Configurable<br>Distributed<br>Floating-Point<br>DSP Blocks | FP32 10+ TFLOPS & FP16, FP11 support<br>Accelerates computation by tuning<br>compute performance                       |
| Tightly coupled<br>high-bandwidth<br>memory                 | >50 TBps on-chip SRAM bandwidth,<br>random access, reduces latency,<br>minimizes external memory access                |
| Programmable<br>Datapath                                    | Reduces unnecessary data movement, improving latency and efficiency                                                    |
| Configurability                                             | Support for variable precision (trade-off<br>throughput and accuracy). Future proof<br>designs and system connectivity |

## Why Intel<sup>®</sup> FPGAs for Machine Learning?

**Convolutional Neural Networks are Compute Intensive** 



| Feature                                                     | Benefit                                                                                                                 |
|-------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| Highly parallel<br>architecture                             | Facilitates efficient low-batch video stream processing and reduces latency                                             |
| Configurable<br>Distributed<br>Floating-Point<br>DSP Blocks | FP32 10+ TFLOPS & FP16, FP11 support<br>Accelerates computation by tuning<br>compute performance                        |
| Tightly coupled<br>high-bandwidth<br>memory                 | >50 TBps on-chip SRAM bandwidth,<br>random access, reduces latency,<br>minimizes external memory access                 |
| Programmable<br>Datapath                                    | Reduces unnecessary data movement, improving latency and efficiency                                                     |
| Configurability                                             | Support for variable precision (trade-off<br>throughput and accuracy). Future proof<br>designs, and system connectivity |

## Why Intel<sup>®</sup> FPGAs for Machine Learning?

#### **Convolutional Neural Networks are Compute Intensive**



| Feature                                                     | Benefit                                                                                                                 |
|-------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| Highly parallel<br>architecture                             | Facilitates efficient low-batch video stream processing and reduces latency                                             |
| Configurable<br>Distributed<br>Floating-Point<br>DSP Blocks | FP32 10+ TFLOPS & FP16, FP11 support<br>Accelerates computation by tuning<br>compute performance                        |
| Tightly coupled<br>high-bandwidth<br>memory                 | >50 TBps on-chip SRAM bandwidth,<br>random access, reduces latency,<br>minimizes external memory access                 |
| Programmable<br>Datapath                                    | Reduces unnecessary data movement, improving latency and efficiency                                                     |
| Configurability                                             | Support for variable precision (trade-off<br>throughput and accuracy). Future proof<br>designs, and system connectivity |



#### Get an Even Bigger Performance Boost with Intel<sup>®</sup> FPGA

<sup>1</sup>Depending on workload, quality/resolution for FP16 may be marginally impacted. A performance/quality tradeoff from FP32 to FP16 can affect accuracy; customers are encouraged to experiment to find what works best for their situation. The benchmark results reported in this deck may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user's components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations. For more complete information about performance and benchmark results, visit <u>www.intel.com/benchmarks</u>. **Configuration:** Intel<sup>®</sup> Core<sup>™</sup> i7-6700K CPU @ 2.90GHz fixed, GPU GT2 @ 1.00GHz fixed Internal ONLY testing, performed 4/10/2018 Test v312.30 – Ubuntu\* 16.04, OpenVINO<sup>™</sup> 2018 RC4. Intel<sup>®</sup> Arria 10-1150GX FPGA. Tests were based on various parameters such as model used (these are public), batch size, and other factors. Different models can be accelerated with different Intel hardware solutions, yet use the same Intel software tools. Benchmark Source: Intel Corporation.

| Programmable Solutions Group | Intel's compilers may one optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,<br>SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by<br>Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information recarding the specific instruction sets covered by this notice. Notice revision #20110804 |  | 17 |  |
|------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|----|--|
|------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|----|--|

## Intel<sup>®</sup> FPGA Deep Learning Acceleration Suite Features

\*Other names and brands may be claimed as the property of others

- CNN acceleration engine for common topologies executed in a graph loop architecture
  - AlexNet, GoogleNet, LeNet, SqueezeNet\*, VGG16\*, ResNet, Yolo, SSD...
- Software Deployment

**Programmable Solutions Group** 

- No FPGA compile required
- Run-time reconfigurable
- Customized Hardware Development
  - Custom architecture creation with parameters
  - Custom primitives using OpenCL<sup>™</sup> flow \_



18

## FPGA Usage with OpenVINO<sup>™</sup> Toolkit



### Application Acceleration with Intel® FPGA Powered Platforms



\*Please contact Intel representative for complete list of ODM manufacturers. Other names and brands may be claimed as the property of others.

Programmable Solutions Group

20

## Agenda

- OpenVINO<sup>™</sup> toolkit
  - Overview
  - Graph Optimizations
  - Inference Engine
- Deep learning acceleration
  - Machine Learning on FPGA
  - Intel<sup>®</sup> FPGA Deep Learning Acceleration Suite

- Execution on the FPGA
- Demo

## MultiChannel Demo







FPGA acceleration enables scaling of number of channels whilst maintaining frame rate

### Summary

- FPGAs provide a flexible, deterministic low-latency, high-throughput, and energy-efficient solution for accelerating AI applications
- Intel<sup>®</sup> FPGA Deep Learning Acceleration Suite supports CNN inference on FPGAs
- Accessed through the OpenVINO<sup>™</sup> toolkit
- DLA architecture can be customized for best performance
- Available for Intel<sup>®</sup> Arria<sup>®</sup> 10 FPGAs today

## Call to Action, Resources

Download ► Free Intel<sup>®</sup> Distribution of OpenVINO<sup>™</sup> toolkit

### Get started quickly with

- <u>OpenVINO toolkit developer resources</u>
- Intel technology, decoded online webinars, tools, how-tos, and quick tips
- Hands-on developer workshops
- Intel<sup>®</sup> FPGA resources
- <u>AI powered by Intel FPGAs</u>

#### Support

Programmable Solutions Group

 Connect with Intel engineers and computer vision experts at the public <u>Community Forum</u>

Select Intel customers may contact their Intel representative for issues beyond forum support.



### Legal Disclaimers

© Intel Corporation. Arria, Celeron, Intel, Intel Atom, Intel Core, the Intel logo, Intel. Experience What's Inside, the Intel. Experience What's Inside logo, Iris, Movidius, Myriad, OpenVINO, the OpenVINO logo, and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

\*Other names and brands may be claimed as the property of others.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

intel

