OpenVINO™ Toolkit and FPGAs

A Look at the FPGA Targeting of this Versatile Toolkit
James Reinders, Editor Emeritus, The Parallel Universe
In this article, we’ll
take a firsthand look at how to use Intel® Arria® 10 FPGAs with the OpenVINO™
toolkit (which
stands for open visual inference and neural network optimization). The OpenVINO
toolkit has much to offer, so I’ll start with a high-level overview showing how
it helps develop applications and solutions that emulate human vision using a
common API. Intel supports targeting of CPUs, GPUs, Intel® Movidius™ hardware including their Neural Compute Sticks, and FPGAs with the common API. I especially want to highlight
another way to use FPGAs that doesn’t require knowledge of OpenCL* or VHDL* to
get great performance. However, like any effort to get maximum performance, it
doesn’t hurt to have some understanding about what’s happening under the hood.
I’ll shed some light on that to satisfy your curiosity―and to help you survive
the buzzwords if you have to debug your setup to get things working.
We’ll start with a
brief introduction to the OpenVINO toolkit and its ability to support
vision-oriented applications across a variety of platforms using a common API.
Then we’ll take a look at the software stack needed to put the OpenVINO toolkit
to work on an FPGA. This will define key vocabulary terms we encounter in
documentation and help us debug the machine setup should the need arise. Next,
we’ll take the OpenVINO toolkit for a spin with a CPU and a CPU+FPGA. I’ll
discuss why “heterogeneous” is a key concept here (not everything runs on the
FPGA). Specifically, we’ll use a high-performance Intel® Programmable Acceleration Card with an Intel Arria® 10 GX FPGA. Finally, we’ll peek
under the hood. I’m not the type to just drive a car and never see what’s
making it run. Likewise, my curiosity about what’s inside the OpenVINO toolkit
when targeting an FPGA is partially addressed by a brief discussion of some of
the magic inside.
The Intel Arria 10 GX
FPGAs I used are not the sort of FPGAs that show up in $150 FPGA development
kits. (I have more than a few of those.) Instead, they’re PCIe cards costing
several thousand dollars each. To help me write this article, Intel graciously
gave me access for a few weeks to a Dell EMC PowerEdge* R740 system, featuring
an Intel Programmable Acceleration Card with an Arria 10 GX FPGA. This gave me
time to check out the installation and usage of the OpenVINO toolkit on FPGAs
instead of just CPUs.
The OpenVINO Toolkit
To set the stage,
let’s discuss OpenVINO toolkit and its ability to support vision-oriented
applications across a variety of platforms using a common API. Intel recently
renamed the Intel® Computer Vision SDK as the OpenVINO toolkit. Looking at all
that’s been added, it’s not surprising Intel wanted a new name to go with all
the new functionality. The toolkit includes three new APIs: the Deep Learning
Deployment toolkit, a common deep learning inference toolkit, and optimized
functions for OpenCV* and OpenVX*, with support for the ONNX*, TensorFlow*,
MXNet*, and Caffe* frameworks.
The OpenVINO toolkit
offers software developers a single toolkit for applications that need
human-like vision capabilities. It does this by supporting deep learning,
computer vision, and hardware acceleration with heterogeneous support—all in a
single toolkit. The OpenVINO toolkit is aimed at data scientists and software
developers working on computer vision, neural network inference, and deep learning
deployments who want to accelerate their solutions across multiple hardware
platforms. This should help developers bring vision intelligence into their
applications from edge to cloud. Figure 1 shows potential performance improvements using the toolkit.

Accuracy changes can
occur with Fp16. The benchmark results reported in this deck may need to be
revised as additional testing is conducted. The results spend on the specific
platform configurations and workloads utilized in the testing, and may not be
applicable to any particular user’s components, computer system, or workloads.
The results are not necessarily representative of other benchmarks and other
benchmark results may show greater or lesser impact from mitigations. For more
complete information about the performance and benchmark results, visit www.intel.com/benchmarks. Configuration:
Intel® Core™ i7 processor 6700 at 2.90 GHz fixed. GPU GT2 at 1.00 GHz fixed.
Internal ONLY testing performed 6/13/2018, test v3 15.21. Ubuntu* 16.04
OpenVINO™ toolkit 2018 RC4, Intel® Arria 10 FPGA 1150GX. Tests were based on
various parameters such as model used (these are public), batch size, and other
factors. Different models can be accelerated with different Intel® hardware
solutions, yet use the same Intel® Software Tools. Benchmark source: Intel
Corporation.
Figure 1 – Performance
improvement using the OpenVINO toolkit
While it’s clear that
Intel has included optimized support for Intel® hardware, top-to-bottom support
for OpenVX APIs provides a strong non-Intel connection, too. The toolkit
supports both OpenCV and OpenVX. Wikipedia sums up as follows: “OpenVX is complementary to the open source vision library OpenCV. OpenVX, in some applications, offers a better optimized graph
management than OpenCV.” The toolkit includes a library of functions, pre-optimized
kernels, and optimized calls for both OpenCV and OpenVX.
The OpenVINO toolkit
offers specific capabilities for CNN-based deep learning inference on the edge.
It also offers a common API that supports heterogeneous execution across CPUs
and computer vision accelerators including GPUs, Intel Movidius hardware, and
FPGAs.
Vision systems hold
incredible promise to change the world and help us solve problems. The OpenVINO
toolkit can help in the development of high-performance computer vision and
deep learning inference solutions—and, best of all, it’s a free download.
FPGA Software Stack, from the FPGA up to the OpenVINO Toolkit
Before we jump into
using the OpenVINO toolkit with an FPGA, let’s walk through what software had
to be installed and configured to make this work. I’ll lay a foundational
vocabulary and try not to dwell too much on the underpinnings. In the final
section of this article, we’ll revisit to ponder some of the under-the-hood
aspects of the stack. For now, it’s all about knowing what has to be installed
and working.
Fortunately, most of
what we need for the OpenVINO toolkit to connect to FPGAs is collected in a
single install called the Intel Acceleration Stack, which can be downloaded
from the Intel FPGA Acceleration Hub. All we need is the Runtime version (619 MB
in size). There’s also a larger development version (16.9 GB), which we could
also use because it includes the Runtime. This is much like the choice of
installing a runtime for Java* or a complete Java Development Kit. The choice is
ours. The Acceleration Stack for Runtime includes:
- The FPGA programmer (called Intel®Quartus® Prime Pro Edition Programmer Only)
- The OpenCL runtime (Intel® FPGA Runtime Environment for OpenCL)
- The Intel FPGA Acceleration Stack, which includes the Open Programmable Acceleration Engine (OPAE). OPAE is an open-source project that has created a software framework for managing and accessing programmable accelerators.
I know from personal
experience that there are a couple of housekeeping details that are easy to
forget when setting up an FPGA environment: the firmware for the FPGA and the
OpenCL Board Support Package (BSP). Environmental setup for an FPGA was a new
world for me, and reading through FPGA user forums confirmed that I’m not
alone. Hopefully, the summary I’m about to walk through, “up-to-date
acceleration stack, up-to-date firmware, up-to-date OpenCL with BSP,” can be a
checklist to help you know what to research and assure on your own system.
FPGA Board Firmware: Be Up to Date
My general advice
about firmware is to find the most up-to-date version and install it. I say the
same thing about BIOS updates, and firmware for any PCIe card. Firmware will
come from the board maker (for an FPGA board like I was using, the Intel®
Programmable Acceleration Card [PAC] with an Arria® 10 GX FPGA). Intel actually
has a nice chart showing which firmware is compatible with which release of the
Acceleration Stack. Updating to the most recent Acceleration Stack requires the
most recent firmware. That’s what I did. You can’check the latest firmware
version with the command sudo fpgainfo fme.
OpenCL BSP: Be Up to Date
You can hardly use
OpenCL and not worry about having the right BSP. BSPs originally served in the
embedded world to connect boards and real-time operating systems―which
certainly predates OpenCL. However, today, for FPGAs, a BSP is generally a
topic of concern because it connects an FPGA in a system to OpenCL. Because
support for OpenCL can evolve with a platform, it’s essential to have the
latest version of a BSP for our particular FPGA card. Intel integrates the BSPs
with their Acceleration Stack distributions, which is fortunate because this
will keep the BSP and OpenCL in sync if we just keep the latest software
installed. I took advantage of this method, following the instructions to
select the BSP for my board. This process included installing OpenCL itself
with the BSP using the aocl install command (the name of which is an abbreviation of Altera
OpenCL*).
Is the FPGA Ready?
When we can type aocl
list-devices and get a good
response, we’re ready. If not, then we need to pause and figure out how to get
our FPGA recognized and working. The three things to check:
- Install the latest Acceleration Stack software
- Verify firmware is up-to-date
- Verify the OpenCL is installed with the right
BSP
I goofed on the last
two, and lost some time until I corrected my error―so I was happy when I
finally saw:
_______________________________________________________________________
Device Name:
ac10
Package Pat:
/home/james/tools/intelrtestack/a10_gx_pac_ias_l_l_pv/opencl/opencl_bsp
Vendor: Intel Corp
Physical Dev Name Status Information
pac_a10_eb00000 Passed PAC Arria 10 Platform
(pac_alO_eb00000)
PCIe 134:00.0
FPGA temperature = 57 degrees C. DIAGNOSTIC_PASSED _______________________________________________________________________

Figure 2 shows the PAC I
used.
Figure 2 – Intel®
Programmable Acceleration Card with an Intel Arria 10 GX FPGA
The OpenVINO Toolkit Targeting CPU+FPGA
After making sure that
we’ve installed the FPGA Acceleration Stack, updated our board firmware, and
activated OpenCL with the proper BSP, we’re ready to install the OpenVINO
toolkit. I visited the OpenVINO toolkit website to obtain a prebuilt toolkit by
registering and downloading “OpenVINO toolkit for Linux* with FPGA Support
v2018R3.” The complete offline download package was 2.3 GB. Installation was
simple. I tried both the command-line installer and the GUI installer (setup_GUI.sh). The GUI installer
uses X11 to popup windows and was a nicer experience.
We’ll start by taking
OpenVINO toolkit for a spin on a CPU, and then add the performance of an Intel
Programmable Acceleration Card with an Arria 10 GX FPGA.
SqueezeNet
Intel has packaged a
few demos to showcase OpenVINO toolkit usage, including SqueezeNet. SqueezeNet
is a small CNN architecture that achieves AlexNet*-level accuracy on ImageNet*
with 50x fewer parameters. The creators said it well in their paper: “It’s no secret that much of deep learning is tied up in the
hell that is parameter tuning. [We make] a case for increased study into the
area of convolutional neural network design in order to drastically reduce the
number of parameters you have to deal with.” Intel’s demo uses a Caffe
SqueezeNet model―helping show how the OpenVINO toolkit connects with popular
platforms.
I was able to run
SqueezeNet on the CPU by typing:
cd
/opt/intel/computer_vision_sdk_fpga_<version>/deployment_tools/demo ./demo_sgueezenet_download_convert_run.sh
I was able to run
SqueezeNet on the FPGA by typing:
cd
/opt/intel/computer_vision_sdk_fpga<version>/deployment_tools/demo ./demo_sgueezenet_download_convert_run.sh -d
HETERO:FPGA,CPU
I said “FPGA,” but
you’ll note that I actually typed HETERO:FPGA,CPU. That’s because,
technically, the FPGA is asked to run the core of the neural network
(inferencing), but not our entire program. The inferencing engine has a very
nice error message to help us understand what we’ve specified that still runs
on the CPU:
./demo_sgueezenet_download_convert_run.sh -d
FPGA
I’ll be told:
Graph is not supported on
FPGA plugin due to existence of layer (Name:prob, Type: SoftMax) in
topology. Most likely you need to use heterogeneous plugin instead of FPGA plugin
directly.
This simple demo
example will run slower on an FPGA because the demo is so brief that the
overhead of FPGA setup dominates the runtime. To overcome this, I did the
following:
export myDIR=/opt/intel/computer_vision_sdk_fpga_2018.3.343 cd $myDIR/deployment_tools/demo/
aocl program ac10 $myDIR/a10_dcp_bitstreams/2-0-1_RC_FP11_SgueezeNet.aocx
alias
csa=’~/inference_engine_samples/intel64/Release/classification_sample_async’
export myPIC=$IE INSTALL/demo/car.png
csa -m squeezenet1.1.xml -i $myPIC -d HETERO:FPGA,CPU -ni 100 -nireq 3
csa -m squeezenetl.1.xml -i SmyPIC -ni 100 -nireq 3
These commands let me
avoid the redundant commands in the script, since I know I’ll run twice. I
manually increased the iteration counts (the –ni parameter) to simulate a more realistic
workload that overcomes the FPGA setup costs of a single run. This simulates
what I’d expect in a long-running or continuous inferencing situation that
would be appropriate with an FPGA-equipped system in a data center.
On my system, the CPU
did an impressive 368 frames per second (FPS), but the version that used the
FPGA was even more impressive at 850 FPS. I’m told that the FPGA can outstrip
the CPU by even more than that for more substantial inferencing workloads, but
I’m impressed with this showing. By the way, the CPU that I used was a
dual-socket Intel® Xeon® Silver processor with eight cores per socket and
hyperthreading. Beating such CPU horsepower is fun.
What Runs on the FPGA? A Bitstream
What I would call a
“program” is usually called a “bitstream” when talking about an FPGA.
Therefore, FPGA people will ask, “What bitstream are you running?” The demo_squeezenet_download_convert_run.sh script hid the
magic of creating and loading a bitstream. Compiling a bitstream isn’t fast,
and loading is pretty fast, but neither needs to happen every time because,
once loaded on the FPGA, it remains available for future runs. The aocl program acl0… command that I
issued loads the bitstream, which was supplied by Intel for supported neural
networks. I didn’t technically need to reload it, but I choose to expose that
step to ensure the command will work even if I ran other programs on the FPGA
in between.
Wait…Is that All?
The thing I liked
about using the OpenVINO toolkit with an FPGA was that I could easily say,
“Hey, when are you going to tell me more?” Let’s review what we’ve covered:
- If we have a computer vision application, and we can train it using any popular platform (like Caffe), then we can deploy the trained network with the OpenVINO toolkit on a wide variety of systems.
- Getting an FPGA working means installing the right Acceleration Stack, updating board firmware, getting OpenCL installed with the right BSP, and following the OpenVINO toolkit Inference Engine steps to generate use the appropriate FPGA bitstream for our neural net.
- And then it just works.
Sorry, there’s no need
to discuss OpenCL or VHDL programming. (You can always read my article on
OpenCL programing in Issue 31 of The Parallel Universe.)
For computer vision,
the OpenVINO toolkit, with its Inference Engine, lets us leave the coding to
FPGA experts―so we can focus on our models.
Inside FPGA Support for the OpenVINO Toolkit
There are two very
different under-the-hood things that made the OpenVINO toolkit targeting an
FPGA very successful:
- An abstraction that spans devices but includes FPGA support
- Very cool FPGA support
The abstraction I
speak of is Intel’s Model Optimizer and its usage by the Intel Inference
Engine. The Model Optimizer is a cross-platform, command-line tool that:
- Facilitates the transition between the training and deployment environment
- Performs static model analysis
- Adjusts deep learning models for optimal execution on end-point target devices
Figure 3 shows the
process of using the Model Optimizer, which starts with a network model trained
using a supported framework, and the typical workflow for deploying a trained
deep learning model.

Figure 3 – Using the
Model Optimizer
The inference engine
in our SqueezeNet example simply sends the work to the CPU or the FPGA based on
our command. The intermediate representation (IR) that came out of the Model
Optimizer can be used by the inferencing engine to process on a variety of devices
including CPUs, GPUs, Intel Movidius hardware, and FPGAs. Intel had also done
the coding work to create an optimized bitstream for the FPGA that uses the IR
to configure itself to handle our network, which brings us to my second
under-the-hood item.
The very cool FPGA
support is a collection of carefully tuned codes written by FPGA experts.
They’re collectively called the Deep Learning Accelerator (DLA) for FPGAs, and
they form the heart of the FPGA acceleration for the OpenVINO toolkit. Using
the DLA gives us software programmability that’s close to the efficiency of
custom hardware designs, thanks to those expert FPGA programmers who worked
hard to handcraft it. (If you want to learn more about the DLA, I recommend the
team’s paper, “DLA: Compiler
and FPGA Overlay for Neural Network Inference Acceleration.” They describe their
work as “a methodology to achieve software ease-of-use with hardware efficiency
by implementing a domainspecific, customizable overlay architecture.”)
Wrapping Up and Where to Learn More
I want to thank the
folks at Intel for granting me access to systems with Arria 10 FPGAs cards.
This enabled me to evaluate firsthand the ease with which I was able to exploit
heterogeneous parallelism and FPGA-based acceleration. I’m a need-for-speed
type of programmer―and the FPGA access satisfied my craving for speed without
making me use any knowledge of FPGA programming.
I hope you found this
walkthrough interesting and useful. And I hope sharing the journey as FPGA
capabilities get more and more software support is exciting to you, too.
Here are a few links
to help you continue learning and exploring these possibilities:

Software and workloads
used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are
measured using specific computer systems, components, software, operations and
functions. Any change to any of those factors may cause the results to vary.
You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that
product when combined with other products. For more complete information visit
http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Source: https://educationecosystem.com/blog/openvino-toolkit-and-fpgas/