Machine learning is migrating from the cloud to the network edge for real-time processing, lower latency, improved security, more efficient use of available bandwidth, and lower overall power consumption. As a result, developers of resource limited Internet of things (IoT) devices at these edge nodes need to figure out how to efficiently add this new level of intelligence.
Using machine learning at the edge and on a microcontroller-based system provides several new opportunities for developers to revolutionize the way that they design systems. There are several different architectures and techniques that developers can use to add intelligence to their edge nodes. In this article we will become more familiar with those architectures, along with some of the technologies that can be used to accelerate the process.
The role of machine learning at the edge
Machine learning at the edge can be useful for embedded system engineers for many reasons. First, an intelligent system can solve problems that are often difficult for a developer to code for. Take simple text recognition as an example. Recognizing text is a programming nightmare, but if machine learning is used, well, it’s nearly as simple as writing a “Hello World!” application in C.
Second, intelligent systems can be easily scaled for new data and situations. For example, if a system was trained for recognizing basic text and was suddenly provided with text in a new font, it’s not back to the drawing board for the coded algorithm. Instead, it’s just a matter of providing additional training images so that the network can learn to recognize the new font as well.
Finally, we can also understand that machine learning at the edge provides developers with the ability to decrease costs for certain types of applications such as:
- Image recognition
- Speech and audio processing
- Language processing
When first examining machine learning at the edge, using an application processor might seem like a good option. There are several open source tools designed for computer vision, including OpenCV, that can be leveraged to get started. However, using an application processor in many applications may not be sufficient since these processors do not have deterministic, real-time behaviors.
Machine learning architectures at the edge
When it comes to using machine learning at the edge, the three typical approaches are:
- Edge node acquires data and the machine learning is done in the cloud
- Edge node acquires data and the machine learning is done on chip
- Edge node acquires data, first-pass machine learning is done at the edge with more in-depth analysis done in the cloud
The first two solutions are the ones that are being explored the most by industry at the moment and where we will focus our attention for this article.
There are several advantages to using an architecture where the edge device acquires data and uses a cloud-based machine learning system to process it. First, the edge device does not need all the horse power and resources that are necessary to run a machine learning algorithm. Second, the edge device can remain a low-cost, resource constrained device just like the systems that many embedded systems developers are used to creating. The only difference will be that they need to be able to connect to a cloud-based service provider through HTTPS in order to analyze their data. Third, cloud-based machine learning is advancing at an amazing pace and it would be very difficult, time consuming, and costly to transfer those capabilities to an on-chip based solution.
For a developer looking to get started with cloud-based machine learning, they can use a development board like the STM32F779I-EVAL board from STMicroelectronics (Figure 1). This development board is based on the STMicroelectronics STM32F769NIH6 microcontroller with an Arm® Cortex®-M7 core and comes with an on-board camera, an Ethernet port for high-speed communication with the cloud, and an on-board display. The board can be used with software such as Express Logic’s X-Ware IoT platform to easily connect to any machine learning cloud provider such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud.
Figure 1: The STM32F779I-EVAL board is based on an Arm Cortex-M7 processor and includes everything necessary to perform deep learning either on-chip or up on the cloud. (Image source: STMicroelectronics)
Keeping machine learning in the cloud can make a lot of sense for a development team, but there are several reasons why machine learning is starting to move from the cloud to edge. The reasons are very application specific, but they do include important factors such as
- Real-time processing requirements
- Bandwidth limits
- Security requirements
If an application has concerns in this area, then it may make sense to bring the neural network from the cloud to the edge. In this situation, developers need to make sure they understand what it is that they are looking for in an embedded processor so that the application executes as efficiently as possible.
Selecting a processor for machine learning
There are several important factors that need to be considered about running machine learning on an embedded processor. First, the processor must be able to execute DSP instructions efficiently so a floating point unit (FPU) is useful. Second, there need to be machine learning libraries that can be run on the processor. The libraries need to include convolution, pooling, and activation. Without these libraries a developer would basically have to write the deep learning algorithms from scratch. This is time consuming and costly.
Finally, a developer needs to make sure that there are enough CPU cycles on the microcontroller to complete the neural network execution along with any additional tasks assigned to the processor.
The Arm Cortex-M processors now have a CMSIS-NN extension which is a neural network library designed to run machine learning on a microcontroller efficiently and in a resource constrained environment. This makes it a great choice for an intelligent edge-based system. The exact processor selected will depend on the application at hand, so it’s important to examine several different development boards and the applications for which they are best suited.
First, there is SparkFun Electronics’ OpenMV development board for machine vision (Figure 2). The module is based on the STM32F765VI Cortex-M7-based processor running at 216 MHz, supported by 512 Kbytes of RAM and 2 Mbytes of flash memory.
Figure 2: The OpenMV development board from SparkFun is a machine vision platform that uses the Arm CMSIS-NN framework to run the machine learning algorithms efficiently on a Cortex-M. (Image source: SparkFun Electronics)
The OpenMV module can be used to:
- Detect motion through frame differencing
- Color tracking
- Market tracking
- Face detection
- Eye tracking
- Line and shape detection
- Template matching
The module’s software is based on the Arm CMSIS-NN library, so it’s running the machine learning networks as efficiently as possible on the processor.
Second, there is the STM32F746ZG Nucleo development board for STMicroelectronics’ Arm Cortex-M7-based STM32F746 processor running at 216 MHz (Figure 3). The processor used on the board has a little bit less memory and flash when compared to the processor on the OpenMV module, with 320 Kbytes and 1 Mbyte, respectively. This same processor was used by Arm in many of its machine learning white papers that cover topics such as keyword spotting.
Figure 3: The STM32F746ZG Nucleo board is a low-cost development board for developers looking to get started with machine learning without all the bells and whistles. (Image source: STMicroelectronics)
The development board provides more of an open platform for prototyping and systems that use extensive I/O and peripherals. It includes an Ethernet port, USB OTG, three LEDs, two user and reset buttons, and expansion board connectors for ST Zio (including Arduino Uno V3) and ST Morpho.
Finally, there is the IMXRT1050-EVKB from NXP Semiconductors for its i.MX RT 1050 processor which runs at up to 600 MHz (Figure 4). This processor has a lot of horsepower for executing machine learning algorithms but is still based on the Cortex-M7 architecture. As such, it is a great general purpose platform that developers can use to experiment and tune their understanding of machine learning. The processor contains 512 kB of tightly coupled memory (TCM) and the ability to use external NOR, NAND or eMMC flash.
Figure 4: The NXP i.MX RT1050 is based on the Arm Cortex-M7 architecture but also marries the best features from the NXP Cortex-A i.MX series of processors. The RT1050 is a high-end processor capable of providing a great machine learning experience. (Image source: NXP Semiconductors)
Understanding the role of CMSIS-NN
It’s important to realize that even if machine learning is moved from the cloud to the edge, it’s impractical to run the machine learning framework on the microcontroller. The microcontroller can run the trained network, which is the output of the framework, but nothing more. Arm-NN translates a trained model that was run on a high-end machine into low-level code that can run on the microcontroller. The low-level library that provides the API’s for Arm-NN is CMSIS-NN.
As we discussed earlier, CMSIS-NN contains the API’s and library functions for common machine learning activities such as:
Figure 5: Arm-NN translates the trained model that is performed on a high-end machine to the low-level code that runs on a Cortex-M processor using the CMSIS-NN library. (Image source: Arm)
Tips and tricks for using machine learning at the edge
There are many techniques that can help improve machine learning at the edge. Below are several tips and tricks that will help developers interested in getting their own machine learning systems up and running:
- If latency is not an issue, use the edge to gather data and the cloud to process the data with the machine learning network
- When offloading the machine learning to the cloud, don’t over select the amount of processing power needed in the edge device unless you plan to move the machine learning to that device in the future
- When real-time performance is critical, implement the machine learning network at the edge on a high performance Arm Cortex-M7 processor
- Read “Deep Learning” by Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Francis Bach to understand the theory and math behind machine learning
- Start in the cloud or on a PC and then work your way to an embedded target
- Create a “Hello World” application that can recognize hand-written digits
- Review the Arm papers on keyword spotting and speech recognition
- Purchase a development kit and duplicate an example
Intelligence is quickly finding its way from the cloud to the edge. There are three different approaches that developers can choose, ranging from fully offloading machine learning to the cloud, to running the trained machine learning algorithm on the edge. Running machine learning at the edge requires a microcontroller with high performance and DSP capabilities. The Arm Cortex-M7 processors are a great match to get machine learning up and running at the edge.