Back

 Industry News Details

 
Deep Learning Challenges in Embedded Platforms Posted on : Oct 22 - 2016

The successful spread of artificial intelligence (AI) into everyday applications will be dependent on how easy it is to deploy deep neural networks in small, low-power devices rather than large server networks

In this post we look at ways to deal with those challenges.

In 2014, Google made an entry to the ImageNet large-scale visual recognition challenge (ILSVRC), titled GoogLeNet. It is an interesting case study because it is a 22-layer deep convolutional network, and includes nine inceptions, creating a very rich and complex topology.

In the GoogLeNet network, each connection in each layer can potentially go back and forth through DDR. To handle this in an embedded system poses a challenge. The complex topology of the network must be divided into batches of layers to run on a DSP or dedicated hardware. We call this subnetwork division.

In our CEVA network generator tool, all analysis is done automatically without user intervention. The network is divided into subnetworks and each subnetwork runs on the DSP according to the execution order set by the network generator. For example, let’s take a look at the inception part of the GoogLeNet network after going through our network generator tool.

As you can see in the above image, the network generator created four subnetworks. Of these subnetworks, three run at different execution time, but two can run in parallel on different cores. Additionally, the network generator is designed to create long layer sequences, which potentially will only go through internal memory.

Overcoming the Challenges

Next, let’s take a look at methods designed to overcome some of the most significant challenges of deep learning in embedded platforms.

Reducing bandwidth

Due to tight constraints of bandwidth in embedded platforms, implementation of convolutional neural networks will undoubtedly generate some bandwidth issues. These are caused by either the network filter weight, or data transfer from layer to layer.

Here are two rules that can help reduce the bandwidth significantly:

  • Each output map is created by running the same filter on a different position in the input map. Relying on this rule, we can save the massive load of the data weight, reducing unnecessary bandwidth usage.
  • Each output is calculated by the same input data. Applying this rule, the input can be loaded and used for all the outputs without utilizing the DDR more than once.

Multiply and Accumulate Utilization

A powerful feature of DSP architecture is the ability to perform single cycle multiply-accumulate (MAC) instructions for intense computations. In order to maximize efficiency, it is beneficial to have a continuous sequence of MAC instructions. This can be handled differently in two distinct cases:

  • A low number of large input maps
  • A high number of small input maps

In the first case, we will prefer to complete the filter calculation for each input map before going to the next map. This way we benefit from overlapping filters, and on the edges of the map we will have redundant MAC utilization loss. As shown in the formula below, width and height are calculated first, in this case. We call this approach local filter calculation.

In the second case, of small-sized input maps that occur in large amounts, the calculation should be performed across the maps. Different input maps are processed to one output map. In this case, partial filter results are calculated and at the end of the process all the partial results are summed together to one result using the property of the convolutional filter enabling this. As shown in the next formula, channels are calculated first. We call this approach cross map filter calculation.

Utilizing internal memory

To use the embedded resources efficiently, we must have all the input maps in the internal memory, and loaded only once. But, what if we don’t have enough memory to preserve this rule? In this case we will need to perform tile division of the input, but still preserve the rule. After the division, we will have the same number of inputs, but in tiles. The impact of this division is loading the weights in correlation to the number of tiles.

All these problems and their solutions are clearly something that the user would like to avoid dealing with when implementing deep learning on an embedded platform. At CEVA, we believe this should be a basic demand for a real-time system to perform without the user’s involvement, or even awareness. This is core responsibility of the CEVA deep neural network framework and CEVA network generator.

What else can be done?

We’ve covered a few embedded algorithmic solutions that serve to change the convolution calculation to our benefit. In addition to these, more things can be done on the algorithmic level by understanding neural networks work. Here are a few examples that use compression approach and prior knowledge to reduce bandwidth and improve performance:

  • Using algorithms like Huffman coding
  • Work in pipeline to save BW
  • Identify when some of the calculation can be saved
  • Share data between calculations
  • Recognize when the focus should be on the weights and when it should be on the map size – network dependent
  • Compress and decompress better over time (learn from frame by frame execution)

Conclusion

As you can see, there is a lot that can be done in the technical aspects of deep convolutional neural networks for embedded systems. Once the challenges of deep learning in embedded systems has been overcome, there are many opportunities that are open. Source