“Machine learning and deep learning have become an indispensable part of our lives. Artificial intelligence (AI) applications using natural language processing (NLP), image classification, and object detection have been deeply embedded in many of the devices we use. Most AI applications can meet their purposes well through cloud engines, such as vocabulary predictions when replying to e-mails in Gmail.
Machine learning and deep learning have become an indispensable part of our lives. Artificial intelligence (AI) applications using natural language processing (NLP), image classification, and object detection have been deeply embedded in many of the devices we use. Most AI applications can meet their purposes well through cloud engines, such as vocabulary predictions when replying to e-mails in Gmail.
Although we can enjoy the benefits of these AI applications, this approach leads to challenges for many factors such as privacy, power consumption, latency, and cost. If there is a local processing engine that can perform some or all of the calculations (inferences) at the source of the data, then these problems can be solved. There is a bottleneck in the memory power consumption of traditional digital neural networks, and it is difficult to achieve this goal. To solve this problem, a combination of multi-level memory and analog in-memory calculation methods can be used to enable the processing engine to meet lower power requirements from milliwatts (mW) to microwatts (μW), thereby performing AI inference at the edge of the network .
Challenges faced by AI applications in providing services through cloud engines
If a cloud engine is used to provide services for AI applications, users must upload some data to the cloud in an active or passive manner. The computing engine processes the data in the cloud and provides predictions, and then sends the prediction results to downstream users for use. The following outlines the challenges faced in this process:
Figure 1: Data transfer from the edge to the cloud
1. Privacy issues: For devices that are always online and always aware, personal data and/or confidential information are at risk of abuse during the upload period or during the retention period of the data center.
2. Unnecessary power consumption: If every data bit is transmitted to the cloud, hardware, radios, transmission devices, and unnecessary calculations in the cloud will consume power.
3. Delay in small batch inference: If the data comes from the edge, it sometimes takes at least one second to receive a response from the cloud system. When the delay exceeds 100 milliseconds, people will have obvious perception, resulting in a poor user experience.
4. The data economy needs to create value: sensors are everywhere and cheap; but they generate a lot of data. Uploading every bit of data to the cloud for processing is not cost-effective.
To use a local processing engine to solve these challenges, a neural network that performs inference operations must first be trained with a specified data set for the target use case. This usually requires high-performance computing (and memory) resources and floating-point arithmetic operations. Therefore, the training part of the machine learning solution still needs to be implemented on the public or private cloud (or local GPU, CPU, and FPGA Farm), while combining the data set to generate the best neural network model. The inference operation of the neural network model does not require back propagation, so after the model is ready, a small computing engine can be used to deeply optimize the local hardware. Inference engines usually require a large number of multiply-accumulate (MAC) engines, followed by activation layers (such as modified linear unit (ReLU), Sigmoid function, or hyperbolic tangent function, depending on the complexity of the neural network model) and pools between layers化层。 The layer.
Most neural network models require a lot of MAC operations. For example, even the relatively small “1.0 MobileNet-224” model has 4.2 million parameters (weights), and it takes up to 569 million MAC operations to perform an inference. Most of these models are dominated by MAC calculations, so the focus here is on the calculation part of machine learning calculations, and at the same time looking for opportunities to create better solutions. Figure 2 below shows a simple fully connected two-layer network. Input neurons (data) are processed by the first layer of weights. The output neurons of the first layer are processed by the second layer of weights and provide predictions (for example, whether the model can find the cat’s face in the specified image). These neural network models use the “dot product” operation to calculate each neuron in each layer, as shown in the following formula:
(For simplicity, the term “deviation” is omitted from the formula).
Figure 2: Fully connected two-layer neural network
In digital neural networks, weights and input data are stored in DRAM/SRAM. The weights and input data need to be moved to a certain MAC engine for inference. According to the figure below, after adopting this method, most of the power consumption comes from obtaining model parameters and inputting the data to the ALU where the MAC operation actually occurs. From an energy point of view, a typical MAC operation using digital logic gates consumes about 250 fJ of energy, but the energy consumed during data transmission exceeds the calculation itself by two orders of magnitude, reaching the range of 50 picojoules (pJ) to 100 pJ. To be fair, many design techniques can minimize the data transfer from the memory to the ALU, but the entire digital solution is still limited by the von Neumann architecture. This means that there are plenty of opportunities to reduce power waste. What if the energy consumption for performing MAC operations can be reduced from about 100 pJ to a few fractions of pJ?
Eliminate memory bottlenecks while reducing power consumption
If the memory itself can be used to eliminate the previous memory bottleneck, then performing inference-related operations at the edge becomes a feasible solution. Using the in-memory calculation method can minimize the amount of data that must be moved. This in turn will eliminate wasted energy during data transmission. The active power consumption generated by the flash memory unit during operation is low, and almost no energy is consumed in the standby mode, so energy consumption can be further reduced.
Figure 3: Memory bottlenecks in machine learning calculations
Source: “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks” published by Y.-H. Chen, J. Emer and V. Sze at the 2016 International Computer Architecture Symposium.
An example of this approach is the memBrain™ technology of Silicon Storage Technology (SST), a subsidiary of Microchip. The solution relies on SST’s SuperFlash® memory technology, which has become a recognized standard for multi-level memory suitable for microcontroller and smart card applications. This solution has a built-in in-memory computing architecture that allows calculations to be completed where the weights are stored. There is no data movement for weights, only input data needs to be moved from input sensors (such as cameras and microphones) to the memory array, thus eliminating the memory bottleneck in MAC calculations.
This memory concept is based on two basic principles: (a) the analog current response of a transistor is based on its threshold voltage (Vt) and input data, (b) Kirchhoff’s current law, which is a network of multiple conductors that meet at a certain point In, the algebraic sum of the current is zero. It is also important to understand the basic non-volatile memory (NVM) bit unit in this multi-level memory architecture. The figure below (Figure 4) shows two ESF3 (3rd generation embedded SuperFlash) bit cells with a shared erase gate (EG) and source line (SL). Each bit cell has five terminals: control gate (CG), work line (WL), erase gate (EG), source line (SL) and bit line (BL). The erase operation of the bit cell is performed by applying a high voltage to the EG. The programming operation is performed by applying high/low voltage bias signals to WL, CG, BL, and SL. The read operation is performed by applying low voltage bias signals to WL, CG, BL, and SL.
Figure 4: SuperFlash ESF3 unit
Using this memory architecture, users can program memory bit cells with different Vt voltages by fine-tuning the programming operation. Memory technology uses smart algorithms to adjust the floating gate (FG) voltage of the memory cell to obtain a specific current response from the input voltage. Depending on the requirements of the final application, the cell can be programmed in the linear region or in the subliminal region.
Figure 5 illustrates the function of storing multiple voltages in a memory cell. For example, we want to store a 2-bit integer value in a memory cell. In this case, we need to use one of four 2-bit integer values (00, 01, 10, 11) to program each cell in the memory array. At this time, we need to use four possible with sufficient spacing One of the Vt values program each cell. The four IV curves below correspond to the four possible states, and the current response of the cell depends on the voltage applied to the CG.
Figure 5: Programming Vt voltage in ESF3 cell
The weight of the trained model is programmed to the floating gate Vt of the memory cell. Therefore, all the weights of each layer of the trained model (for example, a fully connected layer) can be programmed on a matrix-like memory array, as shown in Figure 6. For inference operations, the digital input (for example, from a digital microphone) is first converted to an analog signal using a digital-to-analog converter (DAC), and then applied to the memory array. Subsequently, the array performs thousands of MAC operations on the specified input vector in parallel, and the generated output immediately enters the activation stage of the corresponding neuron, and then uses an analog-to-digital converter (ADC) to convert the output back to a digital signal. Then, these digital signals are pooled before entering the next layer.
Figure 6: Weight matrix memory array used for inference
This type of multi-level memory architecture has a very high degree of modularity and is very flexible. Many memory slices can be combined to form a large-scale model with a mixture of weight matrix and neurons, as shown in Figure 7. In this example, the MxN slice configuration is connected through the analog and digital interfaces between the slices.
Figure 7: The modular structure of memBrain™
Up to now, we have mainly discussed the chip implementation of this architecture. A software development kit (SDK) is provided to help develop solutions. In addition to the chip, the SDK also helps infer the development of the engine. The SDK process has nothing to do with the training framework. Users can use floating point calculations to create neural network models in all the frameworks provided (such as TensorFlow, PyTorch or other frameworks) as needed. After the model is created, the SDK can help quantify the trained neural network model and map it to the memory array. In this array, vector matrix multiplication can be performed using input vectors from sensors or computers.
Figure 8: memBrain™ SDK process
The advantages of the multi-level memory method combined with in-memory computing functions include:
1. Ultra-low power consumption: technology designed for low power consumption applications. The first advantage in terms of power consumption is that this solution uses in-memory calculations, so during calculations, transferring data and weights from SRAM/DRAM does not waste energy. The second advantage in terms of power consumption is that the flash memory cell operates at a very low current in sub-threshold mode, so the active power consumption is very low. The third advantage is that there is almost no energy consumption in standby mode, because the non-volatile memory unit does not require any power to save the data of the always-on device. This method is also very suitable for taking advantage of the weight and sparsity of the input data. If the input data or weight is zero, the memory bit cell will not activate.
2. Reduce package size: This technology uses a split gate (1.5T) cell architecture, while the SRAM cell in the digital implementation is based on the 6T architecture. In addition, compared with the 6T SRAM cell, this cell is much smaller. In addition, a cell can store a complete 4-bit integer value, instead of requiring 4*6 = 24 transistors like SRAM cells to achieve this purpose, essentially reducing the on-chip footprint.
3. Reduce development costs: Due to the memory performance bottleneck and the limitations of the von Neumann architecture, many dedicated devices (such as Nvidia’s Jetsen or Google’s TPU) tend to improve the performance per watt by reducing the geometric structure, but this method solves the edge The cost of computational puzzles is high. Using the method of combining analog in-memory calculations with multi-level memory, on-chip calculations can be completed in flash memory cells, so that larger geometric sizes can be used, while reducing mask costs and shortening the development cycle.
The prospect of edge computing applications is very broad. However, the challenges of power consumption and cost need to be solved first before edge computing can be developed. Using a memory method capable of performing on-chip calculations in flash memory cells can eliminate major obstacles. This method utilizes a multi-level memory technology solution of the recognized standard type that has been production-proven, and this solution has been optimized for machine learning applications.