Abstract

We use 250 billion microcontrollers daily in electronic devices that are capable of running machine learning models inside them. Unfortunately, most of these microcontrollers are highly constrained in terms of computational resources, such as memory usage or clock speed. These are exactly the same resources that play a key role in teaching and running a machine learning model with a basic computer. However, in a microcontroller environment, constrained resources make a critical difference. Therefore, a new paradigm known as tiny machine learning had to be created to meet the constrained requirements of the embedded devices. In this review, we discuss the resource optimization challenges of tiny machine learning and different methods, such as quantization, pruning, and clustering, that can be used to overcome these resource difficulties. Furthermore, we summarize the present state of tiny machine learning frameworks, libraries, development environments, and tools. The benchmarking of tiny machine learning devices is another thing to be concerned about; these same constraints of the microcontrollers and diversity of hardware and software turn to benchmark challenges that must be resolved before it is possible to measure performance differences reliably between embedded devices. We also discuss emerging techniques and approaches to boost and expand the tiny machine learning process and improve data privacy and security. In the end, we form a conclusion about tiny machine learning and its future development.

1. Introduction

Globally, Internet of Things (IoT) devices are sending data to the cloud at an accelerating rate because the number of such devices is increasing, and the capacity of network connections is improving all the time. At the same time, the number of different cloud service platforms has grown, and these platforms have become more accessible for all users. International Data Corporation estimates that 79.4 zettabytes of data will be generated by 41.6 billion IoT devices in 2025 [1]. However, it is not necessary to transfer all these data to the cloud when we can take advantage of the edge computing capabilities of the IoT devices instead of using cloud computing, which burdens networks and radio bands. Furthermore, applications may also need to be geographically dispersed, higher bandwidth, ultralow latency, and privacy-sensitive features [2]. Hence, a computing paradigm that happens closer to the edge is needed. Therefore, the use of IoT devices as comprehensive edge computers is an important issue to address. The authors of [2] explained a taxonomy of different computing paradigms such as fog, edge, extreme edge, and mist computing that can be found in the literature. In this review, for clarity, we use cloud–fog–edge taxonomy, with the common term ‘edge’ to mean edge computing that happens in the cloud or fog-connected sensor nodes or IoT devices themselves. Edge computing—or, more specifically, Edge artificial intelligence (Edge AI)—means the computation of a machine learning (ML) algorithm on an edge device or node [3] that can be as tiny as a single microcontroller with integrated IoT radio. Today, the term TinyML [4] is widely used in the context of a lightweight ML for embedded devices. In addition, these edge devices are now located more often at the edge of the physical world, measuring some physical quantity. When defining TinyML hardware or devices, ultralow power consumption is the most defining characteristic, typically below 1 mW [5]. In this way, the processor range includes 32-bit Arm Cortex-M7 and RISC-V PULP processors and below. TinyML is a vast research topic, and its main elements can be divided roughly into datasets, use cases, hardware, framework and algorithms, and models [6]. In the end, all these main elements focus more or less on the technological improvement of products.

Edge AI is also changing sensors when sensor manufacturers have recently started implementing AI features in sensors’ application specific integrated circuits (ASIC) such as STMicroelectronics’ Intelligent Sensor Processing Units (ISPU) [7]. In addition, STMicroelectronics introduced a new generation of microelectromechanical system (MEMS) sensors in 2022, including ISPUs and support for on-device learning. Another self-learning AI sensor manufacturer is Bosch Sensortec, which promises always-on data processing algorithms at ultralow power consumption in their new MEMS-integrated 32-bit Fuser2 microcontrollers [8]. TinyML is a promising technology for soft sensors and sensor data fusion, and it has been recently studied in articles [9, 10, 11].

2. Challenges and Opportunities

TinyML has big challenges, such as the microcontroller’s limited computational resources, lack of a unified framework [12], and absence of open-source TinyML datasets [5]. TinyML also has huge potential because microcontrollers are cheap and widespread. TinyML can potentially decrease the IoT device’s energy consumption, lifetime costs, and inference latency. Same time it increases IoT device’s data privacy and intelligence level. Table 1 summarizes the advantages and disadvantages of application features in IoT devices.

Edge computing requires a great deal of resources from an IoT device, as the computational resources of its microcontroller are typically limited. This computational challenge becomes overwhelming if the standard neural networks- (NN-) based ML algorithms are used. For this reason, light software frameworks, tools, and libraries have recently been created, especially for use with microcontrollers, that can be used to build the TinyML model. After the model is created, it can be used within the source code to implement ML features in IoT devices. Nevertheless, one of the great challenges in TinyML is the lack of a unified framework that can be used across a wide range of hardware [12]. The fundamental difference between normal ML and TinyML procedures is that with the latter, the model is usually created on a more efficient computer and then ported to a microcontroller, which begins to perform inference based on the model [1315]. The edge inference throughput should be robust without frame drops [16]. Another thing is, that the microcontroller running the ML algorithm in a loop is not the only process that consumes energy and computational resources on an edge IoT device. Raw data measurement and communication with the cloud services and the other IoT devices are also resource-expensive processes that the edge device must typically perform. Resource optimization becomes even more important to IoT devices that are battery-powered. Therefore, computational resource optimization is a rising topic that has been addressed in recent TinyML and edge computing articles using different techniques. Another challenge is the absence of TinyML-focused open-source datasets that are large enough for TinyML benchmarking and academic research [5]. Additionally, it would be good if the data in the datasets corresponded to the data sent by external sensors in terms of temporal and spatial resolution [17].

TinyML has enormous potential because there are 250 billion microcontrollers in our printers, TVs, cars, and pacemakers that are capable of running the ML model on the edge [18]. It is also estimated that 2.5 billion IoT devices will be shipped with a TinyML chipset in 2030 [19]. These huge estimations lean on the fact that microcontrollers are astonishingly cheap, and that the world has only just started its digital transformation and will need much more data in the future. For this reason, TinyML has been defined as a fast-growing field among machine learning technologies [4]. In addition, cloud-sourced ML inference could cost a lot during an AI application’s lifetime because some applications are very data-centered. Therefore, ML inference and data pruning are much cheaper to perform at the data origin [3]. This approach also saves energy because the use of IoT radio is a more energy-expensive operation than edge computing. Moreover, when ML inference is made in an IoT device itself, it reduces latency and increases data privacy and security [20].

3. Methods for Resource Optimization

Different approaches and methods can be used to save microcontrollers’ computational resources, such as memory and processor usage, when used in TinyML devices. One of the most commonly used methods to reduce the computational load on basic edge devices is simply to send heavy computational tasks forward to edge gateways, as was done by the authors in [21]. Nevertheless, this approach could lead to rising power consumption in TinyML devices because sending and receiving data is a more energy-consuming process than powering an on-device neural network (NN) [22], and this can be critical, at least for battery-operated devices. Therefore, it is better to resolve a resource problem by computational means within the edge device itself. This can be done through quantization, pruning, and clustering methods that reduce ML model size and processor usage. The downside with the use of these tradeoff methods is that the prediction or classification accuracy usually decreases.

3.1. Quantization

One of the vital computational capabilities of a microcontroller, which is usually needed for running NNs, is performing floating-point operations. Neural networks typically use high-precision 32-bit floating-point data in the production and inference mode [23]. However, these floating-point NN operations require a great deal of memory, system throughput, and clock speed from a microcontroller [24]. In some cases, microcontrollers are not even capable of performing hardware floating-point operations; e.g., in the Arm Cortex-M processor series, the hardware floating-point unit (FPU) is included, starting with M4F processors [25]. Still, this problem can be solved by computational means by using the Arm software floating-point C library, software floating-point emulation (FPE), or converting floating-point data to fixed-point data format [26, 27]. Quantization of 32-bit floating-point data to 8-bit fixed-point data lowers the model’s memory footprint by 75%, and integer operations make the microcontroller run much faster [28]. In [29], the authors tested fine-tuned convolutional neural network (CNN) quantization with the CIFAR-10 dataset, 30 epochs, and different weight and activation bit-width combinations. The results showed that by using 4-bit fixed-point weight and activation values, the classification error rate gained only from 6.98% to 8.30%, compared to floating-point values. The authors of [30] also reported good results when testing 4-bit precision quantization with different datasets; they reported 50% memory and 75% computation savings with only a 5% accuracy drop. However, the results also showed that the accuracy starts to fall more rapidly when using 3-bit or 2-bit ultralow precision, although this is partly task-dependent. Alternatively, mixed-precision quantization is a method that can be used to optimize a model’s weights and activation bit widths separately to the target of the microcontroller’s memory and CPU constraints [31]. Furthermore, this method can be used for quantizing each layer separately to different bit widths to maximize accuracy and avoid data loss [32]. Nevertheless, searching for optimal bit widths for all layers is a major computational challenge.

3.2. Binarization

Binarization is another form of quantization, whereby the bit-width compression level is maximized by reducing all operands, weights, and activations to a single bit [33, 34]. In binarized neural networks (BNNs), the arithmetic operations are swapped to bit-wise, and XNOR operations, and only the binarized values (+1 or -1) of the weights and activations are used in all calculations [34]. As a result, 1-bit operations reduce memory need (32) and the number of memory accesses (32) and ultimately lead to increased power efficiency. In [35], the authors introduced embedded binarized neural networks (eBNNs) specially designed for constrained embedded devices. eBNN and BNN have the same network structure, and their model parameters are identical, but they differ in computation order. Computation reordering is needed because original BNNs need a large intermediate pool for storing temporary convolution results in floating-point format, and this pool consumes a large portion of the embedded device’s available memory. In eBNN, this is solved with a pool block that can store only one convolution result at a time, and then a max-pooled result is sent through batch normalization and a binary activation function to the result matrix.

3.3. Pruning

An NN’s computational complexity can be lowered by pruning its unused features. Pruning techniques can be divided into two main categories: structured and unstructured pruning [36]. Structured pruning means removing entire channels or filters, and unstructured pruning means removing individual weight connections by setting them to zero [37]. In addition, it is possible to combine different pruning approaches. For example, in [38], the authors presented a method whereby unstructured and structured pruning approaches were combined with neural architecture search, which automatically finds accurate, lightweight, and sparse CNN architecture.

The process of zeroing out the NN model’s weights is also called magnitude-based pruning, and it leads to a sparse model and can bring a sixfold improvement in model compression [39]. The downside of this method is that it also leads to sparse matrix multiplications that need extra computation power and the use of sparse convolution libraries [40, 41]. Still, by weight pruning the deep neural network (DNN) model’s internal redundancy, the model can be downsized, and its performance can be increased without a decrease in the prediction accuracy [40]. The weight pruning method suits the use of microcontrollers in particular because the benefits of model size reduction are more significant than the extra computational load from sparse multiplications.

Structured pruning changes the shapes of layers and weight matrices by removing groups of weight connections [37]. When whole channels or filters are removed, the network’s inference speed increases, and the model size decreases. A channel-level pruning produces a lightweight network, but it can lower the model’s performance and accuracy when the width of the entire network is reduced. Hence, it is recommended that unstructured pruning methods are used whenever possible. In [40], the authors reported a 3.54-fold mean performance speedup and 88% size reduction in the model when they tested the different weight and node pruning combinations with Arm Cortex-M4 microcontroller with a two-way SIMD (single instruction, multiple data) unit for 16-bit fixed-point mathematics, 128kB SRAM, and 512kB flash storage. In addition, their proposed pruning technique, named Scalpel, a mixture of SIMD-aware weight pruning and node pruning, gained better efficiency and a smaller memory footprint for the model than basic pruning techniques.

3.4. Clustering

The number of individual weight values can be reduced using a process known as clustering, whereby the model’s weight values are replaced with a smaller number of centroid weight values that are calculated from the original model’s grouped weights [39]. Weight clustering reduces memory usage via model compression, and the compressed CNN model can be five times smaller than the original. When weight clustering and quantization processes are compared to each other, weight clustering brings higher accuracy and compression ratio, but the two can still be used effectively together [42]. The weight clustering process is typically done with the k-means clustering algorithm [42, 43].

4. Frameworks and Libraries

In real TinyML applications, in addition to the ML model, there is system logic and sometimes a real-time operating system (RTOS) that consumes already limited memory resources [44]; however, in most used cases, the TinyML application does not require RTOS. Nevertheless, RTOSs might sometimes be useful for TinyML applications too, as they are capable of running multithreaded and concurrent software executions [45]. In this kind of multithreaded TinyML application, RTOSs, such as Miosix [46], Zephyr OS [47], Riot OS [48], and Arm Mbed OS can be used [45].

The lack of a unified TinyML framework has led to the use of custom frameworks. Furthermore, custom frameworks, which have limited availability, require complicated manual optimization when used with different hardware. Nevertheless, in the past few years, the TinyML framework development has begun to progress. Among the first frameworks was Arm uTensor, an open-source ML framework for microcontrollers, and then in 2019, uTensor and Google’s TensorFlow began to build the TensorFlow Lite for microcontrollers framework together [49]. In recent years, Arm has also released a comprehensive set of network kernels in the software library known as Cortex Microcontroller Software Interface Standard-NN (CMSIS-NN) [50]. Apache, too, has extended its open-source ML framework TVM to cover microcontrollers in μTVM [51]. Another edge ML framework is PyTorch Mobile, which extends the PyTorch ecosystem [52]. In addition to these more versatile frameworks, there is the emlearn library, which is an open-source ML inference engine for microcontrollers starting from 8-bit architecture [53].

4.1. TensorFlow Lite

TensorFlow Lite (TFLite) is an open-source deep learning (DL) framework and set of tools for deploying and running ML models on Android, iOS, embedded Linux devices, and microcontrollers [54]. It supports multiple programming languages, such as Java, Swift, Objective-C, C++, and Python. Nevertheless, when using highly constrained microcontrollers and with only some hundreds or dozens of kilobytes of RAM, TensorFlow Lite for Microcontrollers (TFLM) is an efficient tool to use together with TFLite. TFLM can be used for running ML inference on a device, but it does not yet support on-device training. Its core runtime requires only 16 kB of memory, and it can be used with many Arm Cortex-M architecture microcontrollers. It has also been tested with Espressif ESP32 and different digital signal processors (DSP) [12]. Furthermore, TFLM does not require an operating system, and it can be downloaded as an Arduino library.

The TensorFlow Model Optimization Toolkit can be used to minimize the model’s latency, memory utilization, and power consumption. These tools include methods such as post-training quantization (PTQ), quantization aware training (QAT), pruning, and clustering [39]. In addition, TFLite includes TensorFlow Lite converter, which can be used to postquantize an already trained model and convert it to device-optimized TFLite format [55]. Posttraining integer quantization best suits constrained microcontrollers, and the method converts the weight and activation bit-width of 32-bit floating-point numbers to 8-bit fixed-point numbers.

4.2. Cortex Microcontroller Software Interface Standard-NN

The CMSIS-NN library is built for NN development on Arm Cortex-M processors, and inference based on its functions achieves a 4.6-fold speedup in throughput, and a 4.9-fold cut-back in energy consumption [56]. The CMSIS-NN library contains a specific category of NN functions and support functions such as convolution, activation, fully connected layer, pooling, softmax, basic math, activation table, and data-type conversation functions [50]. The functions use either 8-bit or 16-bit integers as parameters, but most of the functions still use 16-bit multiply and accumulate (MAC) instructions for operations such as matrix multiplications [56]. These 16-bit SIMD instructions require an Arm processor with a SIMD unit, but it is possible to use the CMSIS-NN library with older Arm processors such as Arm Cortex-M0 without the SIMD unit [57]. However, Arm Cortex-M0’s performance lags behind that of the Arm Cortex-M4, M7, M33, and M35P when using the CMSIS-NN.

4.3. Apache TVM

In recent years, Apache TVM infrastructure has been extended with μTVM, which is software that can manage the host-driven execution of tensor programs on microcontrollers that run without OS [51]. μTVM runtime offers a C-code generator, cross-compiler interface, and μDevice interface as well as interoperability between μTVM runtime and TVM’s AutoTVM, an automatic tensor program optimizer. μTVM uses the JTAG (Joint Test Action Group) connection and Open On-Chip Debugger (OpenOCD) control between the target device’s processor and the host, i.e., a desktop computer. This setup allows AutoTVM’s autotuning process, through which it generates candidate kernels round after round and executes them in the target device; at the end, it uses timing results for autotuning the model parameters. As the results in [51] show, AutoTVM tuning increases performance by lowering the program’s execution time from 294 ms to 157 ms, and it is almost the same as the TFLite+CMSIS-NN model.

4.4. PyTorch Mobile

PyTorch Mobile provides simplified end-to-end workflow and execution of ML models on edge devices [52]. It can be used with more powerful mobile operating systems such as iOS, Android, and Linux. PyTorch Mobile includes XNNPACK floating point and QNNPACK 8-bit quantized kernel libraries for mobile-optimized NN inference. PyTorch Mobile cannot be used with the most constrained microcontrollers at this point, but it is possible to use PyTorch models on microcontrollers through Open Neural Network Exchange (ONNX) format conversion with other software, including TensorFlow, STM32Cube.AI, and Cainvas [5860].

4.5. emlearn

The emlearn library contains a Python-C model converter and inference engine for microcontrollers and other devices that use C-code [53]. It can be used for converting classic ML and NN models such as random forests (RF), decision trees (DT), naive Bayes (NB), multilayer perceptron (MLP), and sequential models built with Keras and scikit-learn frameworks. It supports fixed-point math and does not use dynamic memory allocations. Most of the other discussed frameworks can be mainly used with 32-bit computer architecture, but emlearn can be used with 8-bit AVR processors, as was done by authors in [61]. The emlearn library is similar to MicroMLgen [62], FogML [63], and sklearn-porter [64] libraries.

5. Development Environments

The Edge Impulse, Qeexo AutoML, and Imagimob provide TinyML as a service. The Edge Impulse is an open-source software development kit (SDK) that enables ML on microcontrollers [13], and Qeexo AutoML is an automated ML platform [14]. Another lite toolkit for embedded systems, which we will discuss later in this review, is STMicroelectronics STM32Cube.AI [15]. The Cartesiam NanoEdge AI Studio includes lightweight ML libraries that can be used with all Arm Cortex-M family microcontrollers [65].

5.1. Edge Impulse Studio

The Edge Impulse (EI) SDK can be used for implementing neural networks on embedded devices and includes real sensor data collection and live signal processing, testing, and code deployment to the target device [13]. Furthermore, the actual data can be collected by sensors in IoT devices and mobile phones, and an existing dataset can be uploaded to the EI SDK with an uploader tool in JSON, CBOR, JPG, and WAV formats [66].

The authors in [67] tested the EI SDK in their research to develop a means to triage COVID-19 suspected cases. The authors created a wrist-wearable IoT device based on a 32-bit Espressif ESP8266EX microcontroller. The device was used to measure and process raw photoplethysmogram (PPG) data. These data were extracted to vital components and eventually to 22 NN input features wirelessly transferred to the EI SDK. The patient triage was formed by combining vital PPG components and the EI SDK NN classification toolchain, whereby the classification was made into three classes: slow breathing (bradypnea), normal breathing, and heavy breathing. The selected densely connected pyramid NN architecture gave the model 95.1% accuracy and 138 ms inference time estimation for on-device inference.

5.2. Qeexo AutoML

Qeexo AutoML provides an automated ML platform for Arm Cortex processors and even highly constrained M0 and M0+ processors [14]. Deploying ML on the M0+ can be pretty difficult compared to doing so on an M4 because the M0+ can only calculate 32-bit fixed-point mathematics and has lower memory capacity, lower CPU speed, and no support for saturation arithmetic and DSP [68]. Because of all this, Qeexo AutoML has developed a highly optimized Arm Cortex M0+ fixed-point ML pipeline, including sensor data handling, feature computation, and inference, all with fixed-point data. With the M0+, the pipeline uses tree-based ML algorithms such as gradient boosting machine (GBM), RF, and eXtreme Gradient Boosting (XGBoost). Qeexo AutoML’s comprehensive ML algorithm portfolio also includes NB, DT, Isolation Forest (IF), support vector machine (SVM), local outlier factor (LOF), logistic regression (LR), CNN, convolutional recurrent neural network (CRNN), recurrent neural network (RNN), and artificial neural network (ANN) [14]. Qeexo AutoML uses intelligent pruning and posttraining quantization methods resulting in 90% model size compression. Additional 8-bit quantization can shrink the model size by up to 75% compared to models using 32-bit precision [68].

5.3. STM32Cube.AI

STMicroelectronics STM32Cube.AI is an NN and ML toolkit for STM32 developers to run optimized inferences in microcontrollers [15]. STM32Cube.AI tools contain the most common deep learning libraries and decision-making processes with more resource-optimized algorithms such as a DT classifier. The STM32Cube.AI can be expanded with the X-CUBE-AI package, including automatic conversion of pretrained NN and classic ML models. X-CUBE-AI supports all frameworks that use ONNX format, including PyTorch, Microsoft Cognitive Toolkit, and MATLAB, and has support for well-known DL and ML frameworks such as TFlite, Keras, Caffe, Lasagne, ConvnetJS, scikit-learn (IF, SVM, k-means clustering (kMC), etc.), and XGBoost package [58, 69, 70]. In addition, X-CUBE-AI can optimize networks by 8-bit quantization and save weight and activation parameters in external Flash and RAM memories if more extensive networks are used.

5.4. NanoEdge AI Studio

Cartesiam NanoEdge AI Studio comprises software and a collection of AI libraries for embedded developers that can be used as a search engine for choosing an optimal ML algorithm [65]. It includes signal preprocessing, hyperparametrization, anomaly detection, and classification models such as k-nearest neighbor (kNN), SVM, and NN [71]. The NanoEdge AI Studio allows application-specific ML library development, and it enables unsupervised learning, inference, and a prediction that can be run inside a microcontroller [72]. The program automatically tests, optimizes, and calculates the best algorithmic combination as a C library. After the NanoEdge AI Studio has chosen the best library in the project, the library will be able to learn normal behaviors and figure out what an anomaly is [73]. It can perform iterative learning in 30 msecs in an Arm Cortex-M4 80 Mhz and consumes only 4 kB RAM in a typical configuration [74]. It is also worth mentioning that Cartesiam AI has been used in one of the first commercial TinyML products, a sensor called Bob Assistant, which uses automated on-device learning techniques for monitoring machines online [75]. This sensor prepares and sends predictive maintenance reports automatically once the period of learning the machine’s normal behavior ends.

5.5. Imagimob

Imagimob has two software products that can be used for building Edge AI applications. The Imagimob AI software suite is an end-to-end development solution for building Edge AI and TinyML applications [76]. It can be used with all types of time-series data, and it focuses on deep learning. Imagimob AI development follows five steps: (1) data capture and labeling, (2) data management in one place, (3) automatic model building with AI training service, (4) model verification with visualization of all models and predictions, and (5) edge optimization and application packaging. Imagimob supports quantization of LSTM (long short-term memory) layers, which is challenging but essential when using time-series data [77]. Edge is the easy-to-operate SaaS solution that can be used for simplifying complex Edge AI and TinyML development [78]. It can convert TensorFlow and Keras h5 file formats into the highly performing C code used in edge devices. This conversion might be a challenging task for even a proficient programmer, but when using Imagimob Edge, it can be done automatically in seconds. The suite is suitable for running DL models on highly constrained embedded devices such as Arm Cortex-M0 microcontrollers with a RAM memory size as small as 10 kB [78, 79].

5.6. TinyML Development Tools Summary

This section summarizes TinyML development tools’ features in Table 2, showing the available ML algorithms, supported interoperable frameworks, and the minimum architecture and type of the target processor.

6. TinyML Benchmarking

When discussing the design of a TinyML performance benchmarking test, there are four primary challenges to overcome: (1) varying power consumption across the range of devices; (2) limited and varying memory resources across the range of devices; (3) lack of hardware heterogeneity, which makes it hard to normalize performance results; (4) lack of software heterogeneity because major vendors have their proprietary tools and compilers [5]. In addition to this, the benchmark toolset should cover various ways for model deployment. Today, benchmarking tests are designed to benchmark either ML inference or microcontroller performance rather than the intersection of these technologies. One of the unsuitable benchmarking methods is the MLPerf Inference Benchmark [80], which is targeted at more powerful computers. Recently, the authors of [81] have introduced the MLPerf Tiny Benchmark Suite to meet the requirements of TinyML. This open-source suite [82] can be used to measure the accuracy, latency, and energy consumption of TinyML inference. The MLPerf Tiny v0.5 provides visual wake words, keyword spotting, anomaly detection, and image classification tasks for benchmarking, including reference implementations, which are provided using TFLite and TFLM [81]. The suite can be used for evaluating embedded devices that have clock speed in the range of 10 MHz–250 MHz and which typically consume less than 50 mW per inference [82].

The authors in [61] tested emlearn, sklearn-porter, and MicroMLgen classic ML libraries with an extremely constrained Arduino Uno microcontroller that had only an 8-bit processor, a clock speed of 16 MHz, 32 kB of flash memory, and 2 kB of SRAM. Their work selected DT, RF, SVM, and MLP algorithms for the test, and the benchmark showed that DT and RF gave the best accuracy, lowest memory footprint, and fastest classification speed. The MLP algorithm benchmark test showed good 0.97 accuracy with one hidden layer with four neurons, but its weights and biases did not continue to fit Arduino Uno’s SRAM when the network complexity grew. MicroMLgen’s SVM was the weakest performing algorithm in the benchmark in terms of accuracy and memory footprint.

7. Emerging Techniques of TinyML

Among the latest emerging TinyML techniques is federated learning (FL), which was introduced and defined in [83]. FL is a large-scale machine learning technique, whereby ML models are trained in remote devices while keeping training data localized [84]. Therefore, FL enables data privacy and security when the attack surface is limited only to the IoT devices themselves [83]. FL architectures can be divided into centralized and decentralized ones [85]. In the centralized approach, there is a server between end devices, and in the decentralized approach, end devices can exchange data between themselves. For example, when centralized edge devices collaboratively train a prediction model, they first update new parameters locally to the shared prediction model, then send updates to the server and finally receive the aggregated model back from the server [86]. In the typical decentralized approach, each device can perform local updates to ML model parameter gradients after the device has received gradient updates directly from all other nodes [87].

One key challenge when combining FL and TinyML techniques is model on-device training, usually not supported in TinyML frameworks. Still, on-device training and evaluation can be implemented with programming languages such as Java, Swift, and C/C++ [86]. Furthermore, FL resource optimization can be done using a technique known as transfer learning (TL), which uses older models to generate a new one [88]. This procedure reduces the computational resources required to train a new model. In [89], the authors presented a method named federated transfer learning on tiny devices (TinyFedTL), whereby they implemented their own fully connected layer inference and backpropagation update between an Arduino Nano 33 BLE Sense microcontrollers and a local server. As a result, they managed to train an ML model without sending raw data to the server; only the weights and bias data had to be sent between the client nodes and the server. Nevertheless, as in any other ML model training procedure, also in the FL approach, the training efficiency and model accuracy depend on the data set quality and computing power [90]. The TL approach is also an effective method to use by itself. Like in [91], the Tiny Transfer Learning (TinyTL) reduced memory footprint up to 6.5-fold. TinyTL uses pretrained models to save the microcontroller’s memory resources by not storing activations, learning only biases, and freezing the weights.

Another recent article proposed a method known as TinyML with Online-Learning (TinyOL), which can use streaming data for posttraining and upgrading of the existing on-device NN model [22]. In this method, an extra TinyOL training layer is used interleaved with the prediction phase. After new data first flow through the existing TinyML model to the inference phase and the result label is found, the evaluation metrics and weights are updated according to the new data. When TinyOL uses an incremental learning process, it decreases the microcontroller’s memory and processor usage compared to batch learning because new data can be handled one by one, and in the end, the data can be erased when the update is finished. Besides modern NN models, traditional ML algorithms such as NB, SVM, LR, and DT are even better suited for resource-constrained on-device training because their resource demands are typically low [92]. In addition, as in TinyML overall, lowering model complexity with dimensionality reduction and pruning and lowering computational load with quantization helps achieve better on-device training performance.

One of the most attractive emerging technology combinations is the integration of low-power wide-area networks (LPWANs) with TinyML. Energy efficiency and large coverage are the foremost defining characteristics of LPWANs, although they have a low data rate [93]. Therefore, LPWAN radio technologies such as LoRaWAN, Sigfox, NB-IoT, and LTE-M are ideal technology partners for TinyML; this is because in TinyML, the inference is made inside a constrained microcontroller, and in most cases, only a compressed inference result is needed to send to the server. Today, when national LPWAN networks and their coverage are becoming more widespread worldwide, it is even more tempting to avoid separate gateway devices and build stand-alone end-node applications. This makes sense because it is also easier to set up one device than a complex combination of separate end-nodes and gateways. Electronics manufacturers have also discovered this, and they have started to integrate 32-bit Arm Cortex processors and sub-GHz radios in system-on-chip (SoC) and system-in-package (SiP) units [94, 95]. In [85], the author built a wearable TinyML device whereby he integrated the MLP model and peripherals such as a LoRaWAN transceiver, GPS module, and inertial sensor. The results showed that the peripherals’ libraries and MLP model’s memory footprint were below 2 kB SRAM, which is small enough even for the tiniest microcontrollers. Table 3 summarizes the advantages of emerging techniques of TinyML.

8. Conclusion

TinyML has recently been continuously studied by different organizations, which have in turn created various frameworks, tools, and methods for applying ML on microcontrollers. In these studies, overcoming the microcontrollers’ resource constraints has been the main research topic; as presented in many articles, this is typically done through computational means by lowering the memory footprint of an ML model, which also has a positive effect on microprocessors’ CPU usage and power consumption. However, the downside in this is the tradeoff between model size and accuracy, as it has an accuracy-lowering impact, although this is at a reasonable level in most cases. Anyhow, TinyML is still in its early stages, and commercial products, for example, are mainly still to be realized. Therefore, the future of TinyML evolution depends on how companies and the academic community focus their resources for testing and benchmarking various TinyML applications and algorithms. A comprehensive benchmark tool that can be used with a range of microcontrollers is a vital first step for creating a continuum for research.

9. Future Application Areas

Overall, when considering the technical evolution of small IoT devices from a broader perspective, there have not yet been any megatrend products that everybody should own. Those tiny IoT devices that are available are used mainly for controlling purposes and for perhaps sending data over the internet. Nevertheless, in the future, TinyML is likely to change the evolution and demand of tiny IoT devices, and we will see many new must-have products in this category. The main reason for this is that new intelligent products are at the center of a digital, data-orientated, energy-efficient, and resource-optimized lifestyle. An excellent example of this product category is wearable technology, which combines health, personal safety, and communication technology. Hence, the future use of TinyML is not limited just to areas where microelectronics are now present, but it will also extend to new fields and inexpensive products. One example of this product category comprises condition monitoring solutions that are currently used only with critical and expensive machinery. Therefore, low-cost TinyML sensors will be likely to extend condition monitoring applications to less critical and mobile machinery that does not even need electricity since the sensor can use a battery. In addition, these kinds of machines are also good targets for FL applications because they are typically mass-produced, and so they could join together to produce an ML model that could be generalized to different situations. Another thing worth considering is how techniques such as TinyML, FL, on-device learning, and LPWAN could influence research in different fields of natural science so that the behaviors of geographically distributed study objects can be classified and observed on-device, and then the inference results or even upgraded parameters of the ML model are sent to the server. In addition, this improves data privacy and security because no sensitive raw data are sent to cloud servers. Inference engine on edge also reduces inference time and network usage, which can be critical features for some applications. From a negative point of view, TinyML may be used for ethically controversial solutions such as military, surveillance, and hacking devices. Thus, it is essential to remember the ethical aspect when building TinyML applications. Finally, TinyML is likely to cement its position among other ML techniques, and its maturity will quickly multiply over time.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work has been done under the eÄlytelli and coADDVA funded by the European Regional Development Fund and the Regional Council of Central Finland.