Lifelong Intelligence Beyond the Edge using Hyperdimensional Computing

Xiaofan Yu x1yu@ucsd.edu 0000-0002-9638-6184 University of California San DiegoLa JollaCaliforniaUSA , Anthony Thomas ahthomas@ucsd.edu University of California San DiegoLa JollaCaliforniaUSA , Ivannia Gomez Moreno ivannia.gomez@cetys.edu.mx CETYS University, Campus TijuanaTijuanaMexico , Louis Gutierrez l8gutierrez@ucsd.edu University of California San DiegoLa JollaCaliforniaUSA and Tajana Šimunić Rosing tajana@ucsd.edu 0000-0002-6954-997X University of California San DiegoLa JollaUSA

(2023)

Abstract.

On-device learning has emerged as a prevailing trend that avoids the slow response time and costly communication of cloud-based learning. The ability to learn continuously and indefinitely in a changing environment, and with resource constraints, is critical for real sensor deployments. However, existing designs are inadequate for practical scenarios with (i) streaming data input, (ii) lack of supervision and (iii) limited on-board resources. In this paper, we design and deploy the first on-device lifelong learning system called LifeHD for general IoT applications with limited supervision. LifeHD is designed based on a novel neurally-inspired and lightweight learning paradigm called Hyperdimensional Computing (HDC). We utilize a two-tier associative memory organization to intelligently store and manage high-dimensional, low-precision vectors, which represent the historical patterns as cluster centroids. We additionally propose two variants of LifeHD to cope with scarce labeled inputs and power constraints. We implement LifeHD on off-the-shelf edge platforms and perform extensive evaluations across three scenarios. Our measurements show that LifeHD improves the unsupervised clustering accuracy by up to 74.8% compared to the state-of-the-art NN-based unsupervised lifelong learning baselines with as much as 34.3x better energy efficiency. Our code is available at https://github.com/Orienfish/LifeHD.

Edge Computing, Lifelong Learning, Hyperdimensional Computing

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; ; ^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

The fusion of artificial intelligence and Internet of Things (IoT) has become a prominent trend with numerous real-world applications, such as in smart cities (Chen et al., 2016), smart voice assistants (Sun et al., 2020), and smart activity recognition (Weiss et al., 2016). However, the predominant current approach is cloud-centric, where sensor devices send data to the cloud for offline training using extensive data sources. This approach faces challenges like slow updates and costly communication, involving the exchange of large sensor data and models between the edge and the cloud (Shunhou and Peng, 2022). Instead, recent research has shifted towards edge learning, where machine learning is performed on resource-constrained edge devices right next to the sensors. While most studies focused on inference-only tasks (Lin et al., 2020, 2021; Saha et al., 2023), some recent work has investigated the optimization of computational and memory resources for on-device training (Gim and Ko, 2022; Lin et al., 2022). Nevertheless, these efforts often rely on static models for inference or lack the adaptability to accommodate new environments.

To fundamentally address these issues, sensor devices should be capable of ”lifelong learning” (Parisi et al., 2019): to learn and adapt with limited supervision after deployment. On-device lifelong learning reduces the need for expensive data collection (including labels) and offline model training, operating in a deploy-and-run manner. This approach enables autonomous learning solely from the incoming samples with minimal supervision, and is thus able to provide real-time decision-making even without a network connection. The lifelong aspect is essential for handling dynamic real-world environments, representing the future of IoT.

Although extensive research has investigated lifelong learning across various scenarios (Parisi et al., 2019), existing techniques face challenges that render them unsuitable for real-world deployments. These challenges include:

(C1)

Streaming data input. Edge devices collect streaming data from a dynamic environment. This online learning with non-iid data contrasts with the default offline and iid setting where multiple passes on the entire dataset are allowed (Grill et al., 2020).
(C2)

Lack of supervision. Obtaining ground-truth labels and expert guidance is often challenging and expensive. Most lifelong learning methods rely on some form of supervision, such as class labels (Kirkpatrick et al., 2017) or class shift boundaries (Rao et al., 2019), which are typically unavailable in real-world scenarios.
(C3)

Limited device resources. Neural networks (NN) are known for their high resource demands (Wang et al., 2019). Furthermore, the main techniques for lifelong learning based on NN, such as regularization (Kirkpatrick et al., 2017) and memory replay (Lopez-Paz and Ranzato, 2017), add extra computational and memory requirements beyond standard NNs, making them inadequate for edge devices.

Refer to caption — Figure 1. Real-world example of on-device lifelong learning evaluated using the unsupervised clustering accuracy metric (Xie et al., 2016). The training latency is measured on two typical edge platforms.

Real-World Example. To illustrate the challenges faced, we present a real-world scenario in Fig.1. Consider a camera deployed in the wild continuously collecting data from surrounding environment. Our goal is to train an unsupervised object recognition algorithm on the edge device, purely from the data stream. We construct both iid and sequential (one class appears after the other) streams from CIFAR-100 (Banos and Saez, 2014), and adopt the smallest MobileNet V3 model (Howard et al., 2019) with the popular BYOL unsupervised learning pipeline (Grill et al., 2020). As seen in Fig. 1, while the model shows improved accuracy with iid streams, it has a significant performance loss under sequentially ordered data, highlighting the NN effect of “forgetting” in a streaming and unsupervised setting. In terms of efficiency, we measure the training latency of MobileNet V3 (small) (Howard et al., 2019) on two typical edge platforms, Raspberry Pi (RPi) 4B (rpi, 2023a) and Jetson TX2 (jet, 2023) by running 10 gradient descent steps on a single batch of 32 samples. Even on these very capable edge platforms, training takes up to 17.4 seconds, clearly unsuitable for real-time processing under 30 FPS. Therefore, a novel approach capable of handling non-iid data and offering more efficient updates is necessary to accommodate the continual changes in data.

To address challenges (C1)-(C3), we draw inspiration from biology, where even tiny insects display remarkable lifelong learning abilities, and do so using “hardware” that requires very little energy (Avarguès-Weber et al., 2012). Hyperdimensional computing (HDC) is an emerging paradigm inspired by the information processing mechanisms found in biological brains (Kanerva, 2009). In HDC, all data is represented using high-dimensional, low-precision (often binary) vectors known as “hypervectors,” which can be manipulated through simple element-wise operations to perform tasks like memorization and learning. HDC is well-understood from a theoretical standpoint (Thomas et al., 2021) and shares intriguing connections with biological lifelong learning (Shen et al., 2021). Furthermore, its use of basic element-wise operators aligns with highly parallel and energy-efficient hardware, offering substantial energy savings in IoT applications (Kim et al., 2018; Imani et al., 2017; Dutta et al., 2022; Xu et al., 2023). While HDC is reported as a promising avenue, the literature to date has not explored weakly-supervised lifelong learning using HDC.

In this work, we design and deploy LifeHD, the first system for on-device lightweight lifelong learning in an unsupervised and dynamic environment. LifeHD leverages HDC’s efficient computation and advantages in lifelong learning, while effectively handling unlabeled streaming inputs. These capabilities extend beyond the scope of existing HDC designs, which have focused overwhelmingly on the supervised setting (Kim et al., 2018; Imani et al., 2017). Specifically, LifeHD represents the input as high-dimensional, low-precision vectors, and, drawing inspiration from work in cognitive science (Baddeley, 1992), organizes data into a two-tier memory hierarchy: a short-term “working memory” and a long-term memory. The working memory processes incoming data and summarizes it into a group of fine-grained clusters that are represented by hypervectors called cluster HVs. Long-term memory consolidates the frequently appeared cluster HVs in the working memory, and will be retrieved for merging and inference occasionally. We emphasize that LifeHD is designed to suit a variety of edge devices with diverse resource levels. More efficiency gains can be achieved by employing optimizations such as pruning and quantization (Wang et al., 2022; Gim and Ko, 2022), but this is not the focus of our work.

Our basic approach in LifeHD is fully unsupervised. However, in reality, labels may be available (or could be acquired) for a small number of examples. We introduce LifeHD ${}_{\textrm{semi}}$ to exploit a limited number of labeled samples as an extension to the purely unsupervised LifeHD. Additionally, we propose LifeHD ${}_{\textrm{a}}$ , which uses an adaptive scheme inspired by model pruning, to adjust the HD embedding dimension on-the-fly. LifeHD ${}_{\textrm{a}}$ allows us to further reduce resource usage (power in-particular), where necessary.

In summary, the contributions of this paper are:

(1)

We design LifeHD, the first end-to-end system for on-device unsupervised lifelong intelligence using HDC. LifeHD builds upon HDC’s lightweight single-pass training capability and incorporates our novel clustering-based memory design to address challenges (C1)-(C3).
(2)

We further propose LifeHD ${}_{\textrm{semi}}$ as an extension to fully utilize the scarce labeled samples along with the stream. We devise LifeHD ${}_{\textrm{a}}$ that enables adaptive pruning in LifeHD to reduce real-time power consumption.
(3)

We implement LifeHD on off-the-shelf edge devices and conduct extensive experiments across three typical IoT scenarios. LifeHD improves the unsupervised clustering accuracy up to 74.8% with 34.3x better energy efficiency compared to leading unsupervised NN lifelong learning methods (et al, 2022; Fini et al., 2022; Smith et al., 2021).
(4)

LifeHD ${}_{\textrm{semi}}$ improves the unsupervised clustering accuracy by up to 10.25% over the SemiHD (Imani et al., 2019b) baseline under limited label availability. LifeHD ${}_{\textrm{a}}$ limits the accuracy loss within 0.71% using only 20% of LifeHD’s full HD dimension.

The rest of the paper is organized as follows. We start by a comprehensive review of related works in Sec. 2. We then introduce salient background on HDC in Sec. 3 to help understanding. We formally define the unsupervised lifelong learning problem we target to solve in Sec 4. Afterwards, Sec. 5 describes the details of our major design LifeHD. Sec. 6 introduces LifeHD ${}_{\textrm{semi}}$ and LifeHD ${}_{\textrm{a}}$ . Sec. 7 presents the implementation and results of LifeHD, while the evaluations of LifeHD ${}_{\textrm{semi}}$ and LifeHD ${}_{\textrm{a}}$ are reported in Sec 8. We add the discussions and future works in Sec. 9. The entire paper is concluded in Sec. 10.

2. Related Work

Lifelong and On-Device Learning. Lifelong learning (or continual learning) is a large and active area of research in the broader machine learning community. Catastrophic forgetting is a major challenge in lifelong learning, and refers to a commonly observed empirical phenomenon in which updating certain machine learning models with new data severely degrades their ability to perform previously learned tasks (McCloskey and Cohen, 1989). Previous works proposed techniques such as dynamic architecture (Rusu et al, 2016; Lee et al., 2020), regularization by penalizing important weights (Kirkpatrick et al., 2017; Zhang et al., 2020a), knowledge distillation from past models (Fini et al., 2022) and experience replay using a memory buffer (Lopez-Paz and Ranzato, 2017; Tiwari et al., 2022). The lifelong learning literature has examined a wide range of problem settings, ranging from the fully supervised case, in which tasks and class labels are provided, and the fully unsupervised case without any labels and prior knowledge (et al, 2022; Tiezzi et al., 2022). However, all of these works are based on deep NNs and require backpropagation, which is problematic for resource-constrained devices.

Neurally-inspired lightweight algorithms have recently been proposed for lifelong learning applications. FlyModel (Shen et al., 2021) and SDMLP (Bricken et al., 2023) use sparse coding and associative memory for lifelong learning. However, both approaches assume full supervision. STAM (Smith et al., 2021) is an expandable memory architecture for unsupervised lifelong learning, using layered receptive fields and a two-tier memory hierarchy. It learns via online centroid-based clustering pipeline, novelty detection and memory updates. Nevertheless, the memory in STAM is solely dedicated to image storage, while our LifeHD additionally emphasizes merging past patterns into coarse groups and shows more effective learning performance.

Recent works optimize the resource usage of on-device training via pruning and quantization (Profentzas et al., 2022; Lin et al., 2022), tuning partial weights (Cai et al., 2020; Ren et al., 2021), memory profiling and optimization (Wang et al., 2022; Xu et al., 2022; Gim and Ko, 2022), as well as growing the NN on the fly (Zhang et al., 2020b). All these works optimize training given resource constraints and do not focus on lifelong learning. They are orthogonal to the contribution of LifeHD which focuses on adaptive and continual training. LifeHD can be further optimized by combining with such techniques.

Hyperdimensional Computing. HDC has garnered substantial interest from the computer hardware community as an energy-efficient and low-latency approach to learning, and has been successfully applied to problems such as human activity recognition (Kim et al., 2018), voice recognition (Imani et al., 2017), image recognition (Dutta et al., 2022; Xu et al., 2023), to name a few. The large majority of literature on HDC has focused on using the technique to perform supervised classification tasks. Among the limited literature for weakly-supervised learning with HDC, HDCluster (Imani et al., 2019a) enabled unsupervised clustering in HDC with a new algorithm that is similar to K-Means. SemiHD (Imani et al., 2019b) is a semi-supervised learning framework using HDC with iterative self-labeling. Hyperseed (Osipov et al., 2022), C-FSCIL (Hersche et al., 2022) and FSL-HD (Xu et al., 2023) adopted HDC or similar vector symbolic architectures (VSA) for unsupervised or few-shot learning. All above works did not consider the lifelong aspect and used offline training on a static dataset. To the best of the authors’ knowledge, LifeHD is the first work that designs and deploys lifelong learning in edge IoT applications especially with zero or minimal amount of labels.

3. Background on HDC

Hyperdimensional Computing (HDC) is an emerging paradigm for information processing from the cognitive-neuroscience literature (Kanerva, 2009). In HDC, all computation is performed on low-precision and distributed representations of data that accord naturally with highly parallel and low-energy hardware.

The first step in HDC is encoding, which maps an input $x\in\mathcal{X}$ to a distributed representation $\phi(x)$ living in some $D$ -dimensional inner-product space ${\mathcal{H}}$ , that we call the “HD-space.” For instance, one might take ${\mathcal{H}}\subset\{\pm 1\}^{D}$ , or ${\mathcal{H}}\subset{\mathbb{R}}^{D}$ . We refer to points in the HD-space as hypervectors. Encodings of data can be manipulated so as to build more complex composite representations using a set of operators defined as follows:

(1)

Bind: $\otimes:{\mathcal{H}}\times{\mathcal{H}}\to{\mathcal{H}}$ . Binding takes two hypervectors as inputs and returns a hypervector that is dissimilar to both inputs, and is intuitively used to represent tuples. For bipolar hypervectors (i.e., ${\mathcal{H}}\subset\{\pm 1\}^{d}$ ), the binding operator is typically element-wise multiplication.
(2)

Bundle: $\oplus:{\mathcal{H}}\times{\mathcal{H}}\to{\mathcal{H}}$ . Bundling takes two hypervectors as input and returns a hypervector similar to both operands, and is intuitively used to build sets. The bundling operation is implemented through addition.
(3)

Permute: $\rho:{\mathcal{H}}\to{\mathcal{H}}$ . Permutation can be used to encode sequential information and is typically implemented using a cyclic shift.

The encoding function $\phi:{\mathcal{X}}\to{\mathcal{H}}$ embeds data from its ambient representation into HD-space. In general, encoding should preserve some meaningful notion of similarity between input points in the sense that $\phi(x)\cdot\phi(x^{\prime})\approx k(x,x^{\prime})$ , where $k$ is some similarity function of interest on ${\mathcal{X}}$ . In this paper, we use spatiotemporal encoding for time series sensor data, and HDnn for more complex data, such as images, which we explain in the following.

Spatiotemporal Encoding. The spatiotemporal method (Moin et al., 2021) jointly encodes the analog information from each sensor (spatial) and at each time stamp (temporal) to a single hypervector. Suppose there are $d$ -different sensors $s_{1},...,s_{d}$ , each of which produce a real-valued reading $x_{1},...,x_{d}$ , whereupon we may model the input at a particular moment in time by a set of tuples $\{(s_{i},x_{i})\}_{i=1}^{d}$ . We pre-generate a set of base hypervectors to represent the values and sensors respectively. To represent a real valued feature $x\in{\mathbb{R}}$ , we quantize the support of $x$ into a set of bins with centroids $a_{1},...,a_{Q}$ , and assign each bin an embedding $\varphi(a_{i})$ , which we call level hypervectors, such that $\varphi(a_{i})\cdot\varphi(a_{j})$ is monotonically decreasing in $|a_{i}-a_{j}|$ . As shown in Fig. 2 (left), we initially generate a random hypervector for the first level. To maintain similarity between adjacent level hypervectors, for each subsequent level, we randomly flip a fraction of bits from the previous level as described in (Thomas et al., 2021). The fraction of flipping is denoted as $P$ . This process is repeated until all $Q$ level hypervectors are generated. To represent different sensor $s$ , we assign each sensor a random embedding $\psi(s_{i})$ , which we call ID hypervectors, by sampling $\psi(s_{i})\sim\text{Unif}(\{\pm 1\}^{D})$ .

The complete spatiotemporal encoding is visualized in Fig. 2 (right). We encode a pair $(s,x)$ via $\psi(s)\otimes\varphi(a(x))$ , where $a(x)$ is the centroid of the bin closest to $x$ . This preserves both the level and sensor ID information. To encode the readings for all sensors we bundle together their individual embeddings and round to bipolar (e.g. $\{\pm 1\}$ ) precision: $\phi(x)=\text{Sign}\left(\bigoplus_{i=1}^{d}\psi(s_{i})\otimes\varphi(a(x_{i})% )\right)$ . Finally, to represent a sequence of $T$ readings: $X=\{x_{1},...,x_{T}\}$ , we use permutation: $\phi(X)=\bigotimes_{\tau=1}^{T}\rho^{\tau}(\phi(x_{\tau}))$ .

HDnn Encoding. In this work we use the recently proposed HDnn style encoding (Dutta et al., 2022; Xu et al., 2023) that combines a pretrained and frozen NN feature extractor with HDC’s spatiotemporal encoding to obtain state of the art accuracy for sound and images. In HDnn the inputs to the spatio-temporal encoding, $s_{1},...,s_{d}$ , are intermediate feature outputs of the pretrained and frozen NN (Fig. 3). For example, a section of MobileNet pretrained on ImageNet creates features which are then encoded into HD hypervectors for object recognition tasks. This only marginally increases the computational costs as no training is performed on NN, all the training happens in HD.

Supervised Training and Inference. A common use case of HDC, summarized in Fig. 3, is to fit classifiers. In particular, let us suppose that we see a set of $N$ labeled samples $\{(X_{i},y_{i})\}_{i=1}^{N}$ , where $x_{i}$ is an input, and $y_{i}\in\{c_{1},...,c_{J}\}$ is a class label. In the traditional approach to classification, one simply represents each class via the bundle of its training data. That is: $\phi(c_{j})=\bigoplus_{i:y_{i}=c_{j}}\phi(X_{i})$ . We store the trained class hypervectors in an associative memory. For example, in Fig. 3, we compute and store the class hypervectors of cats and dogs. During inference, we first encode the testing sample $X_{q}$ into a query hypervector $\phi(X_{q})$ using the same encoding procedure as for training. We then predict the label corresponding to the most similar class as measured by the cosine similarity, i.e., $\hat{y}=\operatorname*{argmax}_{j}\cos(\phi(X_{q}),\phi(c_{j}))\propto% \operatorname*{argmax}_{j}(\phi(X_{q})\cdot\phi(c_{j}))/\|\phi(c_{j})\|$ .

4. Problem Definition

Before diving into our method, we first rigorously formulate the unsupervised lifelong learning problem using streaming sources, driven by real-world IoT applications.

Streaming Data. To represent continuously changing environment, we assume a well-known class-incremental model in lifelong learning, in which new classes emerge in a sequential manner (Rao et al., 2019). We also allow data distribution shift within one class. This setting models a scenario in which a device is continuously sampling data while the surrounding environment may change implicitly over time, e.g., the self-driving vehicle as shown in Fig. 1. We require that all samples appear only once (i.e., single-pass streams).

Formally, we consider a scenario involving $d$ sensors, each producing a real valued reading. We group readings into sliding windows of length $T$ , and treat one such batch $X_{i}\in{\mathbb{R}}^{T\times d}$ as an input sample. Each input $X_{i}$ is associated with an unknown label $y_{i}$ . Importantly, the labels are not made available during training, nor the boundaries of class shift. Therefore the entire process is unsupervised. We represent the data stream associated with each class by ${\mathcal{D}}_{j}=\{X_{1},X_{2},...\}$ , and the set of streams for all classes by ${\mathcal{D}}=\{{\mathcal{D}}_{1},...,{\mathcal{D}}_{J}\}$ . Note that the class-incremental streams can have imbalanced classes, i.e., $|D_{i}|\neq|D_{j}|,i\neq j$ , and gradual distribution shift within each class.

Learning Protocol. Our goal is to build a classification algorithm that maps $\mathcal{X}\rightarrow\mathcal{Y}$ . For evaluation, we use the common evaluation protocol in state-of-the-art lifelong learning works (Fini et al., 2022; et al, 2022; Smith et al., 2021), in which we construct an iid dataset $\mathcal{E}=\{(X_{k},y_{k})\}$ for periodic testing, by sampling labeled examples from each class in a manner that preserves the overall (im)balance between the classes. Note, that even when one class has not appeared in the training data stream, it is always included in $\mathcal{E}$ . Hence $\mathcal{E}$ is a global view of all classes that can potentially exist in the environment.

Unsupervised Clustering Accuracy. Since we do not give class labels or the total number of classes during training, the predicted label can be different from the ground-truth label. Therefore, for evaluation metric, we cannot adopt the simple prediction accuracy that requires exact label matching. Instead, we employ a widely used clustering metric known as unsupervised clustering accuracy (ACC) (Xie et al., 2016), which mirrors the conventional accuracy evaluation but within an unsupervised context.

Suppose $\omega_{k}$ is the predicted cluster of testing sample $(X_{k},y_{k})$ in $\mathcal{E}$ . ACC is computed as: $ACC=\max_{m}\frac{1}{|\mathcal{E}|}\sum_{k=1}^{|\mathcal{E}|}\mathbf{1}\left\{% y_{k}=m(\omega_{k})\right\}$ , where $m$ ranges over all possible one-to-one mappings between predicted clusters and ground-truth classes. Intuitively, this metric computes the accuracy under the “best” mapping between clusters and labels. The biggest advantage of ACC is that it does not require the number of clusters and classes to be equal. For instance, a cluster of pines and a cluster of redwood both belong to the ground truth label of trees. We treat such clustering result as a valid learning outcome, with a concrete visualization shown in Sec. 7.4.

5. LifeHD

In this section, we present the design of LifeHD, the first unsupervised HDC framework for lifelong learning in general edge IoT applications. Compared to operating in the original data space, HDC improves pattern separability through sparsity and high dimensionality, making it more resilient against catastrophic forgetting (Shen et al., 2021). LifeHD preserves the advantages of HDC in computational efficiency and lifelong learning, while handling the input of unlabeled streaming data, which has not been achieved in previous work (Imani et al., 2019a, b; Osipov et al., 2022; Hersche et al., 2022; Xu et al., 2023).

5.1. LifeHD Overview

Fig. 4 gives an overview of how LifeHD works. The first step is HDC encoding of data into hypervectors as described in Sec. 3. Training samples $X$ are organized into batches of size $bSize$ and input into an optional fixed NN for feature extraction (e.g. for images and sound) and the encoding module. The encoded hypervectors $\phi(X)$ are input to LifeHD’s two-tier memory design inspired by cognitive science studies (Baddeley, 1992), consisting of working memory and long-term memory. This memory system intelligently and dynamically manages historical patterns, stored as hypervectors and referred to as cluster HVs. As shown in Fig. 4, the working memory is designed with three components: novelty detection, cluster HV update and cluster HV merge. $\phi(X)$ is first input into novelty detection step (1⃝). An insertion to the cluster HVs is made if a novelty flag is raised, otherwise $\phi(X)$ updates the existing cluster HVs (2⃝). The third component, cluster HV merge (3⃝), retrieves the cluster HVs from long-term memory, and merges similar cluster HVs into a supercluster via a novel spectral clustering-based merging algorithm (Von Luxburg, 2007). The interaction between working and long-term memory happens as commonly encountered cluster HVs are copied to long-term memory, which we call consolidation (4⃝). Finally, when the size limit of either working or long-term memory is reached, the least recently used cluster HVs are forgotten (5⃝).

All modules in LifeHD work collaboratively, making it adaptive and robust to continuously changing streams without relying on any form of prior knowledge. For example, in scenarios of distribution drift, LifeHD may generate new cluster HVs upon encountering drifted samples initially, which can later be merged into coarse clusters. This approach ensures that LifeHD can efficiently capture and retain historical patterns.

In the following, we discuss more details about the major components of LifeHD: novelty detection (Sec. 5.2), cluster HV update (Sec. 5.3), and cluster HV merging (Sec. 5.4). We summarize the important notations used in this paper in Table 1.

Table 1. List of important notations.

Symbol	Meaning
$d$	Number of sensor sources
$T$	Time window length of one input sample $X$
$D$	Dimension of the HD-space
$Q$	Number of quantized level for encoding
$P$	Fraction of random bit flip to generate level hypervector
$\phi$	HDC encoding function
$\varphi,\psi$	Level and ID hypervector encoding function
$bSize$	Batch size of input samples
$\mathcal{M},\mathcal{L}$	Set of cluster HVs stored in the working and long-term memory
$M,L$	Maximum number of cluster HVs in the working and long-term memory
$\mu,\hat{\sigma}$	Mean similarity and standard difference of between each cluster HV and its assigned inputs in the working memory
$hit$	The number of times that each cluster HV is hit in the working memory
$hit_{th}$	The hit frequency threshold to consolidate cluster HV from working to long-term memory
$p,q$	The most recent batch index when each cluster HV is accessed, for the working and long-term memory cluster HVs
$\gamma$	Hyperparameter for novelty detection sensitivity
$\alpha$	Moving average update rate during cluster HV update
$g_{ub}$	Cluster HV merge sensitivity
$f_{merge}$	Cluster HV merge frequency
$r$	Average labeling ratio in LifeHD ${}_{\textrm{semi}}$
$D_{a}$	Dimension of the mask used in LifeHD ${}_{\textrm{a}}$

5.2. Novelty Detection

The initial novelty detection step (1⃝ in Fig. 4) is crucial for identifying emerging patterns in the environment. Suppose $\mathcal{M}=\{m_{1},...,m_{M}\}$ is the set of cluster HVs stored in the working memory. We gauge the ”radius” of each cluster by tracking two scalars for each cluster HV $i$ : $\mu_{i}$ and $\hat{\sigma}_{i}$ , which represent the mean cosine difference and standard difference between the cluster HV and its assigned inputs. Given $\phi(X)$ , we first identify the most similar cluster HV, denoted by $j$ . LifeHD marks $\phi(X)$ as “novel” if it substantially differs from its nearest cluster HV. Specifically, this dissimilarity is measured by comparing $\cos(\phi(X),m_{j})$ with a threshold based on the historical distance distribution of cluster HV $j$ :

(1)

\textrm{If }\cos(\phi(X),m_{j})<\mu_{j}-\gamma\hat{\sigma}_{j}\textrm{, then % flag novel.}

The hyperparameter $\gamma$ fine-tunes the sensitivity to novelties.

LifeHD recognizes new $\phi(X)$ as prototypes and inserts them into the working memory. When reaching its size limit $M$ , the working memory experiences forgetting (5⃝ in Fig. 4). The least recently used (LRU) cluster HV, represented by $LRU=\operatorname*{argmin}_{i=1}^{M}p_{i}$ , is replaced. Here $p$ corresponds to the latest batch index where the cluster HV was accessed. A similar forgetting mechanism is configured for the long-term memory, where the last batch accessed is marked with $q$ .

5.3. Cluster HV Update

If novelty is not detected, indicating that $\phi(X)$ closely matches cluster HV $j$ , we proceed to update the cluster HV and its associated information (2⃝ in Fig. 4). This update process involves bundling $\phi(X)$ with cluster HV $m_{j}$ , akin to how class hypervectors are updated as described in Sec. 3, and updaing $\mu_{j}$ and $\hat{\sigma}_{j}$ with their moving average:

(2a)	$\displaystyle m_{j}$	$\displaystyle\leftarrow m_{j}\oplus\phi(X)$
(2b)	$\displaystyle\mu_{j}$	$\displaystyle\leftarrow(1-\alpha)\mu_{j}+\alpha\cos(\phi(X),m_{j})$
(2c)	$\displaystyle\hat{\sigma}_{j}$	$\displaystyle\leftarrow(1-\alpha)\hat{\sigma}_{j}+\alpha\|\cos(\phi(X),m_{j})-% \mu_{j}\|$
(2d)	$\displaystyle hit_{j}$	$\displaystyle\leftarrow hit_{j}+1,p_{j}\leftarrow idx$

The hyperparameter $\alpha$ adjusts the balance between historical and recent inputs, where a higher $\alpha$ gives more weight to recent samples. Properly maintaining $\mu_{j}$ and $\hat{\sigma}_{j}$ is vital for tracking the “radius” of each cluster HV, affecting future novelty detection. We also increase the hit frequency $hit_{j}$ and refresh $p_{j}$ with current batch index $idx$ . $hit_{j}$ is further used to compared with a predetermined threshold $hit_{th}$ to decide when a working memory cluster HV appears sufficiently frequently to be consolidated to long-term memory (4⃝ in Fig. 4). $p_{j}$ determines forgetting as described in the previous section. With this lightweight approach, LifeHD continually records temporal cluster HVs from the environment, while the most prominent cluster HVs are transferred to long-term memory.

5.4. Cluster HV Merging

Cluster HV merge (3⃝ in Fig. 4) has the dual benefit of reducing memory use and of elucidating underlying similarity structure in the data. Intuitively, a group of cluster HVs can be merged if they are similar to each other and dissimilar from other cluster HVs. For instance, one might merge the cluster HVs for Bulldog and Chihuahua into a single “Dog” cluster HV, that remains distinct from the cluster HV for “Tabby Cat”.

To merge the cluster HVs, we first construct a similarity graph defined over the cluster HVs from the long-term memory. The cluster HVs correspond to nodes, and a pair of cluster HVs are connected by an edge if they are sufficiently similar. We then merge the cluster HVs by computing a particular type of cut in the graph in a manner similar to spectral clustering (Ng et al., 2001). This graph based formalism for clustering is able to capture complex types of cluster geometry and often substantially outperforms simpler approaches like K-Means (Von Luxburg, 2007). We detail the steps of cluster HV merging in LifeHD below, while Fig. 5 offers an illustrative overview.

Step 1: Preprocessing. Given the set of long-term memory cluster HVs $\mathcal{L}=\{l_{1},...,l_{L}\}$ , we construct a graph ${\mathcal{G}}$ using the adjacency matrix $A\in\{0,1\}^{L\times L}$ . Here, $A_{ij}=A_{ji}=\mathbbm{1}[\cos(l_{i},l_{j})\geq\beta]$ , with $\beta$ as an adaptive threshold. In other words, an edge connects cluster HVs $l_{i}$ and $l_{j}$ if their similarity in HD-space surpasses $\beta$ . A larger $\beta$ implies that cluster HVs must be more similar to be considered for merging. In practice, we set $\beta=\frac{1}{M}\sum_{i=1}^{M}\mu_{i}$ , representing the mean of the observed cluster HVs.

Step 2: Decomposition. We compute the Laplacian $W=D-A$ , where $D$ is the diagonal matrix in which $D_{ii}=\sum_{j}A_{ij}$ . We then compute the eigenvalues $\lambda_{1},..,\lambda_{L}$ , sorted in increasing order, and eigenvectors $\nu_{1},...,\nu_{L}$ of $W$ .

Step 3: Grouping. We infer $k=\max_{i\in[L]}\lambda_{i}\leq g_{ub}$ , and merge the cluster HVs by running K-Means on $\nu_{1},...,\nu_{k}$ . The upperbound $g_{ub}$ is a hyperparameter that adjusts the granularity of merging, with a smaller $g_{ub}$ leading to smaller $k$ thus encouraging merging more aggressively.

Our merging approach is formally grounded, as discussed in (Von Luxburg, 2007). It is a well-known fact that the eigenvectors of $W$ encode information about the connected components of ${\mathcal{G}}$ . When ${\mathcal{G}}$ has $k$ connected components, the eigenvalues $\lambda_{1}=\lambda_{2}=...=\lambda_{k}=0$ . To recover these components, K-Means clustering on $\nu_{1},...,\nu_{k}$ can be employed, as explained in (Von Luxburg, 2007). However, practical scenarios may have a few inter-component edges that should ideally be distinct. For instance, when the similarity threshold is imprecisely set, erroneous edges may appear in the graph, causing $\lambda_{1},...,\lambda_{k}$ to be only approximately zero. Our merging approach is designed to handle this situation by introducing $g_{ub}$ . The cluster HV merging is evaluated every $f_{merge}$ batches, where $f_{merge}$ is a hyperparameter that controls the trade-off between merging performance and computational latency. Both $g_{ub}$ and $f_{merge}$ are analyzed in Sec. 7.8, along with other key hyperparameters in LifeHD.

Time Complexity of Merging. A potentially limiting issue with spectral clustering is its time complexity, which is, in the worst case $O(L^{3})$ . However, this is not a concern in our setting. First, the number of cluster HVs in long-term memory ( $L$ ) is typically small, around $50$ in practice, resulting in modest worst-case complexity. Secondly, worst-case analysis is overly pessimistic, assuming a full eigendecomposition of the graph Laplacian ( $W$ ). In practice, $W$ is nearly always approximately low rank, meaning that only the first $k\ll L$ eigenvectors are needed. In such cases, fast randomized eigendecomposition algorithms can reduce the time complexity to linear in $L$ (Halko et al., 2011). Thus, while spectral clustering is sometimes colloquially thought of as an “expensive” procedure, this is true only in very unfavorable “worst-case” settings. In practice, its complexity is modest and acceptable for our situation, as shown in Sec. 7.5.

6. Variants of LifeHD

While LifeHD is designed to cater to general IoT applications with streaming input and without supervision, real-world scenarios may vary. Some scenarios might have a few labeled samples in addition to the unlabeled stream, while others may require operation within strict power constraints. LifeHD offers extensibility to address these diverse needs. In this section, we introduce two software-based extensions: LifeHD ${}_{\textrm{semi}}$ , which adds a separate processing path to manage labeled samples, and LifeHD ${}_{\textrm{a}}$ , which adaptively prunes the HDC model using masking to handle low-power scenarios.

6.1. LifeHD ${}_{\textrm{semi}}$

While LifeHD excels in unsupervised scenarios, it does not harness labeled data when available. To address this limitation, we introduce LifeHD ${}_{\textrm{semi}}$ as an extension to enhance accuracy utilizing the limited labels. In Fig. 6, we provide an overview of LifeHD ${}_{\textrm{semi}}$ . For each input batch $idx$ , we consider two subsets: one labeled $(X_{l,idx},y_{l,idx})$ and one unlabeled $X_{ul,idx}$ . We denote the average labeling ratio throughout the data stream as $r=\frac{\sum_{idx}|X_{l,idx}|}{\sum_{idx}|X_{l,idx}|+|X_{ul,idx}|}$ . Since obtaining external supervision is often challenging in dynamic environments, we focus on cases where $r\leq 0.01$ .

LifeHD ${}_{\textrm{semi}}$ retains the two-tier memory structure of LifeHD but introduces modifications to the working memory components. In the LifeHD ${}_{\textrm{semi}}$ pipeline, the working memory undergoes three key steps. Firstly, labeled samples $(X_{l},y_{l})$ update labeled class hypervectors following the conventional HDC methods outlined in Sec. 3. Next, we process unlabeled samples $X_{ul}$ through novelty detection and HV update modules, mirroring LifeHD. Importantly, in LifeHD ${}_{\textrm{semi}}$ , these operations are applied to both labeled HVs and cluster HVs. Lastly, we introduce a merging step to group labeled HVs and cluster HVs that are closely related. To handle labeled HVs, we modify the adjacency matrix $A$ by making it diagonal for labeled entries. For example, if the first $J$ HVs correspond to labeled HVs, we ensure that $A_{1:J,1:J}=\text{diag}([1,...,1])$ , while calculating the remaining values following LifeHD procedures. This strategy prevents the merging of labeled HVs with each other. With these adjustments, LifeHD ${}_{\textrm{semi}}$ offers a solution that retains the core elements of LifeHD while handling scarce labeled inputs.

6.2. LifeHD ${}_{\textrm{a}}$

While HDC computation is typically lightweight, there may be instances of energy scarcity (e.g., when powered by a solar panel) that call for a balance between accuracy and power efficiency. Similar to neural networks, one approach is to prune the HDC model using a mask, retaining the most crucial HDC dimensions post-encoding (Khaleghi et al., 2020). Dimension importance can be determined by aggregating all class hypervectors into one and sorting the values across all dimensions. Notably, direct reduction of the encoding dimension should be avoided, as it can degrade HDC’s expressive capability and end up with corruption. Fig. 7 (left) visually demonstrates the impact of masking in supervised HDC tasks: retaining the top 6000 bits incurs only a 3% accuracy loss compared to using the full 10K-bit precision.

While the concept of masking has been employed in prior HDC studies (Khaleghi et al., 2020), they are not directly applicable to LifeHD due to their offline training setting with iid data. With streaming non-iid data in LifeHD, the set of observed cluster HVs may only represent a subset of the potential classes, and the less significant bits could become crucial as new classes are introduced.

We introduce LifeHD ${}_{\textrm{a}}$ , which enhances LifeHD through an adaptive masking approach applied to all cluster HVs in working and long-term memory. Let $D_{a}$ represent the target dimension for reduction. The rationale behind LifeHD ${}_{\textrm{a}}$ is depicted in Fig. 7 (right). Whenever an original LifeHD detects novelty, we temporarily revert to the full dimension $D$ for 2 batches, which is sufficient for LifeHD to consolidate new patterns in its memory. After these two batches, we assess the long-term memory cluster HVs by aggregating them and ranking the dimensions, and derive a mask retaining $D_{a}$ dimensions with the largest absolute values. This mask is then applied to the following batches of $\phi(X)$ immediately after encoding, up until the next novelty is detected. Novelty detection is executed with the masked hypervectors. Importantly, LifeHD ${}_{\textrm{a}}$ can utilize the same novelty detection sensitivity as LifeHD since the most significant dimensions dominate the similarity check. In other words, the similarity results in LifeHD ${}_{\textrm{a}}$ using $D_{a}$ dimension are similar as using the full dimension. LifeHD ${}_{\textrm{a}}$ offers an adaptive HDC model pruning interface with minimal accuracy loss and overhead.

Table 2. Experimental setup of LifeHD across all datasets.

Dataset	Application	Classes	Total	Training Data Order	HDnn?	Pretrained	# of
	Category	(Balanced?)	Samples			Models	Params
MHEALTH (Banos and Saez, 2014)	Activity	12 (N)	9K	Temporal order during collection	N	-	-
ESC-50 (Piczak, 2015)	Sound	50 (Y)	2K	Class-incremental, random within class	Y	ACDNet (Mohaimenuzzaman et al., 2023)	4.7M
CIFAR-100 (Krizhevsky et al., 2009)	Image	20 (Y)	60K	Class-incremental, random within class	Y	MobileNet V2 (Sandler et al., 2018) or	2.2M
CIFAR-100 (Krizhevsky et al., 2009)	Image	20 (Y)	60K	or gradual rotation within class	Y	MobileNet V3 small (Howard et al., 2019)	927K

7. Evaluation of LifeHD

7.1. System Implementation

We implement LifeHD with Python and PyTorch (Paszke et al., 2019) and deploy it on three standard edge platforms: Raspberry Pi (RPi) Zero 2 W (rpi, 2023b), Raspberry Pi 4B (rpi, 2023a), and Jetson TX2 module (jet, 2023). The selection of edge platforms represent three tiers with small, medium and abundant resources.

RPi Zero 2 W has a 1GHz quad-core Cortex-A53 CPU and 512MB SDRAM. RPi 4B enjoys a 1.8GHz quad-core Cortex-A72 CPU and 4GB SDRAM. The Jetson TX platform is equipped with a dual-core NVIDIA Denver 2 CPU, a quad-core ARM Cortex-A57 MPCore, an NVIDIA Pasca family GPU with 256 NVIDIA CUDA cores, and 8GB RAM. We measure the training latency per batch and the energy consumption using the Hioki 3334 powermeter (Hioki, 2023).

We are aware that all NN-based feature extractors can be pruned and quantized to attain more efficient deployment on edge platforms (Dutta et al., 2022), same for the NN-based baselines we compare to (et al, 2022; Fini et al., 2022). However, NN model compression is not the primary focus of LifeHD. Existing compression techniques (Neill, 2020; Kim et al., 2019) can be applied directly to the feature extractor in LifeHD. We leave LifeHD with acceleration design and emerging hardware deployment for future works.

7.2. Experimental Setup

We conduct comprehensive experiments to evaluate LifeHD on three typical edge scenarios. All three scenarios incorporate continuous data streams and expect lifelong learning over time. We summarize the experimental setup in Table 2.

Application #1: Personal Health Monitoring. Continuous health monitoring has emerged as a popular use case for IoT. We utilize the MHEALTH (Banos and Saez, 2014) dataset which includes measurements of acceleration, rate of turn, and magnetic field orientation on a smartwatch. MHEALTH differentiates 12 activities in daily lives and is collected from 10 subjects. Notably, MHEALTH employs raw time-series signals rather than processed frequency components as inputs. We use time windows of 2.56s ( $T=128$ ) with 75% overlap to generate the samples. In contrast to previous datasets, we strictly adhere to the temporal order during data collection.

Application #2: Sound Characterization. Continuous sound detection contributes to the characterization of urban environments. We choose the ESC-50 (Piczak, 2015) dataset to emulate this scenario. This dataset comprises 5-second-long recordings categorized into 50 semantically diverse classes, including animals, human sounds, and urban noises. We construct the class-incremental streams by arranging the data in random order within each class.

Application #3: Object Recognition. Object recognition is a common use case for camera-mounted mobile systems, e.g., self-driving vehicles. We set up a class-incremental stream from CIFAR-100 (Krizhevsky et al., 2009), consisting of 32 $\times$ 32 RGB images of 20 coarse classes. We further evaluate the case of data distribution drift by examining gradual rotations occurring within each CIFAR-100 class.

On MHEALTH, LifeHD is fully dependent on the HDC spatiotemporal encoder to process the raw time-series signals. For ESC-50 and CIFAR-100, LifeHD utilizes the HDnn framework with a pretrained feature extractor before HDC encoding, same as in the state-of-the-art HDC works (Shen et al., 2021; Hersche et al., 2022). Specifically, we adapt a pretrained ACDNet with quantified weights (Mohaimenuzzaman et al., 2023) for ESC-50. ACDNet is a compact convolutional neural network architecture designed for small embedded devices. For CIFAR-100, we use a MobileNet V2 (Sandler et al., 2018) for accuracy evaluation and MobileNet V3 small (Howard et al., 2019) for efficiency evaluation, both pretrained on ImageNet (Russakovsky et al., 2015). For all pretrained NNs, we remove the last fully connected layer used for classification and keep the remaining weights frozen.

Table 3 summarizes the key hyperparameters in LifeHD, which are selected based on a separate validation set. We configure $\alpha=0.1$ for moving-average update, $hit_{th}=10$ for long-term memory consolidation. The long-term memory size $L$ is set to 50 in all cases.

Table 3. Important hyperparameters configuration of LifeHD.

Dataset	HDC Encoding			LifeHD Design
	$D$	$Q$	$P$	$bSize$	$M$	$\gamma$	$g_{ub}$	$f_{merge}$
MHEALTH	1000	5	0.01	32	50	3.0	0.2	25
ESC-50	10000	100	0.02	32	100	1.0	0.1	5
CIFAR-100	10000	100	0.01	32	100	1.0	0.1	150

7.3. State-of-the-Art Baselines

We conduct a comprehensive comparison between LifeHD and state-of-the-art NN-based unsupervised lifelong learning baselines, which continuously train a NN for representation learning. The loss functions in these setups are defined in the feature space without relying on label supervision. During testing, we freeze the neural network and apply K-Means clustering on the testing feature embeddings to generate predicted labels. $k$ is set to 50 which is the same number of cluster HVs as in LifeHD. Such a pipeline is widely used for lifelong learning evaluations (Rao et al., 2019; Smith et al., 2021).

Fig. 8 presents a comparison of the pipeline setup using both the baselines and LifeHD on HDnn and non-HDnn frameworks respectively. To ensure fair comparisons, in HDnn framework on ESC-50 and CIFAR-100, we initialize the NN with the same pretrained weights for LifeHD and NN baselines. For the NN baselines on MHEALTH, we randomly initialize a one-layer LSTM of 64 units followed by a fully connected layer of 512 units. This architecture has achieved competitive accuracy as the Transformers-based designs on MHEALTH (Essa and Abdelmaksoud, 2023).

We compare LifeHD with the following baselines, which include all main lifelong learning techniques:


(a) The ACC curve of LifeHD vs. state-of-the-art NN baselines on three scenarios.	(b) The final results of LifeHD in CIFAR-100 with rotation.

•

Finetune is a naïve baseline that optimizes the NN model using the current batch of data without any lifelong learning techniques.
•

CaSSLe (Fini et al., 2022) is a distillation-based framework that utilizes self-supervised losses. It leverages distillation between the representations of the current model and a past model. In the original paper, the past model is captured at the end of the previous task and prior to the introduction of a new task. However, since we do not assume awareness of task shifts, we simply freeze the model from the previous batch.
•

LUMP (et al, 2022) employs a memory buffer for replay and mitigates catastrophic forgetting by interpolating the current batch with previously stored samples in the memory.
•

STAM (Smith et al., 2021) is brain-inspired expandable memory architecture using online clustering and novelty detection. We exclusively apply STAM to CIFAR-100 due to its demand for intricate dataset-specific tuning (e.g., number of receptive fields), and because the authors only released the implementation for the CIFAR datasets.
•

SupHDC (Kim et al., 2018; Hersche et al., 2022) is the fully supervised HDC pipeline.

All baselines are adapted from their original open-source code. For CaSSLe and LUMP, we employ BYOL (Grill et al., 2020) as the self-supervised loss function because it has showed superb empirical performance in lifelong learning tasks compared to other self-supervised learning backbones (Fini et al., 2022). We use the memory buffer size of 256 for LUMP which is the same as in the original paper. We employ the Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.03 across all methods, training each batch for 10 steps. All experiments are executed for 3 random trials.

7.4. LifeHD Accuracy

Results on Three Application Scenarios. Fig. 9 (a) details the ACC curve of all methods as streaming samples are received. All NN baselines start at higher accuracy, especially in ESC-50 (sounds) and CIFAR-100 (images), owing to the presence of a pretrained NN feature extractor within the HDnn framework. Meanwhile, LifeHD begins with lower accuracy as both the working and long-term memories are empty, needing to learn the cluster HVs and the optimal number of clusters. Notably, as streaming samples come in, all NN baselines experience a decline in ACC, underscoring the inherent challenges of unsupervised lifelong learning with streaming non-iid data and a lack of supervision. This is primarily due to the demand for extensive iid data and multi-epoch offline training for finetuning NNs, which is not feasible in our setting. CaSSLe (Fini et al., 2022) leads to forgetting due to its inability to identify suitable past models from which to distill knowledge. Similarly, LUMP (et al, 2022) exhibits reduced ACC in ESC-50 and CIFAR-100, with only marginal ACC improvement in MHEALTH (time series), suggesting that its memory interpolation strategy may not be universally suitable for all applications. While the memory-based design of STAM (Smith et al., 2021) can mitigate forgetting, its efficacy in distinguishing patterns and acquiring new knowledge remains unsatisfactory. On the contrary, LifeHD demonstrates incremental accuracy across all three different scenarios, achieving up to 9.4%, 74.8% and 11.8% accuracy increase on MHEALTH (time series), ESC-50 (sound) and CIFAR-100 (images), compared with the NN-based unsupervised lifelong learning baselines at the end. Such outcome can be attributed to HDC’s lightweight but meaningful encoding and the effective memorization design of LifeHD.

Table 4. The gap of ACCs at the end of the stream between LifeHD and Supervised HDC (Kim et al., 2018; Hersche et al., 2022).

Method	MHEALTH	ESC-50	CIFAR-100
LifeHD	0.75	0.92	0.20
Supervised HDC (Kim et al., 2018; Hersche et al., 2022)	0.90	0.95	0.26
Gap	-0.15	-0.03	-0.06

Results under Data Distribution Drift. We further evaluate LifeHD’s performance under drifted data and present the final ACC along with the number of discarded cluster HVs in Fig. 9 (b). Specifically, we introduce gradual rotation to the CIFAR-100 samples within each class, ranging from no rotation to a substantial rotation angle of $80^{\circ}$ . The other parameter settings remain the same as in Table 3. The number of discarded cluster HVs accounts for those that are either forgotten or merged. From Fig. 9 (b), we can observe the remarkable resilience of LifeHD to drifted data, with an ACC loss of less than 2.3% even under a severe rotation of $80^{\circ}$ . This robustness stems from the general and uniform design of LifeHD to accommodate various types of continuously changing data streams. In cases of slight or minimal distribution drift, LifeHD updates existing cluster HVs; in instances of severe drift, new cluster HVs are created and subsequently merged if deemed appropriate. However, due to the finite memory capacities, more cluster HVs are subject to forgetting or merging under larger drifts, as shown in Fig. 9 (b) by the number of discarded cluster HVs, leading to ACC loss. In our experiments, LifeHD demonstrates minimal ACC loss even under substantial rotation shifts.

Comparison with Fully Supervised HDC. Table 4 compares the average ACCs of supervised HDC method (Kim et al., 2018; Hersche et al., 2022) and LifeHD. Even without any supervision, LifeHD approaches the ACC of supervised HDC with a gap of 15%, 3% and 6% on MHEALTH, ESC-50 and CIFAR-100. A minimal ACC gap confirms the effectiveness of LifeHD in separating and memorizing key patterns. To help explain the small ACC loss even without supervision, we visualize the confusion matrix of LifeHD on MHEALTH in Fig. 10. MHEALTH has 12 true classes (y axis), whereas LifeHD maintains 23 cluster HVs in its long-term memory (x axis). ACC is evaluated by mapping the unsupervised cluster HVs to true labels. Although LifeHD cannot achieve precise label matching with true classes, it can preserve the essential patterns by using finer-grained clusters. For example, the green box in Fig. 10 highlights a valid learning outcome, where LifeHD uses predicted cluster HV No. 0, 9 and 23 to represent a bigger true class of “Lying down”.

7.5. Training Latency and Energy

Fig. 11 provides comprehensive latency and energy consumption results to train one batch of samples on all three edge platforms. For CIFAR-100, we use the most lightweight MobileNet version, V3 small (Howard et al., 2019), as HDnn feature extractor and NN baseline, to assess LifeHD’s efficiency gain over the most competitive mobile computing setup. On RPi Zero, we report results for the relatively lightweight NN-based baselines, Finetune and LUMP (et al, 2022), using the smallest dataset, MHEALTH, while running CaSSLe (Fini et al., 2022) on MHEALTH would result in out-of-memory errors. As shown in Fig. 11, LifeHD is up to 23.7x, 36.5x and 22.1x faster to train on RPi Zero, RPi 4B and Jetson TX2, respectively, while being up to 22.5x, 34.3x and 20.8x more energy efficient on each, compared to the NN-based unsupervised lifelong learning baselines. In most settings, CaSSLe (Fini et al., 2022) is the most time-consuming because of the expensive distillation. LUMP (et al, 2022) is slightly more expensive than Finetune due to its replay mechanism. STAM (Smith et al., 2021), implemented only on CPU, incurs the longest training latency on Jetson TX2, as it does not use GPU’s acceleration capabilities. LifeHD is clearly faster and more efficient than all NN-based unsupervised lifelong learning baselines (Fini et al., 2022; et al, 2022; Smith et al., 2021) due to LifeHD’s lightweight nature. The overhead of LifeHD alongside fully supervised HDC, SupHDC (Kim et al., 2018; Hersche et al., 2022), is negligible on more powerful platforms like RPi 4B and Jetson TX2. Notably, in LifeHD, the cluster HV merging step for processing about 40 LTM elements takes 7.4, 0.86 and 0.66 seconds to run on RPi Zero, RPi 4B and Jetson TX2, respectively, which only executes once every $f_{merge}$ batches. Further enhancements can be achieved using the acceleration techniques mentioned in Sec. 5.4.

Fig. 11 indicates LifeHD improves latency and energy efficiency the most on RPi 4B, as compared to RPi Zero and Jetson TX2 that represent more limited or powerful devices. This is because the high-dimensional nature of LifeHD requires a fair amount of memory, thus it cannot run efficiently on the highly restricted RPi Zero. The GPU resources on Jetson TX2 boost the NN-based baselines, narrowing the gap between them and LifeHD. We expect much larger efficiency improvements when LifeHD is accelerated using emerging in-memory computing hardware (Dutta et al., 2022; Xu et al., 2023).

7.6. Memory Usage

Fig. 12 provides a comprehensive summary of peak memory footprint for all methods on MHEALTH and CIFAR-100. We categorize the methods into NN training (Finetune, LUMP (et al, 2022), CaSSLe (Fini et al., 2022), STAM (Smith et al., 2021)) and HDC training (Supervised HDC (Kim et al., 2018; Hersche et al., 2022) and our LifeHD). Following (Kwon et al, 2023), we calculate the peak memory of NN training as the sum of model, optimizer and activation memories, plus additional memory consumption for lifelong learning. Specifically, CaSSLe (Fini et al., 2022) requires additional memory for training a predictor and inference from a frozen model, LUMP (et al, 2022) needs extra memory for replay. For HDC-based methods, each dimension of the cluster HV is represented as a signed integer and stored in a byte. In addition to the working and long-term memories, we also consider the storage of bipolar level and ID hypervectors for encoding, and the frozen MobileNet for HDnn encoding in CIFAR-100. Notice that our focus here is on comparing full-precision memory usage, and optimization techniques like quantization can be applied to all methods in the future.

The results in Fig. 12 highlight LifeHD’s memory efficiency. LifeHD conserves 80.1%-86.2% and 84.1%-96.0% of memory compared to NN training baselines on MHEALTH (non-HDnn) and CIFAR-100 (HDnn), respectively. This remarkable efficiency stems from LifeHD’s HDC design, which dispenses with the memory-intensive gradient descents in NNs. STAM (Smith et al., 2021), with its hierarchical and expandable memory structure, consumes 6.3x the memory of LifeHD, as it stores raw image patches across all hierarchies. Compared to fully supervised HDC, SupHDC (Kim et al., 2018; Hersche et al., 2022), LifeHD introduces a modest memory increase to accomplish the challenging task of organizing label-free cluster HVs. LifeHD proves advantageous for edge applications with only 103 KB and 2.5 MB of peak memory required for MHEALTH and CIFAR-100.

7.7. Ablation Studies

The design of LifeHD consists of several key elements: the two-tier memory organization, novelty detection and online update, and cluster HV merging that manipulates past patterns. We conduct experiments to assess the contribution of each element. Using the configuration in Table 3, we evaluate the performance of (i) LifeHD without long-term memory, using only a single layer memory, (ii) LifeHD without merging, employing only novelty detection, online update and forgetting, and (iii) complete LifeHD. We present the ACC and the number of cluster HVs in LTM during MHEALTH training in Fig. 13, chosen as a representative scenario. LifeHD without LTM (green dashdot line) forces cluster HV merging to take place in working memory, where the large number of temporary cluster HVs creates less important nodes in the graph and corrupts the graph-based merging process, as shown in Fig. 13 (left). This necessitates the design of the two-tier memory architecture and merging with LTM elements. LifeHD without merging (blue dashed line) consumes 1x more memory in the LTM, making it unsuitable for resource-constrained edge devices. Our design of LifeHD (red solid line) strategically combines similar cluster HVs with minor loss on the clustering quality, achieving ACC similar to those without merging while conserving memory storage.

7.8. Sensitivity Analysis

Fig. 14 summarizes the sensivitity results of key parameters in LifeHD, while the less sensitive ones such as $\alpha$ and $hit_{th}$ are omitted due to space limitation. The default setting is the same as in Table 3.

Working Memory Size. Fig. 14 (a) shows ACCs using working memory sizes of 20, 50, 100 and 200. In general, a larger working memory allows more temporary cluster HVs at the cost of higher memory consumption. $M=100$ produces optimal results, while further increasing the memory size reduces clustering quality. This occurs because excessively large working memory retains outdated prototypes, degrading lifelong learning performance.

Novelty Threshold. In Fig. 14 (b), we present the final ACCs for different novelty detection thresholds ( $\gamma$ ). A lower $\gamma$ results in more frequent novelty detections and increased loads on the working memory, while a higher $\gamma$ may lead to overlooking significant changes. Remarkably, LifeHD demonstrates resilience to variations in $\gamma$ , a phenomenon that we attribute to the combined impact of novelty detection and merging processes.

Merging Sensitivity. Fig. 14 (c) shows ACC using various merging thresholds ( $g_{ub})$ . $g_{ub}$ determines the number of clusters ( $k$ ) to merge in the cluster HV merging step (Sec. 5.4). A low value for $g_{ub}$ results in overly aggressive merging, leading to the fusion of dissimilar cluster HVs and a degraded ACC. A larger $g_{ub}$ adopts a conservative merging strategy and encourages finer-grained clusters, albeit at the expense of increased resource demands.

Merging Frequency. Fig. 14 (d) shows the final ACCs for different merging frequencies ( $f_{batch}$ ). LifeHD shows its robustness across various $f_{batch}$ values, partly due to the presence of $g_{ub}$ to prevent aggressive merging. Less frequent merging (larger $f_{batch}$ ) raises the risk of forgetting important patterns as of memory constraints. More frequent merging (smaller $f_{batch}$ ) increases the computational burden due to the spectral clustering-based algorithm.

Encoding Level and Flipping Ratio for Spatiotemporal Encoding. Fig. 14 (e) and (f) show the ACCs for various quantization encoding levels ( $Q$ ) and flipping ratios ( $P$ ) during the spatiotemporal encoding. Both parameters are important for preserving the similarity in HD-space after encoding. Optimal $Q$ depends on the sensor sensitivity, with finer-grained sensors requiring more quantization levels. $P$ determines the similarity between adjacent levels of hypervectors. For personal health monitoring, such as MHEALTH, $Q=10,P=0.01$ usually gives the best results.


(a) Working memory size	(b) Novelty threshold	(c) Merging sensitivity

(d) Merge frequency	(e) Encoding level	(f) Encoding flipping ratio

8. Evaluation of LifeHD ${}_{\textrm{semi}}$ and LifeHD ${}_{\textrm{a}}$

In this section, we compare LifeHD ${}_{\textrm{semi}}$ and LifeHD ${}_{\textrm{a}}$ , our proposed extensions from LifeHD, with existing designs that are similar.

Performance of LifeHD ${}_{\textrm{semi}}$ . To evaluate LifeHD ${}_{\textrm{semi}}$ in a low-label scenario, we compare it with SemiHD (Imani et al., 2019b), which is the state-of-the-art HDC method for semi-supervised learning. We adapt SemiHD (Imani et al., 2019b) for single-pass settings, introducing a pseudolabel assignment threshold. When the cosine similarity of an unlabeled sample to the nearest class hypervector surpasses the threshold, we assign that class as its pseudolabel. The sample is then employed to update the class hypervector in SemiHD. We explore various threshold values and choose the optimal result for comparison. Fig. 15 (a) compares LifeHD ${}_{\textrm{semi}}$ and SemiHD (Imani et al., 2019b) on ESC-50 and CIFAR-100 across various labeling ratios $r<0.01$ . The advantages of LifeHD ${}_{\textrm{semi}}$ are most prominent when labels are limited, the weakly supervised scenario is LifeHD ${}_{\textrm{semi}}$ ’s major focus. LifeHD ${}_{\textrm{semi}}$ improves ACC by up to 10.25% and 3.6% on ESC-50 and CIFAR-100 respectively. This outcome arises from the unsupervised nature of LifeHD, allowing it to autonomously organize prominent cluster HVs, especially when all samples from a class lack labels. As the labeling ratio increases, LifeHD ${}_{\textrm{semi}}$ ’s advantage over SemiHD diminishes, because more labels bolster SemiHD’s performance.

Performance of LifeHD ${}_{\textrm{a}}$ . LifeHD ${}_{\textrm{a}}$ provides an interface to trade minimal performance loss for efficiency gains, by adaptively pruning out the insignificant dimensions. We compare LifeHD ${}_{\textrm{a}}$ with previous HDC works employing a fixed mask throughout training (Khaleghi et al., 2020), and the results are presented in Fig. 15 (b) for CIFAR-100, including ACC and training latency per batch on RPi 4B. Fixed masks negatively impact HDC learning, especially with smaller dimensions. Such masks fail to adapt to new hypervectors in class-incremental streams, where less significant dimensions may become crucial later in training. LifeHD ${}_{\textrm{a}}$ addresses this issue by adjusting the mask upon novelty detection, leading to a degradation of only 0.71% in ACC and 4.5x efficiency gain compared to the complete LifeHD, using only 20% of the full HD dimension of LifeHD. The overhead of adaptively adjusting the mask is negligible when novelty detection occurs infrequently.

9. Discussions and Future Works

Problem Scale. One limitation of LifeHD is the relative small problem scale (e.g., the image size of CIFAR-100 is restricted to 32x32) due to the essential difficulty of unsupervised lifelong learning problem, including single-pass non-iid data and no supervision. For the same reason, there remains a disparity in accuracy between unsupervised lifelong learning and fully supervised NNs, as substantiated by prior research (Smith et al., 2021; et al, 2022). In order to scale LifeHD to more challenging applications such as self-driving vehicles, one possible direction is to leverage the pretrained foundation model as a frozen feature extractor in the HDnn framework, which we leave for future investigation.

Hyperparameter Tuning. While we recognize that hyperparameters can influence the performance of LifeHD, such an issue is not exclusive to LifeHD, but has persistently been a challenge in machine learning research (Bischl et al., 2023). In LifeHD, the impact of hyperparameters can be mitigated through pre-deployment evaluation and component co-design. For example, encoding parameters such as $Q,P$ can be tuned on similar health monitoring data sources prior to deployment. Meanwhile, the component of cluster HVs merging can increase LifeHD’s resiliency to the novelty detection threshold $\gamma$ , as a higher quantity of novel clusters can be merged in later stage of learning.

Limitations of HDC. HDC serves as the fundamental core of LifeHD. While HDC shows promise with its notable lightweight design, it is burdened by several limitations that remain active areas of research. First, for complex datasets like audio and images, HDC requires a pretrained feature extractor (the HDnn encoding) which may not exist for certain applications. Moreover, akin to any other architecture, HDC vectors face capacity limitations determined by the dimension of HD space, encoding method, and potential noise levels in the input data (Thomas et al., 2021). Due to these factors, careful evaluation and sometimes manual feature engineering are required to successfully deploy HDC for new applications.

Future Works. Although LifeHD focuses on single-device lifelong learning for classification tasks, the method can be extended for other types of tasks and learning settings, such as federated learning and reinforcement learning. We leave the investigation of these topics for future work.

10. Conclusion

The ability to learn continuously and indefinitely in the presence of change, and without access to supervision, on a resource-constrained device is a crucial trait for future sensor systems. In this work, we design and deploy the first end-to-end system named LifeHD to learn continuously from real-world data streams without labels. Our approach is based on Hyperdimensional Computing (HDC), an emerging neurally-inspired paradigm for lightweight edge computing. LifeHD is built on a two-tier memory hierarchy including a working and a long-term memory, with collaborative components of novelty detection, online cluster HV update and cluster HV merging for optimal lifelong learning performance. We further propose two extensions to LifeHD, LifeHD ${}_{\textrm{semi}}$ and LifeHD ${}_{\textrm{a}}$ , to handle scarce labeled samples and power constraints. Practical deployments on typical edge platforms and three IoT scenarios demonstrate LifeHD’s improvement of up to 74.8% on unsupervised clustering accuracy and up to 34.3x on energy efficiency compared to state-of-the-art NN-based unsupervised lifelong learning baselines (Fini et al., 2022; et al, 2022; Smith et al., 2021).

Acknowledgements.

The authors would like to thank the anonymous shepherd, reviewers, and our colleague Xiyuan Zhang for their valuable feedback. This work was supported in part by National Science Foundation under Grants #2003279, #1826967, #2100237, #2112167, #1911095, #2112665, and in part by PRISM and CoCoSys, centers in JUMP 2.0, an SRC program sponsored by DARPA.

References

(1)
jet (2023) 2023. Jetson TX2 Module. https://developer.nvidia.com/embedded/jetson-tx2. [Online].
rpi (2023a) 2023a. Raspberry Pi 4B. https://www.raspberrypi.com/products/raspberry-pi-4-model-b/. [Online].
rpi (2023b) 2023b. Raspberry Pi Zero 2 W. https://www.raspberrypi.com/products/raspberry-pi-zero-2-w/. [Online].
Avarguès-Weber et al. (2012) Aurore Avarguès-Weber et al. 2012. Simultaneous mastering of two abstract concepts by the miniature brain of bees. Proceedings of the National Academy of Sciences 109, 19 (2012), 7481–7486.
Baddeley (1992) Alan Baddeley. 1992. Working memory. Science 255, 5044 (1992), 556–559.
Banos and Saez (2014) Garcia Rafael Banos, Oresti and Alejandro Saez. 2014. MHEALTH Dataset. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5TW22.
Bischl et al. (2023) Bernd Bischl et al. 2023. Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 13, 2 (2023), e1484.
Bricken et al. (2023) Trenton Bricken et al. 2023. Sparse Distributed Memory is a Continual Learner. In International Conference on Learning Representations.
Cai et al. (2020) Han Cai et al. 2020. Tinytl: Reduce memory, not parameters for efficient on-device learning. Advances in Neural Information Processing Systems 33 (2020), 11285–11297.
Chen et al. (2016) Ning Chen et al. 2016. Smart urban surveillance using fog computing. In 2016 IEEE/ACM Symposium on Edge Computing (SEC). IEEE, 95–96.
Dutta et al. (2022) Arpan Dutta et al. 2022. Hdnn-pim: Efficient in memory design of hyperdimensional computing with feature extraction. In Proceedings of the Great Lakes Symposium on VLSI 2022. 281–286.
Essa and Abdelmaksoud (2023) Ehab Essa and Islam R Abdelmaksoud. 2023. Temporal-channel convolution with self-attention network for human activity recognition using wearable sensors. Knowledge-Based Systems 278 (2023), 110867.
et al (2022) Divyam Madaan et al. 2022. Representational Continuity for Unsupervised Continual Learning. In International Conference on Learning Representations.
Fini et al. (2022) Enrico Fini et al. 2022. Self-supervised models are continual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Gim and Ko (2022) In Gim and JeongGil Ko. 2022. Memory-efficient DNN training on mobile devices. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 464–476.
Grill et al. (2020) Jean-Bastien Grill et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33 (2020), 21271–21284.
Halko et al. (2011) Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53, 2 (2011), 217–288.
Hersche et al. (2022) Michael Hersche et al. 2022. Constrained few-shot class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9057–9067.
Hioki (2023) Hioki. 2023. Hioki3334 Powermeter. https://www.hioki.com/en/products/detail/?product_key=5812.
Howard et al. (2019) Andrew Howard et al. 2019. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1314–1324.
Imani et al. (2019a) Mohsen Imani et al. 2019a. Hdcluster: An accurate clustering using brain-inspired high-dimensional computing. In Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1591–1594.
Imani et al. (2019b) Mohsen Imani et al. 2019b. Semihd: Semi-supervised learning using hyperdimensional computing. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1–8.
Imani et al. (2017) Mohsen Imani, Deqian Kong, Abbas Rahimi, and Tajana Rosing. 2017. Voicehd: Hyperdimensional computing for efficient speech recognition. In IEEE International Conference on Rebooting Computing (ICRC). IEEE, 1–8.
Kanerva (2009) Pentti Kanerva. 2009. Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation 1 (2009), 139–159.
Khaleghi et al. (2020) Behnam Khaleghi, Mohsen Imani, and Tajana Rosing. 2020. Prive-hd: Privacy-preserved hyperdimensional computing. In ACM/IEEE Design Automation Conference (DAC). IEEE, 1–6.
Kim et al. (2019) Hyeji Kim, Muhammad Umar Karim Khan, and Chong-Min Kyung. 2019. Efficient neural network compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12569–12577.
Kim et al. (2018) Yeseong Kim, Mohsen Imani, and Tajana S Rosing. 2018. Efficient human activity recognition using hyperdimensional computing. In Proceedings of the 8th International Conference on the Internet of Things. 1–6.
Kirkpatrick et al. (2017) James Kirkpatrick et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences (2017).
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
Kwon et al (2023) Young D Kwon et al. 2023. LifeLearner: Hardware-Aware Meta Continual Learning System for Embedded Computing Platforms. In Proceedings of the 21st ACM Conference on Embedded Networked Sensor Systems.
Lee et al. (2020) Soochan Lee et al. 2020. A Neural Dirichlet Process Mixture Model for Task-Free Continual Learning. In International Conference on Learning Representations.
Lin et al. (2020) Ji Lin et al. 2020. Mcunet: Tiny deep learning on iot devices. Advances in Neural Information Processing Systems 33 (2020), 11711–11722.
Lin et al. (2021) Ji Lin et al. 2021. Memory-efficient patch-based inference for tiny deep learning. Advances in Neural Information Processing Systems 34 (2021), 2346–2358.
Lin et al. (2022) Ji Lin et al. 2022. On-device training under 256kb memory. Advances in Neural Information Processing Systems 35 (2022), 22941–22954.
Lopez-Paz and Ranzato (2017) David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems 30 (2017).
McCloskey and Cohen (1989) Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. Vol. 24. Elsevier, 109–165.
Mohaimenuzzaman et al. (2023) Md Mohaimenuzzaman et al. 2023. Environmental Sound Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Resource-Constrained Devices. Pattern Recognition 133 (2023), 109025.
Moin et al. (2021) Ali Moin et al. 2021. A wearable biosensing system with in-sensor adaptive machine learning for hand gesture recognition. Nature Electronics 4, 1 (2021), 54–63.
Neill (2020) James O’ Neill. 2020. An overview of neural network compression. arXiv preprint arXiv:2006.03669 (2020).
Ng et al. (2001) Andrew Ng, Michael Jordan, and Yair Weiss. 2001. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems 14 (2001).
Osipov et al. (2022) Evgeny Osipov et al. 2022. Hyperseed: Unsupervised learning with vector symbolic architectures. IEEE Transactions on Neural Networks and Learning Systems (2022).
Parisi et al. (2019) German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. 2019. Continual lifelong learning with neural networks: A review. Neural networks 113 (2019), 54–71.
Paszke et al. (2019) Adam Paszke et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
Piczak (2015) Karol J Piczak. 2015. ESC: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia. 1015–1018.
Profentzas et al. (2022) Christos Profentzas, Magnus Almgren, and Olaf Landsiedel. 2022. MiniLearn: On-Device Learning for Low-Power IoT Devices. In International Conference on Embedded Wireless Systems and Networks.
Rao et al. (2019) Dushyant Rao, Francesco Visin, Andrei Rusu, Razvan Pascanu, Yee Whye Teh, and Raia Hadsell. 2019. Continual unsupervised representation learning. Advances in neural information processing systems 32 (2019).
Ren et al. (2021) Haoyu Ren, Darko Anicic, and Thomas A Runkler. 2021. Tinyol: Tinyml with online-learning on microcontrollers. In 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
Russakovsky et al. (2015) Olga Russakovsky et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115 (2015), 211–252.
Rusu et al (2016) Andrei A Rusu et al. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671 (2016).
Saha et al. (2023) Swapnil Sayan Saha et al. 2023. TinyNS: Platform-Aware Neurosymbolic Auto Tiny Machine Learning. ACM Transactions on Embedded Computing Systems (2023).
Sandler et al. (2018) Mark Sandler et al. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510–4520.
Shen et al. (2021) Yang Shen, Sanjoy Dasgupta, and Saket Navlakha. 2021. Algorithmic insights on continual learning from fruit flies. arXiv preprint arXiv:2107.07617 (2021).
Shunhou and Peng (2022) Shun Shunhou and Yang Peng. 2022. AIoT on Cloud. In Digital Transformation in Cloud Computing. CRC Press, 629–732.
Smith et al. (2021) James Smith et al. 2021. Unsupervised Progressive Learning and the STAM Architecture. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21. 2979–2987.
Sun et al. (2020) Ke Sun, Chen Chen, and Xinyu Zhang. 2020. ” Alexa, stop spying on me!” speech privacy protection against voice assistants. In Proceedings of the 18th conference on Embedded Networked Sensor Systems. 298–311.
Thomas et al. (2021) Anthony Thomas, Sanjoy Dasgupta, and Tajana Rosing. 2021. A theoretical perspective on hyperdimensional computing. Journal of Artificial Intelligence Research 72 (2021), 215–249.
Tiezzi et al. (2022) Matteo Tiezzi et al. 2022. Stochastic Coherence Over Attention Trajectory For Continuous Learning In Video Streams. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. 3480–3486.
Tiwari et al. (2022) Rishabh Tiwari et al. 2022. Gcr: Gradient coreset based replay buffer selection for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 99–108.
Von Luxburg (2007) Ulrike Von Luxburg. 2007. A tutorial on spectral clustering. Statistics and computing 17 (2007), 395–416.
Wang et al. (2019) Erwei Wang et al. 2019. Deep neural network approximation for custom hardware: Where we’ve been, where we’re going. ACM Computing Surveys (CSUR) 52, 2 (2019), 1–39.
Wang et al. (2022) Qipeng Wang et al. 2022. Melon: Breaking the memory wall for resource-efficient on-device machine learning. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. 450–463.
Weiss et al. (2016) Gary M Weiss et al. 2016. Smartwatch-based activity recognition: A machine learning approach. In 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 426–429.
Xie et al. (2016) Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016. Unsupervised deep embedding for clustering analysis. In International Conference on Machine Learning. PMLR, 478–487.
Xu et al. (2022) Daliang Xu et al. 2022. Mandheling: Mixed-precision on-device dnn training with dsp offloading. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. 214–227.
Xu et al. (2023) Weihong Xu, Jaeyoung Kang, and Tajana Rosing. 2023. FSL-HD: Accelerating Few-Shot Learning on ReRAM using Hyperdimensional Computing. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1–6.
Zhang et al. (2020a) Junting Zhang et al. 2020a. Class-incremental learning via deep model consolidation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1131–1140.
Zhang et al. (2020b) Yu Zhang, Tao Gu, and Xi Zhang. 2020b. MDLdroidLite: A release-and-inhibit control approach to resource-efficient deep neural networks on mobile devices. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 463–475.



(a) Latency on RPi Zero	(b) Latency on RPi 4B	(c) Latency on Jetson TX2

(d) Energy on RPi Zero	(e) Energy on RPi 4B	(f) Energy on Jetson TX2