Bayesian compression for dynamically expandable networks

doi:10.1016/j.patcog.2021.108260

Pattern Recognition

Volume 122, February 2022, 108260

https://doi.org/10.1016/j.patcog.2021.108260 Get rights and content

Highlights

•
A compact model structure with preserving the accuracy via sparsity inducing priors, which leads to fewer neurons at each hidden layer in the network, equivalently fewer parameters.
•
Dynamically expands network capacity with only the necessary number of neurons by employing sparsity inducing priors for the added neurons, so as to increase the network capacity when necessary.
•
Variational Bayesian approximation for the model parameters with parameter uncertainty.

Abstract

This paper develops Bayesian Compression for Dynamically Expandable Network (BCDEN), which can learn a compact model structure with preserving the accuracy in a continual learning scenarios. Dynamically Expandable Network (DEN) is efficiently trained by performing selective retraining, dynamically expands network capacity with only the necessary number of units, and effectively prevents semantic drift by duplicating and timestamping units in an online manner. Overcoming conventional DEN only giving point estimates, we providing the Bayesian inference under the principle framework. We validate our BCDEN on multiple public datasets under continual learning setting, on which it can outperform existing continual learning methods on a variety of tasks, and with the state-of-the-art compression results, while still maintaining comparable performance.

Introduction

Continual learning, also called lifelong learning and incremental learning, is an important topic in transfer learning where tasks arrive in sequence. The primary goal of continual learning is to perform well on the entire set of tasks in an incremental way that avoids revisiting all previous data at each stage. Since real world tasks continually evolve over time and the size of datasets often prohibits frequent batch updating which are the key problem in machine learning. To tackle this problem, we exploit the power of deep neural networks for continual learning. Fortunately, storing and transferring knowledge can be done in a straightforward manner through the learned deep network weights. By sharing these learned weights, a balance can be built between adapting to new task and retaining knowledge from existing tasks.

Considering continual learning as a special case of online or incremental learning in case of deep neural networks, multiple ways can be conducted to perform such incremental learning [1], [2]. The simplest way is to incrementally fine-tune the network to new tasks by continuing to train the network with new training data. However, such simple retraining of the network can lead to the catastrophic forgetting problem [3], [4], [5] and an inability to adapt to new tasks. For example, if the previous tasks are classifying images of animals and the new task is to classify images of cars, the features learned on the previous tasks may not be useful for the new one. On the other hand, the retrained representations for the new task could adversely affect the old tasks, as they may have drifted from their original meanings and are no longer suitable for them.

In order to ensure the knowledge sharing through the network is beneficial for all tasks in the online or incremental learning of a deep neural network, [6], [7], [8] prevent drastic changes in the parameters which have a large influence on prediction, but allows other parameters to change more freely. Nguyen et al. [9] merges online variational inference (VI) [10], [11], [12] with Monte Carlo VI for neural networks [13] to yield variational continual learning (VCL), and extends VCL to include a small episodic memory by combining VI with the coreset data summarization method [14], [15]. Yoon et al. [16] proposes a novel deep network model named as Dynamically Expandable Networks (DEN), with dynamically increasing the network capacity by adding in or duplicating neurons when necessary, DEN can maximally utilize the network learned on all previous tasks to efficiently learn to predict for the new task in the continual learning scenario.

While deep neural networks are a widely popular tool in the application of continual learning, they often have more parameters than the number of the training instances. As a result, running them on hardware limited devices remain difficult in many real world scenarios. Hence, compression and efficiency have become a topic of interest in the deep learning community. There is a variety of approaches to address these problem settings. However, most methods have the same strategy of reducing the neural network structure. A justification is the finding that neural networks suffer from significant parameter redundancy [17]. Methods in this line of thought are network pruning, where unnecessary connections are being removed [18], [19], [20], or student-teacher learning where a large network is used to train a significantly smaller network [21], [22].

In this paper, we develop our Bayesian Compression [23] for Dynamically Expandable Networks (BCDEN) which can get pruned network structure with preserved accuracy for each observed task in the continual learning scenarios. We will use the variational Bayesian approximation for the model parameters. By employing sparsity inducing priors for hidden units, not the individual weights, we can prune neurons including all their ingoing and outgoing weights. This avoids more complicated and inefficient coding schemes needed for pruning or vector quantizing individual weights. We train the network from task $t = 1$ with sparsity inducing priors to promote group sparsity in the weights, which means fewer neurons to be connected at each hidden layer in the network. When a new task $t$ arrives at the model, BCDEN will first perform selective retraining for the new task. If the new task is highly relevant to the old ones, selective retraining alone will be sufficient for the new task. However, if the old network can not sufficiently explains the new task, BCDEN will dynamically add in the necessary number of neurons by employing sparsity inducing priors for them, to increase the network capacity. When the network shows degenerate performance for earlier tasks, BCDEN will perform network duplication to prevent such semantic drift or catastrophic forgetting.

Our main contributions in this paper include: (1) a compact model structure with preserving the accuracy via sparsity inducing priors, which leads to fewer neurons at each hidden layer in the network, equivalently fewer parameters; (2) dynamically expands network capacity with only the necessary number of neurons by employing sparsity inducing priors for the added neurons, so as to increase the network capacity when necessary; (3) variational Bayesian approximation for the model parameters with parameter uncertainty.

Section snippets

Related work

The problem of alleviating semantic drift or catastrophic forgetting has been addressed in many previous studies. Goodfellow et al. [5] simply regularizes the model parameter at each step with the regularization parameter and uses cross-validation to find it, but this method is often coarse and leads to catastrophic forgetting. In [6], the predictions of the previous task’s network and the current network are encouraged to be similar when applied to data from the new task by using a form of

Preliminaries

Below we introduce several preliminaries of our algorithm, which are Bayesian Inference, Stochastic Variational Inference frameworks and Reparametrizing Variational Dropout, a method for group sparsity.

Bayesian compression for a continual learning network

In this section, we consider the compression and efficiency of a deep neural network under the continual learning scenario, where unknown number of tasks with unknown distributions of training data arrive at the model in sequence. Specifically, our goal is to learn pruned models for a sequence of $T$ tasks, $t = 1, \dots, t, \dots, T$ for unbounded $T$ where the task at time point $t$ comes with training data $D_{t} = {x_{n}, y_{n}}_{n = 1}^{N_{t}}$ . Note that each task $t$ can be either a single task, or comprised of set of subtasks. Though

Experiments

The experiments evaluate the performance and compression capability of our BCDEN through four different datasets. The groups of parameters in our model were constructed by coupling the scale variables for each input neuron for the fully connected layers or for each filter for the convolutional layers. Determining the threshold for pruning can be easily done with manual inspection, as usually these neurons or filters can be divided into two well separated components, that are signal and noise,

Conclusion

We proposed Bayesian Compression for Dynamically Expandable Network (BCDEN), which can learn a pruned network structure with preserved accuracy in the continual learning scenarios. BCDEN performs selective retraining, dynamically expands network capacity with only the necessary number of units, and effectively prevents semantic drift by duplicating units and timestamping them from a Bayesian point of view. We validate our method on multiple classification datasets under continual learning

Declaration of Competing Interest

We have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Bo Chen acknowledges the support of the National Natural Science Foundation of China under Grant 61771361, The Youth Innovation Team of Shaanxi Universities, 111 Project under Grant B18039, and the Program for Oversea Talent by Chinese Central Government. Hongwei Liu acknowledges the support of the NSFC for Distinguished Young Scholars under Grant 61525105 and Shaanxi Innovation Team Project.

Yang Yang received the BEng degree in electronic information engineering from Henan Polytechnic University in 2012. Currently, he is working towards his Ph.D. degree at National Lab of Radar Signal Processing, Xidian University. His research interests include machine learning, Bayesian statistical modeling, and statistical signal processing.

References (44)

M. McCloskey et al.
Catastrophic interference in connectionist networks: the sequential learning problem
Psychology of Learning and Motivation
(1989)
S. Scardapane et al.
Group sparse regularization for deep neural networks
Neurocomputing
(2017)
G. Zhou et al.
Online incremental feature learning with denoising autoencoders
Artificial Intelligence and Statistics
(2012)
A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell,...
R. Ratcliff
Connectionist models of recognition memory: constraints imposed by learning and forgetting functions
Psychol. Rev.
(1990)
I.J. Goodfellow et al.
An empirical investigation of catastrophic forgetting in gradient-based neural networks
Comput. Sci.
(2013)
Z. Li et al.
Learning without forgetting
IEEE Trans. Pattern Anal. Mach. Intell.
(2018)
J. Kirkpatrick et al.
Overcoming catastrophic forgetting in neural networks
Proc. Natl. Acad. Sci.
(2017)
F. Zenke et al.
Continual learning through synaptic intelligence
Proceedings of the 34th International Conference on Machine Learning
(2017)
C.V. Nguyen et al.
Variational continual learning
International Conference on Learning Representations
(2018)

Z. Ghahramani et al.

Online variational Bayesian learning

NIPS workshop on Online Learning

(2000)

M.-A. Sato

Online model selection based on the variational bayes

Neural Comput.

(2001)

T. Broderick et al.

Streaming variational bayes

Advances in Neural Information Processing Systems

(2013)

C. Blundell et al.

Weight uncertainty in neural networks

International Conference on Machine Learning

(2015)

O. Bachem et al.

Coresets for nonparametric estimation-the case of DP-means

International Conference on Machine Learning

(2015)

J. Huggins et al.

Coresets for scalable Bayesian logistic regression

Advances in Neural Information Processing Systems

(2016)

J. Yoon et al.

Lifelong learning with dynamically expandable networks

International Conference on Learning Representations

(2018)

M. Denil et al.

Predicting parameters in deep learning

Advances in Neural Information Processing Systems

(2013)

Y. LeCun et al.

Optimal brain damage

Advances in Neural Information Processing Systems

(1990)

S. Han et al.

Learning both weights and connections for efficient neural network

Advances in Neural Information Processing Systems

(2015)

Y. Guo et al.

Dynamic network surgery for efficient DNNs

Advances In Neural Information Processing Systems

(2016)

J. Ba et al.

Do deep nets really need to be deep?

Advances in Neural Information Processing Systems

(2014)

Cited by (12)

Multi-modal fusion approaches for tourism: A comprehensive survey of data-sets, fusion techniques, recent architectures, and future directions
2024, Computers and Electrical Engineering
Multi-modal fusion techniques show promising results and great potential in the tourism application area. Over the past decade, various multi-modal-based fusion methods have been employed in many sub-domains of tourism, including point-of-interest (POI) recommendations, tourism demand forecasting, and travel route planning. However, the lack of architectural descriptions and consistent terminologies makes it difficult compared to other solutions. This survey provides a comprehensive analysis of current methods of multi-modal fusion and explores the diverse fusion strategies, including the early, late, hybrid, and hierarchical fusion methods. This survey provides the details of multi-modal fusion fundamentals, analyzes the benchmark dataset, and includes statistical information on these datasets. In addition, this survey discussed the particulars of federated learning-based methods and the current problems of tourism applications and suggested a solution to those problems. This survey aims to thoroughly analyze fusion-based strategies in the tourism application area to offer a valuable reference for researchers, stakeholders, and practitioners.
A three-way decision approach for dynamically expandable networks
2024, International Journal of Approximate Reasoning
Conventional deep learning models are designed to work on a single task. They are required to be trained from scratch each time new tasks are added. This leads to overhead in training time. Continual deep learning models with dynamically expandable network architecture aim to handle this issue. The key idea in these models is to find a balance between the properties of stability (preserving the learned information) and plasticity (updating and accommodating the new information) also sometimes referred to as the stability-plasticity dilemma. The stability and plasticity of the model critically depends on three-way division of nodes into freeze, partially regularize and duplicate nodes. Freezing more nodes result in high stability but typically low plasticity. On the other hand, duplicating more nodes result in high plasticity but may not have an effective stability. In this paper, we introduce an approach called three-way decisions based dynamically expandable networks or 3WDDEN and its memory-based version called 3WDDEN-replay. The proposed approaches use game-theoretic rough sets to determine effective thresholds for three-way division of nodes by considering a tradeoff game between stability and plasticity. Experimental results of 3WDDEN on MNIST variant datasets show an overall improvement of 3.8% in accuracy compared to standard dynamically expandable network approach or DEN. 3WDDEN-replay further adds to accuracy with additional memory cost.
Exemplar-free class incremental learning via discriminative and comparable parallel one-class classifiers
2023, Pattern Recognition
The exemplar-free class incremental learning (IL) requires classification models to learn new-class knowledge incrementally without retaining any old samples. Recently, the IL framework based on parallel one-class classifiers (POC) has demonstrated promising performance. It trains a one-class classifier (OCC) for each category and thus is immune to the catastrophic forgetting problem. However, the single-class training strategy may incur weak discriminability and low comparability between different classifiers in POC. To meet this challenge, we propose a new IL framework, referred to as Discriminative and Comparable Parallel One-class Classifiers (DCPOC). Instead of ordinary OCCs (e.g., deep SVDD) used in other POC methods, DCPOC adopts variational auto-encoders (VAE) as OCCs because VAEs can be used not only to identify classes for given samples but also to generate pseudo samples for the trained classes. With this advantage, DCPOC trains a new-class VAE in contrast with the old-class VAEs, which benefits the new-class VAE to reconstruct better for new-class samples but worse for old-class pseudo samples, thus enhancing the comparability. Furthermore, DCPOC introduces a hinge reconstruction loss to reinforce the discriminability. We evaluate our method on MNIST, CIFAR10, CIFAR100, Tiny-ImageNet, and ImageNet. The experimental results show that DCPOC achieves state-of-the-art performance on these datasets.¹
Towards Automatic Model Compression via a Unified Two-Stage Framework
2023, Pattern Recognition
Deep Neural Networks have become ubiquitous in various domains. Meanwhile, the problems of massive storage and computation costs have hindered the deployment of these models to real-world applications. This paper proposes a novel and unified two-stage framework for automatic model compression. To determine the compression ratio of each layer, we improve the optimization from two aspects. First, to predict the performance of each compression policy, we propose Dynamic BN, which improves the correlation significantly with little computation overhead. Second, to search for the compression ratio allocation, we propose an efficient and hyperparameter-free solving algorithm based on the proposed Hessian matrix approximation and Knapsack problem reformulation. Moreover, comprehensive experiments and analyses are conducted on the CIFAR-100&ImageNet datasets and various network architectures to demonstrate its performance advantages over existing model compression methods under the quantization-only, pruning-only, and pruning-quantization settings.
Lifelong learning with Shared and Private Latent Representations learned through synaptic intelligence
2023, Neural Networks
This paper explores a novel lifelong learning method with Shared and Private Latent Representations (SPLR), which are learned through synaptic intelligence. To solve a sequence of tasks, by considering the entire parameter learning trajectory, SPLR can learn task-invariant representation which changes little, and task-specific features that change greatly along the entire parameter updating trajectory. Therefore, in the lifelong learning scenarios, our model can obtain a task-invariant structure shared by all tasks and also contain some private properties that are task-specific to each task. To reduce the parameter quantity, a $ℓ_{1}$ regularization to promote sparsity is employed in the weights. We use multiple datasets under lifelong learning scenes to verify our SPLR, on these datasets it can get comparable performance compared with existing lifelong learning approaches, and learn a sparse network which means fewer parameters while requiring less model training time.
SATS: Self-attention transfer for continual semantic segmentation
2023, Pattern Recognition
Continually learning to segment more and more types of image regions is a desired capability for many intelligent systems. However, such continual semantic segmentation exhibits catastrophic forgetting issues similar to those of continual classification learning. Unlike the existing knowledge distillation strategies for alleviating this problem, transferring a new type of information, namely, the relationships between elements (e.g., pixels) within each image that can capture both within-class and between-class knowledge, is proposed in this study. Such information can be effectively obtained from self-attention maps in a Transformer-style segmentation model. Considering that pixels belonging to the same class in each image typically share similar visual properties, a class-specific region pooling operator is novelly applied to provide reliable relationship information for knowledge transfer. Extensive evaluations on multiple public benchmarks reveal that the proposed self-attention transfer method can effectively alleviate the catastrophic forgetting issue. Furthermore, flexible combinations of the proposed method with widely adopted strategies considerably outperform state-of-the-art solutions.

View all citing articles on Scopus

Bo Chen received the B.S., M.S., and Ph.D. degrees from Xidian University, Xi’an, China, in 2003, 2006, and 2008, respectively, all in electronic engineering. He became a Post-Doctoral Fellow, a Research Scientist, and a Senior Research Scientist with the Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA, from 2008 to 2012. From 2013, he has been a Professor with the National Laboratory for Radar Signal Processing, Xidian University. He received the Honorable Mention for 2010 National Excellent Doctoral Dissertation Award and is selected into Thousand Youth Talents Program in 2014. His current research interests include statistical machine learning, statistical signal processing and radar automatic target detection and recognition.

Hongwei Liu received the M.S. and Ph.D. degrees in electronic engineering from Xidian University in 1995 and 1999, respectively. He worked at the National Laboratory of Radar Signal Processing, Xidian University, Xi’an. From 2001 to 2002, he was a Visiting Scholar at the Department of Electrical and Computer Engineering, Duke University, Durham, NC. He is currently a Professor and the director of the National Laboratory of Radar Signal Processing, Xidian University, Xi’an. His research interests are radar automatic target recognition, radar signal processing, adaptive signal processing, and cognitive radar.

View full text

Bayesian compression for dynamically expandable networks

Highlights

Abstract

Introduction

Section snippets

Related work

Preliminaries

Bayesian compression for a continual learning network

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgements

Neurocomputing

Online incremental feature learning with denoising autoencoders

Artificial Intelligence and Statistics

Connectionist models of recognition memory: constraints imposed by learning and forgetting functions

Psychol. Rev.

An empirical investigation of catastrophic forgetting in gradient-based neural networks

Comput. Sci.

Learning without forgetting

IEEE Trans. Pattern Anal. Mach. Intell.

Overcoming catastrophic forgetting in neural networks

Proc. Natl. Acad. Sci.

Continual learning through synaptic intelligence

Proceedings of the 34th International Conference on Machine Learning

Variational continual learning

International Conference on Learning Representations

Online variational Bayesian learning

NIPS workshop on Online Learning

Online model selection based on the variational bayes

Neural Comput.

Streaming variational bayes

Advances in Neural Information Processing Systems

Weight uncertainty in neural networks

International Conference on Machine Learning

Coresets for nonparametric estimation-the case of DP-means

International Conference on Machine Learning

Coresets for scalable Bayesian logistic regression

Advances in Neural Information Processing Systems

Lifelong learning with dynamically expandable networks

International Conference on Learning Representations

Predicting parameters in deep learning

Advances in Neural Information Processing Systems

Optimal brain damage

Advances in Neural Information Processing Systems

Learning both weights and connections for efficient neural network

Advances in Neural Information Processing Systems

Dynamic network surgery for efficient DNNs

Advances In Neural Information Processing Systems

Do deep nets really need to be deep?

Advances in Neural Information Processing Systems