Elsevier

Pattern Recognition

Volume 122, February 2022, 108260
Pattern Recognition

Bayesian compression for dynamically expandable networks

https://doi.org/10.1016/j.patcog.2021.108260Get rights and content

Highlights

  • A compact model structure with preserving the accuracy via sparsity inducing priors, which leads to fewer neurons at each hidden layer in the network, equivalently fewer parameters.

  • Dynamically expands network capacity with only the necessary number of neurons by employing sparsity inducing priors for the added neurons, so as to increase the network capacity when necessary.

  • Variational Bayesian approximation for the model parameters with parameter uncertainty.

Abstract

This paper develops Bayesian Compression for Dynamically Expandable Network (BCDEN), which can learn a compact model structure with preserving the accuracy in a continual learning scenarios. Dynamically Expandable Network (DEN) is efficiently trained by performing selective retraining, dynamically expands network capacity with only the necessary number of units, and effectively prevents semantic drift by duplicating and timestamping units in an online manner. Overcoming conventional DEN only giving point estimates, we providing the Bayesian inference under the principle framework. We validate our BCDEN on multiple public datasets under continual learning setting, on which it can outperform existing continual learning methods on a variety of tasks, and with the state-of-the-art compression results, while still maintaining comparable performance.

Introduction

Continual learning, also called lifelong learning and incremental learning, is an important topic in transfer learning where tasks arrive in sequence. The primary goal of continual learning is to perform well on the entire set of tasks in an incremental way that avoids revisiting all previous data at each stage. Since real world tasks continually evolve over time and the size of datasets often prohibits frequent batch updating which are the key problem in machine learning. To tackle this problem, we exploit the power of deep neural networks for continual learning. Fortunately, storing and transferring knowledge can be done in a straightforward manner through the learned deep network weights. By sharing these learned weights, a balance can be built between adapting to new task and retaining knowledge from existing tasks.

Considering continual learning as a special case of online or incremental learning in case of deep neural networks, multiple ways can be conducted to perform such incremental learning [1], [2]. The simplest way is to incrementally fine-tune the network to new tasks by continuing to train the network with new training data. However, such simple retraining of the network can lead to the catastrophic forgetting problem [3], [4], [5] and an inability to adapt to new tasks. For example, if the previous tasks are classifying images of animals and the new task is to classify images of cars, the features learned on the previous tasks may not be useful for the new one. On the other hand, the retrained representations for the new task could adversely affect the old tasks, as they may have drifted from their original meanings and are no longer suitable for them.

In order to ensure the knowledge sharing through the network is beneficial for all tasks in the online or incremental learning of a deep neural network, [6], [7], [8] prevent drastic changes in the parameters which have a large influence on prediction, but allows other parameters to change more freely. Nguyen et al. [9] merges online variational inference (VI) [10], [11], [12] with Monte Carlo VI for neural networks [13] to yield variational continual learning (VCL), and extends VCL to include a small episodic memory by combining VI with the coreset data summarization method [14], [15]. Yoon et al. [16] proposes a novel deep network model named as Dynamically Expandable Networks (DEN), with dynamically increasing the network capacity by adding in or duplicating neurons when necessary, DEN can maximally utilize the network learned on all previous tasks to efficiently learn to predict for the new task in the continual learning scenario.

While deep neural networks are a widely popular tool in the application of continual learning, they often have more parameters than the number of the training instances. As a result, running them on hardware limited devices remain difficult in many real world scenarios. Hence, compression and efficiency have become a topic of interest in the deep learning community. There is a variety of approaches to address these problem settings. However, most methods have the same strategy of reducing the neural network structure. A justification is the finding that neural networks suffer from significant parameter redundancy [17]. Methods in this line of thought are network pruning, where unnecessary connections are being removed [18], [19], [20], or student-teacher learning where a large network is used to train a significantly smaller network [21], [22].

In this paper, we develop our Bayesian Compression [23] for Dynamically Expandable Networks (BCDEN) which can get pruned network structure with preserved accuracy for each observed task in the continual learning scenarios. We will use the variational Bayesian approximation for the model parameters. By employing sparsity inducing priors for hidden units, not the individual weights, we can prune neurons including all their ingoing and outgoing weights. This avoids more complicated and inefficient coding schemes needed for pruning or vector quantizing individual weights. We train the network from task t=1 with sparsity inducing priors to promote group sparsity in the weights, which means fewer neurons to be connected at each hidden layer in the network. When a new task t arrives at the model, BCDEN will first perform selective retraining for the new task. If the new task is highly relevant to the old ones, selective retraining alone will be sufficient for the new task. However, if the old network can not sufficiently explains the new task, BCDEN will dynamically add in the necessary number of neurons by employing sparsity inducing priors for them, to increase the network capacity. When the network shows degenerate performance for earlier tasks, BCDEN will perform network duplication to prevent such semantic drift or catastrophic forgetting.

Our main contributions in this paper include: (1) a compact model structure with preserving the accuracy via sparsity inducing priors, which leads to fewer neurons at each hidden layer in the network, equivalently fewer parameters; (2) dynamically expands network capacity with only the necessary number of neurons by employing sparsity inducing priors for the added neurons, so as to increase the network capacity when necessary; (3) variational Bayesian approximation for the model parameters with parameter uncertainty.

Section snippets

Related work

The problem of alleviating semantic drift or catastrophic forgetting has been addressed in many previous studies. Goodfellow et al. [5] simply regularizes the model parameter at each step with the regularization parameter and uses cross-validation to find it, but this method is often coarse and leads to catastrophic forgetting. In [6], the predictions of the previous task’s network and the current network are encouraged to be similar when applied to data from the new task by using a form of

Preliminaries

Below we introduce several preliminaries of our algorithm, which are Bayesian Inference, Stochastic Variational Inference frameworks and Reparametrizing Variational Dropout, a method for group sparsity.

Bayesian compression for a continual learning network

In this section, we consider the compression and efficiency of a deep neural network under the continual learning scenario, where unknown number of tasks with unknown distributions of training data arrive at the model in sequence. Specifically, our goal is to learn pruned models for a sequence of T tasks, t=1,,t,,T for unbounded T where the task at time point t comes with training data Dt={xn,yn}n=1Nt. Note that each task t can be either a single task, or comprised of set of subtasks. Though

Experiments

The experiments evaluate the performance and compression capability of our BCDEN through four different datasets. The groups of parameters in our model were constructed by coupling the scale variables for each input neuron for the fully connected layers or for each filter for the convolutional layers. Determining the threshold for pruning can be easily done with manual inspection, as usually these neurons or filters can be divided into two well separated components, that are signal and noise,

Conclusion

We proposed Bayesian Compression for Dynamically Expandable Network (BCDEN), which can learn a pruned network structure with preserved accuracy in the continual learning scenarios. BCDEN performs selective retraining, dynamically expands network capacity with only the necessary number of units, and effectively prevents semantic drift by duplicating units and timestamping them from a Bayesian point of view. We validate our method on multiple classification datasets under continual learning

Declaration of Competing Interest

We have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

Bo Chen acknowledges the support of the National Natural Science Foundation of China under Grant 61771361, The Youth Innovation Team of Shaanxi Universities, 111 Project under Grant B18039, and the Program for Oversea Talent by Chinese Central Government. Hongwei Liu acknowledges the support of the NSFC for Distinguished Young Scholars under Grant 61525105 and Shaanxi Innovation Team Project.

Yang Yang received the BEng degree in electronic information engineering from Henan Polytechnic University in 2012. Currently, he is working towards his Ph.D. degree at National Lab of Radar Signal Processing, Xidian University. His research interests include machine learning, Bayesian statistical modeling, and statistical signal processing.

References (44)

  • M. McCloskey et al.

    Catastrophic interference in connectionist networks: the sequential learning problem

    Psychology of Learning and Motivation

    (1989)
  • S. Scardapane et al.

    Group sparse regularization for deep neural networks

    Neurocomputing

    (2017)
  • G. Zhou et al.

    Online incremental feature learning with denoising autoencoders

    Artificial Intelligence and Statistics

    (2012)
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell,...
  • R. Ratcliff

    Connectionist models of recognition memory: constraints imposed by learning and forgetting functions

    Psychol. Rev.

    (1990)
  • I.J. Goodfellow et al.

    An empirical investigation of catastrophic forgetting in gradient-based neural networks

    Comput. Sci.

    (2013)
  • Z. Li et al.

    Learning without forgetting

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • J. Kirkpatrick et al.

    Overcoming catastrophic forgetting in neural networks

    Proc. Natl. Acad. Sci.

    (2017)
  • F. Zenke et al.

    Continual learning through synaptic intelligence

    Proceedings of the 34th International Conference on Machine Learning

    (2017)
  • C.V. Nguyen et al.

    Variational continual learning

    International Conference on Learning Representations

    (2018)
  • Z. Ghahramani et al.

    Online variational Bayesian learning

    NIPS workshop on Online Learning

    (2000)
  • M.-A. Sato

    Online model selection based on the variational bayes

    Neural Comput.

    (2001)
  • T. Broderick et al.

    Streaming variational bayes

    Advances in Neural Information Processing Systems

    (2013)
  • C. Blundell et al.

    Weight uncertainty in neural networks

    International Conference on Machine Learning

    (2015)
  • O. Bachem et al.

    Coresets for nonparametric estimation-the case of DP-means

    International Conference on Machine Learning

    (2015)
  • J. Huggins et al.

    Coresets for scalable Bayesian logistic regression

    Advances in Neural Information Processing Systems

    (2016)
  • J. Yoon et al.

    Lifelong learning with dynamically expandable networks

    International Conference on Learning Representations

    (2018)
  • M. Denil et al.

    Predicting parameters in deep learning

    Advances in Neural Information Processing Systems

    (2013)
  • Y. LeCun et al.

    Optimal brain damage

    Advances in Neural Information Processing Systems

    (1990)
  • S. Han et al.

    Learning both weights and connections for efficient neural network

    Advances in Neural Information Processing Systems

    (2015)
  • Y. Guo et al.

    Dynamic network surgery for efficient DNNs

    Advances In Neural Information Processing Systems

    (2016)
  • J. Ba et al.

    Do deep nets really need to be deep?

    Advances in Neural Information Processing Systems

    (2014)
  • Cited by (12)

    • A three-way decision approach for dynamically expandable networks

      2024, International Journal of Approximate Reasoning
    View all citing articles on Scopus

    Yang Yang received the BEng degree in electronic information engineering from Henan Polytechnic University in 2012. Currently, he is working towards his Ph.D. degree at National Lab of Radar Signal Processing, Xidian University. His research interests include machine learning, Bayesian statistical modeling, and statistical signal processing.

    Bo Chen received the B.S., M.S., and Ph.D. degrees from Xidian University, Xi’an, China, in 2003, 2006, and 2008, respectively, all in electronic engineering. He became a Post-Doctoral Fellow, a Research Scientist, and a Senior Research Scientist with the Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA, from 2008 to 2012. From 2013, he has been a Professor with the National Laboratory for Radar Signal Processing, Xidian University. He received the Honorable Mention for 2010 National Excellent Doctoral Dissertation Award and is selected into Thousand Youth Talents Program in 2014. His current research interests include statistical machine learning, statistical signal processing and radar automatic target detection and recognition.

    Hongwei Liu received the M.S. and Ph.D. degrees in electronic engineering from Xidian University in 1995 and 1999, respectively. He worked at the National Laboratory of Radar Signal Processing, Xidian University, Xi’an. From 2001 to 2002, he was a Visiting Scholar at the Department of Electrical and Computer Engineering, Duke University, Durham, NC. He is currently a Professor and the director of the National Laboratory of Radar Signal Processing, Xidian University, Xi’an. His research interests are radar automatic target recognition, radar signal processing, adaptive signal processing, and cognitive radar.

    View full text