Bayesian compression for dynamically expandable networks
Introduction
Continual learning, also called lifelong learning and incremental learning, is an important topic in transfer learning where tasks arrive in sequence. The primary goal of continual learning is to perform well on the entire set of tasks in an incremental way that avoids revisiting all previous data at each stage. Since real world tasks continually evolve over time and the size of datasets often prohibits frequent batch updating which are the key problem in machine learning. To tackle this problem, we exploit the power of deep neural networks for continual learning. Fortunately, storing and transferring knowledge can be done in a straightforward manner through the learned deep network weights. By sharing these learned weights, a balance can be built between adapting to new task and retaining knowledge from existing tasks.
Considering continual learning as a special case of online or incremental learning in case of deep neural networks, multiple ways can be conducted to perform such incremental learning [1], [2]. The simplest way is to incrementally fine-tune the network to new tasks by continuing to train the network with new training data. However, such simple retraining of the network can lead to the catastrophic forgetting problem [3], [4], [5] and an inability to adapt to new tasks. For example, if the previous tasks are classifying images of animals and the new task is to classify images of cars, the features learned on the previous tasks may not be useful for the new one. On the other hand, the retrained representations for the new task could adversely affect the old tasks, as they may have drifted from their original meanings and are no longer suitable for them.
In order to ensure the knowledge sharing through the network is beneficial for all tasks in the online or incremental learning of a deep neural network, [6], [7], [8] prevent drastic changes in the parameters which have a large influence on prediction, but allows other parameters to change more freely. Nguyen et al. [9] merges online variational inference (VI) [10], [11], [12] with Monte Carlo VI for neural networks [13] to yield variational continual learning (VCL), and extends VCL to include a small episodic memory by combining VI with the coreset data summarization method [14], [15]. Yoon et al. [16] proposes a novel deep network model named as Dynamically Expandable Networks (DEN), with dynamically increasing the network capacity by adding in or duplicating neurons when necessary, DEN can maximally utilize the network learned on all previous tasks to efficiently learn to predict for the new task in the continual learning scenario.
While deep neural networks are a widely popular tool in the application of continual learning, they often have more parameters than the number of the training instances. As a result, running them on hardware limited devices remain difficult in many real world scenarios. Hence, compression and efficiency have become a topic of interest in the deep learning community. There is a variety of approaches to address these problem settings. However, most methods have the same strategy of reducing the neural network structure. A justification is the finding that neural networks suffer from significant parameter redundancy [17]. Methods in this line of thought are network pruning, where unnecessary connections are being removed [18], [19], [20], or student-teacher learning where a large network is used to train a significantly smaller network [21], [22].
In this paper, we develop our Bayesian Compression [23] for Dynamically Expandable Networks (BCDEN) which can get pruned network structure with preserved accuracy for each observed task in the continual learning scenarios. We will use the variational Bayesian approximation for the model parameters. By employing sparsity inducing priors for hidden units, not the individual weights, we can prune neurons including all their ingoing and outgoing weights. This avoids more complicated and inefficient coding schemes needed for pruning or vector quantizing individual weights. We train the network from task with sparsity inducing priors to promote group sparsity in the weights, which means fewer neurons to be connected at each hidden layer in the network. When a new task arrives at the model, BCDEN will first perform selective retraining for the new task. If the new task is highly relevant to the old ones, selective retraining alone will be sufficient for the new task. However, if the old network can not sufficiently explains the new task, BCDEN will dynamically add in the necessary number of neurons by employing sparsity inducing priors for them, to increase the network capacity. When the network shows degenerate performance for earlier tasks, BCDEN will perform network duplication to prevent such semantic drift or catastrophic forgetting.
Our main contributions in this paper include: (1) a compact model structure with preserving the accuracy via sparsity inducing priors, which leads to fewer neurons at each hidden layer in the network, equivalently fewer parameters; (2) dynamically expands network capacity with only the necessary number of neurons by employing sparsity inducing priors for the added neurons, so as to increase the network capacity when necessary; (3) variational Bayesian approximation for the model parameters with parameter uncertainty.
Section snippets
Related work
The problem of alleviating semantic drift or catastrophic forgetting has been addressed in many previous studies. Goodfellow et al. [5] simply regularizes the model parameter at each step with the regularization parameter and uses cross-validation to find it, but this method is often coarse and leads to catastrophic forgetting. In [6], the predictions of the previous task’s network and the current network are encouraged to be similar when applied to data from the new task by using a form of
Preliminaries
Below we introduce several preliminaries of our algorithm, which are Bayesian Inference, Stochastic Variational Inference frameworks and Reparametrizing Variational Dropout, a method for group sparsity.
Bayesian compression for a continual learning network
In this section, we consider the compression and efficiency of a deep neural network under the continual learning scenario, where unknown number of tasks with unknown distributions of training data arrive at the model in sequence. Specifically, our goal is to learn pruned models for a sequence of tasks, for unbounded where the task at time point comes with training data . Note that each task can be either a single task, or comprised of set of subtasks. Though
Experiments
The experiments evaluate the performance and compression capability of our BCDEN through four different datasets. The groups of parameters in our model were constructed by coupling the scale variables for each input neuron for the fully connected layers or for each filter for the convolutional layers. Determining the threshold for pruning can be easily done with manual inspection, as usually these neurons or filters can be divided into two well separated components, that are signal and noise,
Conclusion
We proposed Bayesian Compression for Dynamically Expandable Network (BCDEN), which can learn a pruned network structure with preserved accuracy in the continual learning scenarios. BCDEN performs selective retraining, dynamically expands network capacity with only the necessary number of units, and effectively prevents semantic drift by duplicating units and timestamping them from a Bayesian point of view. We validate our method on multiple classification datasets under continual learning
Declaration of Competing Interest
We have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
Bo Chen acknowledges the support of the National Natural Science Foundation of China under Grant 61771361, The Youth Innovation Team of Shaanxi Universities, 111 Project under Grant B18039, and the Program for Oversea Talent by Chinese Central Government. Hongwei Liu acknowledges the support of the NSFC for Distinguished Young Scholars under Grant 61525105 and Shaanxi Innovation Team Project.
Yang Yang received the BEng degree in electronic information engineering from Henan Polytechnic University in 2012. Currently, he is working towards his Ph.D. degree at National Lab of Radar Signal Processing, Xidian University. His research interests include machine learning, Bayesian statistical modeling, and statistical signal processing.
References (44)
- et al.
Catastrophic interference in connectionist networks: the sequential learning problem
Psychology of Learning and Motivation
(1989) - et al.
Group sparse regularization for deep neural networks
Neurocomputing
(2017) - et al.
Online incremental feature learning with denoising autoencoders
Artificial Intelligence and Statistics
(2012) - A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell,...
Connectionist models of recognition memory: constraints imposed by learning and forgetting functions
Psychol. Rev.
(1990)- et al.
An empirical investigation of catastrophic forgetting in gradient-based neural networks
Comput. Sci.
(2013) - et al.
Learning without forgetting
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) - et al.
Overcoming catastrophic forgetting in neural networks
Proc. Natl. Acad. Sci.
(2017) - et al.
Continual learning through synaptic intelligence
Proceedings of the 34th International Conference on Machine Learning
(2017) - et al.
Variational continual learning
International Conference on Learning Representations
(2018)
Online variational Bayesian learning
NIPS workshop on Online Learning
Online model selection based on the variational bayes
Neural Comput.
Streaming variational bayes
Advances in Neural Information Processing Systems
Weight uncertainty in neural networks
International Conference on Machine Learning
Coresets for nonparametric estimation-the case of DP-means
International Conference on Machine Learning
Coresets for scalable Bayesian logistic regression
Advances in Neural Information Processing Systems
Lifelong learning with dynamically expandable networks
International Conference on Learning Representations
Predicting parameters in deep learning
Advances in Neural Information Processing Systems
Optimal brain damage
Advances in Neural Information Processing Systems
Learning both weights and connections for efficient neural network
Advances in Neural Information Processing Systems
Dynamic network surgery for efficient DNNs
Advances In Neural Information Processing Systems
Do deep nets really need to be deep?
Advances in Neural Information Processing Systems
Cited by (12)
Multi-modal fusion approaches for tourism: A comprehensive survey of data-sets, fusion techniques, recent architectures, and future directions
2024, Computers and Electrical EngineeringA three-way decision approach for dynamically expandable networks
2024, International Journal of Approximate ReasoningTowards Automatic Model Compression via a Unified Two-Stage Framework
2023, Pattern RecognitionSATS: Self-attention transfer for continual semantic segmentation
2023, Pattern Recognition
Yang Yang received the BEng degree in electronic information engineering from Henan Polytechnic University in 2012. Currently, he is working towards his Ph.D. degree at National Lab of Radar Signal Processing, Xidian University. His research interests include machine learning, Bayesian statistical modeling, and statistical signal processing.
Bo Chen received the B.S., M.S., and Ph.D. degrees from Xidian University, Xi’an, China, in 2003, 2006, and 2008, respectively, all in electronic engineering. He became a Post-Doctoral Fellow, a Research Scientist, and a Senior Research Scientist with the Department of Electrical and Computer Engineering, Duke University, Durham, NC, USA, from 2008 to 2012. From 2013, he has been a Professor with the National Laboratory for Radar Signal Processing, Xidian University. He received the Honorable Mention for 2010 National Excellent Doctoral Dissertation Award and is selected into Thousand Youth Talents Program in 2014. His current research interests include statistical machine learning, statistical signal processing and radar automatic target detection and recognition.
Hongwei Liu received the M.S. and Ph.D. degrees in electronic engineering from Xidian University in 1995 and 1999, respectively. He worked at the National Laboratory of Radar Signal Processing, Xidian University, Xi’an. From 2001 to 2002, he was a Visiting Scholar at the Department of Electrical and Computer Engineering, Duke University, Durham, NC. He is currently a Professor and the director of the National Laboratory of Radar Signal Processing, Xidian University, Xi’an. His research interests are radar automatic target recognition, radar signal processing, adaptive signal processing, and cognitive radar.