Even Deep Neural Networks Forget Sometimes – Week 1

Thus far, it’s been a “hit-the-ground-running” kind of week to kick off my summer research experience, but the deeper I delve into the material, the more excited I become about continuing to do so- and the more I learn about… how to learn. This research is, perhaps more than anything else, an opportunity to become adept at the skill of absorbing, contemplating, applying and (hopefully) improving a preexisting collection of information in relatively short order. It’s a cyclical process by nature- learn, apply that knowledge to overcome an obstacle, run into another obstacle, and repeat- but not cyclical in the sense that no progress is made. It’s this repeated process that makes it easier, gradually, to fall into a rhythm and gain momentum. The more I learn, the more efficiently I feel I am able to learn more, and the more I feel the learning curve start to give way to a less steep mountain to climb. Jumping in this week and devoting the time to breaking down that learning curve with continual effort, as was suggested to me by multiple individuals experienced in research, has paid off ten-fold. The fulfillment I derived from actually tinkering with code and producing visible evidence in this first week exceeded even my particularly hopeful expectations of what I stand to get out of this experience.

Currently, Dr. Sprague and I are working on developing a project around the idea of improving existing implementations for preventing what is known as “blackout catastrophe” in neural networks. To understand the implications of this, a brief overview of some basic concepts will likely prove helpful. A neural network is a particular implementation of machine learning which can be visualized as a series of nodes arranged in layers. Deep neural networks are neural networks with multiple hidden (inner) layers in them.

nnet.png

The nodes are connected by edges (symbolized above by arrows). The network model maps the input data to the output via a function, and the goal of learning is to optimize the weights along these edges to minimize the error (and therefore improve the accuracy of the output layer in terms of its distance from a target value). This can be achieved via Stochastic Gradient Descent (SGD), which is a method for tuning each of these weights in such a way as to minimize the error incurred, and SGD uses the data on which the model is trained to optimize these weights over many iterations. However, issues arise in the context of continued learning (that is, training a model on several tasks sequentially), when SGD alone is used in training the model. Essentially, the network model, when trained on a new task, alters the weights in the most optimal configuration for the new task alone, and often achieves good accuracy on that new task. However, because many weights have been changed (and potentially substantially changed), the network “forgets” how to perform previous tasks if they are different in nature. This logically follows from the observation that different tasks will likely correlate with different optimal weight values and the function from inputs to outputs will change from task to task. However, a recently developed learning algorithm from DeepMind incorporates something known as Elastic Weight Consolidation (EWC) to help prevent this. EWC works by determining which weights are most important to a particular task and “protecting” those weights when future tasks are learned by preventing those weights from being changed too much. That way, multiple tasks can be learned and some weights can be more freely adjusted, but those most critical to retaining previously learned tasks remain somewhat preserved.

EWC

The resulting weights are optimized in a manner that  maintains a low error rate on both new and old tasks, so continual learning can occur with less accuracy loss as tasks are added as compared to SGD alone.

EWC_Graph

Don’t go thinking that robots will be conquering the world just yet, though. This is where blackout catastrophe comes in. Given that there are only a finite number of weights in a given network, when the network attempts to learn too many tasks sequentially, the network capacity is reached as many weights are critical to previous tasks and being “preserved” and, ultimately, the network suddenly fails to be able to accurately perform any of the tasks, old or new. In an attempt to uncover more information about how the point at which this blackout catastrophe occurs is related to the number of weights in the network, I spent today altering the number of weights in the network and observing the effects. Interestingly, there appears to be a linear relationship between the number of weights in the network and the number of tasks which can be sequentially learned with EWC prior to blackout catastrophe. I doubled the number of weights in a simple two layer network, which originally experienced blackout catastrophe when attempting to learn the fifth task with 50 weights, and with 100 weights the network experienced blackout catastrophe when attempting to learn the tenth task (twice as many tasks from twice as many weights).

blackout_catastrophe

Moving forward, altering the actual network architecture (adding layers, altering types of layers included) and applying the EWC algorithm to a deep convolutional neural network and observing potential for greater accuracy benefits over SGD alone, as well as investigating how to maximize the number of tasks a network can learn with EWC before blackout catastrophe, are interesting areas of research of crucial significance in improving continued learning. Tweaking of the math in the current EWC implementation (perhaps by applying L1 regularization) could also result in improvements to the algorithm. Also, EWC is not the only proposed or demonstrated method for avoiding blackout catastrophe. As such, there are multiple other routes to explore in looking for a solution to some of the key obstacles faced by neural networks attempting continued learning.

It’s a lot to take in, but as I said before- this learning process has a momentum to it that has drawn me in. It’s an addictive search for both knowledge and ways to creatively manipulate that knowledge into some tangible improvements to contribute to a fascinating and deep (pun definitely intended) field.

For those interested, the DeepMind EWC paper can be found here:

http://www.pnas.org/content/114/13/3521.full

And an implementation of the EWC code can be found here:

https://github.com/ariseff/overcoming-catastrophic

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s