What if… it didn’t have to forget? – Week 3

Featured image by Terence Broad, from http://terencebroad.com/convnetvis/vis.html

This week we went from understanding existing implementations of continual learning with neural networks to identifying and seeking a solution to the problems that have arisen through empirically evaluating those implementations. We have zeroed in on a particular limitation of elastic weight consolidation as a method for allowing neural networks to learn sequentially and begun to pursue a potential solution to that limitation.

As mentioned in my previous posts, elastic weight consolidation (EWC) suffers from a fairly critical drawback. The way in which it allows a neural network to learn multiple different tasks sequentially depends upon the preservation of weights that are important for a given task by not allowing those weights to change too much when a new task is learned. However, neural networks have a finite capacity (number of weights) and when those weights are all being preserved, none can be changed enough to accommodate a new task, and the network rather suddenly goes from being able to perform all of the sequentially learned tasks with better average accuracy across all tasks than stochastic gradient descent alone… to “blackout catastrophe” (the inability to perform any tasks successfully). Given that this critical weakness of EWC arises from the issue of a network’s finite capacity, a logical solution to the issue would involve rectifying that limitation of neural networks by making them dynamically expandable.

This is accomplished by doubling the number of weights in the EWC-enabled network right before the network experiences blackout catastrophe. As of this week, I *believe* we have a working implementation of a network that we can expand at will using TensorFlow and doubling the number of weights in the network without eliminating all of the old saved weights. A next step to ensure that I have done this correctly may be to integrate TensorFlow’s graph visualization software (TensorBoard) into the code to allow us to visualize the network so that we can confirm that the network is being transformed in the manner we desire. However, based on preliminary testing, I believe the weights in the network are, in fact, being doubled. A network with 20 nodes in a hidden layer using EWC, and with its capacity doubled right before I had empirically determined it would be expected to fail due to blackout catastrophe, failed at the same number of tasks as an unexpanded (undoubled) network that began with 40 nodes. This appears to indicate that when the expandable 20-node network failed, there were in fact 40 weights in the network (it had doubled in size as intended).

Confirming that our understanding of the manner in which the network is being doubled is accurate and that already-optimized weights are preserved throughout this process, while critical, is only a first step. We need to “mask” some weights, as Dr. Sprague has made clear, in order to evaluate prior tasks only using the weights that existed in the network when the network was originally trained to perform that task. The network can still improve these weights with experience or use them to learn new tasks efficiently/effectively, however, our accuracy output won’t be dependable until we successfully mask the appropriate weights in the testing step. For instance, if task A was trained in a 20-node network that later expanded to 40 nodes before task F was learned, task F should be evaluated for testing accuracy using all 40 nodes, but task A’s testing accuracy should only be evaluated using the 20 nodes that existed in the network when the network was trained to perform task A.

After that, we’ll need to use the Fisher Information Matrix (which determines part of the EWC error function by limiting weight variances on future tasks for weights important to the network’s learning of previous tasks) to determine when exactly the network is about to experience blackout catastrophe with EWC (when all the weights are considered important for respective learned tasks and the network capacity has thus been reached). In doing so, we hope to allow the network to double in weight capacity at will at the appropriate time (right before blackout catastrophe). This way, doubling of the network won’t need to be hard-coded as occurring at a particular number of tasks but rather when the capacity of the network in question has been reached. This is similar to the Java implementation of ArrayList, which has a fixed size to begin with and expands as appropriate to accommodate its contents.

The above points constitute some daunting (but exciting) hurdles that we have yet to overcome in our pursuit of a solution to EWC’s blackout catastrophe characteristic, but we have a direction in which to progress and a very particular focal point at which our combined energies are focused. With a specific goal in mind, I feel that we are making even more efficient progress toward what could be (and I hope will be) a very exciting and meaningful culmination to this research endeavor.