So much progress has been made since last week that, until I reviewed my last blog post to see where I had left off, I had no idea only 7 days had passed since that post. This research has so greatly matured by a tangible metric (working code) since that time that I now believe I truly comprehend what those wiser and more experienced than I meant when they advised me that I would never be able to see what was “waiting just around the corner.” That is to say, until one has achieved a goal, the exact distance left to go to reach that goal will often prove elusive. And yet- thankfully- here we are, having been granted the invaluable gift of retrospect.
Having said this before, it is with a healthy amount of skepticism that I again announce that I believe to have a working implementation of the solution we originally set out to develop. Critical to understanding the significance of this progress is a grasp of what was wrong with the implementation that I believed to have been successfully implemented a week ago. Essentially, the TensorFlow methods I have used in my implementation return new tensors when they modify the existing ones (by concatenating two 2×2 tensors into a vertical 2×4 tensor, for instance). This means that my old implementation, which would concatenate those tensors and then train on them, was actually only modifying the concatenated tensors during training and not actually modifying the original, smaller tensors of which the larger tensors were comprised. This meant that, when I would reference those smaller sub-tensors after training so that I could test older tasks on only the weights that were in the network when those tasks were trained (before network expansion), the smaller sub-tensors would have been unchanged by training and therefore produce incorrect accuracy data. This means that concatenating more tensors as the network expands, on its own, would NOT work as a way to mask later weights by essentially looping through and referencing the individual “un-concatenated” tensors to separate them into subsets for masking during testing.
To resolve this issue, the new implementation adds a step in the process. Essentially, the tensors are still concatenated where appropriate (e.g. the first layer weights from a network are concatenated with a new set of first layer weights of the same size when the network expands to double its original size), and then training is performed on those concatenated tensors (to use the entire network of weights), but then what follows has been altered. Instead of then referencing the old first layer of weights as a separate subset (and thereby referring to a tensor which has, in fact, NOT been modified by training), the old subset of weights is accessed post-training by using the TensorFlow tf.slice method to slice the concatenated tensor up into smaller pieces (from pre-expansion network(s)) and then use those updated slices (because they came from the updated concatenated tensor) to evaluate testing accuracy for tasks that were trained before the network was expanded.
With these modifications, the accuracy output appears to more squarely rest within the range of our original expectations:
Figure 1. Plot showing the average testing accuracy over all tasks after training on the latest task, as the number of tasks learned sequentially increases. The blue line represents accuracy data for Stochastic Gradient Descent alone, whereas the orange line represents the data for Elastic Weight Consolidation. The network here is being expanded just before training on the fourth task occurs, as that is the point at which it tended to experience blackout catastrophe without expansion, based on empirical observations. Note that there is no sudden loss of accuracy when the network is expanded . Also note that there is a brief spike in accuracy for SGD when the network is expanded, likely due to the fact that many more weights are now made available to the network. This data is also for a very simple network with only one hidden layer and relatively few weights.
Also output by the updated program is the sum of the means of the means of the diagonals of the Fisher Information Matrices (one for each set of weights, including bias weights). I realize that’s a mouthful. Essentially what we are looking at in the graph below is the data produced by taking the diagonal of each weight set’s Fisher Information Matrix, finding the mean value of that diagonal, and then finding the mean of those means across all weight sets. Then that “mean of means” is summed with the “means of means” from all prior tasks as the network is sequentially trained on each new task.
The information we gleaned about Fisher Information Matrices from the Kirkpatrick paper that was the inspiration for this research tells us that we should be able to, in theory, use this data as a measure of how tightly the weights in the network are generally being constrained to retain accuracy on older tasks via EWC. Because this weight constraint, or preservation, is the reason for blackout catastrophe, we aim to use this knowledge to predict when network capacity will be reached and blackout catastrophe will occur. Then we can allow the network to expand itself just before this happens. To accomplish that goal, it is necessary to first determine a mathematically defined relationship between this Fisher Information Matrix data and the point at which blackout catastrophe occurs, which is what I intend to work on next. Future work to be completed also includes making my code completely generalizable to allow the network to loop through the expansion process as appropriate and determine for itself when that process needs to happen to prevent blackout catastrophe. Then we may have a completely automated process to allow continual learning with neural networks with little to no human interaction required. I’m hoping that’s on the horizon but for now I’m caught up in the romance of seeing the beginnings of a hoped-for result coming into fruition. Perhaps the joy of seeing the code functioning is only eclipsed by the humbling knowledge of how far I’ve come from where I started, unaware of what lay ahead.