Prior to the experiences with which I have been presented this summer, I would likely have greeted initial failure at some particular task as an indicator of incompetence. However, it is with a renewed desire to perform genuinely progressive research that I can honestly say that not only does such initial failure come with the territory- it is a critical step in the formative stages of both a person and a product. I have come to greet obstacles as opportunities for learning and advancement rather than begrudging them their very existence. Such an obstacle has reared its head this week and, given that attempting to rectify it has been the focus of most of the work performed on this project as of late, I will attempt here to depict it as clearly as possible.
It was only when combing through the code to add in comments and begin to clean it up (a critical step in ensuring that the code is generalizable, maintainable, and as clearly comprehensible as possible) that I discovered this rather critical issue. As elaborated upon initially in my last point, the flawed iteration of the code utilizes TensorFlow’s tensor concatenation method (tf.concat) to concatenate tensors, thereby expanding the network by concatenating smaller tensors to form larger ones that are the size of the post-expansion network weights and biases to which they respectively correspond. However, TensorFlow returns a new tensor when it concatenates two tensors to form a single, larger one. The issue that results from this is that any modifications to that larger tensor that results from the concatenation… do NOT modify the individual tensors of which that larger tensor is comprised. An example of such a modification would be the sort that occurs when training a neural network on a given task. I thought I had accounted for this issue… until I dug further into the code of the existing implementation of Deepmind’s EWC algorithm (see my Week 1 post for more details on that) that serves as the starting point upon which our project’s code is built.
In reality, the code is written to train based on a list of all the tensor variables in the network and to calculate input and output to and from the hidden layer, as well as output layer results, based on the variables in that list. As an example, suppose our network were to have weights from input to hidden layer 1 (we will call these the W1 tensor), bias weights for hidden layer 1 (b1 tensor), weights for hidden layer 1 to output (W2 tensor), and bias weights (b2 tensor) for each output node. We then have W1, b1, W2, and b2 tensor variables to update via training. but if I want to expand my network by concatenating more weights, for instance by doubling W1 in size by concatenating W1 and another tensor called W1_expand, I now have a NEW tensor that is equivalent to the concatenation of those two original tensors. Now say I train my new expanded network. When the network trains, it will modify the values in that new concatenated tensor. If I need to reference W1 or W1_expand individually later, as the code attempts to do in error calculations and restoration of optimal weights from previous tasks, they haven’t changed with training.
To resolve this issue, the code has now been modified to “slice” through the concatenated tensor after training to “reproduce” newly trained versions of tensors that are the same size as the original W1 and W1_expand were, and then use those sliced tensors in calculations. However, this hasn’t yet rectified every issue. Dr. Sprague suggested that I modify the code so that, even though the bias weights for the output never actually expand (because we never add more output nodes- MNIST handwritten digit data is modeled with 10 output nodes for the numbers 0-9), we need to separate out bias weights for the output because we are separating the output. I realize that’s quite a bit to absorb, and necessitates a brief explanation. We “separate the output” by calculating testing accuracy differently for different tasks, only using weights that were in the network when that task was originally trained on. In our sequential learning problem domain, the network is exposed to a particular task’s training data once and only once, so we don’t want to evaluate testing accuracy on the first task on wights that weren’t introduced into the network until it was trained on, say, the ninth task. This had been accounted for in every other tensor variable EXCEPT bias weighting for the output layer, and modifying this very small part of the code has a lot of potential for improving accuracy. It also makes logical sense. If I have output_1 and output_2, based on which subsection of the network I am using to calculate them, I wouldn’t want to allow every value along the path to that output to have been expanded with the network and then force all of that updated data into the same bias weighting on the output nodes. Thankfully, this could drastically change the results we’re getting as of right now and is the next step in implementing a solution to the overarching remaining issue with the code (which is that accuracy output appears to be corrupted by some issue with the current implementation).
Despite all of the apparent doom-and-gloom in the above descriptions, this process has actually been an unexpected opportunity to greatly simplify and sanity-check the code. The most recent implementation uses a much simpler expansion method that does not rely as heavily on concatenation, trimming down possible sources of programmer error and making the code much easier to understand and trace. Instead of concatenating the tensors, which could have been at the root of many of the issues which have recently come to light, the updated code now makes a new model twice the size of the original and simply restores the old weight values from the old optimized network to the first half of each tensor variable. This is a much more straightforward implementation, meaning we can raise our level of confidence that the code is actually doing what we picture it doing in our heads (which has proven to be a very difficult thing to confirm in and of itself). Besides, code that looks great on paper (as last week’s implementation did) but doesn’t actually work… isn’t really progress. So what I originally believed to be a step in the wrong direction is, in fact, now appearing to be a step in the right direction. How far we are from reaching the goal that we hope lies in that direction… I can’t say for sure. But I CAN say that I’ve probably learned more from troubleshooting this one error than from all the code that worked the first time I ran it.