Progress indeed. My mind has been shaped by experience at this point to work in small chunks, chipping away at an overarching goal in progressively cumulative sections. Each task builds upon those that came before it, and with each task my experience is confirming or correcting my previous notions of the concepts involved. I think I’m beginning to understand why research experience for undergraduates is so important- by actually creating or modifying code to perform tasks, one can grasp the nuances and full context of the ideas that, in classes, we have perhaps only briefly touched on. Or, as I find often happens in my case, underlying misconceptions are rectified by tinkering with the code and analytically digging through the finer points of the output it produces. Even concepts I have never explicitly learned can be inferred through experience manipulating data and adjusting those manipulations appropriately to achieve a desired output. Which is… well… how neural networks learn. It’s possible that I’m slowly evolving into a computer. Or (far more likely) I’m understanding at this point how many concepts in artificial intelligence are not so distant from (on the contrary- potentially largely inspired by) the way in which humans learn. But now, on to the aforementioned progress!
This week, thus far, has primarily been focused on finding ways to both manipulate data in a way that gives us control over the “relevancy” between tasks and to output results in a manner that gives us the information we need to tune and improve our parameters for the best possible advantage of elastic weight consolidation over stochastic gradient descent alone, while extending subsequent task learning capacity before catastrophic forgetting. In order to assess the ability of EWC to potentially improve learning rates when subsequent tasks are “relevant” or “similar” to one another (as a human would do when, say, learning how to play a new video game if he or she already had some experience playing similar video games), we needed to devise a method of generating data with a controllable level of deviation from the data in the previous task. Given that we are working with images of handwritten numbers from 1-10 from the MNIST dataset, this meant permuting the pixels in the image in a way that is not a complete random shuffling of each pixel. The original implementation of the EWC code (referenced in my last post) utilizes this random shuffling method. To achieve this new, controlled permutation, we decided to implement two schools of thought on how this should be done. The first, which was used by the researchers who published the paper introducing EWC (mentioned in my last blog post) is to take a particular percent of the pixels in the image (drawn randomly from the entire image) and switch those with an equal number of random pixels in the image (also drawn randomly from the entire image). Example output of the code (30% pixels being randomly reassigned) looks as follows:
The second method of altering tasks to be learned while maintaining similarity- within the context of the MNIST dataset- involves calculating an array of vectors with random direction and magnitude (one for each pixel) and specifying the bounds for the magnitude so as to control the distances pixels in the new image can travel from their previous locations. These random vectors are then applied to each pixel and the new image produced is comprised of each pixel after it has been moved according to the direction and magnitude of its respective vector. The vector array is calculated once when an MNIST dataset is permuted, then stored and used to permute all images in the dataset in the same way, which is critical to retaining the relevancy of the new task (and therefore the relevancy of the new function from inputs to outputs needing to be mapped by the network model) to the old task (and the old function from inputs to outputs). Example output from this permutation style looks as follows:
Another achievement made this week was the production of graphs that give us a much clearer idea of what is going on. Output from the network is now being graphed in a manner that shows the difference in the average accuracy over all testsets (one for each task) between SGD alone and EWC after the final iteration of testing for each group of tasks. This allows us to better visualize how SGD alone and EWC differ in their ability to maintain accuracy over all tasks when new tasks are added, which is key because this is the previously touted advantage of EWC. Output graphs now appear cleaner and simpler, allowing us to immediately see the information we really need to analyze, including not only differences between SGD alone and EWC but also highlighting the task count at which catastrophic forgetting occurs for EWC (see orange line at task 7 below):
Moving forward, we plan to lay the framework tomorrow for future research endeavors and set an ultimate goal for what it is we really would like to learn from this research. Learning and working in chunks, as I have been doing, has proven very beneficial, though it can be easy at times to lose sight of the endgame here. We need to devise a plan to reach a tangible, informative result that illustrates some key concepts and hypotheses about which we are curious and which we can then investigate and work at in smaller chunks over the remainder of our research. I find myself “pleasantly unsurprised” that I am looking forward to continuing to explore those concepts and hypotheses, and that my enthusiasm for acquiring and utilizing the necessary knowledge to do so grows by the day. I’m also learning a wealth of useful skills such as python-specific coding conventions, working with python libraries like matplotlib and numpy, and image-related algorithm development. This is, I hope, going to culminate in one of the most beneficial investments I have made yet in my education.