For a lengthy period of time earlier this week, there was a series of high frequency chirps roaming around the CS hallway. That was the sound of science.
It was also the sound of Dr. Sprague collecting audio samples for the echolocation project (thank you!). Once I had over 13,000 samples at my disposal (thank you thank you), it was time to build a new neural network, one that was smarter since it gets to see more of the world. The samples were collected using stereo microphones, which provide more informative 3D depth measurements, compared to a singular microphone. Thus, the neural network’s inputs came as sets of two spectrograms per sample.
Before building the neural network, there was a bit of data preprocessing to do. The microphones’ range of “hearing” was limited; moments where the microphones were in front of objects that were too close or too far away produced bad data (depth of zero) that needed to be filtered out. That is because we don’t want the neural network to learn about these misleading values during its training stage. I wrote a piece of code that removes any samples from the training set that contained zeros. Approximately 22% of the samples were thus dropped. One challenge of building this neural network is having enough data to train on, so we want to keep as much of the data as possible. Dr. Sprague gave me a script that his students worked on last semester that calculates an adjusted mean squared error, which may reduce the number of samples we have to drop without misleading the neural network’s training. One thing I need to do next week is compare the network performance of the “zero trick” and “mean squared error trick” to see which will produce more accuracy overall.
The depth maps that the recording device produced are of 480×640 resolution. At this point in time, it is unrealistic to expect this neural network to be able to make predictions of such resolution with high accuracy. So I have downsized the depth maps to 12×16 resolution with the hope that the network will have an easier time learning. If the network is extremely capable of making correct predictions at this resolution, we will keep increasing the resolution until we find its limit. I downsized the maps by taking the average depth value within non-overlapping 40×40 windows across the original maps. The resulting downsized map contains 192 depths that the neural network will attempt to learn. Below is an example of one sample. From left to right: the original depth map, downsizing using 20×20 windows, downsizing using 40×40 windows, interpolated version of third image (which the network will try to predict).
I also split the data set we are using into training and test sets (80% and 20% of data, respectively). The data set is composed of the samples we deem acceptable from the locations that Dr. Sprague recorded from. The test set consists of the middle portion of samples from each location set. The samples were recorded as Dr. Sprague smoothly moved the recorder around (as opposed to making sudden, random turns) so the samples near each other are very similar. Thus, we predict that shuffling the data randomly then taking 20% for testing would most likely create test inputs that are too similar to training inputs. So we are hoping that by pulling sections of samples out, we are reducing bias in the network evaluation.
We have a lot of big data and it is freaking the GPU out. That is, the program I’ve written chokes up before the neural network can even begin training. Within many print statements and obscure error messages, I found evidence to support the hypothesis that the current architecture of the neural network is too computational expensive…the GPU runs out of memory every time it tries to build the network.
The goal for tomorrow (Friday) is to have a working neural network with passable accuracy. The first step is to fine-tune this convolutional neural network’s layout so that the network can be trained and make predictions. Then, there are a lot of questions to be answered such as:
What should the network’s input be in the form of (raw audio samples, spectrograms, something else)?
What is the best way to deal with “bad data”?
What are the best parameter values for the convolutional layers?
Stay tuned.