The big question I tried to answer this week: Can raw audio samples compete with spectrograms as inputs to convolutional neural networks (CNN)?
Although I don’t yet have the conclusive data needed to pick a team, I have some bias for the raw audio, which trains faster by nature of its structure. I’m also grateful for the incremental design that this project has; I am able to shave off some computation time by saving the important structures (downsized depth maps, input spectrograms, split training and test sets, model parameters) as h5 files that I can load during the many times I train the NN. This way, I am able to rewrite any file and change any piece of my program without having to run through the entire data preprocessing stage every single time.
This week, I fully realized the expansive set of hyper-parameters that CNNs, and NNs in general, have. After tweaking the program so that spectrograms and audio samples can both be inserted as inputs, I stared at my code for a bit and drew out some sub-par decision trees that highlight the parameters I care about. These include the size of the convolutional kernels, the number of nodes in the dense layers, the loss function, the activation functions, and the number of epochs for the training process. Needless to say, the decision trees were not pleasant to look at.
I focused mostly on getting the audio sample NN to make acceptable predictions this week. Here is an example of one test result:
It doesn’t look too bad, but don’t be too excited yet- the NN makes the same prediction for every test input. The NN also fares poorly when the training set is plugged in as the test set…the NN has already seen every training input and its respective “answer” so it should recognize this test set and make acceptable predictions. It seems that the predicted depth map has converged to the average depth map of all the training inputs. I hypothesized that the NN wasn’t given enough time to train, so I increased the number of epochs from 50 to 300. The mean squared error for the particular model still leveled off to 0.29.
To look forward to: lots of hyper-parameter tuning! I’ve been working with two convolutional layers followed by two dense layers, but I should start changing that aspect of the architecture. Tomorrow, I am going to start by examining the performance of various optimizers. I created a new downsized depth set; instead of 12×16, this set is 6×8. I’m hoping to see better predictions from the NN given these lesser quality goals.