Transfer Learning for Sound Classification

Identifying sounds in the environment around us is something we as humans do quickly and easily everyday, and yet it is fairly difficult for computers to do this. If computers could accurately identify sounds, this would have lots of applications for robotics, security, and many other areas.

Recently there have been many developments related to computer vision, through advances in deep learning and the creation of large datasets such as ImageNet for training deep learning models.

The area of auditory perception, however, hasn’t quite caught up to computer vision. Google recently released AudioSet, a large scale dataset of annotated sounds. Hopefully we’ll start to see major improvements in sound classification and similar areas.

In this post, however, we will be looking at how to leverage the recent advances in image classification to improve sound classification.

Classifying Sounds in an Urban Environment

Our goal is to classify different sounds in the environment using machine learning. For this task we will be using a dataset called UrbanSound8K. This dataset contains 8732 audio files. There are 10 different types of sounds:

  • Air Conditioner
  • Car Horn
  • Children Playing
  • Dog Bark
  • Drilling
  • Engine Idling
  • Gun Shot
  • Jackhammer
  • Siren
  • Street Music

Each sound recording is ~4s in length. The dataset is organized into 10 folds. We will train on all of them, since the script we are going to use will automatically generate a validation set. This dataset is a nice size to start experimenting with, but ultimately I am hoping to train a model on AudioSet.


There are many different features we can train our model on. In the related field of speech recognition, the Mel-frequency cepstral coefficients (MFCC) is commonly used. The nice thing about MFCC’s are that they are a very sparse representation of the original audio, which is usually sampled at 16khz in most research datasets.

Recently, however, there has been a shift towards training models directly on the raw data. For example, DeepMind designed a convolutional architecture called WaveNet to generate audio. These WaveNets are trained on the raw audio, and not only can they be used generation, they can also be used for speech recognition and other classification tasks.

It would be nice to be able to train a model on more information than the MFCC features, but WaveNets can be computationally expensive to both train and run. What if there was a feature that retained lots of information about the original signal, but was also computationally cheap to train?

This is where spectrograms are useful. In auditory research, a spectrogram is a graphical representation of audio that has frequency on the vertical axis, time on the horizontal axis, and a third dimension of colour represents the intensity of the sound at each time x frequency location.

For example, here is a spectrogram of a violin playing:
violin spectrogram

CC BY-SA 3.0,

In this spectrogram, we can see many frequencies that are multiples of the fundamental frequency of the note being played. These are called harmonics in music. The vertical lines throughout the spectrogram are the brief pause between strokes of the bow on the violin. So it appears the spectrogram contains lots of information about the nature of different sounds.

The other nice thing about using the spectrogram is that we have now changed the problem into one of image classification, which has seen lots of breakthroughs recently.

Here is a script that will convert each wav file into a spectrogram. Each spectrogram is stored in a folder corresponding to its category.

Using Convolutional Neural Networks

Now that the sounds are represented as images, we can classify them using a neural network. The neural network of choice for most image processing tasks is a Convolutional Neural Network (CNN).

The problem with using the UrbanSound8K dataset however is that it is fairly small for deep learning applications. If we were to train a CNN from scratch it would probably overfit to the data; which means that for example, it would memorize all the sounds of dogs barking in UrbanSound8K but would be unable to generalize to the sound of other dog barks in the real world. There is an example of using a CNN for this dataset on Aaqib Saeed’s blog here. We are going to take a different approach however and use transfer learning.

Transfer learning is where we take a neural network that has been trained on a similar dataset, and retrain the last few layers of the network for new categories. The idea is that the beginning layers of the network are solving problems like edge detection and basic shape detection, and that this will generalize to other categories. Specifically, Google has released a pretrained model called Inception, which has been trained on classifying images from the ImageNet dataset. In fact, Tensorflow already has an example script for retraining Inception on new categories.

To get started, we will adapt the example from Tensorflow for Poets Google Codelab.

First, run this command to download the retraining script.

curl -O

Now we can run the script to retrain on our spectrograms.

python \
  --bottleneck_dir=bottlenecks \
  --how_many_training_steps=8000 \
  --model_dir=inception \
  --summaries_dir=training_summaries/basic \
  --output_graph=retrained_graph.pb \
  --output_labels=retrained_labels.txt \

In another terminal tab, you can run

tensorboard --logdir training_summaries

to start a tensorboard, which will let us watch the training progress and accuracy in our browser. After around 16k iterations the accuracy tops off at ~86% on the validation set. Not bad for a fairly naive approach to sound classification.


Classifying Sounds from the Microphone

Now that we have a model for classifying sounds, lets apply it to classify sounds from a microphone. The tensorflow retraining example has a script for labelling images.

I modified this script to label sounds from the microphone. First, the script streams audio from the mic using pyaudio, and uses the webrtcvad package to detect if sound is present at the microphone. If a sound is present it is recorded for 3 seconds, and then converted into a spectrogram and finally labelled.

The script was adapted from this gist for recording from the mic, and this gist for generating spectrograms using librosa, as well as the script in tensorflow.

Next Steps

In this post we saw how to classify sounds by applying transfer learning from the image classification domain. There is definitely room for improvement by tweaking the parameters of the retraining, or by training a model from scratch on the spectrograms. I’m also hoping to train a model to classify sounds using a WaveNet next.

You can view the code for this tutorial here.

April 2015

Check out our posters from this year’s Cognitive Neuroscience Society meeting in San Francisco here.