Music Genre Classification with Python

Farzana Anjum
9 min readNov 10, 2019


A Guide to Analyze Music/Audio signals in Python

Music is like a mirror, and it tells people a lot about who you are and what you care about, whether you like it or not. We love to say “you are what you stream,” and that is so true.

As you can imagine, we invest a lot in our machine learning capabilities to predict people’s preferences and context, and of course, all the data we use to train the model is anonymized. We take in large amounts of anonymized training data to develop these models, and we test them out with different uses cases, analyze results, and use the learning to improve those models.

Spotify, with a net worth of $26 billion is reigning the music streaming platform today. It currently has millions of songs in its database and claims to have the right music score for everyone. Spotify’s Discover Weekly service has become a hit with the millennials. Needless to say, Spotify has invested a lot in research to improve the way users find and listen to music. Machine Learning is at the core of their research. From NLP to Collaborative filtering to Deep Learning, Spotify uses them all. Songs are analyzed based on their digital signatures for some factors, including tempo, acoustics, energy, danceability etc. to answer that impossible old first-date query: What kind of music are you into?


Companies nowadays use music classification, either to be able to place recommendations to their customers (such as Spotify, Soundcloud) or simply as a product (for example Shazam). Determining music genres is the first step in that direction. Machine Learning techniques have proved to be quite successful in extracting trends and patterns from the large pool of data. The same principles are applied in Music Analysis also.

In this article, we shall study how to analyze an audio/music signal in Python. We shall then utilize the skills learned to classify music clips into different genres.

Audio Processing With Python

Sound is represented in the form of an audio signal having parameters such as frequency, bandwidth, decibel, etc. A typical audio signal can be expressed as a function of Amplitude and Time.


These sounds are available in many formats which makes it possible for the computer to read and analyze them. Some examples are:

  1. mp3 format
  2. WMA (Windows Media Audio) format
  3. wav (Waveform Audio File) format

Audio Libraries

Python has some great libraries for audio processing like Librosa and PyAudio.There are also built-in modules for some basic audio functionalities.

We will mainly use two libraries for audio acquisition and playback:

1. Librosa

It is a Python module to analyze audio signals in general but geared more towards music. It includes the nuts and bolts to build a MIR(Music information retrieval) system. It has been very well documented along with a lot of examples and tutorials.

For a more advanced introduction which describes the package design principles, please refer to the librosa paper at SciPy 2015.


To fuel more audio-decoding power, you can install ffmpeg which ships with many audio decoders.

2. IPython.display.Audio

IPython.display.Audio lets you play audio directly in a jupyter notebook.

Loading an audio file

This returns an audio time series as a numpy array with a default sampling rate(sr) of 22KHZ mono. We can change this behaviour by saying:

to resample at 44.1KHz, or

to disable resampling.

The sample rate is the number of samples of audio carried per second, measured in Hz or kHz.

Playing Audio

Using, IPthon.display.Audio to play the audio

This returns an audio widget in the jupyter notebook as follows:

This widget won’t work here, but it will work in your notebooks. I have uploaded the same to SoundCloud so that we can listen to it.

You can even use an mp3 or a WMA format for the audio example.

Visualizing Audio


We can plot the audio array using librosa.display.waveplot:

Here, we have the plot of the amplitude envelope of a waveform.


A spectrogram is a visual representation of the spectrum of frequencies of sound or other signals as they vary with time. Spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data is represented in a 3D plot, they may be called waterfalls. In 2-dimensional arrays, the first axis is frequency while the second axis is time.

We can display a spectrogram using. librosa.display.specshow.

The vertical axis shows frequencies (from 0 to 10kHz), and the horizontal axis shows the time of the clip. Since we see that all action is taking place at the bottom of the spectrum, we can convert the frequency axis to a logarithmic one.

Writing Audio

librosa.output.write_wav saves a NumPy array to a WAV file.

Creating an audio signal

Let us now create an audio signal at 220Hz. An audio signal is a numpy array, so we shall create one and pass it into the audio function.

So, here it is- the first sound signal created by you.🙌

Feature extraction

Every audio signal consists of many features. However, we must extract the characteristics that are relevant to the problem we are trying to solve. The process of extracting features to use them for analysis is called feature extraction. Let us study about few of the features in detail.

Zero Crossing Rate

The zero crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back. This feature has been used heavily in both speech recognition and music information retrieval. It usually has higher values for highly percussive sounds like those in metal and rock.

Let us calculate the zero crossing rate for our example audio clip.

There appear to be 6 zero crossings. Let’s verify with librosa.

Spectral Centroid

It indicates where the ”centre of mass” for a sound is located and is calculated as the weighted mean of the frequencies present in the sound. Consider two songs, one from a blues genre and the other belonging to metal. Now as compared to the blues genre song which is the same throughout its length, the metal song has more frequencies towards the end. So spectral centroid for blues song will lie somewhere near the middle of its spectrum while that for a metal song would be towards its end.

librosa.feature.spectral_centroid computes the spectral centroid for each frame in a signal:

There is a rise in the spectral centroid towards the end.

Spectral Rolloff

It is a measure of the shape of the signal. It represents the frequency below which a specified percentage of the total spectral energy, e.g. 85%, lies.

librosa.feature.spectral_rolloff computes the rolloff frequency for each frame in a signal:

Mel-Frequency Cepstral Coefficients

The Mel frequency cepstral coefficients (MFCCs) of a signal are a small set of features (usually about 10–20) which concisely describe the overall shape of a spectral envelope. It models the characteristics of the human voice.

Let’ work with a simple loop wave this time.

librosa.feature.mfcc computes MFCCs across an audio signal:

Here mfcc computed 20 MFCC s over 97 frames.

We can also perform feature scaling such that each coefficient dimension has zero mean and unit variance:

Chroma Frequencies

Chroma features are an interesting and powerful representation for music audio in which the entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma) of the musical octave.

librosa.feature.chroma_stft is used for computation

Case Study: Classify songs into different genres.

After having an overview of the acoustic signal, their features and their feature extraction process, it is time to utilise our newly developed skill to work on a Machine Learning Problem.


In his section, we will try to model a classifier to classify songs into different genres. Let us assume a scenario in which, for some reason, we find a bunch of randomly named MP3 files on our hard disk, which are assumed to contain music. Our task is to sort them according to the music genre into different folders such as jazz, classical, country, pop, rock, and metal.


We will be using the famous GITZAN dataset for our case study. This dataset was used for the well-known paper in genre classification “ Musical genre classification of audio signals “ by G. Tzanetakis and P. Cook in IEEE Transactions on Audio and Speech Processing 2002.

The dataset consists of 1000 audio tracks each 30 seconds long. It contains 10 genres namely, blues, classical, country, disco, hiphop, jazz, reggae, rock, metal and pop. Each genre consists of 100 sound clips.

Preprocessing the Data

Before training the classification model, we have to transform raw data from audio samples into more meaningful representations. The audio clips need to be converted from .au format to .wav format to make it compatible with python’s wave module for reading audio files. I used the open-source SoX module for the conversion. Here is a handy cheat sheet for SoX conversion.


  • Feature Extraction

We then need to extract meaningful features from audio files. To classify our audio clips, we will choose 5 features, i.e. Mel-Frequency Cepstral Coefficients, Spectral Centroid, Zero Crossing Rate, Chroma Frequencies, Spectral Roll-off. All the features are then appended into a .csv file so that classification algorithms can be used.

  • Classification

Once the features have been extracted, we can use existing classification algorithms to classify the songs into different genres. You can either use the spectrogram images directly for classification or can extract the features and use the classification models on them.

Either way, a lot of experimentation can be done in terms of models. You are free to experiment and improve your results. Using a CNN model (on the spectrogram images) gives better accuracy and its worth a try.

Next Steps

Music Genre Classification is one of the many branches of Music Information Retrieval. From here you can perform other tasks on musical data like beat tracking, music generation, recommender systems, track separation, and instrument recognition, etc. Music analysis is a diverse field and also an interesting one. A music session somehow represents a moment for the user. Finding these moments and describing them is an interesting challenge in the field of Data Science.



Farzana Anjum

Machine Learning | Computer Vision | Deep Learning | AI | Blogger