Content based Unsupervised Music Recommendation using Convolutional Variational AutoEncoder

Overview

Youtube is the most popular music consumption service in the world and till a few years ago it used to be my choice too. Over a period of time though I realized that Youtube’s recommendation system was slightly biased towards suggesting me songs that already had a large number of views on them. Eventually, when I had enjoyed all these chart topping singles my playlist would become very saturated. Every now and then, be it while watching the end credits of a movie or just strolling through a shopping mall, I would come across an engrossing song from an unknown artist and wonder why my recommendations fail to account for works of such talented yet obscure artists.

The heart of the problem lies in the fact that most major recommendation systems are based on Collaborative filtering which is inherently powered by the relationship between your listening/rating history and that of other users on the platform. This leads to the Cold Start problem where it becomes tough for newer artists to get their work featured to their intended audience while for new users the system may struggle to list suggestions given the lack of history.

One way to overcome this issue is by using Content Based Recommendations where we extract features directly from the source (in our case music) and generate suggestions based on them. In this article I detail my approach to this problem by developing an unsupervised CVAE model to automatically extract contextual features which model the intrinsic property of each song.

CURVE

I believed that the only thing my system needed to model was how similar a group of songs were to each other, with similarity being defined as to how similar they sounded (or were perceived). When I initially began, there were multiple ways to approach this problem. But I wanted to specifically develop a method which was inherently free from any bias. While content based recommendations removed bias induced by other users, there was still another aspect which had to accounted for.

It was tempting to develop a supervised method given that most work had been in that direction and it generally outperforms unsupervised methods. But assigning genre/class labels to each song would have again introduced a different form of bias into the system. In modern times due to the diversity in the types of music being produced it is hard to objectively assign a categorical value to them. For instance, my Rock may not be the same as your Rock. Songs often lie on a spectrum, where tunes conventionally defined to be in one genre may exhibit properties from a host of genres. For these reasons I decided in favor of using an unsupervised method.

Dataset

I used Spotify’s public API to download songs from a range of playlists belonging to different genres such that the dataset was well represented. I downloaded around 1100 songs in the mp3 format with each being a 30 second preview of the original. These songs were tagged with the name of the genre of the playlist from which they were extracted from. Although these tags were not used while training the model, they aided in a more intuitive understanding of the final visualisations.

The Dataset was created from Spotify Playlists like the ones above

To effectively convert this into the computer vision domain, I transformed all the audio files into their mel-spectrogram representations and used these images in training.

VAE Model

The task of this model in simple terms was to accept the spectrograms as input, compress them into an N-dimensional embedding and again recreate them as well as it could from this compressed representation. Now, for the purposes of the recommendation model, we were not interested in the reconstructed output but rather in the N-dimensional embedding/ latent vector which represented each song. The interactive graph you see on the home page are these N-dim vectors scaled down to 2 dimensions using PCA (I tried t-sne but good ol’ PCA worked the best). What these N-dim vectors model exactly is very hard for us to interpret, but these numbers arrange the songs on the graph in such a way that semantic context is preserved ie. songs which are similar to each other based on their spectrograms are placed closer to each other. I won’t go into the specifics of what a VAE model is exactly but if you are interested you can read more about it here . What I will explain is why I decided to use the VAE architecture as opposed to other unsupervised techniques like a vanilla autoencoder. There are 2 clear advantages of using VAE as opposed to its more conventional counterpart:

Results

One trained the model converged with a final total test loss of 0.7. We extracted the intermediate latent vectors, reduced them to two dimensions using PCA and visualized them. Upon analyzing the graph by listening to the audio samples it is evident that the arrangement of the songs clearly captures their intrinsic similarities, which is a testament to the complex features encapsulated by these vectors.

Analysis

Upon exploring the visualisation of the embeddings there are quite a few interesting observations

Notes and FAQs