The Playlist Alchemist: Using Spotify’s Million Playlist Dataset to Build Your Perfect Mixtape

Saurav Shrivastava
4 min readApr 21, 2023

--

Image : Spotify R&D | Research

Spotify, a digital music streaming platform, is one of the most popular platforms for music lovers. It has over 489 million active users and is available in over 184 countries. With such a vast user base, Spotify generates an enormous amount of data every day. In 2018, Spotify released a massive dataset of over a million playlists. The dataset is known as the “Million Playlist Dataset” (MPD). In this blog, I’ll explore the importance of this dataset in the field of technology related to digital music streaming and demonstrate how to use the dataset to build a playlist recommendation system.

Importance of the Million Playlist Dataset

The Million Playlist Dataset is an extensive collection of playlists created by Spotify users. It contains information about the songs, artists, and albums that users listen to and the playlists they create. The dataset is essential for researchers in the field of digital music streaming because it allows them to study users’ listening habits and preferences. The dataset can be used to identify popular songs, artists, and genres, and can also be used to analyse how users interact with the platform.

The MPD is also useful for developers who are building music-related applications. For example, a developer can use the MPD to build a playlist recommendation system that suggests songs based on a user’s listening history. Additionally, the MPD can be used to train machine learning models to predict a user’s listening preferences.

Building a Playlist Recommendation System

In this section, I’ll demonstrate how to use the Million Playlist Dataset to build a playlist recommendation system. I’ll use Python and the Pandas library to manipulate the data and Scikit-learn library to build a machine learning model.

Step 1: Download the Million Playlist Dataset

You can download the Million Playlist Dataset from the official Spotify website. The dataset is available in JSON format and is divided into several files. Each file contains information about a specific aspect of the playlists, such as the tracklist, playlist metadata, and album metadata.

Step 2: Load the Dataset into a Pandas Dataframe

Once you have downloaded the dataset, you can load it into a Pandas dataframe using the following code:

import pandas as pd
data = []
with open('mpd.slice.0–999.json', 'r') as f:
for line in f:
data.append(json.loads(line))
df = pd.DataFrame(data)

This code will load the data from the first 1,000 playlists into a Pandas dataframe. You can change the filename to load a different subset of the data.

Step 3: Exploratory Data Analysis

Before building the playlist recommendation system, let’s take a look at the data. We can use Pandas to perform some basic exploratory data analysis:

# Number of playlists in the dataset
print('Number of playlists:', len(df))
# Most common tracks
tracks = df['tracks'].explode()
most_common_tracks = tracks.value_counts().head(10)
print('Most common tracks:\n', most_common_tracks)
# Most common artists
artists = tracks.apply(lambda x: x['artist_name'])
most_common_artists = artists.value_counts().head(10)
print('Most common artists:\n', most_common_artists)
# Most common albums
albums = tracks.apply(lambda x: x['album_name'])
most_common_albums = albums.value_counts().head(10)
print('Most common albums:\n', most_common_albums)

This code will print out the number of playlists in the dataset and the most common tracks, artists, and albums. This information can be used to get a sense of what songs and artists are popular among Spotify users.

Step 4: Feature Engineering

To build a playlist recommendation system, we need to extract features from the data that we can use to train a machine learning model. One common approach is to use the “bag of words” model, where each song is represented by a vector that indicates whether or not it appears in a particular playlist. We can create this representation using the Pandas “get_dummies” function:

# Convert tracks to bag of words representation
tracks_dummies = pd.get_dummies(tracks.apply(lambda x: x['track_uri'])).groupby(level=0).sum()

This code will create a matrix where each row represents a playlist, and each column represents a song. The values in the matrix indicate whether or not a song appears in a particular playlist.

Step 5: Train the Machine Learning Model

Now that you have your features, you can train a machine learning model to make playlist recommendations. I’ll use the K-nearest neighbors (KNN) algorithm, which is a simple but effective algorithm for recommending items based on similarity. I’ll also use the Scikit-learn library to train the model:

from sklearn.neighbors import NearestNeighbors
# Train KNN model
model = NearestNeighbors(metric='cosine')
model.fit(tracks_dummies)

This code will train a KNN model using the cosine similarity metric. The model will be able to find the playlists that are most similar to a given playlist based on the songs they contain.

Step 6: Make Playlist Recommendations

Now that we have trained our model, you can use it to make playlist recommendations. To do this, I’ll select a random playlist from the dataset and find the playlists that are most similar to it using the KNN model:

# Select a random playlist
playlist_index = np.random.choice(len(df))
playlist_tracks = df.loc[playlist_index]['tracks']
print('Playlist tracks:\n', playlist_tracks)
# Find similar playlists
playlist_vector = tracks_dummies.iloc[playlist_index]
distances, indices = model.kneighbors(playlist_vector.values.reshape(1, -1), n_neighbors=10)
# Print recommended playlists
print('\nRecommended playlists:')
for i in range(1, len(indices[0])):
recommended_tracks = df.loc[indices[0][i]]['tracks']
print('Playlist {}: {}'.format(i, recommended_tracks))

This code will select a random playlist from the dataset, print out the songs in the playlist, and then find the 10 most similar playlists using the KNN model. It will then print out the songs in each of the recommended playlists.

Conclusion

The Million Playlist Dataset is a valuable resource for researchers and developers in the field of digital music streaming. It allows us to study users’ listening habits and preferences and build music-related applications such as playlist recommendation systems. In this blog, I demonstrated how to use the dataset to build a playlist recommendation system using Python and Scikit-learn. With further refinement and optimization, this system can be a powerful tool for helping users discover new music and improving their overall listening experience.

--

--

Saurav Shrivastava

Accidental Data Scientist | Loves music & cricket inside-out