TAYLOR SWIFT’S ORACLE OF DELPHI

How can we predict the next big hit just like her?

NOÉL
8 min readApr 29, 2022

Old friends and family know that I am a rock-star wannabe. Sadly, my dreams of being famous by 27 didn’t materialised. So I did what every failed musician would do… become a data analyst in the music industry.

Beyoncé, Taylor Swift and Dua Lipa look good, sound great, and data excellent!
Beyoncé, Taylor Swift and Dua Lipa look good, sound great, and data excellent!

The modern music company today employ music analytics to predict trends, determine when’s the best time to release singles, identify the profiles of their listeners, set concert dates and more.

But can data science really, really predict the next big hit? Let’s find out.

MUSIC TODAY

In 2020, the music industry in the US was worth over $12 billion.¹

Streaming services such as Spotify, Tidal and Apple Music accounted for 83% of this market, generating US$10.07 billion in 2020.²

In fact, every music company and artist are trying to grab a slice of the pie, from making an average annual salary of US$36,000 to millions for big-named artists such as Beyoncé.³

The past decade has seen music companies embracing data science, with the rise of “music analytics” helping record companies analyse trends and predict what the next big hit might be.⁴

Sadly, producing the next big hit isn’t about raw talent anymore. It’s about employing big data and then choosing a song whose genre and lyrics have relevance and go well with listeners.

Spotify, Tidal and Apple Music accounted for 83% of a $12-billion music industry.
Spotify, Tidal and Apple Music accounted for 83% of a $12-billion music industry.

THE MAESTROS VS ME

Big players such as Universal Music Group, Sony Music and EMI all have their own in-house music analytics department.

And since I too have python installed on my iMac, I reckon I could predict what the big hit will be, using Machine Learning.

MY INSTRUMENTS

Like many, I got my data from Kaggle, which in turn, got the data from Spotify Web API.⁵

The data has over a hundred thousand songs over a 100-year period from 1921 to 2020.⁶

For this case, it will be a regression problem (i.e. a prediction of quantity where the output variable is a real or continuous value, such as “salary” or “weight”). And I’ll be using the machine learning models of Linear Regression and KNN.

The tools used are the usual python libraries of Pandas, Numpy, Matploylib, Seaborn and Scikit Learn.

And I’ll be doing all the python on Jupyter.

From Spotify Web API, to Pandas, Numpy, Matploylib, Seaborn and Scikit Learn, via Jupyter.
From Spotify Web API, to Pandas, Numpy, Matploylib, Seaborn and Scikit Learn, via Jupyter.

INTRO

When I imported the dataset, I found a whopping 174,000 records over 19 columns of attributes.

Some of the key attributes are ‘acousticness’, ‘energy’ and ‘tempo’. I’ll go through each of the attribute later.

The great news was that there was no null values. However, as I pored through the data, I realised that it was rather ‘dirty’ with plenty of duplicates and zero values.

Original Dataset with 174,389 row of records, 19 columns of attributes, no null values, and 18 attributes.
Original dataset with 174,389 row of records, 19 columns of attributes, and no null values.

1ST VERSE

One of the first thing I noticed was that there was a ‘release_date’ and ‘year’. Upon closer scrutiny, I decided that they were both similar, so I removed ‘release_date’ as I did not require the exact date of the song release (I just needed the year).

Next, I found heaps of duplicates of artists and song name. Close to 15,000 duplicates!

So what I did was to keep the duplicated song with the highest popularity rating, and remove its lesser popular duplicates.

Next, I realised that for ‘tempo’ and ‘popularity’, there were plenty of zero values. This doesn’t make sense as we can’t have ‘tempo’ with a zero value as it means the song is dead.

Same for ‘popularity’… a zero value means the song shouldn’t even be on the list.

Remove ‘release date’ as there’s no difference between ‘release_date’ and ‘year’.
Remove ‘release date’ as there is no difference between ‘release_date’ and ‘year’.
Remove duplicates as well as zero values.

2ND VERSE

There were two attributes that were peculiar — ‘energy’ and ‘loudness’. Both attributes are talking about the same thing, which is intensity.

To confirm, I did a heat map and both attributes have a positive correlation of 0.78.

So to avoid multi-collinearity, I removed loudness and kept energy.

All in all, after the data was cleaned, there remained 122,000 rows of records, down from 174,000. And 16 columns, down from 19 previously.

The two attributes ‘energy’ and ‘loudness’ are the same damn thing!

CHORUS

Alright, now the fun part… Exploratory Data Analysis!

The first thing is to explore ‘popularity’ as the objective was to find the most popular song. So I plotted the mean popularity over the entire 100 years.

From the graph below, you’ll see that after World War II, music became increasingly more popular as the economy grew.

It peaked at 2000 and started dipping. One plausible reason could be the advent of digital music, as people stop buying CDs and started downloading music illegally (yay, Napster!)

And guess what? I found out that The Beatles was the most popular artist over the last 100 years (shocker!).

The Beatles is the numero uno artist in the last 100 years.

BRIDGE (a very long one)

‘Acousticness’ measures if a song is natural-sounding or heavily produced with digital effects.

The scatter plot below tells us that people prefer songs which are heavily produced and studio-engineered, which is the majority of pop songs we hear on the radio.

The ‘danceability’ attribute is a normal bell curve, with the zero value being least danceable and one being most danceable.

But when we plot with ‘mean popularity’, we see that there is a skew on the left, which reflects that sad, heartbreak songs are very popular.

People prefer studio-engineered songs rather than unplugged ones. Hence, the scatter plot skews to zero.
People prefer studio-engineered songs rather than unplugged ones. Hence, the scatter plot skews to zero.
People are addicted to sad, sappy songs. Urg, slap out of it!!

With regards to ‘duration’, even though all the songs span between 5 seconds to 90 minutes, we find that most songs are between 2 to 5 minutes. In fact, the holy grail duration for a pop song is 3½ minutes.

And finally, we have ‘instrumentalness’, the yellow scatter plot shows that songs with vocals are more popular that songs that are mainly instrumental. Which isn’t surprising since instrumental music sounds like elevator music.

Give me vocals or hand me death.
Give me vocals or hand me death.

CHORUS AGAIN

Moving on quickly, I then prepared the data by dividing the independent and dependent variables into two separate data groups.

I then split the data into training and testing sets.

Divide independent and dependent variables, and then split them into training and testing datasets.
Divide independent and dependent variables, and then split them into training and testing datasets.

BACK TO VERSE

Before training the model, I employed ‘ColumnTransformer’ and ‘One-Hot Encoding’ to convert the values to a 0-1 scale (what this means is that the conversion standardise all the measurements into a common denominator).

I also dropped the columns which cannot be converted to numerical values like the artists name and song names (because we can’t quantify names and song titles).

After which, I did a quick check on the ‘Mean Square Error’ using three regression models — Linear Regression, Lasso and KNN.

From the results, KNN had the lowest error of 17.8. So, I’ll proceed with using KNN as my regression model (oh, how convenient of me).

KNN had the lowest error of 17.8, so I use it as my regression model.
KNN had the lowest error of 17.8, so I use it as my regression model.

PRE-CHORUS

I then trained the model with a larger train set, and scored my model with 18 neighbours.

This gave me a Mean Squared Error of 24.4, which is significantly higher than previously.

Trained the model with a larger train set, and scored my model with 18 neighbours.
Trained the model with a larger train set, and scored my model with 18 neighbours.
The result was a ‘Mean Squared Error’ of 24.4, which is significantly higher than previously (around 18).
The result was a ‘Mean Squared Error’ of 24.4, which is significantly higher than previously (around 18).

FINAL CHORUS

Finally, I plotted the predicted popularity for the test set and compared it with the true popularity data.

As you can see, my prediction is systematically higher than the real value.

The model fails to adapt to a sharp drop in 2016 and a spike in 2017.

One possible reason for this discrepancy could be that there may be large outliers which wasn’t transformed in the data.

And another possible reason for this poor regression model could be that the important attributes like artists name and song name (remember, they couldn’t be quantified) were not included in the calculations.

Evaluation of model, by plotting the ‘predicted popularity’ for the test set, and compare it
to the ‘true popularity’ for the trained and test set, by sorting the songs.

APPLAUSE PLEASE

In conclusion, my model failed in predicting the song popularity with the true data.

It’s because my prediction was systematically higher than the real value. Plus, my model failed to adapt to sharp drop in 2016 and a spike in 2017.

That’s it.

My prediction model failed to adapt to sharp drop in 2016 and the spike in 2017. REJECTED!
My prediction model failed to adapt to sharp drop in 2016 and the spike in 2017.

I am now a failed rockstar-wannabe and a failed music analyst.

In my defense, if more time was given, like seven years, I would include the artist’s popularity. This is because top artists do influenced the song’s popularity because of their star status.

Another thing I would try is to include measuring the song title and lyrics. There are NLP tools such as TF-IDF Vectorization, which can actually rank the effectiveness of words. However, these are premium services which I’m too cheap to pay.

Here’s my python code which you’re free to use: <capstone04_SongPredictor_NoelMark.ipynb>

Thank you for the music and I’ll exit stage left.

Thank you for the music!

EDIT (2 MAY 2022):

Just chanced upon this comment by miles by music from an article by Ted Gioia <https://tedgioia.substack.com/p/spotify-shares-now-selling-at-less>.

Similarly, for all the supposed bells and whistles that recommendation algorithms offer from the form perspective (for instance, Spotify’s music is catalogued by BPM, instrumentation, vocal style, “danciness” and other such parameters that they aggregated via acquisition of Echo Nest’s data scientists), there’s little to no cultural analysis going on on their end. After all, genre is only a loose proxy for culture or politics — in the algorithm’s view of things, these are simply collapsed along aesthetic lines. Similarly, I can’t search music by thematic content, which as a curator, is another vital ask. The resulting question is “Who is this platform really for?” and inevitably the answer is not serious music listeners but instead people who require endless, indistinguishable, background music.

EDIT (27 NOVEMBER 2022):

Here’s every song with more than a billion streams on Spotify as of November 2022. To date, over 300 songs belong to the billionaire club.

Every song with over 1 billion Spotify streams. © Visual Capitalist
Every song with over 1 billion Spotify streams. © Visual Capitalist

--

--