Spotify Wrapped part 2: Data analysis and visualization

In the first part of this series, written by Iulia Feroli, we talked about how to get your Spotify Wrapped data and visualize it in Kibana. In part 2, we're diving deeper into the data to see what else we can find out. To do this, we're going to leverage a bit of a different approach and use Spotify to Elasticsearch to index the data into Elasticsearch. This tool is a bit more advanced and requires a bit more setup, but it is worth it. The data is more structured, and we can ask more complex questions.

Differences from the first Spotify Wrapped analysis

In the first blog, we used the Spotify export directly and didn't perform any normalisation tasks or any other data processing. This time, we will use the same data, but we will perform some data processing to make the data more usable. This will allow us to answer much more complex questions, such as:

What is the average duration of a song in my top 100?
What is the average popularity of a song in my top 100?
What is the median listening duration to a song?
What is my most skipped track?
When do I like to skip tracks?
Am I listening to a particular hour of the day more than others?
Am I listening to a particular day of the week more than others?
Is it a month of particular interest?
What is the artist with the longest listening time?

Spotify Wrapped is a fun experience every year, showing you what you listened to this year. It does not give you year-over-year changes, and thus you might miss some artists that were once in your top 10, but now have vanished.

Processing Spotify Wrapped data for analysis

There is a large difference in the way we process the data in the first and the second post. If you want to keep working with the data from the first post, you will need to account for some field name changes, as well as need to revert to ES|QL to do certain extractions like hour of day on the fly.

Nonetheless, you all should be able to follow this post. The data processing is done in the Spotify to Elasticsearch repository involves asking the Spotify API for the duration of the song, popularity and also renames and enhances some fields. For example the artist field in the Spotify export itself is just a String and does not represent features or multi-artist tracks

Visualizing Spotify Wrapped data with Dashboards

I created a dashboard in Kibana to visualize the data. The dashboard is available here and you can import it into your Kibana instance. The dashboard is quite extensive and answers many of the above questions.

Let's get into some of the questions and how to answer them together!

What is the average duration of a song in my top 100?

To answer this question, we can use Lens or ES|QL. Let's explore all three options. Let's phrase this question correctly in an Elasticsearch manner. We want to find the top 100 songs and then calculate the average duration of all of those songs combined. In Elasticsearch terms that would be two aggregations:

Figure out the top 100 songs
Calculate the average duration of those 100 songs.

Lens

In Lens, this is rather simple: create a new Lens, switch to a table, and drag and drop the title field into the table. Then click on the title field and set the size to 100, as well as set accuracy mode. Then drag and drop the duration field into the table and use last value, because we really only need the last value of each of the songs' duration. The same song will only have one duration. At the bottom of this last value aggregation is a dropdown for a summary row, select average and it will show it to you.

ES|QL

ES|QL is a pretty fresh language compared to DSL & aggregations, but it is very powerful and easy to use. To answer the same question in ES|QL, you would write the following query:

from spotify-history
| stats duration=max(duration), count=count() by title
| sort count desc
| limit 100
| stats `Average duration of the songs`=avg(duration)

Let me take you step-bystep through this ES|QL query:

from spotify-history - This is the index pattern we are using.
stats duration=max(duration), count=count() by title - This is the first aggregation, we are calculating the maximum duration of each song and the count of each song. We use max instead of last value as used in the Lens, that is because ES|QL right now does not have a first or last.
sort count desc - We sort the songs by the count of each song, so the most listened to song is on top.
limit 100 - We limit the result to the top 100 songs.
stats Average duration of the songs=avg(duration) - We calculate the average duration of the songs.

Is a month of particular interest to me?

To answer this question we can use Lens with the help of runtime field and ES|QL. What do we notice straight away, that there is no field in the data that denotes the month directly, instead we need to calculate it from the @timestamp field. There are multiple ways to do this:

Use a runtime field, to power the Lens
ES|QL

I personally think that ES|QL is the neater and quicker solution.

FROM spotify-history
| eval month=DATE_EXTRACT("MONTH_OF_YEAR", @timestamp)
| stats count=count() by month

That's it, nothing fancy needed to do, we can leverage the DATE_EXTRACT function to extract the month from the @timestamp field and then we can aggregate on it. Using the ES|QL visualisation we can drop that onto the dashboard.

Spotify Wrapped monthly breakdown visualization

What is my listening duration per artist per year?

The idea behind that is to see if an artist is just a one-time thing or if there is a reoccurrence. If I remember correctly, Spotify only shows you the top 5 artists in the yearly wrapped. Maybe your number 6 artist stays the same all the time, or they heavily change after the 10th position?

One of the simplest representation of this is a percentage bar chart. We can use Lens for this. Follow the steps along:

Drag and drop the listened_to_ms field. This field represents how long you listened to a song in milliseconds. Now per default Lens will create a median aggregation, we do not want that, alter that to a sum. In the top select percentage instead of stacked for the bar chart type. For the breakdown select artist and say top 10. In the Advanced dropdown don't forget to select accuracy mode. Now every color block represents how much you listened to this single artist. Depending on your timepicker the bars might represent values from days, to weeks, to months, to years. If you want a weekly breakdown, select the @timestamp and set the mininum interval to year. Now what we can tell in my case is that Fred Again.. is the artist I listened to most, nearly 12% of my total listening time was consumed by Fred Again... We also see that Fred Again.. dropped a bit in 2024, but Jamie XX grew largely. If we compare just the size of the bars. We can also tell that whilst Billie Eilish is constantly being played in 2024 the bar widthend. This means that I listened to Billie Eilish more in 2024 than in 2023.

Spotify Wrapped history visualization using Kibana

What about the top tracks per artist per listening time versus overall listening time?

That's a mouthfull of a question. Let me try to explain what I want to say with that. Spotify tells you about the top song from a single artist, or your overall 5 top songs. Well, that's definitely interesting, but what about the breakdown of an artist? Is all my time consumed just by a single song that I play over and over again, or is that evenly distributed?

Create a new lens and select Treemap as type. For the metric, same as before: select sum and use listened_to_ms as the field. For the group by we need two values. The first one is artist and then add a second one with title. The intermediate result looks like this:

Let's change that to top 100 artists and deselect the other in the advanced dropdown, as well as enable accuracy mode. For title change that to top 10 and enable accuracy mode. The final result looks like this:

Spotify Wrapped visualization using Kibana

What does this tell us now exactly? Without looking at any time component, we can tell that over all my listening history with Spotify, I spent 5.67% listening to Fred Again... In particularly I spent 1.21% of that time, listening to Delilah (pull me out of this). It is interesting to see, if there is a single song that occupies an artist, or if there are other songs as well. The treemap itself is a nice form to represent such data distributions.

Do I listen on a particular hour and day?

Well, that we can answer super simple with a Lens visualisation leveraging the Heat Map. Create a new Lens, select Heat Map. For the Horizontal Axis select dayOfWeek field and set it to Top 7 instead of Top 3. For the Vertical Axis select the hourOfDay and for Cell Value just a simple Count of records. Now this will produce this panel:

Spotify Wrapped listening habits visualization using Dashboards

There are a couple of annoying things around this Lens, that just disturb me when interpreting. Let's try and clean it up a bit. First of all, I don't care about the legend too much, use the symbol in the top with the triangle, square, circle and disable it.

Now the 2nd part that is annoying is the sorting of the days. It's Monday, Wednesday, Thursday, or anything else, depending on the values you have. The hourOfDay is correctly sorted. The way to sort the days is a funny hack and that is called to use Filters instead of Top Values. Click on dayOfWeek and select Filters, it should now look like this:

Spotify Wrapped history visualization with Kibana Dashboards

Now just start typing the days. One filter for each day. "dayOfWeek" : Monday and give it the label Monday and rinse and repeat.

One caveat in all of this though is, that Spotify provides the data in UTC+0 without any timezone information. Sure, they also provide the IP address and the country where you listened to and we could infere the timezone information from that, but that can be wonky and for countries like the U.S. that has multiple timezones, it can be too much of a hassle. This is important because Elasticsearch and Kibana have timezone support and by providing the correct timezone in the @timestamp field, Kibana would automatically adjust the time to your browser time.

It should look like this when finalized, and we can tell that I am a very active listener during working hours and less so on Saturdays and Sundays.

Spotify Wrapped visualization using Kibana Dashboards

Conclusion

In this blog, we dove a bit deeper into the intricacies that the Spotify data offers. We showed a few simple and quick ways to get some visualizations up and running. It is simply amazing to have this much control over your own listening history. Check out the other parts of the series:

Ready to try this out on your own? Start a free trial.

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!