Unveiling Listener Preferences through Podcast Reviews
This project involves the analysis of podcast review data using Python and SQLite. The dataset consists of 2 million reviews for 100,000 podcasts. We will conduct exploratory data analysis (EDA) to uncover insights into podcast popularity, review sentiments, category trends, and more. The analysis will utilize Python libraries such as Pandas, Matplotlib, and Plotly for data manipulation and visualization.
Summary / Findings
First, we cleaned the data by removing entries in languages other than English, converting all text to lowercase, and removing unnecessary punctuation and stop words. We also added a new column to the data recording the length of each review.
One way we measured podcast popularity was by looking at the number of reviews each one received. The podcast "Crime Junkie" had the most reviews, but interestingly, we found no clear connection between the number of reviews and the average rating of a podcast. We also examined the relationship between the length of a review and its rating. There was a weak negative correlation, meaning that slightly longer reviews tended to have slightly lower ratings, but this connection was not very strong.
The majority of reviews (86.66%) gave podcasts a perfect rating of 5, indicating a high level of listener satisfaction. Ratings of 4, 3, 2, and 1 were significantly less common. We also performed sentiment analysis to identify common words and phrases used in positive and negative reviews. Negative reviews often used words like "podcast," "like," and "listen" in ways that expressed dissatisfaction, while positive reviews frequently included words like "love" and "great," suggesting enjoyment and appreciation.
Looking at podcast categories, "Society & Culture" had the most reviews, and these were mostly positive. In contrast, categories like "TV & Film" and "Sports" had a wider range of ratings. We also investigated trends over time, finding that the number of reviews increased over time, peaking in June 2020. The average rating followed a similar trend, peaking in 2018 before declining slightly. When we looked at variations by month, we found that January had the highest number of reviews and December had the lowest. Similarly, February had the highest average rating, while December had the lowest.
Finally, we analyzed the behavior of individual reviewers. There was a significant difference in how often different authors wrote reviews. Interestingly, four of the most active reviewers consistently gave podcasts perfect scores, while another reviewer gave slightly lower ratings (averaging around 4.04). Overall, the top 5 most active reviewers all had high average ratings, suggesting that they generally had a positive view of the podcasts they reviewed. We also found that these reviewers had preferences for certain categories, with "Business" being the most common category they reviewed, followed by "Comedy" and "Education." However, they still reviewed podcasts from a wide range of categories.
Data Exploration / Analysis