Contd EDA on Netflix 2
Task. 3)
Content Analysis and Popularity:
1. What is the distribution of ratings across movies and TV shows? Are there any trends between rating and content type? (e.g., Are documentaries skewed towards higher or lower ratings?)
2. How many unique directors are there? Can we identify any directors with a high concentration of movies/shows in specific genres?
3. For TV shows, is there a correlation between the number of seasons and the average rating?
4. Analyze the "listed_in" descriptions. Are there any combinations of genres that frequently appear together? (e.g., Do Romantic TV Comedies often have International themes?)
1: Distribution of Ratings Across Movies and TV Shows
python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('netflix_titles.csv')
# Create a count plot for ratings by type
plt.figure(figsize=(12, 6))
sns.countplot(data=df, x='rating', hue='type', palette='Set1')
plt.title('Distribution of Ratings Across Movies and TV Shows')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.legend(title='Type')
plt.xticks(rotation=45)
plt.show()
2: Count of Unique Directors and High Concentration in Specific Genres
python
# Count unique directors
unique_directors_count = df['director'].nunique()
print(f"Number of unique directors: {unique_directors_count}")
# Count number of titles per director and genre
directors_genre_counts = df.groupby(['director', 'listed_in']).size().reset_index(name='count')
high_concentration_directors = directors_genre_counts[directors_genre_counts['count'] > 5] # Example threshold
print(high_concentration_directors)
3: Correlation Between Number of Seasons and Average Rating for TV Shows
python
# Filter for TV shows and convert ratings to numeric (if needed)
tv_shows = df[df['type'] == 'TV Show']
tv_shows['average_rating'] = tv_shows['rating'].map({'G': 0, 'PG': 1, 'PG-13': 2, 'R': 3, 'TV-MA': 4}) # Example mapping
# Calculate correlation
correlation = tv_shows['average_rating'].corr(tv_shows['duration'].str.extract('(\d+)')[0].astype(int)) # Assuming duration is in seasons
print(f"Correlation between number of seasons and average rating: {correlation}")
4: Analyze "listed_in" Descriptions for Frequent Genre Combinations
python
from collections import Counter
# Split genres and count combinations
genre_combinations = Counter()
for genres in df['listed_in']:
genre_list = genres.split(', ')
for i in range(len(genre_list)):
for j in range(i + 1, len(genre_list)):
combination = tuple(sorted([genre_list[i], genre_list[j]]))
genre_combinations[combination] += 1
# Display the most common combinations
common_combinations = genre_combinations.most_common(10)
print("Most common genre combinations:")
for combo, count in common_combinations:
print(f"{combo}: {count}")
Explanation of Commands
1: This code creates a count plot to visualize the distribution of ratings across movies and TV shows.
2: The first part counts unique directors, while the second part groups the data to find directors with a high concentration of titles in specific genres.
3: This code calculates the correlation between the number of seasons (assuming duration is used) and average ratings for TV shows.
4: This code analyzes the listed_in descriptions to find frequently occurring genre combinations using a counter.
These commands will help you extract insights from the dataset effectively.
Temporal Analysis:
5. Over time (by release year), how has the average duration of movies on Netflix changed?
6. Is there a seasonal trend in terms of when new shows are added to Netflix? (e.g., Are more shows added in specific quarters?)
7. Can you identify any countries that have increased their content presence on Netflix over the years (based on release year and country data)?
5: Average Duration of Movies Over Time
python
import pandas as pd
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('netflix_titles.csv')
# Convert duration to numeric values (in minutes) for movies only
df['duration_minutes'] = df['duration'].str.replace(' min', '').astype(int)
# Group by release year and calculate average duration
average_duration = df[df['type'] == 'Movie'].groupby('release_year')['duration_minutes'].mean().reset_index()
# Plotting the average duration over time
plt.figure(figsize=(12, 6))
plt.plot(average_duration['release_year'], average_duration['duration_minutes'], marker='o')
plt.title('Average Duration of Movies on Netflix Over Time')
plt.xlabel('Release Year')
plt.ylabel('Average Duration (minutes)')
plt.grid()
plt.show()
6: Seasonal Trend of New Shows Added to Netflix
python
# Convert 'date_added' to datetime format and extract the month
df['date_added'] = pd.to_datetime(df['date_added'])
df['month_added'] = df['date_added'].dt.month
# Count the number of shows added per month
monthly_counts = df.groupby('month_added')['type'].count().reset_index()
# Plotting the seasonal trend
plt.figure(figsize=(12, 6))
plt.bar(monthly_counts['month_added'], monthly_counts['type'], color='skyblue')
plt.title('Number of Shows Added to Netflix by Month')
plt.xlabel('Month')
plt.ylabel('Number of Shows Added')
plt.xticks(monthly_counts['month_added'], ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()
7: Countries Increasing Content Presence on Netflix Over the Years
python
# Count number of titles per country per year
country_year_counts = df.groupby(['country', 'release_year']).size().reset_index(name='count')
# Find countries with increasing content presence over the years
increased_content_countries = country_year_counts[country_year_counts.groupby('country')['count'].diff().fillna(0) > 0]
# Display countries that have increased their content presence
print("Countries with increased content presence over the years:")
print(increased_content_countries)
Explanation of Commands
5: This code calculates and plots the average duration of movies released on Netflix over the years. It converts the duration column to numeric values and groups by release_year.
6: This code analyzes when new shows are added to Netflix by extracting the month from date_added, counting entries per month, and plotting a bar graph.
7: This code counts titles per country per year and identifies countries that have shown an increase in content presence over time.
These commands will help you extract insights from the dataset effectively.
Data Cleaning and Feature Engineering:
8. Are there any missing values in the data? If so, how are they distributed across different features, and how might we handle them?
9. The "duration" feature is currently in text format (e.g., "90 min"). How can we convert this into a numerical format (minutes) for further analysis?
10. Can additional features be created from the existing data? For instance, could we create a new feature separating "listed_in" genres into individual categories?
8: Check for Missing Values and Their Distribution
python
import pandas as pd
# Load the dataset
df = pd.read_csv('netflix_titles.csv')
# Check for missing values
missing_values = df.isnull().sum()
# Display missing values distribution
print("Missing values in each feature:")
print(missing_values[missing_values > 0])
# Handling missing values (example strategies)
# Drop rows with missing values
df_cleaned = df.dropna()
# Alternatively, fill missing values with a placeholder or mean/median/mode
# df['column_name'].fillna(value='placeholder', inplace=True)
9: Convert Duration Feature to Numerical Format
python
# Convert duration to numeric values (in minutes)
df['duration_minutes'] = df['duration'].str.replace(' min', '').astype(int)
# Display the updated DataFrame to verify conversion
print(df[['title', 'duration', 'duration_minutes']].head())
10: Create Additional Features from Existing Data
python
# Split 'listed_in' genres into individual categories
df['genres'] = df['listed_in'].str.split(', ')
# Explode the genres into separate rows (if needed for analysis)
df_exploded = df.explode('genres')
# Display the updated DataFrame with new genre feature
print(df_exploded[['title', 'genres']].head())
Explanation of Commands
8: This code checks for missing values in the dataset and displays the count of missing entries for each feature. It also provides examples of how to handle these missing values by either dropping them or filling them with placeholders.
9: This code converts the duration feature from text format (e.g., "90 min") to a numerical format (in minutes) for easier analysis.
10: This code creates a new feature that splits the listed_in column into individual genre categories. It uses the explode function to create separate rows for each genre if needed.
These commands will help you effectively analyze and manipulate the dataset based on your queries.
Comments
Post a Comment