Contd EDA on Netflix 2

December 23, 2024

Contd EDA on Netflix 2

Task. 3)

Content Analysis and Popularity:

1. What is the distribution of ratings across movies and TV shows? Are there any trends between rating and content type? (e.g., Are documentaries skewed towards higher or lower ratings?)

2. How many unique directors are there? Can we identify any directors with a high concentration of movies/shows in specific genres?

3. For TV shows, is there a correlation between the number of seasons and the average rating?

4. Analyze the "listed_in" descriptions. Are there any combinations of genres that frequently appear together? (e.g., Do Romantic TV Comedies often have International themes?)

1: Distribution of Ratings Across Movies and TV Shows

python

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

# Load the dataset

df = pd.read_csv('netflix_titles.csv')

# Create a count plot for ratings by type

plt.figure(figsize=(12, 6))

sns.countplot(data=df, x='rating', hue='type', palette='Set1')

plt.title('Distribution of Ratings Across Movies and TV Shows')

plt.xlabel('Rating')

plt.ylabel('Count')

plt.legend(title='Type')

plt.xticks(rotation=45)

plt.show()

2: Count of Unique Directors and High Concentration in Specific Genres

python

# Count unique directors

unique_directors_count = df['director'].nunique()

print(f"Number of unique directors: {unique_directors_count}")

# Count number of titles per director and genre

directors_genre_counts = df.groupby(['director', 'listed_in']).size().reset_index(name='count')

high_concentration_directors = directors_genre_counts[directors_genre_counts['count'] > 5] # Example threshold

print(high_concentration_directors)

3: Correlation Between Number of Seasons and Average Rating for TV Shows

python

# Filter for TV shows and convert ratings to numeric (if needed)

tv_shows = df[df['type'] == 'TV Show']

tv_shows['average_rating'] = tv_shows['rating'].map({'G': 0, 'PG': 1, 'PG-13': 2, 'R': 3, 'TV-MA': 4}) # Example mapping

# Calculate correlation

correlation = tv_shows['average_rating'].corr(tv_shows['duration'].str.extract('(\d+)')[0].astype(int)) # Assuming duration is in seasons

print(f"Correlation between number of seasons and average rating: {correlation}")

4: Analyze "listed_in" Descriptions for Frequent Genre Combinations

python

from collections import Counter

# Split genres and count combinations

genre_combinations = Counter()

for genres in df['listed_in']:

genre_list = genres.split(', ')

for i in range(len(genre_list)):

for j in range(i + 1, len(genre_list)):

combination = tuple(sorted([genre_list[i], genre_list[j]]))

genre_combinations[combination] += 1

# Display the most common combinations

common_combinations = genre_combinations.most_common(10)

print("Most common genre combinations:")

for combo, count in common_combinations:

print(f"{combo}: {count}")

Explanation of Commands

1: This code creates a count plot to visualize the distribution of ratings across movies and TV shows.

2: The first part counts unique directors, while the second part groups the data to find directors with a high concentration of titles in specific genres.

3: This code calculates the correlation between the number of seasons (assuming duration is used) and average ratings for TV shows.

4: This code analyzes the listed_in descriptions to find frequently occurring genre combinations using a counter.

These commands will help you extract insights from the dataset effectively.

Temporal Analysis:

5. Over time (by release year), how has the average duration of movies on Netflix changed?

6. Is there a seasonal trend in terms of when new shows are added to Netflix? (e.g., Are more shows added in specific quarters?)

7. Can you identify any countries that have increased their content presence on Netflix over the years (based on release year and country data)?

5: Average Duration of Movies Over Time

python

import pandas as pd

import matplotlib.pyplot as plt

# Load the dataset

df = pd.read_csv('netflix_titles.csv')

# Convert duration to numeric values (in minutes) for movies only

df['duration_minutes'] = df['duration'].str.replace(' min', '').astype(int)

# Group by release year and calculate average duration

average_duration = df[df['type'] == 'Movie'].groupby('release_year')['duration_minutes'].mean().reset_index()

# Plotting the average duration over time

plt.figure(figsize=(12, 6))

plt.plot(average_duration['release_year'], average_duration['duration_minutes'], marker='o')

plt.title('Average Duration of Movies on Netflix Over Time')

plt.xlabel('Release Year')

plt.ylabel('Average Duration (minutes)')

plt.grid()

plt.show()

6: Seasonal Trend of New Shows Added to Netflix

python

# Convert 'date_added' to datetime format and extract the month

df['date_added'] = pd.to_datetime(df['date_added'])

df['month_added'] = df['date_added'].dt.month

# Count the number of shows added per month

monthly_counts = df.groupby('month_added')['type'].count().reset_index()

# Plotting the seasonal trend

plt.figure(figsize=(12, 6))

plt.bar(monthly_counts['month_added'], monthly_counts['type'], color='skyblue')

plt.title('Number of Shows Added to Netflix by Month')

plt.xlabel('Month')

plt.ylabel('Number of Shows Added')

plt.xticks(monthly_counts['month_added'], ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

plt.show()

7: Countries Increasing Content Presence on Netflix Over the Years

python

# Count number of titles per country per year

country_year_counts = df.groupby(['country', 'release_year']).size().reset_index(name='count')

# Find countries with increasing content presence over the years

increased_content_countries = country_year_counts[country_year_counts.groupby('country')['count'].diff().fillna(0) > 0]

# Display countries that have increased their content presence

print("Countries with increased content presence over the years:")

print(increased_content_countries)

Explanation of Commands

5: This code calculates and plots the average duration of movies released on Netflix over the years. It converts the duration column to numeric values and groups by release_year.

6: This code analyzes when new shows are added to Netflix by extracting the month from date_added, counting entries per month, and plotting a bar graph.

7: This code counts titles per country per year and identifies countries that have shown an increase in content presence over time.

These commands will help you extract insights from the dataset effectively.

Data Cleaning and Feature Engineering:

8. Are there any missing values in the data? If so, how are they distributed across different features, and how might we handle them?

9. The "duration" feature is currently in text format (e.g., "90 min"). How can we convert this into a numerical format (minutes) for further analysis?

10. Can additional features be created from the existing data? For instance, could we create a new feature separating "listed_in" genres into individual categories?

8: Check for Missing Values and Their Distribution

python

import pandas as pd

# Load the dataset

df = pd.read_csv('netflix_titles.csv')

# Check for missing values

missing_values = df.isnull().sum()

# Display missing values distribution

print("Missing values in each feature:")

print(missing_values[missing_values > 0])

# Handling missing values (example strategies)

# Drop rows with missing values

df_cleaned = df.dropna()

# Alternatively, fill missing values with a placeholder or mean/median/mode

# df['column_name'].fillna(value='placeholder', inplace=True)

9: Convert Duration Feature to Numerical Format

python

# Convert duration to numeric values (in minutes)

df['duration_minutes'] = df['duration'].str.replace(' min', '').astype(int)

# Display the updated DataFrame to verify conversion

print(df[['title', 'duration', 'duration_minutes']].head())

10: Create Additional Features from Existing Data

python

# Split 'listed_in' genres into individual categories

df['genres'] = df['listed_in'].str.split(', ')

# Explode the genres into separate rows (if needed for analysis)

df_exploded = df.explode('genres')

# Display the updated DataFrame with new genre feature

print(df_exploded[['title', 'genres']].head())

Explanation of Commands

8: This code checks for missing values in the dataset and displays the count of missing entries for each feature. It also provides examples of how to handle these missing values by either dropping them or filling them with placeholders.

9: This code converts the duration feature from text format (e.g., "90 min") to a numerical format (in minutes) for easier analysis.

10: This code creates a new feature that splits the listed_in column into individual genre categories. It uses the explode function to create separate rows for each genre if needed.

These commands will help you effectively analyze and manipulate the dataset based on your queries.

Search This Blog

CS Grad