Contd for Netlfix EDA
• Q. 1) For 'House of Cards', what is the Show Id and Who is the Director of this show ?
To retrieve the information for "House of Cards," specifically the Show ID and Director, you can use the following commands in Python with pandas:
Commands to Retrieve Information
Import the necessary libraries:
python
import pandas as pd
Load the dataset:
python
df = pd.read_csv('netflix_titles.csv')
Filter for "House of Cards":
python
house_of_cards_info = df[df['title'] == 'House of Cards']
Retrieve Show ID and Director:
python
show_id = house_of_cards_info['show_id'].values[0]
director = house_of_cards_info['director'].values[0]
Print the results:
python
print(f"Show ID: {show_id}, Director: {director}")
Explanation of Commands
The first command imports the pandas library, which is essential for data manipulation.
The second command loads the dataset from a CSV file into a DataFrame.
The third command filters the DataFrame to find the row corresponding to "House of Cards."
The fourth command extracts the Show ID and Director from the filtered DataFrame.
The final command prints out the retrieved information.
This sequence of commands will allow you to efficiently find and display the Show ID and Director for "House of Cards."
• Q. 2) In which year the highest number of TV Shows & Movies were released? Show with Bar Graph.
To determine the year with the highest number of TV shows and movies released from the dataset, you can follow these steps using Python with pandas and matplotlib for visualization. Here’s how you can do it:
Commands to Retrieve Information and Create a Bar Graph
Import the necessary libraries:
python
import pandas as pd
import matplotlib.pyplot as plt
Load the dataset:
python
df = pd.read_csv('netflix_titles.csv')
Group by release year and count the number of titles:
python
release_counts = df['release_year'].value_counts().sort_index()
Identify the year with the highest number of releases:
python
max_year = release_counts.idxmax()
max_count = release_counts.max()
print(f"The year with the highest number of releases is {max_year} with {max_count} titles.")
Create a bar graph to visualize the data:
python
plt.figure(figsize=(12, 6))
release_counts.plot(kind='bar', color='skyblue')
plt.title('Number of TV Shows & Movies Released by Year')
plt.xlabel('Year')
plt.ylabel('Number of Releases')
plt.xticks(rotation=45)
plt.axhline(y=max_count, color='r', linestyle='--', label=f'Max Releases: {max_count} in {max_year}')
plt.legend()
plt.tight_layout()
plt.show()
Explanation of Commands
The first command imports pandas for data manipulation and matplotlib for plotting.
The second command loads the dataset from a CSV file.
The third command counts how many titles were released each year and sorts them by year.
The fourth command identifies which year had the most releases and prints that information.
The final commands create a bar graph showing the number of releases per year, highlighting the maximum count with a dashed line.
This sequence of commands will help you find out which year had the highest number of TV shows and movies released, along with a visual representation of the data.
• Q. 3) How many Movies & TV Shows are in the dataset ? Show with Bar Graph.
To determine how many movies and TV shows are in the dataset and visualize this information with a bar graph, you can follow these steps using Python with pandas and matplotlib. Here’s how to do it:
Commands to Retrieve Information and Create a Bar Graph
Import the necessary libraries:
python
import pandas as pd
import matplotlib.pyplot as plt
Load the dataset:
python
df = pd.read_csv('netflix_titles.csv')
Count the number of Movies and TV Shows:
python
counts = df['type'].value_counts()
Create a bar graph to visualize the data:
python
plt.figure(figsize=(8, 5))
counts.plot(kind='bar', color=['blue', 'orange'])
plt.title('Number of Movies & TV Shows in the Dataset')
plt.xlabel('Type')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()
Explanation of Commands
The first command imports the pandas library for data manipulation and matplotlib for plotting.
The second command loads the dataset from a CSV file into a DataFrame.
The third command counts how many entries are categorized as 'Movie' and 'TV Show' using value_counts().
The final commands create a bar graph showing the count of movies and TV shows in the dataset.
This sequence of commands will help you find out how many movies and TV shows are present in the dataset and visualize that information effectively with a bar graph.
• Q. 4) Show all the Movies that were released in the year 2000.
• Q. 5) Show only the Titles of all TV Shows that were released in India only.
• Q. 6) Show Top 10 Directors, who gave the highest number of TV Shows & Movies to Netflix ?
To address the queries regarding the Netflix dataset, you can use the following commands in Python with pandas to retrieve the required information.
4: Show all the Movies that were released in the year 2000
python
import pandas as pd
# Load the dataset
df = pd.read_csv('netflix_titles.csv')
# Filter for movies released in the year 2000
movies_2000 = df[(df['type'] == 'Movie') & (df['release_year'] == 2000)]
# Display the results
print(movies_2000[['title', 'release_year']])
5: Show only the Titles of all TV Shows that were released in India only
python
# Filter for TV shows released in India
tv_shows_india = df[(df['type'] == 'TV Show') & (df['country'] == 'India')]
# Display only the titles
print(tv_shows_india['title'])
6: Show Top 10 Directors who gave the highest number of TV Shows & Movies to Netflix
python
# Count the number of titles per director
top_directors = df['director'].value_counts().head(10)
# Display the top directors
print(top_directors)
Explanation of Commands
For 4, we filter the DataFrame for entries where the type is "Movie" and the release year is 2000, then display the relevant columns.
For 5, we filter for entries where the type is "TV Show" and the country is "India," displaying only the title column.
For 6, we use value_counts() on the director column to count how many titles each director has contributed, then retrieve the top 10.
These commands will help you extract and display the required information from the dataset effectively.
• Q. 7) Show all the Records, where "Category is Movie and Type is Comedies" or "Country is United Kingdom".
• Q. 8) In how many movies/shows, Tom Cruise was cast ?
• Q. 9) What are the different Ratings defined by Netflix ?
Here are the commands to retrieve the required information for each of your queries regarding the Netflix dataset:
7: Show all the Records where "Category is Movie and Type is Comedies" or "Country is United Kingdom"
python
# Filter for movies that are comedies or country is United Kingdom
filtered_records = df[(df['type'] == 'Movie') & (df['listed_in'].str.contains('Comedies')) | (df['country'] == 'United Kingdom')]
# Display the results
print(filtered_records)
8: In how many movies/shows was Tom Cruise cast?
python
# Count the number of movies/shows where Tom Cruise is in the cast
tom_cruise_count = df[df['cast'].str.contains('Tom Cruise', na=False)].shape[0]
# Display the count
print(f"Tom Cruise was cast in {tom_cruise_count} movies/shows.")
9: What are the different Ratings defined by Netflix?
python
# Get unique ratings defined by Netflix
unique_ratings = df['rating'].unique()
# Display the unique ratings
print("Different Ratings defined by Netflix:")
print(unique_ratings)
Explanation of Commands
For 7, we filter the DataFrame for entries where the type is "Movie" and includes "Comedies" in the listed_in column, or where the country is "United Kingdom."
For 8, we check for entries in the cast column that contain "Tom Cruise" and count them.
For 9, we retrieve unique values from the rating column to see all different ratings defined by Netflix.
These commands will help you extract and display the required information from the dataset effectively.
▪ Q. 9.1) How many Movies got the 'TV-14' rating, in Canada ?
▪ Q. 9.2) How many TV Shows got the 'R' rating, after year 2018 ?
9.1: How many Movies got the 'TV-14' rating in Canada?
python
# Count the number of movies with 'TV-14' rating in Canada
tv14_movies_canada = df[(df['rating'] == 'TV-14') & (df['country'] == 'Canada') & (df['type'] == 'Movie')]
# Display the count
count_tv14_movies_canada = tv14_movies_canada.shape[0]
print(f"Number of Movies with 'TV-14' rating in Canada: {count_tv14_movies_canada}")
9.2: How many TV Shows got the 'R' rating after the year 2018?
python
# Count the number of TV shows with 'R' rating released after 2018
r_tv_shows_after_2018 = df[(df['rating'] == 'R') & (df['release_year'] > 2018) & (df['type'] == 'TV Show')]
# Display the count
count_r_tv_shows_after_2018 = r_tv_shows_after_2018.shape[0]
print(f"Number of TV Shows with 'R' rating after 2018: {count_r_tv_shows_after_2018}")
Explanation of Commands
For 9.1, we filter the DataFrame for entries where the rating is "TV-14," the country is "Canada," and the type is "Movie." We then count these entries.
For 9.2, we filter for entries where the rating is "R," the release year is greater than 2018, and the type is "TV Show." We then count these entries as well.
These commands will help you extract and display the required information from the dataset effectively.
• Q. 10) What is the maximum duration of a Movie/Show on Netflix ?
• Q. 11) Which individual country has the Highest No. of TV Shows ?
• Q. 12) How can we sort the dataset by Year ?
• Q. 13) Find all the instances where: Category is 'Movie' and Type is 'Dramas' or Category is 'TV Show' & Type is 'Kids' TV'.
10: What is the maximum duration of a Movie/Show on Netflix?
python
# Convert duration to numeric values (in minutes)
df['duration_minutes'] = df['duration'].str.replace(' min', '').astype(int)
# Find the maximum duration
max_duration = df['duration_minutes'].max()
# Display the maximum duration
print(f"The maximum duration of a Movie/Show on Netflix is {max_duration} minutes.")
11: Which individual country has the highest number of TV Shows?
python
# Count the number of TV Shows per country
tv_show_counts = df[df['type'] == 'TV Show']['country'].value_counts()
# Get the country with the highest number of TV Shows
highest_tv_shows_country = tv_show_counts.idxmax()
highest_tv_shows_count = tv_show_counts.max()
# Display the results
print(f"The country with the highest number of TV Shows is {highest_tv_shows_country} with {highest_tv_shows_count} shows.")
12: How can we sort the dataset by Year?
python
# Sort the dataset by release year
sorted_df = df.sort_values(by='release_year')
# Display the sorted dataset (optional)
print(sorted_df.head()) # Show first few rows of sorted dataset
13: Find all instances where Category is 'Movie' and Type is 'Dramas' or Category is 'TV Show' & Type is 'Kids' TV.
python
# Filter for Movies that are Dramas or TV Shows that are Kids TV
filtered_instances = df[((df['type'] == 'Movie') & (df['listed_in'].str.contains('Dramas'))) |
((df['type'] == 'TV Show') & (df['listed_in'].str.contains("Kids' TV")))]
# Display the filtered results
print(filtered_instances)
Explanation of Commands
For 10, we convert the duration column to numeric values, then find and display the maximum duration.
For 11, we count how many TV shows exist for each country and identify which country has the highest count.
For 12, we sort the DataFrame by release_year and optionally display the first few rows of the sorted data.
For 13, we filter for entries where either condition (Movie and Dramas or TV Show and Kids TV) is met, then display those records.
These commands will help you extract and display the required information from the dataset effectively.
Comments
Post a Comment