Netflix Exploratory Data Analysis

 About Questionnaire 

 • This Notebook is for EDA practice on a given NETFLIX Dataset 

 •It has a good set of questions for Data analysis 

 • It contains all questions necessary for data analysis 

 • This NoteBook can be considered a Good practice/Challenge for Beginners and Intermediate learners • Some questions are provided below, this notebook also consists of multiple solutions to the questions

 Dataset Link: 

netflix.csv


1) Is there any Duplicate Record in this dataset ? If yes, then remove the duplicate records. 


To determine if there are any duplicate records in the provided dataset, we can analyze the data based on unique identifiers such as the  title and release_year. 

Steps to Identify and Remove Duplicates

1. Load the Dataset: Import the dataset into a data manipulation tool or programming environment (e.g., Python with pandas).

2. Check for Duplicates: Use functions to identify duplicate rows based on key columns.

3. Remove Duplicates: Keep the first occurrence and remove subsequent duplicates.


Code in Python


Here’s a sample code snippet using Python's pandas library to achieve this:


python :

import pandas as pd


# Load the dataset

df = pd.read_csv('netflix_titles.csv')


# Check for duplicates based on 'title' and 'release_year'

duplicates = df[df.duplicated(subset=['title', 'release_year'], keep=False)]


# Print duplicate records

print("Duplicate Records:")

print(duplicates)


# Remove duplicates, keeping the first occurrence

df_cleaned = df.drop_duplicates(subset=['title', 'release_year'], keep='first')


# Save the cleaned dataset

df_cleaned.to_csv('netflix_titles_cleaned.csv', index=False)


Explanation of the Code

-Loading the Dataset: The dataset is loaded into a DataFrame.

- Identifying Duplicates: The `duplicated()` function checks for duplicates in specified columns.

- Removing Duplicates: The `drop_duplicates()` function removes duplicate entries while retaining the first occurrence.

- Saving Cleaned Data: The cleaned DataFrame is saved back to a CSV file.

Conclusion

By running this code, you will be able to identify and remove any duplicate records from your dataset effectively. If you need further assistance with specific entries or additional operations, feel free to ask!

 Task.

 2) Is there any Null Value present in any column ? Show with Heat-map. 

To check for null values in the dataset and visualize them with a heatmap, you can use Python with libraries like pandas and seaborn. Below are the steps and code to accomplish this.

Steps to Identify Null Values and Create a Heatmap

Load the Dataset: Import the dataset into a pandas DataFrame.

Check for Null Values: Use pandas to identify null values in each column.

Visualize with a Heatmap: Use seaborn to create a heatmap showing the presence of null values.

Example Code

python

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt


# Load the dataset

df = pd.read_csv('netflix_titles.csv')


# Check for null values

null_values = df.isnull()


# Create a heatmap to visualize null values

plt.figure(figsize=(10, 6))

sns.heatmap(null_values, cmap='viridis', cbar=False, yticklabels=False)

plt.title('Heatmap of Null Values in Dataset')

plt.xlabel('Columns')

plt.ylabel('Rows')

plt.show()

Explanation of the Code

Load the Dataset: The CSV file is read into a DataFrame using pd.read_csv().

Check for Null Values: The isnull() function creates a DataFrame of the same shape as df, where each entry is True if it is null and False otherwise.

Create a Heatmap: The heatmap is generated using seaborn's heatmap() function, which visualizes the null values. The color map 'viridis' is used for better visibility, and y-tick labels are turned off for clarity.

This code will help you identify any null values present in the dataset and visualize them effectively using a heatmap.


 

Comments

Popular Posts