IMDB - Dataset Analysis - Basic
Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)
Project Brief for Self-Coders
Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code.
Keep in mind that it´s all about getting the right results/conclusions. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code.
Data Import and first Inspection
Import the movies dataset from the CSV file "movies_complete.csv". Inspect the data.
Some additional information on Features/Columns:
id: The ID of the movie (clear/unique identifier).
title: The Official Title of the movie.
tagline: The tagline of the movie.
release_date: Theatrical Release Date of the movie.
genres: Genres associated with the movie.
belongs_to_collection: Gives information on the movie series/franchise the particular film belongs to.
original_language: The language in which the movie was originally shot in.
budget_musd: The budget of the movie in million dollars.
revenue_musd: The total revenue of the movie in million dollars.
production_companies: Production companies involved with the making of the movie.
production_countries: Countries where the movie was shot/produced in.
vote_count: The number of votes by users, as counted by TMDB.
vote_average: The average rating of the movie.
popularity: The Popularity Score assigned by TMDB.
runtime: The runtime of the movie in minutes.
overview: A brief blurb of the movie.
spoken_languages: Spoken languages in the film.
poster_path: The URL of the poster image.
cast: (Main) Actors appearing in the movie.
cast_size: number of Actors appearing in the movie.
director: Director of the movie.
crew_size: Size of the film crew (incl. director, excl. actors).
Import Necessary Libraries for this Task
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_columns=None # To show all columns
pd.options.display.float_format = '{:.2f}'.format
Read the Movie Data
df = pd.read_csv('movies_complete.csv',parse_dates=['release_date']) #parse date will convert release_date column into Datetime.
df

44691 rows × 22 columns
Getting Info About Data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44691 entries, 0 to 44690
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 44691 non-null int64
1 title 44691 non-null object
2 tagline 20284 non-null object
3 release_date 44657 non-null datetime64[ns]
4 genres 42586 non-null object
5 belongs_to_collection 4463 non-null object
6 original_language 44681 non-null object
7 budget_musd 8854 non-null float64
8 revenue_musd 7385 non-null float64
9 production_companies 33356 non-null object
10 production_countries 38835 non-null object
11 vote_count 44691 non-null float64
12 vote_average 42077 non-null float64
13 popularity 44691 non-null float64
14 runtime 43179 non-null float64
15 overview 43740 non-null object
16 spoken_languages 41094 non-null object
17 poster_path 44467 non-null object
18 cast 42502 non-null object
19 cast_size 44691 non-null int64
20 crew_size 44691 non-null int64
21 director 43960 non-null object
dtypes: datetime64[ns](1), float64(6), int64(3), object(12)
memory usage: 7.5+ MB
Statistical Summary
df.describe()
count
44691.00
8854.00
7385.00
44691.00
42077.00
44691.00
43179.00
44691.00
44691.00
mean
107186.24
21.67
68.97
111.65
6.00
2.96
97.57
12.48
10.31
std
111806.36
34.36
146.61
495.32
1.28
6.04
34.65
12.12
15.89
min
2.00
0.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
25%
26033.50
2.00
2.41
3.00
5.30
0.40
86.00
6.00
2.00
50%
59110.00
8.20
16.87
10.00
6.10
1.15
95.00
10.00
6.00
75%
154251.00
25.00
67.64
35.00
6.80
3.77
107.00
15.00
12.00
max
469172.00
380.00
2787.97
14075.00
10.00
547.49
1256.00
313.00
435.00
df.hist(figsize=(20,12),bins=100)

The best and the worst movies...
Filter the Dataset and find the best/worst n Movies with the
Highest Revenue
Highest Budget
Highest Profit (=Revenue - Budget)
Lowest Profit (=Revenue - Budget)
Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
Highest number of Votes
Highest Rating (only movies with 10 or more Ratings)
Lowest Rating (only movies with 10 or more Ratings)
Highest Popularity
The Best and Worst Movies ever
We will try to filter our data based on criteria , that is responsible to determine the best and worst movies ever. We are also going to import HTML , as we will convert our analysis to a beautiful web page. To do this, all you need to do is to import HTML.
from IPython.display import HTML # we are using this to try to present our data in good looking website format
Filtering Columns responsible to determine best and worst movies
edf = df[
['poster_path','title','budget_musd','revenue_musd','vote_count','vote_average','popularity']
].copy()
Create a column 'profit_musd' (revenue - budget)
edf['profit_musd'] = edf['revenue_musd'] - edf['budget_musd']
edf.head()

Create a column 'return_musd' (revenue/budget)
edf['return_musd'] = edf['revenue_musd']/edf['budget_musd']
edf.head()

Rename Columns in Something Meaningful to present it later in Graphs
edf.columns = ['','Title','Budget','Revenue','Votes','Average Rating','Popularity','Profit','Return']
edf.head()

Set Title as Index
edf.set_index('Title',inplace=True)
Convert Our DataFrame into HTML (Poster , Title , Popularity')
subset = df[['poster_path','title','popularity']].head()
HTML(subset.to_html(escape=False)) # convert dataframe into HTML
Highest Rated Movies
HTML(edf.sort_values(by='Average Rating',ascending=False).head().to_html(escape=False))
Now this approach does not make sense as you can see there is only one vote and it is not sufficient enough to judge on rating. So let us find median of Votes and consider it to be the minimum number of votes to be given to any movie.
minimum_votes = edf['Votes'].median()
minimum_votes
edf.loc[edf['Votes']>=minimum_votes].sort_values(by='Average Rating',ascending=False).head()

Movies With Highest ROI
Here also we will keep above approach , as there are few movies with close to zero budget , we must exclude them and so let us find the median of budget.
minimum_budget = edf['Budget'].median()
minimum_budget
edf.loc[edf['Budget']>=minimum_budget].sort_values(by='Return',ascending=False).head()

Before moving ahead , let us fill all na values of Budget and Votes with 0.
edf['Budget'].fillna(0,inplace=True)
edf['Votes'].fillna(0,inplace=True)
Create a Function to find Best and Worst Movies
def find_movies(n,by,ascending=False,min_bud=minimum_budget,min_votes=minimum_votes):
return edf.loc[(edf['Budget']>=min_bud) & (edf['Votes']>=min_votes)].sort_values(by=by,ascending=ascending).head(n)
Top 5 - Highest Revenue
HTML(find_movies(5,by='Revenue').to_html(escape=False))

Top 5 - Highest Budget
HTML(find_movies(n=5,by='Budget').to_html(escape=False))

Top 5 - Highest Profit
HTML(find_movies(5,'Profit').to_html(escape=False))

Top 5 - Highest ROI
HTML(find_movies(5,by='Return').to_html(escape=False))

Top 5 - Lowest Profit
HTML(find_movies(5,by='Profit',ascending=True).to_html(escape=False))

Top 5 - Most Popular
HTML(find_movies(5,'Popularity').to_html(escape=False))

Find Your Next Movie
Science Fiction Action Movie With Bruce Willis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_columns=None # To show all columns
pd.options.display.float_format = '{:.2f}'.format
df = pd.read_csv('movies_complete.csv',parse_dates=['release_date']) #parse date will convert release_date column into Datetime.
df.head()

Filtering Genres (Science Fiction and Action)
filter_genres = df['genres'].str.contains('Science Fiction') & df['genres'].str.contains('Action')
Filtering Bruce Willis Movies
# Filtering Bruce Willis Movies
filter_cast = df['cast'].str.contains('Bruce Willis')
Filtering
edf = df.loc[filter_genres & filter_cast].sort_values(by='vote_average',ascending=False).head(5)
edf = edf[['title','poster_path','vote_average']]
HTML(edf.to_html(escape=False))
Movies With Uma Thurman and Quentin Tarantino
filter_cast = df['cast'].str.contains('Uma Thurman') # Actor/ Cast
filter_director = df['director'] == 'Quentin Tarantino' # Director
fdf = df.loc[(filter_cast) & (filter_director)].sort_values(by='runtime')
HTML(fdf[['title','poster_path','vote_average','runtime']].to_html(escape=False))
Most Successful Pixar Movies from 2010 to 2015 (Highest Revenue)
Filtering Pixar Movies
filter_production = df.loc[df['production_companies'].str.contains('Pixar').fillna(False)]
filter_production.head()
Filtering Release Date
successful = filter_production.loc[filter_production['release_date'].between('2010-01-01','2015-12-31')]
successful.head()
Result
successful = successful.sort_values(by='revenue_musd')
HTML(successful[['title','poster_path','revenue_musd','release_date']].head().to_html(escape=False))
Action Or Thriller Movie with Original Language English with minimum rating of 7.5(Most Recent)
Filtering Genre (Action Or Thriller)
mask_genre = df['genres'].str.contains('Action') | df['genres'].str.contains('Thriller')
Filtering Language
mask_language = df['original_language']=='en'
Filtering Vote (greater than 10)
mask_vote_count = df['vote_count']>=10
Filter Average Rating
mask_rating = df['vote_average']>=7.5
Filter:
next_mov = df.loc[mask_genre & mask_language & mask_rating & mask_vote_count ].sort_values(by='release_date',ascending=False).head()
HTML(next_mov[['title','poster_path','release_date']].to_html(escape=False))
Most Common Words in Titles and Taglines
from wordcloud import WordCloud
titles =df['title'].dropna()
taglines = df['tagline'].dropna()
titles = '.'.join(titles)
taglines = '.'.join(taglines)
title_wordcloud = WordCloud(background_color='white', height=2000, width=4000,max_words=200).generate(titles)
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud, interpolation= "bilinear")
plt.axis('off')y
plt.show()

tagline_wordcloud = WordCloud(background_color='white', height=2000, width=4000,max_words=200).generate(taglines)
plt.figure(figsize=(16,8))
plt.imshow(tagline_wordcloud, interpolation= "bilinear")
plt.axis('off')
plt.show()

Are Franchises More Successful ?
All Franchises
df['franchise'] = df['belongs_to_collection'].notna()
Count Franchise/Standalone Movies
df['franchise'].value_counts()
False 40228
True 4463
Name: franchise, dtype: int64
Revenue (Franchise Vs Standalone Movies)
df.groupby('franchise')['revenue_musd'].mean()
franchise
False 44.74
True 165.71
Name: revenue_musd, dtype: float64
Budget (Franchise Vs Standalone Movies)
df.groupby('franchise')['budget_musd'].mean()
franchise
False 18.05
True 38.32
Name: budget_musd, dtype: float64
Average Rating (Franchise Vs Standalone Movies)
df.groupby('franchise')['vote_average'].mean()
franchise
False 6.01
True 5.96
Name: vote_average, dtype: float64
Popularity (Franchise Vs Standalone Movies)
df.groupby('franchise')['popularity'].mean()
franchise
False 2.59
True 6.25
Name: popularity, dtype: float64
Return Of Investments (Franchise Vs Standalone Movies)
df['roi'] = df['revenue_musd']/df['budget_musd']
df.groupby('franchise')['roi'].median()
franchise
False 1.62
True 3.71
Name: roi, dtype: float64
Aggregate Functions
We will use aggregate functions to calculate all necessary info about Franchise.
franchises = df.groupby('franchise').agg({
'title':'count',
'revenue_musd':['sum','mean'],
'budget_musd':['sum','mean'],
'roi':'median',
'vote_average':'mean',
'popularity':'mean',
'vote_count':['sum','mean']
})
franchises

Most Successful Franchise ?
df['belongs_to_collection'].dropna()
0 Toy Story Collection
2 Grumpy Old Men Collection
4 Father of the Bride Collection
9 James Bond Collection
12 Balto Collection
...
44582 The Carry On Collection
44585 The Carry On Collection
44596 The Carry On Collection
44598 DC Super Hero Girls Collection
44609 Red Lotus Collection
Name: belongs_to_collection, Length: 4463, dtype: object
Largest Franchise
So we can use sort_values to get the maximum number of count of a movie.
franchises.sort_values(by=('title','count'),ascending=False)

We can also use nlargest to get the n numbers of big franchises.
franchises.nlargest(5,('title','count'))

Highest Revenue
franchises.sort_values(by=('revenue_musd','sum'),ascending=False).head(20)
count
sum
mean
sum
mean
median
mean
mean
sum
mean
belongs_to_collection
Harry Potter Collection
8
7707.37
963.42
1280.00
160.00
6.17
7.54
26.25
47866.00
5983.25
Star Wars Collection
8
7434.49
929.31
854.35
106.79
8.24
7.38
23.41
43443.00
5430.38
James Bond Collection
26
7106.97
273.35
1539.65
59.22
6.13
6.34
13.45
33392.00
1284.31
The Fast and the Furious Collection
8
5125.10
640.64
1009.00
126.12
4.94
6.66
10.80
25576.00
3197.00
Pirates of the Caribbean Collection
5
4521.58
904.32
1250.00
250.00
3.45
6.88
53.97
25080.00
5016.00
Transformers Collection
5
4366.10
873.22
965.00
193.00
5.20
6.14
14.43
15232.00
3046.40
Despicable Me Collection
6
3691.07
922.77
299.00
74.75
12.76
6.78
106.72
18248.00
3041.33
The Twilight Collection
5
3342.11
668.42
385.00
77.00
10.27
5.84
29.50
13851.00
2770.20
Ice Age Collection
5
3216.71
643.34
429.00
85.80
8.26
6.38
16.08
13219.00
2643.80
Jurassic Park Collection
4
3031.48
757.87
379.00
94.75
7.03
6.50
10.77
18435.00
4608.75
Shrek Collection
5
2955.81
738.95
535.00
133.75
5.56
6.46
12.97
11721.00
2344.20
The Hunger Games Collection
4
2944.16
736.04
490.00
122.50
6.27
6.88
54.77
26174.00
6543.50
The Hobbit Collection
3
2935.52
978.51
750.00
250.00
3.83
7.23
25.21
17944.00
5981.33
The Avengers Collection
2
2924.96
1462.48
500.00
250.00
5.96
7.35
63.63
18908.00
9454.00
The Lord of the Rings Collection
3
2916.54
972.18
266.00
88.67
11.73
8.03
30.27
24759.00
8253.00
X-Men Collection
6
2808.83
468.14
983.00
163.83
3.02
6.82
9.71
27563.00
4593.83
Avatar Collection
1
2787.97
2787.97
237.00
237.00
11.76
7.20
185.07
12114.00
12114.00
Mission: Impossible Collection
5
2778.98
555.80
650.00
130.00
4.55
6.60
16.51
14005.00
2801.00
Spider-Man Collection
3
2496.35
832.12
597.00
199.00
3.92
6.47
22.62
13517.00
4505.67
The Dark Knight Collection
3
2463.72
821.24
585.00
195.00
4.34
7.80
57.42
29043.00
9681.00
Can you do it with nlargest ???
Highest Average Revenue
franchises.sort_values(by=('revenue_musd','mean'),ascending=False).head(20)

Most Expensive Franchises (Budget)
franchises.nlargest(20,('budget_musd','sum')).head(20)

Highest Rated Franchises
franchises.loc[franchises[('vote_count','mean')]>=1000].nlargest(10,('vote_average','mean'))

Most Successful Directors
Most Number Of Movies (top 5)
df['director'].value_counts().head()
John Ford 66
Michael Curtiz 65
Werner Herzog 54
Alfred Hitchcock 53
Georges Méliès 49
Name: director, dtype: int64
Highest Revenues By Directors
df.groupby('director')['revenue_musd'].sum().sort_values(ascending=False).head()
director
Steven Spielberg 9256.62
Peter Jackson 6528.24
Michael Bay 6437.47
James Cameron 5900.61
David Yates 5334.56
Name: revenue_musd, dtype: float64
Highest Number of Franchises directed by Directors
df.loc[df['belongs_to_collection'].notna()]['director'].value_counts().head()
Gerald Thomas 25
William Beaudine 19
Ere Kokkonen 17
Kunihiko Yuyama 15
Robert Rodriguez 13
Name: director, dtype: int64
Aggregate Functions
directors = df.groupby('director').agg({
'title' : 'count',
'vote_count':'sum',
'vote_average' : 'mean'
})
directors.head()
Highest Rated Movies
directors.loc[(directors['vote_count']>10000) & (directors['title']>10)].nlargest(20,'vote_average')
director
Hayao Miyazaki
14
14700.00
7.70
Christopher Nolan
11
67344.00
7.62
Martin Scorsese
39
35541.00
7.22
Peter Jackson
13
47571.00
7.14
Joel Coen
17
18139.00
7.02
James Cameron
11
33736.00
6.93
Stanley Kubrick
16
18214.00
6.91
Steven Spielberg
33
62266.00
6.89
Danny Boyle
14
16504.00
6.87
Robert Zemeckis
19
37666.00
6.79
Terry Gilliam
14
10049.00
6.76
Tim Burton
21
36922.00
6.73
Ang Lee
14
11164.00
6.71
Antoine Fuqua
12
15519.00
6.71
Woody Allen
49
15512.00
6.69
Clint Eastwood
35
24001.00
6.69
Alfred Hitchcock
53
12772.00
6.64
Ridley Scott
24
43083.00
6.60
Kenneth Branagh
14
11275.00
6.59
Luc Besson
16
19627.00
6.53
To find succesful director in any specific genre i.e. Action
edf = df[df['genres'].notna()]
edf[edf['genres'].str.contains('Horror')].groupby('director')['revenue_musd'].sum().sort_values(ascending=False).head(10)
director
Paul W.S. Anderson 982.29
James Wan 861.31
Wes Craven 834.93
Francis Lawrence 816.23
Ridley Scott 689.00
Marc Forster 531.87
Steven Spielberg 500.10
William Friedkin 466.40
Darren Lynn Bousman 456.34
M. Night Shyamalan 375.37
Name: revenue_musd, dtype: float64
To Find Successful Actors
df['cast']
0 Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...
1 Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...
2 Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...
3 Whitney Houston|Angela Bassett|Loretta Devine|...
4 Steve Martin|Diane Keaton|Martin Short|Kimberl...
...
44686 Leila Hatami|Kourosh Tahami|Elham Korda
44687 Angel Aquino|Perry Dizon|Hazel Orencio|Joel To...
44688 Erika Eleniak|Adam Baldwin|Julie du Page|James...
44689 Iwan Mosschuchin|Nathalie Lissenko|Pavel Pavlo...
44690 NaN
Name: cast, Length: 44691, dtype: object
Set id as index
df.set_index('id',inplace=True)
Split Actor Names to a DataFrame
actors = df['cast'].str.split('|',expand=True)
actors.head()

actors.stack().reset_index()
actors = actors.stack().reset_index(drop=True,level=1) #only remove the given level by default remove all levels
actors
Convert Series to DataFrame
actors = actors.to_frame()
actors

Rename column label from 0 to 'Actor'
actors.columns=['Actor']
actors.head()
Merge Dataframe with Actors DataFrame
actors = actors.merge(df[['title','revenue_musd','vote_average','popularity']],on='id')
actors

Number of Unique Actors
actors['Actor'].nunique()
Actors with highest number of movies
actors['Actor'].value_counts().head(5)
Bess Flowers 240
Christopher Lee 148
John Wayne 125
Samuel L. Jackson 122
Michael Caine 110
Name: Actor, dtype: int64
# This is known as label aggregation
data = actors.groupby('Actor').agg(
total_revenue = ('revenue_musd','sum'),
average_rating = ('vote_average','mean'),
average_popularity = ('popularity','mean'),
movies = ('title','count'),
average_revenue=('revenue_musd','mean')
)
data

Actors who have acted in more than 10 films
data = data.loc[data['movies']>=10]
data

Highest Revenue
actors_with_largest_revenue = data.nlargest(20,'total_revenue')
Highest Number of Films
actors_with_highest_movies = data.nlargest(20,'movies')
Highest Rating
actors_highest_rating = data.nlargest(20,'average_rating')
Popularity
actor_with_popularity = data.nlargest(20,'average_popularity')
Find Common Actors in the top lists
Concat all the dataframes
top_list = pd.concat([actors_with_largest_revenue,actors_with_highest_movies,actors_highest_rating,actor_with_popularity])
top_list

Find Duplicate Records of Actors
top_list = top_list.reset_index() # Resets Index
top_list.loc[top_list.duplicated(subset='Actor')]

What are the most successful/popular genres? Has this changed over time (e.g. 80ths vs. 90ths)?
gen = df['genres'].str.split('|',expand=True)
gen = gen.stack().reset_index(drop=True,level=1).to_frame()
gen.columns = ['gen']
gen
Merge gen and original dataframe
genre = gen.merge(df[['title','revenue_musd','vote_average','popularity','release_date']],on='id')
genre
Aggregate Functions
data = genre.groupby('gen').agg(
{
'revenue_musd' : ['sum','mean'],
'vote_average' : 'mean',
'popularity' : 'mean'
}
)
Genre With Highest Revenue
data.nlargest(5,('revenue_musd','sum'))
sum
mean
mean
mean
gen
Action
201388.05
116.07
5.75
4.78
Adventure
199978.67
179.19
5.88
6.00
Comedy
166845.05
64.10
5.97
3.25
Drama
160754.36
43.83
6.17
3.03
Thriller
129724.55
69.52
5.74
4.51
Genre With Highest Rating
data.nlargest(5,('vote_average','mean'))
sum
mean
mean
mean
gen
Documentary
1449.11
6.65
6.66
0.96
Animation
67432.97
176.99
6.45
4.75
History
14902.20
50.52
6.41
3.48
Music
13370.29
50.08
6.33
2.56
War
15910.46
65.48
6.29
3.35
Genre With Highest Popularity
data.nlargest(5,('popularity','mean'))
sum
mean
mean
mean
gen
Adventure
199978.67
179.19
5.88
6.00
Fantasy
103920.15
166.01
5.93
5.36
Science Fiction
97847.96
131.52
5.48
5.00
Action
201388.05
116.07
5.75
4.78
Family
107076.78
159.10
5.93
4.77
Highest revenue generated by Genre in 90's
nineties = genre.loc[genre['release_date'].between('1900-01-01','1999-12-31')]
nineties

Popularity of Genres in Nineties
nineties.groupby('gen')['popularity'].mean().head(5)
gen
Action 3.32
Adventure 3.80
Animation 2.91
Comedy 2.63
Crime 3.22
Name: popularity, dtype: float64
Try to find Total revenue and average rating too.
Highest revenue generated by Genre in 20's
twenties = genre.loc[genre['release_date'].between('2000-01-01','2022-12-31')]
twenties

Popularity of Genres in Twenties
twenties.groupby('gen')['popularity'].mean().head(5)
gen
Action 6.23
Adventure 8.60
Animation 5.92
Comedy 3.81
Crime 5.30
Name: popularity, dtype: float64
Find Most Successful Production Companies On your own ?
Last updated