Project 1: Explanatory Data Analysis & Data Presentation (Movies Dataset)
Project Brief for Self-Coders
Here you´ll have the opportunity to code major parts of Project 1 on your own. If you need any help or inspiration, have a look at the Videos or the Jupyter Notebook with the full code.
Keep in mind that it´s all about getting the right results/conclusions. It´s not about finding the identical code. Things can be coded in many different ways. Even if you come to the same conclusions, it´s very unlikely that we have the very same code.
Data Import and first Inspection
Import the movies dataset from the CSV file "movies_complete.csv". Inspect the data.
Some additional information on Features/Columns:
id: The ID of the movie (clear/unique identifier).
title: The Official Title of the movie.
tagline: The tagline of the movie.
release_date: Theatrical Release Date of the movie.
genres: Genres associated with the movie.
belongs_to_collection: Gives information on the movie series/franchise the particular film belongs to.
original_language: The language in which the movie was originally shot in.
budget_musd: The budget of the movie in million dollars.
revenue_musd: The total revenue of the movie in million dollars.
production_companies: Production companies involved with the making of the movie.
production_countries: Countries where the movie was shot/produced in.
vote_count: The number of votes by users, as counted by TMDB.
vote_average: The average rating of the movie.
popularity: The Popularity Score assigned by TMDB.
runtime: The runtime of the movie in minutes.
overview: A brief blurb of the movie.
spoken_languages: Spoken languages in the film.
poster_path: The URL of the poster image.
cast: (Main) Actors appearing in the movie.
cast_size: number of Actors appearing in the movie.
director: Director of the movie.
crew_size: Size of the film crew (incl. director, excl. actors).
Import Necessary Libraries for this Task
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_columns=None # To show all columns
pd.options.display.float_format = '{:.2f}'.format
Read the Movie Data
df = pd.read_csv('movies_complete.csv',parse_dates=['release_date']) #parse date will convert release_date column into Datetime.
df
Filter the Dataset and find the best/worst n Movies with the
Highest Revenue
Highest Budget
Highest Profit (=Revenue - Budget)
Lowest Profit (=Revenue - Budget)
Highest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
Lowest Return on Investment (=Revenue / Budget) (only movies with Budget >= 10)
Highest number of Votes
Highest Rating (only movies with 10 or more Ratings)
Lowest Rating (only movies with 10 or more Ratings)
Highest Popularity
The Best and Worst Movies ever
We will try to filter our data based on criteria , that is responsible to determine the best and worst movies ever. We are also going to import HTML , as we will convert our analysis to a beautiful web page. To do this, all you need to do is to import HTML.
from IPython.display import HTML # we are using this to try to present our data in good looking website format
Filtering Columns responsible to determine best and worst movies
Now this approach does not make sense as you can see there is only one vote and it is not sufficient enough to judge on rating. So let us find median of Votes and consider it to be the minimum number of votes to be given to any movie.
Here also we will keep above approach , as there are few movies with close to zero budget , we must exclude them and so let us find the median of budget.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_columns=None # To show all columns
pd.options.display.float_format = '{:.2f}'.format
df = pd.read_csv('movies_complete.csv',parse_dates=['release_date']) #parse date will convert release_date column into Datetime.
df.head()
0 Toy Story Collection
2 Grumpy Old Men Collection
4 Father of the Bride Collection
9 James Bond Collection
12 Balto Collection
...
44582 The Carry On Collection
44585 The Carry On Collection
44596 The Carry On Collection
44598 DC Super Hero Girls Collection
44609 Red Lotus Collection
Name: belongs_to_collection, Length: 4463, dtype: object
Largest Franchise
So we can use sort_values to get the maximum number of count of a movie.
director
Steven Spielberg 9256.62
Peter Jackson 6528.24
Michael Bay 6437.47
James Cameron 5900.61
David Yates 5334.56
Name: revenue_musd, dtype: float64
Highest Number of Franchises directed by Directors
director
Paul W.S. Anderson 982.29
James Wan 861.31
Wes Craven 834.93
Francis Lawrence 816.23
Ridley Scott 689.00
Marc Forster 531.87
Steven Spielberg 500.10
William Friedkin 466.40
Darren Lynn Bousman 456.34
M. Night Shyamalan 375.37
Name: revenue_musd, dtype: float64
To Find Successful Actors
df['cast']
0 Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...
1 Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...
2 Walter Matthau|Jack Lemmon|Ann-Margret|Sophia ...
3 Whitney Houston|Angela Bassett|Loretta Devine|...
4 Steve Martin|Diane Keaton|Martin Short|Kimberl...
...
44686 Leila Hatami|Kourosh Tahami|Elham Korda
44687 Angel Aquino|Perry Dizon|Hazel Orencio|Joel To...
44688 Erika Eleniak|Adam Baldwin|Julie du Page|James...
44689 Iwan Mosschuchin|Nathalie Lissenko|Pavel Pavlo...
44690 NaN
Name: cast, Length: 44691, dtype: object
Bess Flowers 240
Christopher Lee 148
John Wayne 125
Samuel L. Jackson 122
Michael Caine 110
Name: Actor, dtype: int64
# This is known as label aggregation
data = actors.groupby('Actor').agg(
total_revenue = ('revenue_musd','sum'),
average_rating = ('vote_average','mean'),
average_popularity = ('popularity','mean'),
movies = ('title','count'),
average_revenue=('revenue_musd','mean')
)
data