DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis proposed: Is there any correlation between budget spent to produce a movie and a movie’s success? The first step, other than finding a solid data set, is determining how to measure success. Several variables are strong possibilities. First, straight “Revenue” which is the simplest metric that is widely reported and common in most data sets. Second, the Popularity of a movie which is available through several of the movie database sites. Third candidate, Vote_Average from movie viewers from several of the movie database sites. Fourth and Fifth options could be calculated variables of either a profit (Revenue-Budget) or ROI (Revenue/Budget) type variables. All five will be explored in the following analysis. Data Set Selection & Cleansing After review of multiple movie data sets, there were several potential candidates in a wide range of sizes. The selected data set is a middle-sized data set posted on www.kaggle.com from the tmdb movie database. (Link: https://www.kaggle.com/kevinmariogerard/tmdbmovies) It has 10.9k rows with 21 columns listed below: - Id (tmdb_id) (Qual.) - Imdb_id (Qual.) - Popularity (tmdb site data) (Quant.) - Budget (Quant.) - Revenue (Quant.) - Original_title (Qual.) - Cast (Qual.) - Homepage (Qual.) - Director (Qual.) - Tagline - Keywords (Qual.) - Overview (Qual.) - Runtime (Quant.) - Genres (Qual.) - Production_companies (Qual.) - Release_date (Quant.) - Vote_count (tmdb site data) (Quant.) - Vote_average (tmdb site data) (Quant.) - Release_year (Qual.) - Budget_adj (Quant.) - Revuene_adj (Quant.) As part of data cleansing, several columns were dropped that would not be necessary for this analysis. There were several areas that did have some holes (NULL values or zeros) but mostly at the low tail end of the data which is going to be dropped. The threshold settled on was any movie with a recorded budget or revenue lower than $1000 was removed. The net result, only two values remained as NULL or zero. Those values were looked up manually and updated. The cleaned data set landed at about 3.8k rows with 13 columns remaining. The selected columns are show below with two added derived values (*): - Id (tmdb_id) - Imdb_id - Popularity (tmdb site data) - Budget - Revenue - Original_title - Runtime - Genres - Production_companies - Release_date - Vote_count (tmdb site data) - Vote_average (tmdb site data) - Release_year - *Pure_Gain (Revenue-Budget) (Quant.) - *Retrun_Ratio (Revenue/Budget) (Quant.)
8
Embed
DS-240 Final Project Movie Budgets vs. Success · 2018-12-08 · DS-240 Final Project – Movie Budgets vs. Success Introduction Hypothesis / Problem Statement The question or hypothesis
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DS-240 Final Project – Movie Budgets vs. Success
Introduction
Hypothesis / Problem Statement
The question or hypothesis proposed: Is there any correlation between budget spent to produce a movie and a
movie’s success? The first step, other than finding a solid data set, is determining how to measure success.
Several variables are strong possibilities. First, straight “Revenue” which is the simplest metric that is widely
reported and common in most data sets. Second, the Popularity of a movie which is available through several of
the movie database sites. Third candidate, Vote_Average from movie viewers from several of the movie
database sites. Fourth and Fifth options could be calculated variables of either a profit (Revenue-Budget) or ROI
(Revenue/Budget) type variables. All five will be explored in the following analysis.
Data Set Selection & Cleansing
After review of multiple movie data sets, there were several potential candidates in a wide range of sizes. The
selected data set is a middle-sized data set posted on www.kaggle.com from the tmdb movie database. (Link:
https://www.kaggle.com/kevinmariogerard/tmdbmovies) It has 10.9k rows with 21 columns listed below:
- Id (tmdb_id) (Qual.)
- Imdb_id (Qual.)
- Popularity (tmdb site data) (Quant.)
- Budget (Quant.)
- Revenue (Quant.)
- Original_title (Qual.)
- Cast (Qual.)
- Homepage (Qual.)
- Director (Qual.)
- Tagline
- Keywords (Qual.)
- Overview (Qual.)
- Runtime (Quant.)
- Genres (Qual.)
- Production_companies (Qual.)
- Release_date (Quant.)
- Vote_count (tmdb site data) (Quant.)
- Vote_average (tmdb site data) (Quant.)
- Release_year (Qual.)
- Budget_adj (Quant.)
- Revuene_adj (Quant.)
As part of data cleansing, several columns were dropped that would not be necessary for this analysis. There
were several areas that did have some holes (NULL values or zeros) but mostly at the low tail end of the data
which is going to be dropped. The threshold settled on was any movie with a recorded budget or revenue lower
than $1000 was removed. The net result, only two values remained as NULL or zero. Those values were looked
up manually and updated. The cleaned data set landed at about 3.8k rows with 13 columns remaining. The
selected columns are show below with two added derived values (*):