Ifigeneia Tsiflidou

Predictive Modeling of Movie Revenue

UOC – Applied Statistics Project

Description

As part of an Applied Statistics course, this project aimed to predict the global box office revenue of 3,000 movies from The Movie Database (TMDb). The workflow included exploratory analysis, feature engineering and multiple regression modeling  using Python.

Tools

Numpy
Matplotlib
Pandas

Phase 1: Exploratory Data Analysis

Assessed relationships between revenue and features like budget, runtime, popularity, and language. Budget showed the highest correlation (~0.75) and was the strongest individual predictor. Visualized insights using scatter plots and correlation matrices.
Description of image

Phase 2: Feature Engineering & Regression Modeling

Engineered numerical features from JSON fields, such as number of male cast members and production companies. Built a multiple linear regression model using scikit-learn. Budget and popularity were the most significant predictors. The final model achieved an R² score of ~0.38 and was validated through residual analysis.
Description of image

Phase 3: Model Validation & Residual Analysis

Evaluated the model’s assumptions by analyzing the distribution of residuals. A histogram revealed a roughly symmetric shape centered around zero, supporting the validity of the linear model for this dataset.
Description of image

Scroll to Top