Table of contents

Executive Summary 3

Dataset 3

Data Observations 3

Statistical Analysis 4

Conclusion 4

Dataset for the 2004 season 5

Regression Analysis taking LOG (Y) 6

Regression Analysis 8

Executive Summary

This report is to determine whether total team payroll for major league baseball teams directly varies with each team's home attendance. This is an important statistical analysis because if we can prove that there is a relationship between

It is possible in the future that the dataset size could increase due to baseball expansion. For example, in 1998 there were 28 major league baseball teams when two teams were added, the Arizona Diamondbacks and the Tampa Bay Devil Rays. There was no elimination of variables because they were all statistically valid. We thought of eliminating the New York Yankees from our dataset, as it is the one where the total salary is exceptionally high compared to the other teams. But since the mentioned team is one of the top teams which have a huge following we included in our dataset. We did take a logarithm of values of Y, in our case values of salaries to see if we get a better regression analysis. The output of the data is also shown in our study. Thought the coefficient of determination increases but not that significant. Also the residual plot of the log function is much scatter than the regular one. Also the regression equation which we get is y=4E-05x + 9.8235 which is not an easy one. So in conclusion we thought of staying with the regular analysis and dropped the idea of taking the log of values of Y. There were other independent variables that were considered for analysis like team winning percentage, team batting average, and city population. However, home attendance was the most cogent variable to examine from a financial perspective. After performing a simple regression, we conclude that there is a linear relationship.
We see in our analysis, the coefficient of determination, r2, which determines how confident or strong we are in our regression, is .6240, viz. 62.40%. A value of 62.40% is a satisfactory result for our application needs. In addition, we find that our regression equation is y = 3.2707x – 28084. This means that for above 8,587 for increase in one additional fan, it translates into $327 increase of team salary. The 95% confidence interval for the slope is 3.27068 ( .9829. Finally, the standard error of estimate is 21372.9. The p value is 0.00 and the F value is significant. We can say with confidence that the data for our analysis has a linear relationship and we can draw inferences from it to a fair degree of accuracy.

Finally, there is one outlier point. We realize that outliers have a profound influence on the on the slope of the regression line and consequently on the value of the coefficient of correlation. We also realize that in some cases an outlier could be construed as a recording error, like the reading from a scientific instrument. However, all the data in our study is not subject to a recording error. We therefore decide to keep our outlier in our data set and pay careful attention to our scatter plot.


We determined that the sample data is linear. We also determined the coefficient of determination to be satisfactory for the purposes of this analysis.
In the residual analysis, the data on the positive and on the negative side are equally distributed with some exactly on the middle line....

