Part 1: Wrangling

Problem 1

I used the read_sql() function in Pandas to extract data from the SQLite database and put it into a Pandas database. In my SQL query, I used the GROUP BY statement to sum all of the salaries for each team/year combination in the Salaries table, and then used the INNER JOIN statement to combine the resulting table with selected data from the Teams table.

As a result of using INNER JOIN, team/year combinations that existed in one table but not the other (missing data) were not included in the final table. Therefore the final table does not have any data from before 1985, which is the earliest year of recorded data in the Salaries table.

For both of these tables, when the row exists, all variables in the row seem to exist.

Part 2: Exploratory Data Analysis

Payroll Distribution

Problem 2

To show the distribution of payrolls across teams conditioned on time I chose to plot lines showing each team's payroll and how it's changed over time from 1990-2014. I did this by first "unmelting" the table using .pivot() to give each team it's own column showing payroll for each year. From that I could easily plot the data using .plot().

Question 1

Looking at the graph, it's clear that the spread of payrolls increases a lot between 1990 and 2014—the lines are very close together (from the perspective the graph provides) in 1990, and very far apart in 2014. From this we can deduce that the variance/standiard deviation of the distribution of payrolls across teams increases over this time span.

The graph also shows a general upward trend in payrolls over time; if you imagine drawing a best fit line on the graph, you can see it would slope upward. Therefore, the graph shows that the average payroll of teams in the MLB increases from 1990-2014.

Problem 3

To get these two graphs, I used groupby() to get a groupby object that I could call .mean() and .std() on to get the mean payroll and standard deviation at each year. I could then plot the data easily using .plot().

Correlation between payroll and winning percentage

Problem 4

To create the 5 plots, I first used the groupby() function in Pandas to group the data by chunks of time and team, and take the mean payroll and winning percentage for teams in each. I chose the 5 groups that I did (between 1990-2014) because each time chunk is equal, and we can compare results with results in problems 2 and 3. I then plotted the values for teams within each time chunk on 5 different plots. I chose to label two teams on the plots to make analysis easier. The Oakland A's are labeled "OAK", and the New York Yankees are labeled "NYY".

I chose these teams becase the A's are known having used analytics to win while spending less than other teams, while the Yankees are known for spending a lot of money (also becase I am a Yankees fan). Both of these teams don't rename or move in 1990-2014, so they are in all 5 plots.

Question 2

Before analyzing what these plots show, it's important to notice that the x-axis is not consistent for all of them. There seems to a general trend of teams spending more money over time, which is consistent with what I found in part 2 of this project. In 1990-1994 the highest spending teams were around a average annual payroll of \$35 million—in 1995-1999 there were many teams above \$40 million, and the highest spending teams were around \$6 million.

In general, it's easy to see a slight trend upward in most of these graphs, meaning teams that spend more money tend to have a higher winning percentage. It is not until 2010-2014 that you begin to see many teams be near the top in winning percentage that are on the lower end of the league in terms of average annual payroll.

The first thing I noticed when looking at these graphs is the Yankees' increase in spending between 1995-1999 and 2005-2009. In the former their average payroll appears to be around \$65 million a year, whereas in the latter it is around \$200 million, and by then the Yankees have really become an outlier in spending. In the case of the Yankees, this does seem to translate into winning games, because their winning percentage is at or is very near the highest in the league for all three time chunks following their initial spending increase. You could say that the Yankees were particularly good at "paying for wins" from 1995 to 2014.

The first graph shows that the A's in 1990-1994 had one of the highest average annual payrolls in the MLB. However, in 1995-1999 their spending had not increased by nearly as much as other teams in the league, and they had one of the lowest average payrolls; their success seemed to reflect that, their winning percentage was below 50\% and was pretty low compared to most other teams over that span. In 2000-2004 their average payroll was still one of the lowest in the league but their winning percentage was one of the best—almost tied with the Yankees! Something must have happened around the turn of the millennium that helped the A's win games with much less money than other teams.

They remained low spenders relative to other teams in the next two time chunks, but always had a winning record on average (winning percent over 50\%). In 2005 to 2009 their winning percentage was only mediocre, but in 2010-2014 they were back to being near the top of the league.

Part 3: Data transformations

Standardizing Across Years

Problem 5

To transform the data using .transform(), I wanted to input the payroll, the average payroll, and the standard deviation as a tuple. To get the data in that format, I first made two new columns in a temporary dataframe for the average and standard deviation. Then I used zip() to combine them into a new column of tuples.

Using a function I defined that takes in a tuple and gives the standardized payroll, I transformed the column of tulples and added the result to my dataset.

Problem 6

I used the same code to get the same graphs from problem 4, only this time I used standardized payroll instead of payroll for the x-axis.

Question 3

The plots in problem 4 and problem 6 look very similar, except in problem 6 instead of plotting winning percentage against payroll, we're plotting it against the standardized payroll: the number of standard deviations that the given team is from the mean payroll of that year. They look similar because a team with a high payroll is also going to have a high standardized payroll, and a team with a low payroll will also have a low standardized payroll.

We can see from these graphs that the general slight trend of upward still exists after the transform, implying that higher standardized payroll is correlated with higher winning percentage similarly to how actual payroll is.

The significance of this transform is that now payroll is relative to the other payrolls in its year, so we can compare data across years without worrying about the the overall increase in payrolls over time found in problem 2. The x-axis labels between some of the graphs differ in the plots with actual payroll by an order of magnitude due to the overall increase in payrolls. In the plots with standardized payroll, the x values stay much more consistent throughout the 5 graphs—there is a slight change due to the New York Yankees becoming a spending outlier, but in general the x values stay within the range of -2 to 2.

Expected Wins

Problem 7

The code in this section is pretty straightforward. The x and y coordinates for each team was already in my dataset under the standardized payroll and winning percentage columns. I used that to plot the data, and then used that data along with numpy.polyfit() to get the best fit line.

Spending Efficiency

Problem 8

To get efficiency for each team, I made a series of tuples that contained win percent and expected win percent, and then used .transform() to subtract expected win percent from win percent. I then used .pivot() to get a dataframe with columns for the efficiency of each team where rows are each year; this way the data would be easier to plot.

I decided to plot the Oakland A's (OAK) and the New York Yankees (NYA) to be able to compare with the graphs in problems 4 and 6. I also wanted to plot other teams for context; I decided on the the Boston Red Sox (BOS), a team whose spending efficiency is pretty middle-of-the-road and stays around the 0 mark, and the Detroit Tigers (DET), who seem to have poor spending efficiency in general.

After seeing my initial plot, I decided to also look at the 5 year rolling average to make the trends easier to see.

Question 4

This plot shows teams' performance compared to how they were expected to perform based on their payroll, as defined by spedning efficiency. Teams with low standardized payroll and high winning percent on a graph in problem 6 should be shown to have a high spending efficiency here over the years of the specific graph in problem 6.

In question 2 I said I thought the New York Yankees were particularly good at "paying for wins", especially after 1995, because their spending and winning percentage was high. I did not know how efficient that spending was, however. They could have reached diminishing returns. From this graph you can see that the Yankees had a spending efficiency averaging above zero for a while from the late '90s to the late 2000s, so they actually were winning more than they would be expected to based on their payroll.

The Oakland A's, we know from question 2, had one of the lowest average annual payrolls in the league but one of the highest winning percentages from 2000-2004. From this we would expect that they were overacheaving based on expectations and should have a high spending efficiency over that span—and they do! In what looks like 2001 and 2002, they had a winning percentage around 15 points higher than would be predicted based on the regression line in problem 7; and on the rolling average graph you can clearly see they are very efficient on average in that time period.