Regression analysis of Gapminder data

Exercise 1

Question 1

The plot shows a general trend of increasing life expectancy over time. It looks like a linear trend, however it is difficult to know for sure given that the change appears only slight over this time interval.

Question 2

The graph for individual years seems to be bimodal, with one mode above the mean and one below. In 1950-1960 it appears that most countries are below the mean life expectancy, and the graph is skewed toward the higher life expectancies. The skew decreases with time, however, and flips around 1970, where the two modes seem similar in size. By the 2000s the graph is skewed the other direction in an significant way.

Question 3

I would reject the null hypothesisis of no relationship between life expectancy and time because mean life expectancy increases for every year measured in this dataset.

Question 4

I think that a violin plot of the residuals from a linear regression model would look similar to the one I already plotted except the means would be around zero for each year, so it would be without the general upward trend. That's because the residuals are the values relative to the linear model at that year, which would be close to the mean life expectancy for that year (I don't think the difference in the model's prediction and the mean would be significant).

Question 5

The simple linear model of life expectancy vs. year is inherently assuming that there are no other variables affecting life expectancy, and that the distribution of life expectancies would be centered around the mean for each year. This does not seem to true, however, because just by looking at the violin plot in question 1 you can see the distributions are not centered around the mean.

Exercise 2

I decided to print the fitted model as a list of values (above) because I thought it would be an easy way to see the kind of predictions the model was making.

The values above give the y-intercept and the slope, so the model in "y = mx + b" format would be:

$predicted\ life\ expectancy = (0.32590383 * year) - 585.6521874415448$

Question 6

According to my linear regression model, life expectancy increases by about 0.326 years on average every year around the world. This can be seen from the value of reg.coef_ in exercise 2.

Question 7

I reject the null hypothesis of no relationship between year and life expectancy because the slope of the linear regression seems to be significantly greater than zero, which matches what I predicted the relationship was by just looking at the violin plot.

Exercise 3

Question 8

This does match my expectations from question 4. The plot looks similar to the plot of life expectancy vs year, except it's standardized around about zero because the linear regression model's prediction is near the mean life expectancy for each year.

Exercise 4

Question 9

There clearly seems to be a dependence between model residuals and continent, which suggests that when performing a regression analysis of life expectancy across time, using dummy variables to account for the categorical effect could result in a more accurate model.

Exercise 5

Question 10

This plot shows that my regression model should have an interaction term for continent and year because the effect of the year on a country's life expectancy is different in different continents; i.e. the slope and height of the continent-specific linear regressions above are different from each other.

Exercise 6

Question 11

All coefficients seem to be significantly different than zero. This makes sense, because every one of these variables should have a significant impact on life expectancy.

Question 12

I found the average increase in life expectancy each year for each continent by predicting life expectancy at year 1 and subtracting from it the life expectancy prediction at year 0. Here's the results:

Each year, life expectancy increases by...

Exercise 7

This plot shows that the assumption that life expectancy depends on year and continent seems to hold up very well; the distributious of residuals for each year appears to be centered about the mean.