Greehouse Gas Emissions and Temperature from 1990-2014

Data Science Pipeline Tutorial

By Justin DeVito


Introduction

This notebook will walk through the process of using Python for analyzing data through the "data science pipeline".

The data science pipeline has the following steps:

  1. Data collection/curation and parsing
  2. Data management/representation
  3. Exploratory data analysis
  4. Hypothesis testing and machine learning
  5. Communication of insights attained

First, let's begin with data collection.

Getting Data

Data can be found anywhere online. If you're lucky there's an API that makes it easy to request data from a database, but sometimes it'll take a bit more elbow grease. If data is shown on a website you could read the HTML into Python, and parse through it to find the relevant data you want (libraries such as "Beautiful Soup" would help with this). The data you find does not initially need to be perfectly formatted for you to be able to perform data analysis on it.

In this notebook I am going to look at how greenhouse gas emissions have changed over time from 1990-2014, and global temperatures over that same span. I initially found greenhouse gas data for UN member nations (couldn't find a database with global emissions) on the United Nations' website, but each variable (carbon dioxide, methane, etc) was stored in a different dataset. I was able to find the same data on Kaggle, a website where you can find many datasets and other data science resources, which allowed me to download everyting I wanted directly as a CSV file.

UN website link: http://data.un.org/Explorer.aspx

Kaggle link: https://www.kaggle.com/unitednations/international-greenhouse-gas-emissions

With that CSV file I can easily read it into Python using a library called Pandas, which helpes with storing the data and memory as a "dataframe" and provides many functions that help with manipulating the data.

I found the global temperature data on NASA's website as a .txt file; it has the recorded "temperature anomaly" for each year from 1880 to 2020, where "temperature anomaly" is the temperature in Celsius relative to the average temperature from 1951-1980. It contains the exact values as well as "smoothed" values, but I will only deal with the exact data here. To import this data I'll use the requests library to get the data as a string, and then manipulate it using Python until I can put it into a dataframe.

Link to data: https://data.giss.nasa.gov/gistemp/graphs/graph_data/Global_Mean_Estimates_based_on_Land_and_Ocean_Data/graph.txt

Description of data: https://climate.nasa.gov/vital-signs/global-temperature/

Data Management & Exploring the Data

Now that the data is scraped and loaded into memory, we can beginning exploring the data to get a better understanding of what we're working with. To do this, I'll first organize the dataframes, and then make some preliminary plots to see how the data changes with time.

To get a dataframe I could easily use to make plots, I first pivoted the greenhouse gas data so each category has its own column, and summed across all countries for each year; I also got rid of data for total greenhouse gas emissions without indirect CO2, I will only use the full total for my analysis in this tutorial. I also noticed some countries logged emissions for HFCs and PFCs seperately, but others didn't. To simplify analysis I just combined all of those emissions into one column called HFCs_PFCs_mix. Lastly, I added a column with the temperature data.

The column names represent the categories as follows:

Now, with the dataframe organized, we can make some preliminary plots.

The plot above has a line for each column in the dataframe (except temp, it has different units). This plot shows that the total greenhouse gas emissions (GHGs) are really dominated by carbon dioxide (CO2); it has by far the highest emissions by mass out of all the greenhouse gasses shown here. It's hard to see if there is a trend in the other greenhouse gas emissions with time because the y-axis here is only really suited for the scale of CO2 emissions. To fix this, I'll make a seperate plot for each of the greenhouse gasses.

I predict that when I make those plots, the total greenhouse gas plot and CO2 plot will look very similar, like they do here, because most of the total greenhouse gas emissions are CO2 emissions.

I created 6 plots for each of the greenhouse gas categories plus one for the total greenhouse gas emissions. By looking at the y-axis we can see on what scales each of these types of emissions are. The largest, as I noted earlier, is carbon dioxide (CO2)—followed by methane (CH4), nitrous oxide (N2O), hydrofluorocarbons and (per)fluorocarbons (HFCs/PFCs), sulfur hexafluoride (SF6), and finally, nitrogen trifluoride (NF3).

We can also see from the plots a general trend in most of the categories. For example methane, nitrous oxide, and sulfur hexafluoride all appear to have clear downward trends with time, although I can't tell if the relationships are linear or not. My prediction of the greenhouse gas and carbon dioxide plots looking similar holds mostly true, although the right-hand side appears to be a little lower in the GHGs plot than the CO2 one. This, I believe, is due to the fact that many of the greenhouse gasses other than CO2 have downward trends with time, including the second largest contributer: methane.

Now I'll plot the temperature data to get see if there are any obvious trends there. My prediction here is that the temperature will be shown increasing with time, based on what I've heard about global warming.

This scatter plot has a point for each year from 1990-2014 representing the global temperature in that year. It looks to have a linear trend upwards with time—enough so that I will say we can reject the null hypothesis of no correlation between temperature and year. To make this trend even clearer, I'll add a linear regression line to the plot using a library called scikit-learn.

In my opinion, this linear regression matches the data pretty well. It was made using a library called "scikit-learn" which here uses least squares to create a model from the x and y data.

Hypothesis Testing & Machine Learning

After performing exploratory data analysis and getting a solid understanding of the data we're looking at, we can move into deeper analysis.

I want to see if there is a correlation between greenhouse gas emissions and global temperature, but I don't know what that relationship would be. Do greater emissions make greater temperatures for that year? It might make more sense for it to be a relationship with the derivative of temperature, however, because more greenhouse gasses in the atmosphere make greater temperatures on Earth, and emissions represent a change in the amount of greenhouse gasses in the atmosphere. It might even be a relationship with the second derivative of emissions for all I know, because of how the atmosphere changes surface temperatures.

To start I'll just try plotting temperature in vs. greenhouse gas emissions in each year.

This data does not appear to have any meaningful trend, but that's okay.

Now let's try the first derivative. I will get the "derivative" by using the diff function for a Pandas dataframe which replaces each entry with the difference between it and the entry "before it", which is by default the entry in the index before it.

Because the first row (1990) doesn't have a row before it, the change in temperature is NaN, so let's just remove that row.

Again, there appears to be no meaningful trend in the data, let's now let's try with the second derivative of temperature.

Okay, it seems like we are not going to find the relationship I was expecting. The best-fit lines make it seem like these 3 plots have downward trends, but if you look at the spread of the data points I don't think it is strong enough to reject the null hypothesis of no relationship. My prediction was based on what I know in general about greenhouse gas emissions leading to increased temperature in the long-term. The fact that this relationship cannot be replicated here means I assumed something wrong; either there is no relationship between greenhouse gas emissions and temperature, or more likely, something else. Perhaps this window of time (1990-2014) is too small of a sample size—or maybe the greenhouse gas emission data that I used, which only has data from UN member nations, is not as representative of global greenhouse gas emissions as I thought.

What I'm going to try to do now is throw all the data from each greenhouse gas category at a scikit-learn linear regression model with temperature as the target and see if I get any meaningful result. Hopefully the model will work, and by looking at at it I'll be able to get an understanding of why it works.

To test how well it works, I'll calculate the residuals, which are the differences between each actual data point and what the model predicts it would be. So in this case, for each year, I'll subtract the actual temperature anomaly from the predicted temperature anomaly.

Now I'll make a boxplot to get some info about the distribution of residuals.

This boxlpot shows that the model is a pretty good predictor of temperature, but it's not perfect. There are no outliers, and there are no residuals much greater than 0.1 degrees C. However, the median is a little bit off from zero.

These coefficients show the change in temperature in degrees celsius for each kiloton of CO2 equivalent of emissions of that greenhouse gas. It makes sense that they are all very small, because we are dealing with large values of greenhouse gasses (total emissions are on the scale of 10^5 kt CO2 equivalent), and temperature is only changing by a a fraction of a degree celsius.

However, I don't think the rest of this model makes much sense. It is saying that some greenhouse gasses, such as CO2 and CH4 have an inverse relationship with temperature, while others like NF2 and N2O have a direct relationship. This doesn't make sense to me, logically, because I have been led to believe all greenhouse gasses in the atmosphere should have similar effects on temperature, which is why they are all known as "greenhouse gasses".

So, unfortunately, this model has turned out to be a bit of a nothing burger in terms of learning more about the data. Before we move on, though, I'll plot the model and the actual data on the same axes just to see how similar its predictions are visually.

Judging by this plot, the model does seem to function as a pretty good predicter of temperature over this time span. However, there does appear to be a bit of "overfitting" between 1990 and 1995 where the predicted temperature is decreasing every year. Overfitting means that the model is fitting to the noise in the data, so it may produce results very similar to the observed data but does not represent the actual underyling trends.

The linear regression of temperature vs. year from the exploratory data analysis stage is in my opinion a better model of temperature over time, so let's explore that more. Is a line the best model of that data, or is there some other function that could better represent it? Let's try a second degree polynomial.

The second-degree polynomial regression of temperature data with time actually appears almost linear. This indicates to me that the trend in the data is pretty close to linear, and using anything other than a first-order linear regression might just lead to overfitting.

Conclusions

After all of the analysis, unfortunately I could not find a clear relationship between greenhouse gas emissions and temperature within this data. Total greenhouse gas emissions in the UN for the most part decreased from 1990-2014, whereas global temperatures increased. That gives the impression that there is an inverse correlation between emissions and temperatures, which doesn't make sense given what I know about global warming. I thought the underlying relationship could actually be between greenhouse gas emissions and the change in global temperatures, however I could not find this trend in the data either. I concluded this dataset may be to small of a sample size, or not representative enough of global emissions.

What I did find, however, is a clear linear direct relationship between temperature and time from 1990-2014. Even when using a second-degree polynomial regression, the relationship seemed close to linear, so I decided the linear model was best.

Thanks for taking your time to read this tutorial, hopefully after following me through the data science pipeline you better understand how to glean insight from data, and can use this process to discover interesting things on your own in the future!