Part 1: Data scraping and preparation

Step 1: Scraping competitor's data

First, I need to pull the data from the competitor's website and put it into a data structure in Python that I can work with. I do this by using the requests library to get the HTML from the site as a string, and using BeautifulSoup to help me find the relevant table. I then use the read_html method from Pandas to get the table and import the data into a Pandas data frame.

Step 2: Tidying the top 50 solar flare data

To tidy the data, I first got rid of the movies column using the drop command. Then, I used itterows to loop through the data frame and combine the date and time of that row into a string that could be parsed by to_datetime, which I used to convert each time entry into a datetime entry. I then dropped the date column and renamed the columns. To deal with missing data indicated by a "-", I used the replace method from Pandas to replace "-"s with NaN.

Step 3: Scraping the NASA data

The NASA table was more difficult to convert to a data frame because it's not inside an html table—it's just written out as lines of text. So I used BeautifulSoup to get to the <pre> tag that the "table" is in, and then deleted the lines that weren't data. I then put those lines into rows of a data frame, and then expanded them so each data entry was in its own column. I then got rid of the excess variables/junk that were caused by the way the site is formatted, and gave appropriate names to the remaining columns.

Step 4: Tidying the NASA table

To tidy the NASA table, I first replaced the many different symbols that represend missing data with NaN. I then added a new column to indicate if a row corresoponds to a halo flare, and another to indicate if the width given is a lower bound; while then changing the cpa and with columns to be homogeneous. After that I converted the date and time columns into datetime columns. I had to account for the cases where the year changes in between the start and the end or cme, and fix some values in the data that didn't fit the datetime format (24:00 -> 00:00 the next day).

Part 2: Analysis

Question 1: Replication

Top 50 from NASA table:

SpaceWeatherLive top 50:

I think I've replicated the top 50 somewhat well. The first three clearly match, based on classification and time. The third flare and some others have rounded down classifications in the NASA table (X17.2 -> X17). The fourth flare, rated as X17 on SWL is missing from my NASA table. I checked the NASA website to see if I lost the data somewhere but it is not there either; however, there is a flare that occured on the same date around the same time, but it is classified as X1.7 instead of X17, could this be an error...? The next 14 flares seem to match, but after that the NASA table is missing a couple of X5.4s.

I've also noticed that the NASA table seems to round the time data, while SWL has it to the minute.

Question 2: Integration

I defined "best matching" as having the closest start datetime. In my implementation, the higher ranks get to find a match first, so if a lower rank is closest to a flare that's already been matched, it has to match with the flare with the second closest start datetime.

This solution works well for most of the flares in the top 50, as you can see by looking at ranked flares in order and their classification. The classifications match what's on SpaceWeatherLife, and descend as rank descends... except for the 42nd ranked flare, which is classified as M1.8 (all top 50 should be X-class), but has a starting time not too far off from the 42nd ranked flare on SWL. No other flare in the NASA table occurs on that day. At the 4th ranked flare you can see the possible error I mentioned in the last question: what should be an X17 flare according to SWL is listed as a X1.7 flare.

I made a cutoff at 6 hours; if the closest "match" for a flare in SWL is 6 hours or more apart from its start time, that match is not made. Due to this there is not a match for every flare in the top 50—in fact there are only 38 matches. This makes sense, because as you can see in the previous question as well, the NASA table seems to be missing some flares that are in the SpaceWeatherLive table.

Question 3: Analysis

My intention for this section is to make a plot to show if strong solar flares cluster in time. I'll do this by graphing the amount of flares in the top 50 per month. If there is a large variance in the number of flares in the top 50 among different months—in other words, if few months have a lot of flares in the top 50 while other months have little to none—that would indicate that strong flares do indeed cluster with time.

This is a stacked bar graph showing the number of solar flares per month from the NASA table for flares inside and outside of the top 50, where being in the top 50 is defined as being matched with a flare from SpaceWeatherLive's table. Flares in the top 50 are shown in purple, while the rest are shown in salmon.

From the graph, I don't think there is enough evidence to conclude that strong solar flares cluster with time. There are many months with only one solar flare in the top 50, and the month with the most only has four. The graph does however bring to light a strange lack of solar flares from 2007 to 2011. This could be due to some astronomical event that I don't know about, or it could have something to do with the way these solar flares were detected and logged.