Initial Investigations

Now we have data the natural step is to perform some initial investigations. For example, how much data we have?

data.size
// res0: Int = 2082

We have 2082 records, one for each month from January 1850 to July 2023. This isn't a huge amount, but it's certainly too much for us analyze just by looking at it. This is where exploratory data analysis, the focus on this part of the book, comes into play. We'll see many techniques over the next chapters, but we're starting with the most basic.

Perhaps the most basic technique is to just look at some of the data. Here's the first element.

data.head
// res1: Record = Record(
//   year = 1850,
//   month = 1,
//   anomaly = -0.67456436,
//   lower = -0.98177195,
//   upper = -0.3673568
// )

This tells use this element refers to January of 1850, the average temperature was -0.6C below the 1961-1990 baseline, with lower and upper error intervals of approximately -0.9C and -0.3C. (How do know what the meaning of these fields? By reading the documentation, in particular the linked paper.) TODO: Check this

We can also look at the last element.

data.last
// res2: Record = Record(
//   year = 2023,
//   month = 6,
//   anomaly = 1.0509841,
//   lower = 0.9951409,
//   upper = 1.1068273
// )

Here we have information form July of 2023, and the temperature is now above the baseline. This seems like it might support global warming, but what about the data inbetween? Looking at the same month from every year is likely to still be too much to read, but we could look at the same month from each decade.

val decades = data.filter(r => r.year % 10 == 0 && r.month == 6)
// decades: ArraySeq[Record] = ArraySeq(
//   Record(
//     year = 1850,
//     month = 6,
//     anomaly = -0.34424013,
//     lower = -0.60947233,
//     upper = -0.079007894
//   ),
//   Record(
//     year = 1860,
//     month = 6,
//     anomaly = -0.21145956,
//     lower = -0.45290247,
//     upper = 0.029983336
//   ),
//   Record(
//     year = 1870,
//     month = 6,
//     anomaly = -0.23123583,
//     lower = -0.4569157,
//     upper = -0.005555956
//   ),
//   Record(
//     year = 1880,
//     month = 6,
//     anomaly = -0.38053036,
//     lower = -0.57394856,
//     upper = -0.18711212
//   ),
//   Record(
//     year = 1890,
//     month = 6,
//     anomaly = -0.49480304,
//     lower = -0.6899875,
//     upper = -0.2996186
//   ),
//   Record(
//     year = 1900,
//     month = 6,
//     anomaly = -0.21710625,
//     lower = -0.4151495,
//     upper = -0.019062985
//   ),
//   Record(
//     year = 1910,
//     month = 6,
//     anomaly = -0.5593434,
//     lower = -0.7473471,
//     upper = -0.37133965
// ...

With only 18 measurements, this is more manageable. Overall, the data does seem to show increasing temperatures but it would be much easier to see a trend on a graph rather than in printed numbers, so in the next section we'll turn to visualizing data. Before we get there, however, it's time for you to do a bit of analysis on your own.

Exercise: Shall I compare thee to a summer's day?

In this chapter we're learning about data analysis, but we're also learning how to work with collections of data such as List.

When we selected data by decades, we rather arbitrarily chose June as our month of interest. Write code that instead selects data from January. Do you still see a similar trend?

This is a small modification of the original code. Instead of looking for r.month == 6 we look for r.month == 1, which is the numeric code corresponding to January.

val januaryByDecades = data.filter(r => r.year % 10 == 0 && r.month == 1)
// januaryByDecades: ArraySeq[Record] = ArraySeq(
//   Record(
//     year = 1850,
//     month = 1,
//     anomaly = -0.67456436,
//     lower = -0.98177195,
//     upper = -0.3673568
//   ),
//   Record(
//     year = 1860,
//     month = 1,
//     anomaly = -0.39058298,
//     lower = -0.6584608,
//     upper = -0.12270518
//   ),
//   Record(
//     year = 1870,
//     month = 1,
//     anomaly = -0.21106681,
//     lower = -0.4358849,
//     upper = 0.013751276
//   ),
//   Record(
//     year = 1880,
//     month = 1,
//     anomaly = -0.39386344,
//     lower = -0.60645384,
//     upper = -0.181273
//   ),
//   Record(
//     year = 1890,
//     month = 1,
//     anomaly = -0.53390056,
//     lower = -0.68983096,
//     upper = -0.37797013
//   ),
//   Record(
//     year = 1900,
//     month = 1,
//     anomaly = -0.5065199,
//     lower = -0.6711187,
//     upper = -0.34192112
//   ),
//   Record(
//     year = 1910,
//     month = 1,
//     anomaly = -0.4327842,
//     lower = -0.5747264,
//     upper = -0.29084206
// ...

The trend is not exactly the same as before, but it is simlar enough.

Exercise: Statistics is the Grammar of Science

Can you think of other ways we could analyse the data to see if there is a difference in temperature over time, apart from extracting data by decade and visualizing it? This is an open-ended question; any answers are good answers!

There are no right and wrong answers to this, but the more you've studied statistics the more answers (and the more complex those answers) are likely to.

Here are just a few ideas:

  • The main data point is the divergence from the baseline of 1961-1990. So perhaps we could sum the divergences before that period and compare them to the sum of divergences after that baseline? We could also compute sums by month, and compare those, if we suspect the divergence changes by month.

  • As a variation on the above idea