Thursday, April 15, 2010

TempLS - what's been happening, and where it's going

The development of the least squares calculation of a regional or global temperature is now spread over a series of posts, so I'd like to gather together what has been done, and where it might be going. The list of posts is here:

Least squares

The first post, GHCN processing algorithm, introduced the basic idea, variants of which Tamino and Jeff/Romanm had been using to combine stations in grid cells. The model is that the array of station temperature observations, over month and year, can be described as the sum of a local function L, depending on station and month (not year), and a global temperature G, depending only on year. The idea is that L contains the unwanted effects of site variability and seasonal effect, while G is the item of interest.

A weight function is used for the fitting. This has a value for each month-year station combination, and is the product of two components:
1. It is zero if there is no observation
2. It varies inversely with area density of stations.
The density is estimated with a 5deg x 5 grid. the stations falling in each cell are counted, and the cell area divided by that number is the cell estimate.

So a major difference with other methods ia that no grid averages are calculated. A grid is invoked, but only to count the stations contained.

The LS leads to a big linear system (t(B) is transpose):
A1*L + B*G = R1
t(B)*L + A2*G = R2
In principle, there could be about 90,000 equations and unknowns. But A1 and A2 are diagonal matrices, so L can be eliminated (the time consuming step), giving about 100 (number of years) equations for G. This takes about a minute of computing time for a global set. Total time for big global runs is generally less than two minutes.

That first post then pointed to the R code, and gave some basic results for land-only GHCN global data.

Comparisons

The next post, Comparison of GHCN results, was meant to compare the results with other efforts and the major indices. However, Zeke had done the hard work there - mine just slotted in, and seemed to give results very much inl line with the others.

More results

A side-effect of using R is that it is easy to create all sorts of masks for selecting different kinds of sites. The GHCN file v2.temperature.inv is read in as a dataframe, and each of its components can be used in logic tests. This includes latitude and longitude, country code, altitude, pop, urban status, etc. So the next post, More GHCN results., announced Ver 1.1, and showed some results for much-discussed classifications like urban/rural, airports etc. The main V1.1 change was to separate out the process of removing duplicates into a preprocessor program, since it only needs to be done once for any version of v2.mean.

Mechanics

The most commented post was the next, Tips on File Download. The lead-up to V2 was the effort to make the process more user-friendly - the adoption of a text repository, and structuring the code so the parts needed by users were visible at the top.
The next post, Version 1.2 of GHCN processor, described these improvements in detail.

Using the capability

The stimulus prompting much of this effort was Steve Hempell's determined efforts to get the code to work for him. The code was much improved by this (and helped by Carrick, Mosh and others). I posted another set of tests of some past issues - whether "dropped" GHCN stations had been trending down relative to "kept" stations, for example. This used an extension to the inventory categories whereby stations could be selected also by reporting history.

Then Steve got to work on the questions that were interesting him. Two posts were based on his thorough examination:
Latitudinal temperature histories and trends and Continents and trends.

The Big City Trends post related to an interest that Joseph had been exploring with his code.

Incorporating Ocean data

The least squares approach makes it relatively easy to combine data from multiple sources. The next challenge was a full global index, and this was done in V 1.3. It involved more preprocessing, so the inventory file was augmented by a series of synthetic stations based on the HadSST2 ocean grid cells. These could then be included in the fitting in the normal way. The cell density weighting took care of scaling issues. The results were described in Incorporating SST and Land/Ocean models. The fit to SST is excellent - the fit to combined land/sea is still good, but with more disparity. This could probably be improved by better balancing at the shore. V1.3 is described in TempLS Version 1.3 - SST, Land/Sea index, distance utility.

What's next?

Probably more applications posts. On the technical side, using the R global mapping capability. I expect V1.4 will routinely give a map showing the distribution of stations in each run. But beyond that, LS makes it possible to use other parametrisations, particularly spatial. We could show, say, trends expressed in terms of spherical harmonics. R does not seem to do shade plots, but its filled contours aren't bad.

In another direction, finer temperature plots will be possible. Monthly fitting needs little more calculation than annual. The months can be done separately, but for the big elimination step, that only reduces the computing by 1/12, which then has to be done 12 times.

1. Very Cool.

I'm just begining to grasp the power of the dataframe. After digesting your masking code ( for masking off stations that dont fit the logical criteria) I discover the subset() function.

So now, given a dataframe of inventory ( Inv)

I can select the stations that fit a criteria with one simple line.
airports <-subset(inv, inv$airport[]==TRUE); and I have an inventory file with just airports. and if I want Urban airports... UrbanAirport<-subset(airports,airports$Rural[]=="U");

getting to like R

2. Steve,
Thanks, I didn't know about the subset function. But all I need is a set of integer pointers, rather than a full reorganised dataframe.

I like R too, generally, though I wish they would make it easier to shed the dataframe attributes when you take data out.

3. Are you working on a write up for the system of equations for your solution?

4. Carrick,
Yes, I'd promised that. I wanted to link it to the code, with detailed explanation, but the code keeps changing. Real soon now.

5. Yes, one doesnt need a dataframe, but after you read inv in with a read.fwf you effectively have a dataframe. you've added names to inv so you are accessing as a dataframe so subset() should work.

Shedding attributes:

There are a couple ways to do it.

a<-c(1,2,3)
b<-c(2,3,4)
d<-c(5,6,7)
e<-c("U","R","U")
test<-data.frame(a,b,d)
test2<-data.frame(test$a,test$e);
# now your column names maybe funny so assign them the
# way you want

Should work. or in my code I'll do things like this:

####################################
Get_Places<-function(ghcnid,inv){

# ghcnid is the 11 digit identifier
# inv is a data.frame of the *inv file.

# we are passed a list of stations.
# now we look up those stations in the data frame
# match returns a vector of indices that match the large
# Inv file.

Locations=match(ghcnid,inv$Id) Id=inv$Id[Locations];
lat=inv$lat[Locations]; lon=inv$lon[Locations];

# now we have created 3 vectors Id, lat, lon
# they contain the lat, lon for the ghcnid, that we passed in.

nt=data.frame(Id,lat,lon,stringsAsFactors=FALSE);
# and our names are the same as the original frame

# now we have created a dataframe that just contains the lat/lon

return(nt);

}

So for documentation or to creat a KML file I do something like this:

inventory<-Get_Ghcn_Inv(filename); # gets a dataframe version
airports<-subset(inventory,inventory$Aiport[]==TRUE);; kml_list<-Get_Places(inventory,airports$ID);

# and you can just call get_places with airports and it matches all
of them.

and now I have a dataframe named kml_file that has ghcnid,lat,lon