Wednesday, June 24, 2015

TempLS: new V3

TempLS is my R program that I use for calculating global average temperatures. A recent review of the project is here. A few weeks ago I posted an iterative calculation sequence which is the basis of the program. I noted that the preliminaries for this were simple and mechanical. It requires the data just organised in a large array, and a set of weights, derived from station locations.

I've never actually released version 2.2 of TempLS, which is what I have been using. I did a documented release of v2.1 here. Meanwhile, the program continued to accumulate options, and got entwined with my automatic data acquisition system. So I thought it would be useful to create a simplified version with the basic functionality. I did post a simplified version once before, but that was land only, and with more primitive weighting.

So I'm planning to post a description with three parts. The first will be data gathering, the second weighting, and third is the calculation, which has already been covered. But there may be results to discuss. This is the first, which is about the data gathering. There is no fancy maths, and it is pretty boring, but useful for those who may use it.

The program is in four stages - GHCN, SST, weights and calc. The first two make up the data organising part, and are described here. Each stage stores its output, which the next reads, so can be run independently. I have a simple job control mechanism. You set a variable with four characters, the default is
job="UEGO"

There is one character for each stage; current provisions, according to location, are:
U,AUnadjusted or Adjusted GHCN
D,EERSST Ver 3b or V4
G,MGrid or mesh-based weighting
O,POutput, without or with graphics

Each character can be lower case, which means skip that stage, but use the appropriate stored result. The point of this brevity is that the letters are used to name stored intermediate files, so the process can be re-started at any intermediate stage.

The program should be run in a dedicated directory - it will create subdirectories as needed. It starts with a section that defines some convenience functions, interprets the job code to make the names of the intermediate files (I use .sav for binary saved files), and defined the year range yr. This is hard-coded, but you can alter it.

# Function TempLS3.r. Written in four stages, it collects data, makes weights 
#and solves for global temperature anomaly. 
#Written by Nick Stokes, June 24 2015. 
#See  https://moyhu.blogspot.com.au/2015/06/templs-and-new-ersst-v4-noaa-is-now.html
#Functions
isin=function(v,lim){v>=lim[1] & v<=lim[2]}
mth=function(yr){round(seq(yr[1]+0.04,yr[2]+1,by=1/12),3)}
sp=function(s)(unlist(strsplit(s,"")))
pc=function(...,s=""){paste(...,sep=s,collapse=s)}
# Start program
if(!exists("job"))job="UEGO"
i=sp(job)
kind=tolower(i)
do=i!=kind
yr=c(1900,2015); yrs=yr[1]:yr[2]; # you can change these
ny=diff(yr)+1
j=c(1,1,2,2,1,3,1,4); R=1:4; #R will be the names of storage files
for(i in 1:4){ # make .sav file names
 R[i]=pc(c(kind[j[i*2-1]:j[i*2]],".sav"))
 if(!file.exists(R[i]))do[i]=TRUE; # if filename will be needed, calc
}


Stage 1 begins with some function definitions, which I'll describe
if(do[1]){  # Start stage 1 - read land (GHCN)
 getghcn=function(){ # download and untar GHCN
  fi="ghcnm.tavg.latest.qcu.tar.gz"
  fi=sub("u",ki,fi)  # switch to adjusted
  download.file(pc("https://www1.ncdc.noaa.gov/pub/data/ghcn/v3/",fi),fi);
  gunzip(fi,"ghcnv3.tar",overwrite=T);
  untar("ghcnv3.tar",exdir=".");
  di=dir(".",recursive=T)
  h=grep(pc("qc",ki),di,val=T) # find the qcu files un subdir
  h=grep("v3",h,val=T) # find the qcu files un subdir
  file.rename(h,Rr)
  unlink("ghcnm.*")
 }


It uses R facilities to download, unzip and untar. Then it finds the resulting files, data and inventory, which have a date-dependent name, and copies them into the current directory called ghcnu.dat etc. Then it removes the empty directory (else they accumulate). Download takes me about 30 secs. If you are happy with the data from before, you can comment out the call to this.

The next routine reads the data file into a big matrix with 14 rows. These are the station number, year, and then 12 monthly averages. It also reads the flags, and removes all data with a dubious quality flag. It returns the data matrix, converted to °C.
readqc=function(f){  # read the data file
 b=readLines(f)
 d=matrix(NA,14,length(b));
 d[1,]=as.numeric(substr(b,1,11));
 d[2,]=as.numeric(substr(b,12,15)); # year
 iu=0:13*8+4
 for(i in 3:14){
  di=as.numeric(substr(b,iu[i],iu[i]+4)); # monthly aves
  ds=substr(b,iu[i]+6,iu[i]+6); # quality flags
  di[ds!=" " | di== -9999]=NA  # remove flagged data
  d[i,]=di/100  # convert to C
 }
d
}

The rest is fairly simple:
 require("R.utils") # for gunzip
 ki=kind[1]  # u or a
 Rr=paste("ghcn",ki,c(".dat",".inv"),sep="")
 getghcn()
 d=readqc(Rr[1])
 b=readLines(Rr[2])
 ns=length(b)
 wd=c(1,11,12,20,21,30,1,3,31,37,88,88,107,107)
 iv=data.frame(matrix(NA,ns,6))
 for(i in 1:7){
  k=substr(b,wd[i*2-1],wd[i*2]); 
  if(i<6)k=as.numeric(k)
  if(i==1)K=k else iv[,i-1]=k;
 }
 colnames(iv)=c("lat","lon","country","alt","airprt","urban")
 d=d[,isin(d[2,],yr)]  # restrict to years in yr
 ib=match(d[1,],K)-1 # convert station numbers to place in order (1:7280
 x=matrix(NA,12,ny*ns)
 ic=ib*ny+d[2,]-yr[1]+1 # locations for each row of data
 x[,ic]=d[3:14,] # fill x
 dim(x)=c(12*ny,ns);   x=t(x)  # rearrange
 save(x,iv,file=R[1])
} # end stage 1

The inventory is read, and the station numbers in d are replaced by integers 1:7280, being the corresponding rows of the inventory. iv is a reduced inventory dataframe; lat and lon are the first two columns, but other data is there for post-processing.

Then it is just a matter of making a blank x matrix and entering the data. The matrix emerges ns × nm, where ns is number of stations, nm months. Later the array will be redimensioned (ns,12,ny). Finally d and iv are saved to a file which will be called u.sav or a.sav.

So now we come to part 2 - SST. Here is how it starts:
if(do[2]){ # Stage 2 - read SST
# ERSST has convention lon 0:358:2, lat -88:88:2
 ki=kind[2]; 
 s=c("ersst4","ersst.%s.%s.asc","ftp://ftp.ncdc.noaa.gov/pub/data/cmb/ersst/v4/ascii")
 iy=1;  if(ki=="d"){s=sub("4","3b",s);s[1]="../../data/ersst";iy=10;yrs=seq(yr[1],yr[2],10);}
 if(!file.exists(s[1]))dir.create(s[1])
 x=NULL; n0=180*89;

ki is d,e for 3b,4; this is just making filenames for the cases. ERSST comes as a series of 89 x 180 matrices (2°), 10 years per file for 3b, and 1 year for v4. This is good for downloading, as you only have to download a small file for recent data. That is a grid of 16020 (some land), which is more than we need for averaging. SST does not change so rapidly. So I select just every second lat and lon. In v2.2 I included the missing data in a local average, but I'm not now sure this is a good idea. It's messy at boundaries, and the smoothness of SST makes it redundant. So here is the loop over input files
    for(ii in yrs){
    # read and arrange the data
  f=sprintf(s[2],ii,ii+9)
  if(ki=="e")f=sprintf(s[2],"v4",ii)
  g=file.path(s[1],f)
  if(!file.exists(g))download.file(file.path(s[3],f),g);
        v=matrix(scan(g),nrow=n0);  
  v=v[,(mth(c(ii,ii+iy-1)) %in% mth(yr))[1:ncol(v)]]/100  # Select years and convert to C
  n=length(v);
  o=v== -99.99 | v< -1  # eliminate ice -1  
  v[o]=NA
  n=n/n0
  dim(v)=c(89,180,n)
  # reduce the grid
  i1=1:45*2-1;i2=1:90*2 # Take every second one
  v=v[i1,i2,]
  # normalise the result; w is the set of wts that were added
  x=c(x,v) 
  print(c(ii,length(x)/90/45))
    }
 

The files, when downloaded, are stored in a subdirectory, and won't be downloaded again while they are there. So to force a download, remove the file. I use cURL to download when there is a new version, but that requires installation. The first part of this code gathers each file contents into an array v. The -9999 values are converted to NA, and also anything less than -1°C. ERSST enters frozen regions as -1.8°C, but this doesn't make much sense as a proxy for air temperature. I left a small margin, since those are probably months with part freezing (there are few readings below -1 that are not -1.8).

Now to wrap up. A lat/lon inventory is created. Locations with less than 12 months data in total are removed, which includes land places. Then finally the data array x and inventory are stored.
 iv=cbind(rep(i1*2-90,90),rep((i2*2+178)%%360-180,each=45)) # Make lat/lon -88:88,2:358
 m=length(x)/90/45
 dim(x)=c(90*45,m)
 rs=rowSums(!is.na(x))
 o=rs>12  # to select only locations with more than 12 mths data in total
 iv=iv[o,] # make the lat/lon
 x=x[o,] #discard land and unmeasured sea
 str(x)
 ns=sum(o)  # number stations in this month
 rem=ncol(x)%%12
 if(rem>0)x=cbind(x,matrix(NA,ns,12-rem)) # pad to full year
 str(x)
 dim(x)=c(ns,12*((m-1)%/%12+1))
 save(x,iv,file=R[2])
} # End of stage 2


So I'll leave it there for now. The program has the potential to create a.sav, u.sav, d.sav and e.sav, each with a data array x and an iv. A simple task, but in lines it is more than half the program. Stage 3, next, will combine two of these and make a weight array.

Here is the code presented here in one piece. It is stand-alone.

# Function TempLS3.r. Written in four stages, it collects data, makes weights 
#and solves for global temperature anomaly. 
#Written by Nick Stokes, June 24 2015. 
#See  https://moyhu.blogspot.com.au/2015/06/templs-and-new-ersst-v4-noaa-is-now.html
isin=function(v,lim){v>=lim[1] & v<=lim[2]}
mth=function(yr){round(seq(yr[1]+0.04,yr[2]+1,by=1/12),3)}
sp=function(s)(unlist(strsplit(s,"")))
pc=function(...,s=""){paste(...,sep=s,collapse=s)}
if(!exists("job"))job="UEGO"
i=sp(job)
kind=tolower(i)
do=i!=kind
yr=c(1900,2015); yrs=yr[1]:yr[2]; # you can change these
ny=diff(yr)+1
j=c(1,1,2,2,1,3,1,4); R=1:4; #R will be the names of storage files
for(i in 1:4){ # make .sav file names
 R[i]=pc(c(kind[j[i*2-1]:j[i*2]],".sav"))
 if(!file.exists(R[i]))do[i]=TRUE; # if filename will be needed, calc
}
if(do[1]){  # Stage 1 - read land (GHCN)
 getghcn=function(){ # download and untar GHCN
  fi="ghcnm.tavg.latest.qcu.tar.gz"
  fi=sub("u",ki,fi)  # switch to adjusted
  download.file(pc("https://www1.ncdc.noaa.gov/pub/data/ghcn/v3/",fi),fi);
  gunzip(fi,"ghcnv3.tar",overwrite=T);
  untar("ghcnv3.tar",exdir=".");
  di=dir(".",recursive=T)
  h=grep(pc("qc",ki),di,val=T) # find the qcu files un subdir
  h=grep("v3",h,val=T) # find the qcu files un subdir
  file.rename(h,Rr)
  unlink("ghcnm.*")
 }

readqc=function(f){  # read the data file
 b=readLines(f)
 d=matrix(NA,14,length(b));
 d[1,]=as.numeric(substr(b,1,11));
 d[2,]=as.numeric(substr(b,12,15)); # year
 iu=0:13*8+4
 for(i in 3:14){
  di=as.numeric(substr(b,iu[i],iu[i]+4)); # monthly aves
  ds=substr(b,iu[i]+6,iu[i]+6); # quality flags
  di[ds!=" " | di== -9999]=NA  # remove flagged data
  d[i,]=di/100  # convert to C
 }
d
}

 require("R.utils") # for gunzip
 ki=kind[1]
 Rr=paste("ghcn",ki,c(".dat",".inv"),sep="")
 getghcn()
 d=readqc(Rr[1])
 b=readLines(Rr[2])
 ns=length(b)
 wd=c(1,11,12,20,21,30,1,3,31,37,88,88,107,107)
 iv=data.frame(matrix(NA,ns,6))
 for(i in 1:7){
  k=substr(b,wd[i*2-1],wd[i*2]); 
  if(i<6)k=as.numeric(k)
  if(i==1)K=k else iv[,i-1]=k;
 }
 colnames(iv)=c("lat","lon","country","alt","airprt","urban")
 d=d[,isin(d[2,],yr)]  # restrict to years in yr
 ib=match(d[1,],K)-1 # convert station numbers to place in order (1:7280
 x=matrix(NA,12,ny*ns)
 ic=ib*ny+d[2,]-yr[1]+1 # locations for each row of data
 x[,ic]=d[3:14,] # fill x
 dim(x)=c(12*ny,ns);   x=t(x)  # rearrange
 save(x,iv,file=R[1])
}
if(do[2]){ # Stage 2 - read SST
# ERSST has lon 0:358:2, lat -88:88:2
 ki=kind[2]; 
 s=c("ersst4","ersst.%s.%s.asc","ftp://ftp.ncdc.noaa.gov/pub/data/cmb/ersst/v4/ascii")
 iy=1; 
 if(ki=="d"){s=sub("4","3b",s);s[1]="../../data/ersst";iy=10;yrs=seq(yr[1],yr[2],10);}
 if(!file.exists(s[1]))dir.create(s[1])
 x=NULL; n0=180*89;
    for(ii in yrs){
    # read and arrange the data
  f=sprintf(s[2],ii,ii+9)
  if(ki=="e")f=sprintf(s[2],"v4",ii)
  g=file.path(s[1],f)
  if(!file.exists(g))download.file(file.path(s[3],f),g);
        v=matrix(scan(g),nrow=n0);  
  v=v[,(mth(c(ii,ii+iy-1)) %in% mth(yr))[1:ncol(v)]]/100  # Select years and convert to C
  n=length(v);
  o=v== -99.99 | v< -1  # eliminate ice -1  
  v[o]=NA
  n=n/n0
  dim(v)=c(89,180,n)
  # reduce the grid
  i1=1:45*2-1;i2=1:90*2 # Take every second one
  v=v[i1,i2,]
  # normalise the result; w is the set of wts that were added
  x=c(x,v) 
  print(c(ii,length(x)/90/45))
    }
 str(x)
 iv=cbind(rep(i1*2-90,90),rep((i2*2+178)%%360-180,each=45)) # Make lat/lon -88:88,2:358
 m=length(x)/90/45
 dim(x)=c(90*45,m)
 rs=rowSums(!is.na(x))
 o=rs>12  # to select only locations with more than 12 mths data in total
 iv=iv[o,] # make the lat/lon
 x=x[o,] #discard land and unmeasured sea
 str(x)
 ns=sum(o)  # number stations in this month
 rem=ncol(x)%%12
 if(rem>0)x=cbind(x,matrix(NA,ns,12-rem)) # pad to full year
 str(x)
 dim(x)=c(ns,12*((m-1)%/%12+1))
 save(x,iv,file=R[2])
}






1 comment:

  1. Peter,
    The least squares method I use does not need anything special for missing records. It fits a linear model to the monthly means provided by GHCN/ERSST by least squares. I don't handle the process of deriving monthly from daily. There should in principle be a test of whether a station has sufficient data, but GHCN has handled that.

    There is one small exception - SST near ice. ERSST is optimal interpolation, so it provides data without gaps. But some of them are ice, which I treat as missing. That leaves a fringe area which is sometimes ice, sometimes not, and a bias because the missing months are predominantly winter. I haven't found a good way of dealing with that yet.

    ReplyDelete