Numerical integrtaionIntegration is usually defined as a limiting process, whereby the region is subdivided into finer and finer regions, which can then each be evaluated with some local estimate of the integrand. There is theory about whether that converges. With a finite amount of numerical data, you can't go to a limit. But the same idea applies. You can subdivide until you get a result that seems to depend little on changing the subdivision. Sometimes that won't happen before you run out of data to meaningfully estimate the many subdivisions. That's one reason why temperature anomalies are important. With absolute temperature, you would have to divide very finely indeed to be independent of topographic variation, and there just aren't enpigh reading locations to do that. But anomalies take out a lot of that variation, making practical convergence possible.
You might ask - why bother with different methods, rather than finding just one good one? The answer is with this idea of reaching an invariant zone. If we can find an integral estimate that agrees over several different methods, that will give the greatest confidence in the result.
Grid considerationsWith gridding, you can choose a coarse grid so that every cell has data. But then the data may not be a good estimate of the whole grid area. You lose resolution. A finer grid will start to have cells with no data. Traditionally, these are just omitted, meaning in effect that they are assumed to have the global average value. This was improved by Cowtan and Way 2013, using kriging. I proposed a simpler approach using latitude band averaging, which gave some of the same benefit. In this post I'll look at upgrading the infill process, using numnerics similar to solving the diffusion equation. It tries to find a local average to use for each missing cell.
Improving on lat/lon gridsTo do this, I need a better grid system than lat/lon. That creates a big problem at the poles, where cells become very small and skewed. The essential requirement of a grid is that you can quickly allocate a set of scattered data to the right cells, and you know the area of the cells. There are many other ways of doing this. Lat/lon is based on gridding the sphere as if it were a flat surface, which it very much isn't. You can do much better using a somewhat similar 3D object. Regular platonic polyhedra are attractive, and an icosahedron would be best of these. But a cube is more familiar, and good enough. That is what I'll use here. The cube is gridded in the normal way, with a square grid on each face. The sphere surface is radially projected onto the cube.
I'll give details, with the infill process, and tests of the improvement of the results, using spherical harmonics, below the fold. And of course there will be the usual WebGL active picture. It will show the cube grid projected on the sphere, and infill for a typical month, with lines to show the infill dependency.
Cubic gridThink of a Rubik style cube, but with finer gridding. I number each face by rows, with the numbering accumulating from face to face. I use a unit sphere and unit cube (each edge from -1 to 1). So a point (x,y,z) on the sphere projects to (x,y,z)/Max(|x|,|y|,|z|) on the cube. And to map back to the sphere, just divide by the magnitude.
The cell area on the sphere, which is what is needed for area weighting, is simple. The area on the cube face cell centered at r is reduced by the cos of a view angle, which is 1/|r|, and by the inverse square of the projection, which is |r|^-2, being a total factor of |r|^-3. And all those face cells have equal area on the cube.
DiffusionAfter allocating the data points to the cells by projection, I then have populated and unpopulated cells. Populated I estimate by the simple average of data in the cell. Then I check all the unpopulated that are next to one or more pop. They get the average of the adjacent cell values (cells that share an edge), and are marked populated. Then I repeat, until all cells are "populated".
If I was actually doing the integration, I would then multiply those cell average values by the cell areas and add them up. That would end up being a weighted sum of the data. For TempLS, I just want those weights. That is just accountancy. Every time one of the original cell values is used in another cell, I increase its weight.
As a matter of practicality, I limit this, so cells are deemed to inherit values from the four closest actual readings. "Closest" is determined by the stage at which they were associated in the diffusion process. For the cells that were not reached until the third or later stages, there are some arbitrary cutoffs, but the cells chosen should still be close.
In the plot below, for the TempLS data (GHCN+ERSST) for April 2015, you see first the populated cells with a colored checkerboard, different colors for each cube face. Note how the cells do not differ hugely in size, but shrink near the 8 cube corners. Stations reporting in April are marked with black dots. Then the infilled cells are marked in drab colors - grey for the first level, then brown, bluish etc. I'm using here a 24x24 grid on each face. That is finer in parts than the usual 5° lat/lon grid, and since the SST is on a 4° lat/lon in TempLS, there are gaps in sea temp as well. I've used this finer mesh to show multiple layers in the diffusion.
In the unpop (drab) cells, you see little white lines from the centres. These show the connections to the actual pop cells from which they are averaged.
Test of integrationI've tested using spherical harmonics of order up to L=5 (36 functions). Products of these should integrate to produce an identity matrix (orthonormal). So I integrate, first using the grid as if all were present. Then I integrate using only the "populated" cells for that month of April. Then I integrate a third time infilling for the unpopulated cells. For those integrations I use the weighted formulation.
I tabulate the results below, as a sum of squares of differences between the matrix of results and the unit matrix. I have broken into columns according to the L value. The number there is actually an average for products in that section, so it is in fact a power spectrum (more on this in future posts). You can think of L as rising (spatial) frequency. In any case, the test of good integration is that the numbers should be small. I have left out L=0, because that is zero by normalization.
The all grid integration is accurate to six figures. That reflects that the mesh is fine enough (24x24) to resolve these functions well. For the rest, infilling makes a dramatic improvement, by nearly two orders of magnitude, though arguably the square root should be the criterion. Both deteriorate at higher frequency, infilling at a faster rate.