Wednesday, March 13, 2013

Covariance of Samples and Populations

Recently I was asked by a customer why the covariance values obtained by the function CalcCovar of the SDL Suite deviate from the covariance values calculated by the function COVAR of Microsofts Excel(tm).

The explanation for the difference is quite simple: Excel assumes that you are dealing with populations, while the SDL Suite assumes that you are working with samples (which is much more realistic).

In order to convert the covariance of a population to the covariance of a sample you have to multiply the covariance by a factor of n/(n-1), with n being the number of values used for the calculation.

Let me add a personal remark: I never use the statistics package of Excel as it has many built-in flaws and peculiarities which guide the inexperienced or unwary user into the wrong direction.

To be specific: in many cases Excel assumes that a set of, for example, five measurements describes a population and not a sample - an assumption which is rather daring and will mislead millions of world-wide users of this package. However, Microsoft is working to improve on these issues (see the Office blog)....

Saturday, March 2, 2013

Fast Calculation of Quantiles

Did you ever face the problem to calculate a particular quantile of a large collection of numbers (about 1 million values) in a minimum amount of time? I came across this problem when adjusting an image processing system to use a better color assignment in false color images. These images were hampered by a few extreme values which caused an unwanted shift of the color scale.

So the idea to improve this situation was quite simple: change the color assignment from a min/max-based algorithm to a quantile-based algorithm, using the 0.01-quantile and the 0.99 quantile as the reference points for the color scale. This would allow for a maximum number of 1% of outliers on either side of the data distribution.

The only problem was that the calculation of the quantiles from a sorted list of values was too slow. Although the sorting was done by a fast sorting algorithm it took about 200 milli-seconds (ms) to obtain a quantile (the images contained about 800000 data points).

So I looked for a faster algorithm, and decided to implement the QuickSelect algorithm in Delphi. In order to get an idea about the speed improvement I performed a few measurements which are summarized in the table below. The calculation times for 0.01-, 0.5-, and 0.99-quantiles have been measured for various sizes of data sets (between 10000 and 5 million elements) and compared to the calculation time based on fully sorting the data sets (using CompSort, a modified bubble sort which is comparable in speed to the well-known quick sort algorithm). Further, the increase of time was measured when passing the data on the stack and not by reference (in order to prevent changing the data array):

                                      Full    0.01 Qu.
         0.01 Qu.  0.5 Qu. 0.99 Qu.   Sort    (Stack)
  NData     [ms]     [ms]     [ms]    [ms]      [ms]
-------------------------------------------------------
  10000     0.12     0.20     0.14    1.78      0.16
  20000     0.28     0.42     0.26    3.47      0.30
  50000     0.65     1.03     0.64    10.5      0.77
 100000     1.39     2.04     1.25    20.8      1.55
 200000     2.51     4.08     2.57    43.0      3.01
 500000     6.3     10.7      6.41    110       8.00
1000000    12.0     22.9     13.3     226      16.2
2000000    25.4     40.9     26.1     470      33.0
5000000    63.0    107.0     68.9    1167      82.0
So, all in all, the QuickSelect algorithm dramatically speeds up the calculation time, its properties can be summarized as follows:
  • The QuickSelect algorithm is about a factor of 18.5(!) faster than the calculation based on sorting the data
  • The calculation time scales linearly with the size of the data set
  • The calculation time depends on the quantile: 0.01-quantiles and 0.99-quantiles are equally fast, 0.50-quantiles (= medians) take about 70% longer to calculate
  • Changing the passing of the data array from call by reference to copying it on the stack increases calculation time by 30%. However, while this approach has the advantage of not changing the data array, it eventually may (and will) lead to a stack overflow when the size of the data set grows and cannot be restricted.
.... needless to say that the QuickSelect algorithm will become part of the next release of MathPack (which is a package of the SDL Component Suite). Following is the Delphi code of the QuickSelect algorithm:
(********************************************************)
function QuickSelect (var data: array of double; 
                      k, First, Last: integer): double;
(********************************************************
 AUTHOR: Hans Lohninger (www.lohninger.com)
 
 ENTRY:  data .......... data array
         k ............. k-th order statistic 
         First, Last ... range of data to be processed

 EXIT:   function returns the value of the k-th smallest
         element of the data array

 REMARK: you are allowed to use this code for free in your 
         own programs provided that do not change the 
         reference to the author (including the Web URL)
 ********************************************************)


function Partition (First, Last, ixPiv: integer): integer;
{--------------------------------------------------------}

var
  i      : integer;
  iy     : integer;
  pivot  : double;
  adouble: double;

begin
pivot := data[ixPiv];
data[ixPiv] := data[First];
data[First] := pivot;
iy := First + 1;
while (iy < Last) and (data[iy] < pivot) do
  inc (iy);
for i:=iy+1 to Last do
  if (data[i] <= pivot) then
    begin
    adouble := data[i];
    data[i] := data[iy];
    data[iy] := adouble;
    inc(iy);
    end;
result := iy - 1;
data[First] := data[result];
data[result] := pivot;
result := result;
end;


var
  stop   : boolean;
  ixSplit: integer;
  ixPiv  : integer;

begin
stop := false;
while not stop do
  begin
  ixPiv := (First+Last) div 2;
  ixSplit := partition(First, Last, ixPiv);
  if k < ixSplit
    then Last := ixSplit
    else
  if k > ixSplit
    then First := ixSplit + 1
    else begin
         stop := true;
         result := data[k];
         end;
  end;
end;