Monday, October 21, 2013

Broken help file

A few weeks ago I received a mail of a customer who complained of the help file of the SDL Suite which did not correctly display on his new computer (while the same help file on his old computer worked as expected). Several mails and a few Google searches later it turned out that this is due to a "security feature" of newer versions of Windows.

Microsoft discovered in about 2005 a vulnerability of the help engine which allowed to execute remote code within a corrupted chm file (compiled help file). The security hole was "fixed" by an update which simply blocks the display of html pages in the help viewer and displays the misleading message "Navigation to the Website was canceled".

Well, thank you Microsoft, this message is really of big help. Thousands (if not millions) of Windows user will eventually bump into this problem without having a clue of the true nature of the problem....

Those who face the same problem maybe want to have a look at the SDL TechNotes where I published a simple solution to it.

Tuesday, September 3, 2013

Smoothing Data

Yesterday, someone contacted me and asked me how to quickly smooth a bunch of data measured over time. While I did this many times in various projects before, I discovered just now, that I never published a sample program which shows how to perform smoothing of data using the SDL Component Suite. So here we are...

This small program allows to create a time series simulating the measured data. The data may be "poisoned" by arbitrary levels of (Gaussian) noise and up to 10 spikes. The resulting data series is typical for measured data, which usually contains some amount of noise, and occasionally one or two spikes (e.g. when a refrigerator in the vicinity of the measurement device switches off).

Experimenting with this little program quickly shows that normally distributed noise can be easily reduced by most of the smoothing algorithms, provided that the window width of the smoothing algorithm is large enough to cover a sufficient number of random fluctuations, and, of course, small enough not to interfere with low-frequency signal components.

A big problem to all integrating filters are spikes. We know from system theory that the response of a filter to a dirac impulse resembles its own transfer function. Thus a moving average filter will generate rectangular shapes, a polynomial filter will generate parabolic artefacts. A very good solution for getting rid of the spikes (at least if they have a width of only a few data points) is to use a moving median filter.

The images above show the effect of smoothing a noisy signal containing a a spike.

If you want to experiment with these kinds of filters yourself, you can download the program here. The archive contains the executable as well as the Delphi XE4 sources, so you can immediately try out the various options. For adjusting the code to your own needs you have to install the SDL Component Suite (the Light Edition is free of charge).

Last but not least two remarks: you may be interested in experimenting with penalized splines, as well. They are basically not intended to automatically filter data, however, the resulting smoothed curves are free of noise and thus nice to look at. A sample program to experiment with penalized splines may be downloaded from the VIAS Web site. Another way (and a quite efficient one) to perform smoothing is to use fast Fourier transform to remove unwanted parts of the signal.

Monday, June 10, 2013

Mahalanobis Distance

Many of us (especially those who do a lot of calculations involving statistical data) have to calculate distances in arbitrary spaces. While this is quite common in everyday life (think, for example, of the calculation of a room diagonal) it may become quite complicated when doing data analysis.

In data analysis we have to consider two special aspects: (1) the space in which a distance is to be calculated is normally not two- or three-dimensional, but in many cases of much higher dimensionality; (2) the Euclidean distance which we are used to use in everyday life is exceedingly inappropriate when the variables of the space are correlated (which is almost always the case in practical applications).

For short, the higher the correlation of the variables describing a p-dimensional space, the more misleading a calculated Euclidean distance will be. As a consequence using Euclidean distances in data analysis may and will lead to wrong results if variables are correlated.

Fortunately, there is a simple solution to this problem: the "Mahalanobis Distance" (MD). The MD allows for the correlation among variables and returns a distance which is undistorted even for strongly correlated variables. While the mathematics behind the calculation of the MD is quite demanding, the application is simple if you take the newly implemented function MahalanobisDistance of MathPack (a package which is part of the SDL Component Suite).

Example

Let's assume we want to calculate the MD between two points of a q-dimensional data space. The positions of the two points for which the MD has to be calculated are defined by two q-dimensional vectors p1 and p2. In order to specify the dependencies between the variables the user has to supply the inverse of the covariance matrix of the data set (the inverse can be easily obtained by calling the function Invert).

So, all in all, the MD can be calculated by the following statements (assuming that the data are stored in the data table DataMat):


uses
  SDL_Vector, SDL_Matrix, SDL_math2;

...

var
  p1, p2 : TVector;
  CovMat : TMatrix;
  d      : double;

...

DataMat.CalcCovar (CovMat,1,DataMat.NrOfColumns, 
                   1, DataMat.NrOfRows, 1);
if not CovMat.Invert then
  begin
  CovMat.Fill(0);
  CovMat[1,1] := 1;
  end;
d := MahalanobisDistance (p1,p2,CovMat);

Wednesday, March 13, 2013

Covariance of Samples and Populations

Recently I was asked by a customer why the covariance values obtained by the function CalcCovar of the SDL Suite deviate from the covariance values calculated by the function COVAR of Microsofts Excel(tm).

The explanation for the difference is quite simple: Excel assumes that you are dealing with populations, while the SDL Suite assumes that you are working with samples (which is much more realistic).

In order to convert the covariance of a population to the covariance of a sample you have to multiply the covariance by a factor of n/(n-1), with n being the number of values used for the calculation.

Let me add a personal remark: I never use the statistics package of Excel as it has many built-in flaws and peculiarities which guide the inexperienced or unwary user into the wrong direction.

To be specific: in many cases Excel assumes that a set of, for example, five measurements describes a population and not a sample - an assumption which is rather daring and will mislead millions of world-wide users of this package. However, Microsoft is working to improve on these issues (see the Office blog)....

Saturday, March 2, 2013

Fast Calculation of Quantiles

Did you ever face the problem to calculate a particular quantile of a large collection of numbers (about 1 million values) in a minimum amount of time? I came across this problem when adjusting an image processing system to use a better color assignment in false color images. These images were hampered by a few extreme values which caused an unwanted shift of the color scale.

So the idea to improve this situation was quite simple: change the color assignment from a min/max-based algorithm to a quantile-based algorithm, using the 0.01-quantile and the 0.99 quantile as the reference points for the color scale. This would allow for a maximum number of 1% of outliers on either side of the data distribution.

The only problem was that the calculation of the quantiles from a sorted list of values was too slow. Although the sorting was done by a fast sorting algorithm it took about 200 milli-seconds (ms) to obtain a quantile (the images contained about 800000 data points).

So I looked for a faster algorithm, and decided to implement the QuickSelect algorithm in Delphi. In order to get an idea about the speed improvement I performed a few measurements which are summarized in the table below. The calculation times for 0.01-, 0.5-, and 0.99-quantiles have been measured for various sizes of data sets (between 10000 and 5 million elements) and compared to the calculation time based on fully sorting the data sets (using CompSort, a modified bubble sort which is comparable in speed to the well-known quick sort algorithm). Further, the increase of time was measured when passing the data on the stack and not by reference (in order to prevent changing the data array):

                                      Full    0.01 Qu.
         0.01 Qu.  0.5 Qu. 0.99 Qu.   Sort    (Stack)
  NData     [ms]     [ms]     [ms]    [ms]      [ms]
-------------------------------------------------------
  10000     0.12     0.20     0.14    1.78      0.16
  20000     0.28     0.42     0.26    3.47      0.30
  50000     0.65     1.03     0.64    10.5      0.77
 100000     1.39     2.04     1.25    20.8      1.55
 200000     2.51     4.08     2.57    43.0      3.01
 500000     6.3     10.7      6.41    110       8.00
1000000    12.0     22.9     13.3     226      16.2
2000000    25.4     40.9     26.1     470      33.0
5000000    63.0    107.0     68.9    1167      82.0
So, all in all, the QuickSelect algorithm dramatically speeds up the calculation time, its properties can be summarized as follows:
  • The QuickSelect algorithm is about a factor of 18.5(!) faster than the calculation based on sorting the data
  • The calculation time scales linearly with the size of the data set
  • The calculation time depends on the quantile: 0.01-quantiles and 0.99-quantiles are equally fast, 0.50-quantiles (= medians) take about 70% longer to calculate
  • Changing the passing of the data array from call by reference to copying it on the stack increases calculation time by 30%. However, while this approach has the advantage of not changing the data array, it eventually may (and will) lead to a stack overflow when the size of the data set grows and cannot be restricted.
.... needless to say that the QuickSelect algorithm will become part of the next release of MathPack (which is a package of the SDL Component Suite). Following is the Delphi code of the QuickSelect algorithm:
(********************************************************)
function QuickSelect (var data: array of double; 
                      k, First, Last: integer): double;
(********************************************************
 AUTHOR: Hans Lohninger (www.lohninger.com)
 
 ENTRY:  data .......... data array
         k ............. k-th order statistic 
         First, Last ... range of data to be processed

 EXIT:   function returns the value of the k-th smallest
         element of the data array

 REMARK: you are allowed to use this code for free in your 
         own programs provided that do not change the 
         reference to the author (including the Web URL)
 ********************************************************)


function Partition (First, Last, ixPiv: integer): integer;
{--------------------------------------------------------}

var
  i      : integer;
  iy     : integer;
  pivot  : double;
  adouble: double;

begin
pivot := data[ixPiv];
data[ixPiv] := data[First];
data[First] := pivot;
iy := First + 1;
while (iy < Last) and (data[iy] < pivot) do
  inc (iy);
for i:=iy+1 to Last do
  if (data[i] <= pivot) then
    begin
    adouble := data[i];
    data[i] := data[iy];
    data[iy] := adouble;
    inc(iy);
    end;
result := iy - 1;
data[First] := data[result];
data[result] := pivot;
result := result;
end;


var
  stop   : boolean;
  ixSplit: integer;
  ixPiv  : integer;

begin
stop := false;
while not stop do
  begin
  ixPiv := (First+Last) div 2;
  ixSplit := partition(First, Last, ixPiv);
  if k < ixSplit
    then Last := ixSplit
    else
  if k > ixSplit
    then First := ixSplit + 1
    else begin
         stop := true;
         result := data[k];
         end;
  end;
end;

Thursday, February 21, 2013

Rotatable 3D surface plot

Yesterday a customer asked me how to create a quick and dirty solution for displaying a rotatable three dimensional surface (using ChartPack). Here's the solution to it (for sake of clarity I use predefined data stored in the 12 by 14 array "Data"):
  1. Put the TPlot3D component on a form
  2. In the OnShow event add the following code:
        // suppress any redraw during the
        // construction of the plot
    P3d.SuppressPaint := true; 
        // adjust the data grid of the component 
    P3d.GridMat.Resize(14,12);  
        // fill the data grid
    for i:=1 to 14 do
      for j:=1 to 12 do
        P3d.GridMat[i,j] := Data[i,j];  
        // set the scales of the axes
    P3D.SetRange(1,14,1,12,0,35);  
        // this triggers a repaint
    P3d.SuppressPaint := false;    
    
  3. In the object inspector set the color model ("ColorCodingMode") to "cmThreeColors", the range of colors to a range of 0 to 35 (properties "ColorScaleLow" and "ColorScaleHigh"), and of course, the pivot colors ("ColorLow", "ColorMid", and "ColorHigh") according to your requirements.
As a small refinement I added a scrollbar which allows to adjust the property "ColorScaleHigh". Following is a screen shot of the mini demo:

You can download the sample program written for Delphi XE3 from here. The ZIP file contains the sources and the corresponding executable for Windows. In order to rotate the 3D plot simply click into the chart and drag the mouse holding the left mouse button down.

Monday, January 14, 2013

Reentrancy of events

Not very long ago, I encountered a weird situation which took lots of time to resolve (yes, call me stupid): a measurement device delivered data to my program at irregular intervals by triggering an event which took the new value, added it to a FIFO structure and redrew the corresponding graph.

This worked nice, however, every now and then parts of the drawing disappeared (I used RChart for displaying the graph). The problem was, of course, that the glitch occurred only infrequently, so debugging was in fact almost impossible (how can you the detect the condition of not drawing parts of a graph?).

In such situations, I did what I call "debugging by meditation" (;-): I slept over it, hoping that my brain finds a solution while being asleep.....

And the solution was simple enough: the event to redraw the graph was called a second time before the previous call had been finished (re-entrant call), clearing parts of the graph. So I simply created the following program structure to block re-entrant calls of the drawing routine:


var
  IsProcessing: boolean; // (private variable of the form)

...
...
...

procedure OnDataTrigger (Sender: TObject; newdat: double);

begin
.... // collect the data (not shown here)

if not Processing then
  begin
  IsProcessing := true;

  .... // display the data

  IsProcessing := false;
  end;
end;
... and the glitches were gone!

A final remark: in my case it was a little bit more complicated, because of additional restrictions of the measurement device. So I used the event to collect the data in an array and called the drawing routine only when enough data were available...

Yes, and the very last remark: do not forget to initialize the "IsProcessing" variable to FALSE.

Thursday, January 10, 2013

Lifetime of Older Delphi Versions

Did you ever wonder how long it takes that most of the users of an older Delphi version switch to the newer one?

Well, this question could be answered simply by looking at the sales figures of Embarcadero; however I doubt that Embarcadero will make the sales figures available to the public. An idea to get some insight to this question would be to exploit the search statistics of Google. These numbers are publicly available (at least the relative numbers) and certainly provide an indication of how many people are (still) working with a particular version of Delphi.

Here are the results of this approach (the data points are weekly averages scaled to a maximum of 100%):

So what I did, was to visit Google Trends, enter the following keywords, and download the data provided by Google. Search keywords: "Delphi 2009", "Delphi 2010", "Delphi XE", "Delphi XE2", and "Delphi XE3".

The results clearly show that the decay of a particular search keyword immediately starts after the launch of a new release. The interest in the old release drops within a few months to half of the initial interest. Only Delphi 2010 needed almost a year to drop to 50% - which may be an indication that the move to XE was judged excessively cautious by the users.

Wednesday, January 9, 2013

Regular Expressions in Delphi

If you ever have a need for regular expressions in your Delphi programs, here's a Web site (not to say the WebSite) which provides all necessary explanations (including code to download):

regular-expressions.info

Delphi/RadStudio XE or higher supports PCRE. PCRE has been developed by Philip Hazel, University of Cambridge, and is a library of regex functions whose syntax and semantics are close to those of the Perl 5 language.

Tuesday, January 8, 2013

When was Easter Sunday in 1953?

Did you ever try to implement a calendar? If so, you already know that this is one of the most intricate tasks a programmer can come upon (;-))). More on the background of that can be found, for example, in Wikipedia.

To make things short, here's the Delphi code of the formula developed by Carl Gauss, including the correction of Lichtenberg (the code snippet is taken from the calendar component of the SDL Component Suite):
procedure CalcEasterSunday (Year: integer; var Day, Mon: integer); (**************************************************** ENTRY: Year ....... year to be calculated EXIT: Day, Mon ... day and month of Easter Sunday ****************************************************) var kx, mk, sk : integer; ax : integer; dam : integer; rda : integer; sz : integer; og,oe : integer; begin kx := Year div 100; mk := 15 + (3*kx + 3) div 4 - (8*kx + 13) div 25; sk := 2 - (3*kx + 3) div 4; ax := Year mod 19; dam := (19*ax + mk) mod 30; rda := (dam + ax div 11) div 29; og := 21 + dam - rda; sz := 7 - (Year + Year div 4 + sk) mod 7; oe := 7 - (og - sz) mod 7; Mon := 3; Day := og + oe; if Day > 31 then begin Mon := 4; Day := Day-31; end; end;
The date is calculated for the Gregorian calendar (which is the most widely used one in the Western hemisphere). As you can easily calculate, Easter Sunday in 1953 was the 5th of April....

Sunday, January 6, 2013

Selected Items of a TTreeView Component

Did you ever try to programmatically select a particular node in a TTreeView component? I thought this should be simple enough. Assuming that the particular node to be selected is given by Items[i] we could simply write the following statement:
TreeView1.Items[i].Selected := true;
Well that works; if we check whether the node is selected we get back a TRUE value. However the selection is not indicated visually, the display of the TTreeView component clears all previous selections, and that's all.... no indication of the new selection.

After hours of tedious experiments I finally found the solution: in order to make the component indicate the selection, you have to set the focus to the component (even if it already had the focus before selecting the node). So the following two-liner will do the job:

TreeView1.Items[i].Selected := true;
TreeView1.SetFocus;