Monday, October 21, 2013

Broken help file

A few weeks ago I received a mail of a customer who complained of the help file of the SDL Suite which did not correctly display on his new computer (while the same help file on his old computer worked as expected). Several mails and a few Google searches later it turned out that this is due to a "security feature" of newer versions of Windows.

Microsoft discovered in about 2005 a vulnerability of the help engine which allowed to execute remote code within a corrupted chm file (compiled help file). The security hole was "fixed" by an update which simply blocks the display of html pages in the help viewer and displays the misleading message "Navigation to the Website was canceled".

Well, thank you Microsoft, this message is really of big help. Thousands (if not millions) of Windows user will eventually bump into this problem without having a clue of the true nature of the problem....

Those who face the same problem maybe want to have a look at the SDL TechNotes where I published a simple solution to it.

Tuesday, September 3, 2013

Smoothing Data

Yesterday, someone contacted me and asked me how to quickly smooth a bunch of data measured over time. While I did this many times in various projects before, I discovered just now, that I never published a sample program which shows how to perform smoothing of data using the SDL Component Suite. So here we are...

This small program allows to create a time series simulating the measured data. The data may be "poisoned" by arbitrary levels of (Gaussian) noise and up to 10 spikes. The resulting data series is typical for measured data, which usually contains some amount of noise, and occasionally one or two spikes (e.g. when a refrigerator in the vicinity of the measurement device switches off).

Experimenting with this little program quickly shows that normally distributed noise can be easily reduced by most of the smoothing algorithms, provided that the window width of the smoothing algorithm is large enough to cover a sufficient number of random fluctuations, and, of course, small enough not to interfere with low-frequency signal components.

A big problem to all integrating filters are spikes. We know from system theory that the response of a filter to a dirac impulse resembles its own transfer function. Thus a moving average filter will generate rectangular shapes, a polynomial filter will generate parabolic artefacts. A very good solution for getting rid of the spikes (at least if they have a width of only a few data points) is to use a moving median filter.

The images above show the effect of smoothing a noisy signal containing a a spike.

If you want to experiment with these kinds of filters yourself, you can download the program here. The archive contains the executable as well as the Delphi XE4 sources, so you can immediately try out the various options. For adjusting the code to your own needs you have to install the SDL Component Suite (the Light Edition is free of charge).

Last but not least two remarks: you may be interested in experimenting with penalized splines, as well. They are basically not intended to automatically filter data, however, the resulting smoothed curves are free of noise and thus nice to look at. A sample program to experiment with penalized splines may be downloaded from the VIAS Web site. Another way (and a quite efficient one) to perform smoothing is to use fast Fourier transform to remove unwanted parts of the signal.

Monday, June 10, 2013

Mahalanobis Distance

Many of us (especially those who do a lot of calculations involving statistical data) have to calculate distances in arbitrary spaces. While this is quite common in everyday life (think, for example, of the calculation of a room diagonal) it may become quite complicated when doing data analysis.

In data analysis we have to consider two special aspects: (1) the space in which a distance is to be calculated is normally not two- or three-dimensional, but in many cases of much higher dimensionality; (2) the Euclidean distance which we are used to use in everyday life is exceedingly inappropriate when the variables of the space are correlated (which is almost always the case in practical applications).

For short, the higher the correlation of the variables describing a p-dimensional space, the more misleading a calculated Euclidean distance will be. As a consequence using Euclidean distances in data analysis may and will lead to wrong results if variables are correlated.

Fortunately, there is a simple solution to this problem: the "Mahalanobis Distance" (MD). The MD allows for the correlation among variables and returns a distance which is undistorted even for strongly correlated variables. While the mathematics behind the calculation of the MD is quite demanding, the application is simple if you take the newly implemented function MahalanobisDistance of MathPack (a package which is part of the SDL Component Suite).

Example

Let's assume we want to calculate the MD between two points of a q-dimensional data space. The positions of the two points for which the MD has to be calculated are defined by two q-dimensional vectors p1 and p2. In order to specify the dependencies between the variables the user has to supply the inverse of the covariance matrix of the data set (the inverse can be easily obtained by calling the function Invert).

So, all in all, the MD can be calculated by the following statements (assuming that the data are stored in the data table DataMat):


uses
  SDL_Vector, SDL_Matrix, SDL_math2;

...

var
  p1, p2 : TVector;
  CovMat : TMatrix;
  d      : double;

...

DataMat.CalcCovar (CovMat,1,DataMat.NrOfColumns, 
                   1, DataMat.NrOfRows, 1);
if not CovMat.Invert then
  begin
  CovMat.Fill(0);
  CovMat[1,1] := 1;
  end;
d := MahalanobisDistance (p1,p2,CovMat);