In data analysis we have to consider two special aspects: (1) the space in which a distance is to be calculated is normally not two- or three-dimensional, but in many cases of much higher dimensionality; (2) the Euclidean distance which we are used to use in everyday life is exceedingly inappropriate when the variables of the space are correlated (which is almost always the case in practical applications).
For short, the higher the correlation of the variables describing a p-dimensional space, the more misleading a calculated Euclidean distance will be. As a consequence using Euclidean distances in data analysis may and will lead to wrong results if variables are correlated.
Fortunately, there is a simple solution to this problem: the "Mahalanobis Distance" (MD). The MD allows for the correlation among variables and returns a distance which is undistorted even for strongly correlated variables. While the mathematics behind the calculation of the MD is quite demanding, the application is simple if you take the newly implemented function MahalanobisDistance of MathPack (a package which is part of the SDL Component Suite).
Example
Let's assume we want to calculate the MD between two points of a q-dimensional data space. The positions of the two points for which the MD has to be calculated are defined by two q-dimensional vectors p1 and p2. In order to specify the dependencies between the variables the user has to supply the inverse of the covariance matrix of the data set (the inverse can be easily obtained by calling the function Invert).So, all in all, the MD can be calculated by the following statements (assuming that the data are stored in the data table DataMat):
uses SDL_Vector, SDL_Matrix, SDL_math2; ... var p1, p2 : TVector; CovMat : TMatrix; d : double; ... DataMat.CalcCovar (CovMat,1,DataMat.NrOfColumns, 1, DataMat.NrOfRows, 1); if not CovMat.Invert then begin CovMat.Fill(0); CovMat[1,1] := 1; end; d := MahalanobisDistance (p1,p2,CovMat);
Nice post, thanks!
ReplyDelete