Friday, January 29, 2016

Distance Metrics - Part 1



In the world of data science, one important metric that comes up is the distance metric.  How do you measure the difference between two data objects?  What if the data objects are multi-dimensional?  The choice of distance metric is an important one that can change the outcome of your data analyses.  So it's a good idea to review some of the popular and often-used metrics for an understanding of which one applies to your data.


Define a "data object" as a numerically valued profile.  Each value within the profile attributes itself to a real number.  If a data object consists of N dimensions, then it has N real numbers that define its profile.

Manhattan Distance & Euclidean Distance


Often called city-block distance, the Manhattan distance is a very simple metric which measures the sum of the (absolute) differences in each dimension.  Consider an example in measuring the distance between two data objects: [21, 15, 19] and [32, 47, 10].  The first dimension in each object is 21 and 32, respectively.  The second dimension in each object is 15 and 47, respectively, and so on.  The Manhattan distance measures the difference between these two objects as abs(21-32) + abs(15-47) + abs(19-10) = 11 + 32 + 9 = 52.  That is, the Manhattan Distance between these two example data objects is 52.  The units are unimportant in this example, but when considering your data, the units depends on what you're working with.

It's easy to see when the Manhattan distance makes sense to use and when it does not.  In a two-dimensional plane, the city-block distance sums the differences in both the x-axis and y-axis directions.  It may make better sense to instead use a metric which takes the birds-eye distance in a straight line, i.e. the Euclidean Distance.  In the image below, two data objects, labeled A and B, are being compared.  In this case, each data object is two-dimension.  The Manhattan distance between A and B is taken by measuring the length of the purple lines, and the Euclidean distance is taken as the  yellow line between points A and B.  As you can see, the Euclidean distance in this case is smaller than the Manhattan distance, as the city-block path goes out of its way in travelling from point A to B.  In a city setting, the city-block distance makes sense, because you can only go where the roads take you, and the roads are typically laid out in a fashion such that you can't drive through the blocks themselves, because you know, ... buildings and such.

Manhattan vs. Euclidean

In analyzing the general effectiveness of Manhattan vs. Euclidean, it is a good rule of thumb to suggest that Manhattan may be best for single dimensions, and Euclidean is best for two dimensions or more.  Perhaps we should leave that at two dimensions for Euclidean, however.  As the dimensional space becomes more complex, we will see that due to the nature of the how the Euclidean distance behaves, some aspects of the the data may become over-accentuated when they need not be, or should not be.  In addition, the Euclidean distance will give equal precedence to each dimension, when this may or may not be what your data needs.

In the area of wavelets, a single data object may consist of say, 32 dimensions. Each dimension is a real-valued number.  The question of comparing two wavelets comes down to comparing two 32-dimensional data objects.

A typical wavelet

In power & energy research, for a particular kind of data, consider wavelets which are, on most of the dimensional attributes of different data objects, about 80% similar.  Only a handful of the 32 values along the wavelets are significantly different enough to care about.  In this case, the Euclidean distance metric may not be the best tool at hand for measuring the difference in two wavelets, because of 1) the over-accentuation of dimensional attributes which are very different than the rest, and 2) the idea that the metric gives equal attention to all attributes, even when some of them aren't significant.

There are a handful of other distance metrics which may be better; for example: Mahalanobis, Minkowski, Canberra, and more.  Stay tuned for more about these metrics in my next blog post.

Further Reading


There are several callouts to research papers in the following community posts:

http://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions

http://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data