Four Years Remaining

On Supervised Visualization of Multivariate Data

Posted by Konstantin 26.02.2009 No Comments
The first step in the analysis of multivariate data is visualization. Histograms of attribute distributions, scatterplots, box-and-whiskers diagrams, parallel coordinate plots, self organizing maps, and even plots of happy faces - are all means of helping a human to visually comprehend multidimensional data and expoit the enormous power of the human brain to detect patterns. Of all these techniques, two-dimensional scatterplots are perhaps the most popular, as they tend to provide an especially "realistic" feel for the data. But when your data has more than two attributes (perhaps hundreds or thousands), how do you choose the two projection coordinates that would provide you with the "best angle" on the data?

The easiest answer to that question is, of course, to pick a pair of attributes A_i and A_j, and simply plot one versus the other. Unfortunately, this doesn't usually work well, especially when the dataset does have hundreds of attributes. Therefore, the most popular approach in practice is to use PCA and project the data onto the two largest principal components, which mostly results in a rather insightful image.

The PCA projection is, however, completely unsupervised. If your data has class labels assigned to points, PCA does not take them into account. No matter what is the labeling, PCA will always produce the same projection onto the coordinates with the highest variation. This might leave an improper impression that the two classes overlap a lot when in fact they do not. Therefore, this is not what you need. Usually, in the case of labeled data you expect from a scatterplot to provide an indication of how separated the two classes are from each other, and how difficult could it be to discriminate between them. It turns out that it is very easy to construct a linear projection with such properties.

The Linear classifier-based Scatterplot

Assume there are two classes in the data and we are interested in a linear projection, that demonstrates how separated the classes are. Let us train a linear classifier to discriminate the two classes. It does not matter which algorithm you use, as long as it results in a separating hyperplane. Now naturally, the normal to this hyperplane is the main coordinate of interest to you: it is the direction along which the data will be classified linearly by your algorithm. If there is a coordinate for demonstrating separation, this must be it. The choice of the second projection coordinate does not matter much, so I would propose picking any direction orthogonal to the first.

When you have three classes you could select the first projection coordinate as the normal of a hyperplane, separating the first class from the second, and the second coordinate as the normal of a hyperplane, separating the first class from the third.

Finally, note that in general you need not limit yourself to linear classifiers only. Any classifier of the form $y_i = \mathrm{sign}(f(x_i))$ will provide you with an informative coordinate projection function $f(x)$ . This is a natural "supervised" alternative to kernel-PCA or SOM.

Naive supervised linear scatterplot (NS-plot)

To be somewhat more specific, here's a suggestion of a very simple implementation for the abovementioned idea. To avoid the use of a potentially complicated linear classifier training algorithm, let us just pick the vector connecting the means of the two classes as the first projection coordinate. The second coordinate is chosen at random and then orthogonalized with the first one. The Scilab code of the whole algorithm is therefore the following:
```
  // Input:
  //   X - the data matrix (instances in rows, attributes in columns)
  //   C - class assignments (C(i) is the class of instance X(i,:))
  mean_1 = mean(X(C ==  1, :), 'r')';
  mean_2 = mean(X(C == -1, :), 'r')';
  v1 = (mean_2 - mean_1)/norm(mean_2 - mean_1);
  v2 = rand(v1);
  v2 = v2 - v2'*v1*v1;
  v2 = v2/norm(v2);
  X_proj = X*[v1 v2];
  // Output:
  //   X_proj - the projected coordinates
```
Notice how much simpler and more efficient it is than PCA. Despite the simplicity, I haven't seen the use of such a plot anywhere else, so let me coin the boring name NS-plot for it. Personal experience shows that the resulting plot is visually never much worse than a PCA plot, and most often the two plots complement each other. Let me illustrate that on two simple examples.

The IRIS dataset. The plots below show the PCA and the NS plots of the famous iris dataset (where I removed the first class). There is clearly no strong advantage of one plot over the other except that PCA is more difficult to compute.

The ARCENE dataset. The following plots depict the 1000-attribute ARCENE sample dataset. We can see how PCA prefers to stress the unsupervised clustering present in the dataset, thus potentially deemphasizing the specifics of class labeling. In this case, I would say, the PCA and the NS plots complement each other.

Bonus

Noticed the circled points on the plots above? This is one other small trick that I find quite useful, and that does not seem to be widely known. The circled points denote the "boundary" - these are the points whose nearest neighbor is of a different class than their own. The more boundary points there are - the more difficult is the classification problem. The boundary is not an absolute notion, because there are various ways to define distance between points. My suggestion would be to standardize all attributes and use the euclidean norm, unless you have good reasons to do something else (e.g. you a-priori know good weights for the attributes, etc).
Tags: Data analysis, Visualization
Inflight Beverages

Posted by Konstantin 19.02.2009 6 Comments
Everyone, who flies reasonably often has a chance to observe a remarkable effect: no matter how steep the roll angle of an airliner is during a turn, the tea in your cup will always stay parallel to the floor, and not to the ground. In the course of one recent unexpectedly long discussion on this topic, it turned out there is no much easily googleable material out there to refer to. I thought I'd create some, hence a somewhat longer post on an otherwise simple matter of minor relevance to my main affairs.

So, firstly, why doesn't tea orient itself parallel to the ground? To understand that, consider a pair of insightful examples. They should rid you of the incorrect intuitive assumption that it is gravity, which plays a defining role in the orientation of your tea.

Example 1: Water in the bucket

Consider the experiment described in the previous post about a water bucket on a string. No matter what you do with the bucket, as long as the string is at strain, the water will stay parallel to bottom of the bucket, and not to the ground. The explanation for that is the following. Water is affected by gravity G. The bucket is affected by gravity G and the strain of the string S. Therefore, from the frame of reference of the bucket, water experiences force G-(G+S) = -S, which pulls it towards the bottom of the bucket.

Example 2: Paragliders

You have probably seen paragliders in the air. If not, just search for "extreme paragliding" on youtube. Note that whatever the angle of the wing, the person is always hanging perpendicularly to it. Now if you'd hand him a bottle with water, the water would, naturally, also orient itself in parallel to the wing. The reason of that behaviour is the same as in the water bucket experiment, with the lift of the wing playing the same role as the strain of the string.

The main idea of both examples is that as long as the container and the object inside it are equally affected by gravity, it is not gravity that orients the object with respect to the container. More precisely, for the water to incline to the side, there must be some force acting on the container from that side.

The Airplane: A Simple Model
Let us now consider an idealized airplane. According to a popular model, an airplane in flight is affected by four forces: the thrust of the engines T, the drag due to air resistance D, the lift of the wings L and the gravitational weight of the airplane G. The glass of tea inside the airplane is, of course, only affected by gravity G. By applying the same logic as before, we can easily compute, that in the frame of reference of the airplane, the tea is experiencing acceleration F = G-(T+D+L+G) = -(T+D+L).

However, once the airplane has attained constant speed (which is true for most of the duration of the flight), its thrust is completely cancelled by the air resistance, i.e. T+D = 0, in which case F = -L. It now remains to note that the lift force of the wings is always approximately perpendicular to the wings and thus to the floor of the airliner. The tea in your cup must therefore indeed be parallel to the floor.

The Airplane: A More Realistic Model
The model considered in the previous section is somewhat too simplistic. According to it there are no forces acting on the sides of the airplane whatsoever, and it should therefore be absolutely impossible to incline the tea in the cup to the side, which is, of course, not true for a real airplane. So, assuming a pilot would want to incline the tea and spill it, what would be his options?
1. Thrust. As noted above, tea must only be parallel to the floor when thrust is perfectly cancelled by drag and the plane is moving with constant speed. A rapid increase or decrease in thrust could therefore incline the tea towards or against the direction of flight. But of course, there is usually no need for such a maneuver during a passenger flight except for takeoff and landing.
2. Rotation. The model above does not consider the fact, that the pilot may use the ailerons and rudder to turn the airplane, rotating it around its axes. And of course, when the resulting rotation is abrupt enough, the tea could incline somewhat. However, during passenger flights the turns are always performed very smoothly, with rotation speeds around 1 degree per second. In fact it is risky, if not impossible to perform any hasty rotationary maneuvers on an airliner travelling at about 800km/h.
3. Turn with a skid
  
  Slip. The simple model above considered the case when the air is flowing directly along the main axis of the airplane, which need not necessarily always be the case. The condition may be violated either due to a strong sidewind, or during a peculiar kind of a turn, where the airplane "slips" or "skids" on the side. In both cases, the airflow is exerting pressure on the side of the airplane's hull, which generates the so-called body lift. It is usually incomparably smaller than the lift of the wings, but nonetheless, it can incline the water to the side.
  It is interesting to understand why you should almost never experience slip in an airliner. There are two reasons for that. Firstly, most airplanes have a degree of weathercock stability. Like a weathercock, an airplane with a vertical tail stabilizer tends to automatically orient itself into the direction of airflow and thus avoid slip. This effect is especially strong at the speed of a commercial airliner.
  Secondly, if the weathercock effect is not enough to prevent slip, the pilot himself will always ensure that the slip is never too large by watching the slip-indicator (aka inclinometer) and coordinating the turn.
  Why that? Because the airplane is not constructed, aerodynamically, to fly sideways. When the plane is moving sideways, the body of the plane blocks airflow over the trailing wing. So the wing loses lift and begins to drop. This naturally will put the airplane into a bank in the direction of the turn, but it does so at great cost in drag. When the slip is too large the lifting properties of the wings change so drastically that this might put the airplane at the risk of a crash.
Summary
Q: Why is the tea in an airliner parallel to the floor even when the airplane is turning?
A: It follows from the following three conditions, that must all be satisfied for a safe flight:
1. The forces of thrust and drag cancel each other and the airplane moves at a constant speed.
2. The turns are performed smoothly.
3. There is no slip: the air flows directly along the main axis of the airplane.
Tags: Airplanes, Fun, Physics
The Great Swinging Bucket Conspiracy

Posted by Konstantin 12.02.2009 3 Comments

Most of us probably remember this experiment from high school physics lessons: you take a bucket on a string filled with water, spin it around your head and the water does not spill. "But how?" - you would ask in amazement. And the teacher would explain then:
"You see, the bucket is spinning, and this creates the so-called centrifugal force acting on the water, which cancels out gravity and thus keeps the water in the bucket". And you will have a hard time finding any other explanation. At least I failed no matter how hard I googled it.

Unfortunately, this explanation looses an essential point of the experiment and I have seen people irreparaply braindamaged by the blind belief that it is only due to rotation and the resulting virtual centrifugal force that the water does not spill.

However, it is not quite the case. Let us imagine that the bucket has accidentally stopped right over your head and as a result, all centrifugal force has been immediately lost. Would the water spill? It will certainly fall down on your head, but it will do so together with the bucket. Thus, technically, the water will stay inside the bucket.

In fact, the proper way to enjoy the true magic of the experiment is not to swing the bucket in full circles, but rather let it swing back and forth as a pendulum (if you have a string and a beverage bottle nearby, you can do an experiment right now). One will then observe that even at the highest points of the swing, where the bottom of the bucket is at its steepest angle and the centrifugal force is nonexistant, the water stays strictly parallel to the bottom of the bucket, as if no gravity would act upon it. Why doesn't it spill? Clearly, the argument of centrifugal force cancelling gravity is inappropriate.

The proper explanation is actually quite simple and much more generic. We have two objects here: the bucket and the water in it. There is one (real) force acting on the water: gravity G. There are two (real) forces acting on the bucket: gravity G and the strain S of the string pulling the bucket perpendicularly to its bottom. (Note that the centrifugal force is not "real" and I do not consider it here, but if you wish, you may. Just remember then that it acts both on the water and the bucket.)
Now the question of interest is, how does water behave with respect to the bucket? That is, what force "pulls" the water towards the bucket and vice-versa. This can be easily computed by subtracting all forces acting on the bucket from all forces acting on the water. And the result is, of course, G - (S+G) = -S, i.e. a force, pulling the water directly towards the bottom of the bucket.

A magical consequence of this argument is that gravity does not matter inside the bucket, as long as it can act on the bucket freely in the same way as on anything inside it. Nothing special about rotation here, really. It takes a while to realize.

Tags: Conspiracy, Experiment, Fun, Physics