• Bayesian Statistics in Layman’s Terms

Posted by Konstantin 09.01.2016 No Comments

This is a (slightly updated) repost of my quora answer to the corresponding question.

There are many ways in which smart people tend to explain Bayesian statistics and contrast it with a "non-Bayesian" one. One usually highlights that the primary concept of a Bayesian approach is the the desire to model everything as a probability distribution. Once this is fact is clear, many smart people would proceed to claim that this is, in fact, what fundamentally sets Bayesian statistics aside from the "classical" one. However, I feel that this kind of explanation is somewhat incomplete. It is not like classical statisticians do not use complete probability distributions. The difference is in general somewhat more subtle and philosophical.

Consider the question "what is your height?". For a classical statistician there exists some abstract "true answer", say "180cm", which is a fixed number - your one and only height. The problem is, of course, you do not know this number because every measurement is slightly different, so the classical statistician will add that "there is a normally-distributed measurement error". In the world of a pure Bayesian there are almost no "fixed numbers" - everything is a probability distribution, and so is your height! That is, a Bayesian should say that "your height is a Normal distribution centered around 180cm".

Note that from the mathematical perspective there is no difference between the two representations - in both cases the number 180cm is mentioned, and the normal distribution. However, from a philosophical, syntactical, methodological and "mental" perspectives this tends to have serious implications, and there has been historically a kind of an ongoing intellectual feud between the statisticians who lend more towards the first or the second approach (it is somewhat resemblant of how there is a divide among the physicists with regard to their support of the Copenhagen interpretation of quantum mechanics).

One of the implications of denying the fact that things in the world are mostly fixed (and are all pure distributions instead) is that you may not use many of the common sense inference methods directly. What is my height if I stand on a chair? "Well, it is your height plus the height of a chair", a classical statistician would say. He can keep in mind the measurement errors, if necessary, but those could be dealt with later. In the Bayesian world heights are not numbers, so the procedure of adding heights implies convoluting two distributions to get the resulting distribution. If both distributions are Gaussian, the result will match that of the "common sense", but note that now the common sense somehow became "just one special case". Moreover, a Bayesian might even keep the possibility that "your height and the height of the chair are dependent" in the back of his mind, just in case. Because when you speak about two numbers in the Bayesian world, you must immediately start thinking about their joint distribution.

On the other hand, modeling everything in probabilities lets you use probability theory inference methods (Bayes rule, convolutions, marginalizations, etc) everywhere, without the need to differentiate between "fixed numbers" and "random measurement errors" and this adds peace of mind as well as tends to make your explanations clearer. A Bayesian confidence interval, for example, is a "fixed interval such that 95% of height measurements fall into it". A classical confidence interval, on the other hand, is "a random interval such that the true height may fall into it with 95% probability". Again, mathematically and numerically those may often be the same, but think how different the two explanations are.

Bayesian "thinking" tends to be more flexible for complex models. Many classical statistics models would stick to fixed parameters, point or "interval" inferences, and try to "hide" the complexity of probability distributions as much as possible. As a result, reasoning about a system with many highly interconnected concepts becomes flawed. Consider a sequence of three questions:

• What the height of this truck?
• Will it fit under this 3m bridge?
• Do we need pick another route?

In the "classical" mindset you would tend to give fixed answers to the questions.

• "Height of the truck is 297".
• "Yes, 297<300, hence it will fit".
• "No, we do not need".

Sometimes you may be more careful and work with confidence intervals, but it still feels unwieldy:

• "The confidence interval on the height of the truck is 290..310"
• ".. aahm, it might not fit..."
• "let's pick another route, just in case"

Note, if a followup question appears that depends on the previous inferences (e.g. "do we need to remodel the truck") answering it becomes even harder because the true uncertainty is "lost" in the intermediate steps. Such problems are never present if you are disciplined as a Bayesian. Note the answers:

• "The height of the truck is a normal distribution N(297, 10)"
• "It will fit under the bridge with probability 60%"
• "We need another route with probability 40%"

At any point is information about the uncertainty is preserved in the distributions and you are free to combine it further, or apply a decision-theoretic utility model. This makes Bayesian networks possible, for example.

It is interesting to see how this largely philosophical preference leads to two completely different (albeit complementary) sets of techniques. Indeed, if you are a true classical statistician, your work revolves around parameterized probability distributions. You write them down like $P_\alpha(x)$, where $x$ is the "truly random" value from some probability space, and $\alpha$ is the "fixed but unknown" parameter. Your whole "school of thought" is now focused on clever ad-hoc techniques for computing estimates of this fixed parameter from the provided distribution.

For a pure Bayesian, however, there is no "fixed" $\alpha$ that has to be treated somehow separately. Instead, $\alpha$ is also a part of some probability space, and instead of writing $P_\alpha(x)$ he would safely write $P(x| \alpha)$, $P(\alpha | x)$, or $P(x, \alpha)$. As a result, the probability distribution he works with are not parameterized any more, and all of the clever techniques that the classical statisticians have invented over the centuries for estimating parameters become seemingly useless. At this point a classical statistician puts his hands down and goes home, as there is nothing to do for him - there are no "unknowns". The Bayesian is, however, left to struggle with mathematically trivial, yet computationally incredibly heavy methods for extracting essentially the same values that the classical statistician could have obtained using his "parameter estimation" approaches. That's why the Bayesian "school of thought" is mostly focused on computationally-efficient methods for marginalization and sampling.

In reality, of course, a Bayesian would quite often give up and "cheat", at least partially parameterizing his models and making use of the classical estimation methods, while a "classical" statistician might happen to write $P(x|\alpha)$ and apply the Bayes rule here and there, whenever it seems appropriate. A number of computations derived from the two theoretical backgrounds end up exactly the same.

Thus, in practice, labeling things as "Bayesian" or "non-Bayesian" is still largely a philosophical choice. For example, there are methods in machine learning, ensemble learners, that are somewhy never labeled/marketed as being "Bayesian" nor were they probably invented by someone "Bayesian", although at their core those would be among the best examples of where a Bayesian approach is different from a classical one. Those are also among the best performant models quite often, by the way.

• When the Best is not the Best

Posted by Konstantin 04.01.2016 5 Comments

Collecting large amounts of data and then using it to "teach" computers to automatically recognize patterns is pretty much standard practice nowadays. It seems that, given enough data and the right methods, computers can get quite precise at detecting or predicting nearly anything, whether it is face recognition, fraud detection or movie recommendations.

Whenever a new classification system is created, it is taken for granted that the system should be as precise as possible. Of course, classifiers that never make mistakes are rare, but if it possible, we should strive to have them make as few mistakes as possible, right? Here is a fun example, where things are not as obvious.

Consider a bank, which, as is normal for a bank, makes money by giving loans to its customers. Of course, there is always a risk that a customer will default (i.e. not repay the loan). To account for that, the bank has a risk scoring system which, for a given loan application, assesses the probability that the corresponding customer may default. This probability is later used to compute the interest rate offered for the customer. To simplify a bit, the issued interest on a loan might be computed as the sum of customer's predicted default risk probability and a fixed profit margin. For example, if a customer is expected to default with probability 10% and the bank wants 5% profit on its loans on average, the loan might be issued at slightly above 15% interest. This would cover both the expected losses due to non-repayments as well as the profit margin.

Now, suppose the bank managed to develop a perfect scoring algorithm. That is, each application gets a rating of either having 0% or 100% risk. Suppose as well that within a month the bank processes 1000 applications, half of which are predicted to be perfectly good, and half - perfectly bad. This means that 500 loans get issued with a 5% interest rate, while 500 do not get issued at all.

Think what would happen, if the system would not do such a great job and confused 50 of the bad applications with the good ones? In this case 450 applications would be classified as "100%" risk, while 550 would be assigned a risk score of "9.1%" (we still require the system to provide valid risk probability estimates). In this case the bank would issue a total of 550 loans at 15%. Of course, 50 of those would not get repaid, yet this loss would be covered from the increased interest paid by the honest lenders. The financial returns are thus exactly the same as with the perfect classifier. However, the bank now has more clients. More applications were signed, and more contract fees were received.

True, the clients might be a bit less happy for getting a higher interest rate, but assuming they were ready to pay it anyway, the bank does not care. In fact, the bank would be more than happy to segment its customers by offering higher interest rates to low-risk customers anyway. It cannot do it openly, though. The established practices usually constrain banks to make use of "reasonable" scorecards and offer better interest rates to low-risk customers.

Hence, at least in this particular example, a "worse" classifier is in fact better for business. Perfect precision is not really the ultimately desired feature. Instead, the system is much more useful when it provides a relevant and "smooth" distribution of predicted risk scores, making sure the scores themselves are decently precise estimates for the probability of a default.

• What is Steganography

Posted by Konstantin 18.09.2015 No Comments

Steganography is a marvelous subject, which could be considered a subfield of cryptography. Unfortunately, it is some-why absolutely underrated nowadays. It is not covered by the standard computer science curricula (although it would fit perfectly both within lectures on Cryptography as well as Signal Processing), and way too many people believe steganography is something akin to adding grey stamps to images in Photoshop. Even the almighty Wikipedia, in its corresponding article, does not provide a decently clear view of the field, in my opinion. To fix this I thought I'd give a try at making up a short explanation myself.

As mentioned, steganography is somewhat of a fellow discipline to cryptography. While the goal of cryptography is to protect the communication channel between Alice and Bob from some hypothetical evil Eve, the goal of steganography is to conceal the fact of communication itself. The classical setting is described as follows: Alice and Bob are kept imprisoned by the evil warden Walter. They are allowed to talk to each other, but Walter is listening to everything. How could Alice and Bob arrange a prison break without raising suspicion?

Steganographic problems

More specifically, steganographic problems can be divided into steganography against a passive warden, and steganography against an active warden.

Passive warden steganography

In this case Walter can only observe Alice and Bob's communication, but may not interfere. Walter's task is to be able to detect which of the messages between Alice and Bob contain secret meanings (if they do). If Walter comes up with a method by which he can distinguish innocent messages from the ones with a secret meaning with nontrivial probability, we shall say that Walter has succeeded in a "passive attack" on Alice and Bob's steganographic system.

Besides helping imaginary prisoners hide from imaginary wardens, passive steganography has many practical applications. For example, some printers add a barely visible watermark to each printed page, which identifies the printer model and the date. Radio and TV advertisers sometimes include in their clips a hidden identifier of the product being advertised (which simplifies automated airtime monitoring and validation). Storage of image metainfo within the image itself (e.g. storing the patient id within its X-ray picture) may prevent confusion when data is transferred or converted, etc.

It must be noted, that from the perspective of classical cryptography the methods mentioned above are not very interesting. Indeed, it is customary to assume in cryptography that all the details of the protocol used between Alice and Bob are open to all parties, including Walter. The only thing that Walter should not know is the "key". Without knowing this key Walter should be incapable of attacking the system. This is not the case in the examples with the printers and advertisements. No keys are involved and anyone who knows the algorithm can immediately discover the hidden message (i.e. attack the system).

The methods for designing such systems focus primarily on techniques for adding information to a known audio/video/textual signal without distorting it in a perceivable manner and, possibly, making the presence of this information resilient to transformations that the signal may pass (video streaming, file format conversion, text normalization, etc). Consequently, the complexity of many such methods lies mostly in the area of signal processing. In the case of audio, for example, it could mean finding sound frequency components that, on one hand, will be left unchanged by most audio codecs, yet on the other hand could be modified without the listener perceiving the change. In some sense it resembles a fight between the codec (which attempts to remove any redundant channels) and the steganographic system (which aims to make use of any redundancy remaining).

Active warden steganography

A more interesting situation is where Walter is not only listening in to Alice and Bob's communications, but may also actively interfere, removing any potential hidden meanings. The best example here is copyright protection. They say that movies, when shown in cinemas, would usually contain a steganographic watermark in their video signal, which identifies the particular movie theatre. The "warden" here is a movie pirate, who might know about the existence of the watermark, and would be interested in removing it (without corrupting the movie itself, of course). Consequently, the watermark must be included in a way, where Walter the pirate is incapable of removing it, even if he knew all the details of the watermarking algorithm (except for a secret key, of course). Besides movie watermarking, similar techniques could be used by other authors to "sign" their creations so that their authorship could later be established, when necessary.

Methods

Multiple methods exist for solving both the active and passive warden steganography. Not being able to list all of them, I will mention just the two core concepts, which I think are enough for an introductory overview.

LSB-steganography

The most well-known example of steganography - least-significant-bit (LSB) steganography - is applicable to analog data, such as music and images. The idea is that small changes in some values (pixels, frequency components) of an analog signal would not affect the perception of the signal in a notable way. For example, by tuning some pixels in an image one could transmit several bits of additional information. Of course, the choice of those pixels should be determined using a shared secret key, otherwise Walter could easily detect or remove the covert message.

The problem in the most naive version of this approach is that not every pixel in an image can be safely modified. For example, in regions filled with a single color any such change will be easy to spot. This can be overcome by only picking the pixels with enough variability in their neighborhood. Another weakness of the method is that the steganogram may be easy to remove by resizing or cropping the image. This can be prevented by hiding information in the spectral components rather than pixels - those are more resistant to the usual transformations. In fact, if we are dealing with a photograph or anything else that will be saved as a JPEG file, this approach would be quite nontrivial to break. The same ideas are used for steganographic processing of audio and video signals as well.

Equivalence-class steganography

Quite often Alice has a choice for multiple options for sending the same message. For example, when writing a text message to Bob, she can choose the phrasings. She can write "Hi" instead of "Hello", "Robert" instead of "Bob", punctuation can be used in multiple ways, and so on. When Alice is writing a novel, she is free to choose the names of the characters and places, chapter titles and even the order of events. Knowing the possible equivalence classes for the messages to be sent (e.g. words or sentences) as well as a common key, Alice may easily choose her words so that only Bob would be able to grasp whether there is a secret meaning and, if so, what it is.

A simple practical example of a coding algorithm is the following: Alice will code a single bit by each sentence of her text. For obtaining this bit, Bob will have to apply a hash function to the sentence plus the secret key and use the last bit of that hash. During coding Alice will simply have to choose words so that each sentence would correspond to a correct bit.

In its bare form the method is easy to attack - Walter may try to rephrase the text himself on transmission. This, in turn, can be overcome if Alice would hide the bits only into particularly chosen words or word sets (which can only be found by knowing the key), and do it using redundant coding (thus Walter would need to significantly alter the text if he wanted to be sure the message is removed). If Alice hides the message into the names of the characters of her novel, Walter may have no chances at all.

Methods like that are claimed to be used by large software companies to watermark their code (here, instead of choosing synonymous words a specific algorithm will be choosing synonymous bytecode instructions). This makes it possible to track the particular license which was used to leak a pirated version. Another example is geographic mapping software, which may use various equivalent ways of displaying a fractal shore in order to watermark the image.

As noted, other approaches and techniques exist (I did not find the space to properly mention public key steganography here, for example), yet I belive that LSB and EC-steganography largely illustrate the core ideas behind modern steganography in general as well as its possibilities and limitations.

• Data Engineering

Posted by Konstantin 07.09.2015 2 Comments

Research and engineering always go hand in hand. Their relationship is so tight that often people cease to see the difference. Indeed, both are development activities that bring along technical advances. Many researchers work in engineering companies, many engineers do what is essentially research, is there even a need to see the difference? Quite often there is. Let me highlight this difference.

Research (sometimes synonymous with "science") is, ideally, an experimental activity, which aims to explore some space of possibilities in order to hopefully come up with a certain "understanding" of what those possibilities are or how they should be used. Researchers primarily deliver written reports, summarizing their findings. There is absolutely no guarantee that those findings would be "interesting" or "useful" to any degree - indeed, a research project by definition starts off in a position of uncertainty. As a result, for practical, commercial or investment purposes, research is always a risky activity. It may take a long time, it will produce mostly text, and there is no guarantee that the results will be of use. The upside of this risk is, of course, the chance to stumble upon unique innovation, which may eventually bring considerable benefits. Later.

Engineering is a construction activity, where existing knowledge and tools are put together to deliver actual products. In order for this process to be successful, experimentation and uncertainty must be, ideally, brought down to a minimum. Thus, when compared to pure research, engineering projects are low risk endeavors, where the expected outputs are known in advance.

In simple terms - researchers answer questions whilst engineers build things, and by this definition those occupations are, obviously, very different. This difference is often apparent in material fields, such as construction or electronics, where the engineers can be distinguished from the researchers by the kind of tools they would mostly hold in their hands and the places they would spend their time the most. With computational fields this visual difference does not exist - both engineers and researchers spend most of their time behind a computer screen. The tools and approaches are still different, but you won't spot this unless you are in the field. I did not know the difference between Software Engineering and Computer Science until I had the chance to try both.

Things become even more confusing when data mining gets involved. The popularity of projects, which focus on building data-driven intelligent systems, is ever growing. As a result, more and more companies seem to be eager to embrace this magical world by hiring "data scientists" to do "data science" for them. The irony of the situation is that most of those companies are engineering businesses (e.g. software developer firms) and, as such, they would not (or at least should not) normally consider hiring anyone with the word "scientist" in the job title. Because scientists are not too famous for bringing stable income, engineers are.

The term "data science" is a vague one, but I think it is quite succinct in that it captures the exploratory aspect that is inherent in general-purpose data analysis, as well as the current state of the art in the field. Although there are some good high level tools for a wide range of "simple" machine learning tasks nowadays, as soon as you want to try something more exotic, you are often on your own, faced with uncertainty and the need to experiment before you can even try to "build" anything. For a typical engineering company such uncertainty is not a good thing to rely upon.

It does not mean that one cannot engineer data-driven systems nowadays, it means that in reality most of the companies, whether they know it or not, need a very particular kind of "data scientists". They need specialists with a good knowledge of simple reliable tools and are capable of applying them to various data formats. Those, who would perhaps avoid excessive experimentation in favor of simple, scalable, working solutions, even if those are somehow simplistic, suboptimal and do not employ custom-designed forty-layer convolutional networks with inception blocks, which require several months to debug and train. Those, who might not know much about concentration inequalities but would be fluent in data warehousing and streaming. There's actually a name for such people: "Data engineers".

There is nothing novel about the use of such terminology, yet I still regularly encounter way too much misunderstanding around this topic over and over again.

In particular, I would expect way more of the "Data engineering" curricula to appear at universities alongside the more or less standard "Data science" ones. The difference between the two would be slight, but notable - pretty much of the same order as the difference between our "Computer science" and "Software engineering" master's programmes, for example.

• Holography in Simple Terms

Posted by Konstantin 24.07.2015 No Comments

Holography is a fascinating way of storing and displaying visual information - one which will, hopefully, one day become as commonplace as photography and video is today. Unfortunately, at the current moment it is still a largely exotic trade, limited to artistic exhibitions or science museum displays. While the explanation of how photography works is usually part of the standard school curriculum, holography, at the same time, is often regarded as some mysterious sci-fi technology. The problem is made worse by the fact that there do not seem to be too many explanations around in the Internet, which would aim to properly dispel the aura of mystery. Here is my take at fixing that problem.

Let us start by considering the following illustration, where two hypothetical eyes are observing two hypothetical light sources.

The light from the two lamps reaches the eyes at different angles and intensities, which depend on the positions of the eyes relative to the lights. Because of that difference in angles and intensities, the brain, attached to the eyes, is capable of "triangulating" the location of the lamps, thus forming an enjoyable "three-dimensional" perception of the whole scene for itself.

Let us position a piece of glass in front of the light sources. We can regard this glass as a set of "pixels", through which light rays fly from the lamps towards the eyes. The figure below illustrates three such randomly selected "pixels" on the glass along with the directions of rays that leave them.

What if we could design a piece of glass which would fully consist of such hypothetical "pixels", with each pixel emitting two light rays in the appropriate directions? The eyes, observing such a piece of glass, would not know that the rays are really produced "in the pixels". Indeed, they perceive exactly the same light rays as if the light was coming from two lamps, positioned somewhere behind the glass. Hence, the brain, attached to the eyes, has to perform the same triangulation as the one it did when the lamps were real, and is not be able to distinguish the "fake glass" from a real scene with the lamps.

Although this idea of faking reality might look nice, the task of manufacturing small enough "pixels", which would be able to emit light in a number of given directions with given intensities and colors, is, unfortunately, technologically too complex to be practical.

This is where Physics comes to help us out. "Hey, guys," - she says, - "you are forgetting that light is not really a set of rays flying from from point A to point B along straight lines. Light is a wave. Hence, assuming that your light sources are emitting coherent light of a given frequency, a correct illustration of what happens in your scene with two lamps would be the following:"

If we now consider an arbitrary point-"pixel" on our glass, the electromagnetic waves which pass through it are perceived there in the form of some oscillation. The whole situation above could be modeled by considering waves on a water surface. To simulate the "glass" (the blue line) let us have a horizontal row of buoys float on the surface. Each of the buoys will be moving up and down in its place due to the waves, and we can record this movement.

Here comes a question: what if we remove the original two oscillation sources and instead will start "replaying" the recorded movements of the buoys buy forcing them to move up and down and thus form their own waves? The example below simulates this situation using 19 "buoys":

Does not look nice at all, does it? Now Mathematics comes along to help: "Try harder!" - she says. "If your buoys are dense enough, everything will work out!". Hence, we increase the number of buoys and try again:

Indeed, now we observe how a row of oscillating buoys (an analogue of a "glass" in our model) is generating a wave in front of it, which is completely identical to the one that was produced when two sources were oscillating behind it. Now matter where we position ourselves in front of the buoy row, our perception will happily fool us into thinking that the wave we see is really a result of two point sources oscillating somewhere in a distance. Our brain will even help us to estimate the exact position of those sources somewhere behind the row. He does not know he is being fooled. Try closing the upper half of the animation above to get a better understanding of this illusion.

Hence, if we can make a glass, which consists of "pixels" each of which could "record" a profile of a light wave and then be able to reproduce it, the "image" observed by looking at such glass would be perfectly identical to an image that would be observed if the glass would simply pass through it the light from some sources located behind it. But is it possible? Isn't asking each "pixel" to record an arbitrary oscillation too much?

Here is when Mathematics comes along to help again. "Do you want to see a magic trick?" - she asks. "Yes!" - we answer.

Remember that you have two light sources emitting waves behind your piece of glass. As we agreed, they are emitting coherent light, i.e. light of the same frequency (for example, you could shine a laser light on the two points and have them reflect this light). Now let us see what happens at one of the pixels-"buoys" of your glass.

This point receives light waves from both sources. As the distances to the two sources are different, the two light waves reach the pixel with different intensities and phases. Formally, suppose the wave from the first source reaches the pixel with amplitude $a_1$ and phase $b_1$ and the wave from the second source reaches the pixel with amplitude $a_2$ and phase $b_2$. The two waves add up, so the pixel "feels" the overall oscillation of the form

$a_1 \sin(ft + b_1) + a_2 \sin(ft + b_2).$

Now if you have not two, but three, four, ten, or a million of point sources behind the glass, the arriving signal will be, correspondingly, a sum of three, four, ten, or even a million sinewaves with different amplitudes and phases.

Now here comes the trick, hold your breath: Sum of any number of such sinewaves is itself just a single sinewave with some amplitude and phase.

In other words, even if there is a million point sources emitting coherent light from behind the glass, each "pixel-buoy" of our glass will still perform a simple oscillation of the form $A \sin(ft + B)$. Hence, to "record" the oscillation of each "pixel-buoy" we must store only two parameters: its intensity and phase.

Now everything starts to look much simpler. Storing light intensity of a pixel is simple - any photograph does that. Dark spots on a photo or a screen emit (or reflect) low intensity light, white spots emit high intensity. It only remains to store the phase. Several methods have been proposed for that. Conceptually the simplest one (but technologically the most advanced) makes use of particular crystals which can shift the phase of light that passes through them. Each "pixel" could then be made out of such a crystal with an appropriate phase shift coded in it. When a "glass" made of such crystals is lit from the back by a coherent light source, each pixel would introduce the necessary phase shift into the light which passes through it, and thus reproduce the necessary wave front. I presume that this HoloVisio project is relying on this idea to design a proper holographic monitor.

The conventional holograms, the ones you typically see in science museums, are made using a simpler principle, though. Omitting some technical details, the idea behind it is the following. Let us add to the scene a reference ray of coherent light.

Now at some pixels on our "glass" the phase of the reference ray will coincide with the phase of the light coming from the scene. The overall wave oscillation at those points will be increased. At other points the phases of the reference and the scene light will mismatch and the signal will be weakened. We can record the resulting interference pattern by storing light spots on the glass at the pixels where the phases matched and dark spots where the phases did not match.

Now, whenever we shine the reference ray again through the resulting plate, only the parts of the ray with the "correct" phase will be let through, as those are exactly the places where the glass is transparent to the reference ray. As a result, you'll obtain a decent holographic reproduction. Here's a figure from Wikipedia, illustrating this principle of recording and reproducing holograms:

Considering the speed at which our photocameras have been transitioning over the recent decades from megapixel resolutions into gigapixels and observing the fast spread of stereoscopic "3D" from movie theaters into TV-sets and portable devices, it is both surprising and sad to see no notable signs of proper holographic display technology being developed in parallel. I do hope to see mass-produced holographic displays along with appropriate software within my own lifetime, though.

• Face Recognition

Posted by Konstantin 18.06.2015 No Comments

The developments of proper GPU-based implementations of neural network training methods in the recent years have lead to a steady growth of exciting practical examples of their potential. Among others, the topic of face recognition (not to be confused with face detection) is on the steady rise. Some 5 years ago or so, decent face recognition tools were limited to Google Picasa and Facebook, some research labs and a few commercial products, often branded with the word "Biometrics" (that somehow seems to grow out of fashion nowadays).

Today things are different. Given a decently large labeled dataset, some knowledge of machine learning, familiarity with the GPU-based "deep learning" tools, and sufficient perseverence, one can design a reasonably impressive face recognizer system in a matter of several weeks or months at most. Consequently, now is the time when we can see recognition-based startups receive millions in funding. Just some months ago Microsoft came out with their own public face detection and recognition API (which got its fair share of publicity in the form of a funny age-guessing tool).

The Hype Cycle

Hence, the growth in popularity and use of face recognition is apparent. Given that the initially overinflated hype around the whole deep learning buzzword seems to have more-or-less settled down to reality, this time the growth is realistic. We are probably on the "enlightenment" segment of the hype cycle here, very close to reaching actual productivity (which is not without issues, though).

Tambet and Ardi of our institute's neuroscience research group seemed to have caught the noospheric trend and are working on a neural-network based face recognizer with the initial aim of using it to sort and organize historical photos from the Estonian National Archives.

During a Skype University Hackathon, which happened in April I had the chance to join forces with them to present their efforts in the form of a fun public web app. The idea was to let people search for similar faces in the Estonian archive photos. The resulting site was called "teisik.ee" ("doppelganger" in Estonian). Although it does not seem to exactly fulfil the "doppelganger finding" purpose, it does manage to identify persons known to the system from the database surprisingly well.

Output from teisik.ee

Having observed that finding matches to celebrities, even if they are not perfect matches, is entertaining in its own way, we also managed to put up a second version of the service (all within that same weekend!), codenamed celebritymatch.me. This app lets you search the dataset of celebrity faces for those which are apparently similar (according to the opinion of our neural network, at least). Try it, it is not perfect, but rather fun.

The implementation of the recognition system is rather straightforward for anyone who knows what a convolutional network is and otherwise pretty impossible to grasp in full, hence I won't go into much technical detail. It is implemented using Caffe and consists of three consequtive convolutional layers (with ReLU and max-pooling), followed by a fully-connected hidden layer, which is then fully-connected to a softmax classification output layer. The outputs of the penultimate layer (of size 64) are used as the feature representation of the face. Those feature representations are then used to find Euclidean-distance-wise nearest neighbors in the database of faces. The future plans are to later apply the probably smarter FaceNet approach to network training for the same use case. The webapp is done using Flask.

Right after the hackathon Tambet was invited to give an interview about the project on Estonian television. If you understand Estonian, check it out, it is very good.

• Overwhelming Statistical Evidence

Posted by Konstantin 21.05.2015 No Comments

Here's a curious quote from Alan Turing's famous paper from 1950:

Makes you appreciate how seriously one person's wishful thinking, coupled with dedication and publicity skills, may sometimes affect the scientific world.

Posted by Konstantin 21.04.2015 No Comments

Over the recent years I happen to have made several small personal projects using Python's Flask web framework. The framework is designed to provide a very minimalistic "bottom-up" approach. It feels slightly less cluttered and imposing than some of the popular alternatives, thus fitting nicely for the projects a single person might typically want to hack up in a spare weekend. Minimalism of Flask does not mean it is somehow limited or unsuitable for larger projects - perhaps on the contrary, small size of the framework means there are fewer restrictions on what and how you can do with it.

What a small framework needs to be applied comfortably beyond its 6-line "Hello-World" use case, however, is a decent starter project template that would include some of the most common bells and whistles. And indeed, there happen to be several such templates available. I used parts of them over time in my own projects, however I always ended redoing/rewriting/renaming bits and pieces to fit my personal aesthetic needs. Eventually I got tired of renaming and packaged a Flask application template in the way I consistently prefer to use it. I am not sure whether it is objectively better or worse than the alternatives. None the less, if at some point you decide to give Flask a try, let me suggest you try this template of mine as your point of origin.

• Two Bananas Make a Human

Posted by Konstantin 12.04.2015 No Comments

There is a popular claim that "human DNA is 50% similar to the DNA of a banana", which is often cited in the context of "science popularization" as well as in the various "OMG did you know that" articles. It sounds funny, scientific and "plausible", hence I've seen many smart people repeat it, perhaps as a joke, without giving it too much thought. It is worth giving a thought, though.

Note that depending on how you phrase the statement, it may imply different things, some of them could be more, and some less exciting. The following examples are completely different in their meaning:

1. If you change 50% of human DNA you can obtain the DNA of a banana.
2. 50% of human DNA nucleotides are present in the DNA of a banana.
3. 50% of human's DNA regions have approximate matches in the banana DNA (or vice versa, which would be a different statement)
4. 50% of human genes have orthologs among banana genes (or vice versa).

The first one is obviously false, due to the fact that the total length of the human DNA is about 10 times that of the banana. You could include the whole banana sequence verbatim into a human genome, and it would only make 10% of it. The second one is also false, because, strictly speaking, not 50% but all 100% of human DNA nucleotides are also present in the DNA of a banana. Indeed, any two organisms on Earth have their DNA composed as a sequence of exactly the same four nucleotides. Moreover, if you start comparing random positions of two random DNA's, you will get a match once every four attempts by pure chance. There's a 25% basepair-wise similarity of any DNA to a random sequence!

The last two (or four, if you include the "reversed" versions) claims are more informative. In fact, claim #4 is probably the most interesting and is what could be meant if the presumed "50% similarity" was indeed ever found. Given the wide availability of genomic data, this claim be checked to some extent by anyone with a computer and a desire to finally make sure, how much of a banana he is, after all. Let us do it.

What proportion of human genes could be very similar to banana genes?

Although there is a lot of data about gene orthology among various organisms, as far as humans and bananas are concerned, I could not find any proper precomputed alignments. Creating a full-genome alignment for two organisms from scratch is a procedure way too tedious for a single Sunday's evening, so let us try a simplified measurement approach instead. Consider all possible 20-nucleotide reads taken from the gene-associated regions in the reference human genome. We shall say that a human gene "is somewhat bananas" if at least 5% of its 20-bp reads match somewhere on the banana genome. Given this definition, we shall ask: what proportion of the known human genes "are somewhat bananas"?

At this point some passionate readers could argue that, for example, 20-nucleotide reads are not long/short enough for the purposes of this question, or that the cutoff of 5% is too low or too high, or that approximate matching should be used instead along with some proper string alignment techniques, etc. As noted, we shall leave those aspects to dedicated researchers in the field of human bananology.

The computation took about an hour to run and came back with the following conclusion:

Only 33 out of the 24624 human genes (of at least 1000bp in length) are "somewhat bananas".

In other words, no, you are not "50% similar" to a banana according to any simple definition of such similarity. Not even 1% similar! Of course, there could still be means to torture the data and squeeze the "50%" value out, but those must be some quite nontrivial means and the interpretation of the resulting similarity would be far from straightforward.

• My Favourite Statistical Method

Posted by Konstantin 05.04.2015 No Comments

When it comes to data analysis, there are hundreds of exciting approaches: simple summary statistics and hypothesis tests, various clustering methods, linear and nonlinear regression or classification techniques, neural networks of various types and depths, decision rules and frequent itemsets, feature extractors and dimension reductors, ensemble methods, bayesian approaches and graphical models, logic-based approaches and fuzzy stuff, ant colonies, genetic algorithms and other optimization methods, monte-carlo algorithms, sampling and density estimation, logic-based and graph methods. Don't even get me started on the numerous visualization techniques.

This sheer number of options is, however, both a blessing and a curse at the same time. In many practical situations just having those methods at your disposal may pose more problems than solutions. First you need to pick one of the approaches that might possibly fit your purpose. Then you will try to adapt it appropriately, spend several iterations torturing the data only to obtain very dubious first results, come to the conclusion that most probably you are doing something wrong, reconvince yourself that you need to try harder in that direction, spend some more iterations testing various parameter settings. Nothing works as you want it to, so you start everything from scratch with another method to find yourself obtaining new, even more dubious results, torturing the data even further, getting tired of that and finally settling on something "intermediately decent", which "probably makes sense", although you are not so sure any more and feel frustrated.

I guess life of a statistician was probably way simpler back in the days when you could run a couple of t-tests, or an F-test from a linear regression and call it a day. In fact, it seems that many experimental (e.g. wetlab) scientists still live in that kind of world, when it comes to analyzing their experimental results. The world of T-tests is cozy and safe. They don't get you frustrated. Unfortunately, t-tests can feel ad-hockish, because they force you to believe that something "is normally distributed". Also, in practice, they are mainly used to confirm the obvious rather than discover something new from the data. A simple scatterplot will most often be better than a t-test as an analysis method. Hence, I am not a big fan of T-tests. However, I do have my own favourite statistical method, which always feels cozy and safe, and never gets me frustrated. I tend to apply it whenever I see a chance. It is the Fisher exact test in the particular context of feature selection.

My appreciation of it stems from my background in bioinformatics and some experience with motif detection in particular. Suppose you have measured the DNA sequences for a bunch of genes. What can you do to learn something new about the sequence structure from that data? One of your best bets is to first group your sequences according to some known criteria. Suppose you know from previous experiments that some of the genes are cancer-related whereas others are not. As soon as you have specified those groups, you can start making observations like the following: "It seems that 10 out of my 20 cancer-related genes have the subsequence GATGAG in their DNA code. The same sequence is present in only 5 out of 100 non-cancer-related ones. How probable would it be to obtain similar counts of GATGAG, if the two groups were picked randomly?" If the probability to get those counts at random is very low, then obviously there is something fishy about GATGAG and cancer - perhaps they are related. To compute this probability you will need to use the hypergeometric distribution, and the resulting test (i.e. the question "how probable is this situation in a random split?") is known as the Fishers' exact test.

This simple logic (with a small addition of a multiple testing correction on top) has worked wonders for finding actually important short sequences on the DNA. Of course it is not limited to sequence search. One of our research group's most popular web tools uses the same approach to discover functional annotations, that are "significantly overrepresented" in a given group of genes. The same approach can be used to construct decision trees, and in pretty much any other "supervised learning" situation, where you have groups of objects and want to find binary features of those objects, associated with the groups.

Although in general the Fisher test is just one particular measure of association, it is, as I noted above, rather "cozy and comfortable". It does not force me to make any weird assumptions, there is no "ad-hoc" aspect to it, it is simple to compute and, most importantly, in my experience it nearly always produces "relevant" results.

Words overrepresented in the speeches of Greece MPs

A week ago me, Ilya and Alex happened to take part in a small data analysis hackathon, dedicated to the analysis of speech transcripts from the European Parliament. Somewhat analogously to DNA sequences, speeches can be grouped in various ways: you can group them by the speaker who gave them, by country, gender or political party of that speaker, by the month or year when the speech was given or by any combination of such groupings. The obvious "features" of a speech are words, which can be either present or not present in it. Once you view the problem this way the task of finding group-specific words becomes self-evident and the Fisher test is the natural solution to it. We implemented this idea and extracted "country-specific" and "time-specific" words from the speeches (other options were left out due to time constraints). As is usual the case with my favourite method, the obtained results look relevant, informative and, when shown in the form of a word cloud, fun. Check them out.

The complete source code of the analysis scripts and the visualization application is available on Github.

February 2016
M T W T F S S
« Jan
1234567
891011121314
15161718192021
22232425262728
29