• Posted by Konstantin 11.11.2016 2 Comments

    A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMs and Kernel methods, but I never found the time. The post expects basic knowledge of those topics from the reader.

    Introductory Background

    The concept of kernel methods is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

    The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, x^T y, and square it, the result can be regarded as a dot product of two "feature vectors", where the features are all pairwise products of the original inputs:

        \begin{align*} &k(x,y) = (x^Ty)^2 = (\sum_i x_iy_i)^2 = \sum_{i,j} x_ix_jy_iy_j \\ & = \langle (x_1x_1, x_1x_2,\dots,x_ix_j,\dots,x_nx_n), (y_1y_1,\dots,y_iy_j,\dots,y_ny_n)\rangle \end{align*}

    Analogously, if you raise x^Ty to the third power, you are essentially computing a dot product within a space of all possible three-way products of your inputs, and so on, without ever actually having to see those features explicitly.

    If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a kernel function, and this will magically convert your model into a linear model with nonlinear features (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

    Consider, for example, plain old linear regression: f(x) = w^Tx + b. We can "kernelize" it by first representing w as a linear combination of the data points (this is called a dual representation):

        \[f(x) = \left(\sum_i \alpha_i x_i\right)^T x + b = \sum_i \alpha_i (x_i^T x) + b,\]

    and then swapping all the dot products x_i^T x with a custom kernel function:

        \[f(x) = \sum_i \alpha_i k(x_i,x) + b.\]

    If we now substitute k(x,y) = (x^T y)^2 here, our model becomes a second degree polynomial regression. If k(x,y) = (x^T y)^5 it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

    It turns out that there are lots of valid choices for the kernel function k, and, of course, the Gaussian function is one of these choices:

        \[k(x, y) = \exp\left(-\frac{|x - y|^2}{2\sigma^2}\right).\]

    It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

    The Gaussian Kernel

    To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power i makes a kernel with a feature space, which includes all i-wise products. Now let us examine what happens if we add two or more kernel functions. Consider k(x,y) = x^Ty + (x^Ty)^2, for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

        \[(x_1, x_2, \dots, x_n, \quad x_1x_1, x_1x_2,\dots,x_ix_j,\dots, x_n x_n),\]

    i.e. the concatenation of degree-1 (untransformed) features, and degree-2 (pairwise product) features.

    Multiplying a kernel function with a constant c is also meaningful. It corresponds to scaling the corresponding features by \sqrt{c}. For example, k(x,y) = 4x^Ty = (2x)^T(2y).

    Still with me? Great, now let us combine the tricks above and consider the following kernel:

        \[k(x,y) = 1 + x^Ty + \frac{(x^Ty)^2}{2} + \frac{(x^Ty)^3}{6}.\]

    Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by \sqrt{2} and all triple products scaled down by \sqrt{6}.

    Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to infinity-wise products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

        \[\sum_{i=0}^\infty \frac{(x^Ty)^i}{i!} = \exp(x^Ty).\]

    We can conclude here that k(x,y) = \exp(x^Ty) is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

    But we are not done yet. Suppose that we decide to normalize the inputs before applying our linear model. That is, we want to convert each vector x to \frac{x}{|x|} before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

    Consider again the linear kernel k(x,y) = x^Ty. If we normalize the vectors before taking their inner product, we get

        \[\left(\frac{x}{|x|}\right)^T\left(\frac{y}{|y|}\right) = \frac{x^Ty}{|x||y|} = \frac{k(x,y)}{\sqrt{k(x,x)k(y,y)}}.\]

    With some reflection you will see that the latter expression would normalize the features for any kernel.

    Let us see what happens if we apply this kernel normalization to the “infinite polynomial” (i.e. exponential) kernel we just derived:

        \begin{align*} \frac{\exp(x^Ty)}{\sqrt{\exp(x^Tx)\exp(y^Ty)}} &= \frac{\exp(x^Ty)}{\exp(|x|^2/2)\exp(|y|^2/2)}  \\ &= \exp\left(-\frac{1}{2}(|x|^2+|y|^2-2x^Ty)\right) \\ &= \exp\left(-\frac{|x-y|^2}{2}\right). \end{align*}

    Voilà, the Gaussian kernel. Well, it still lacks \sigma^2 in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by 1/\sigma

    To conclude: the Gaussian kernel is a normalized polynomial kernel of infinite degree (where feature products if i-th degree are scaled down by \sqrt{i!} before normalization). Simple, right?

    An Example

    The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let x = 1, y = 2. The value of the Gaussian kernel k(x, y) for these inputs is:

        \[k(x, y) = \exp(-0.5|1-2|^2) \approx 0.6065306597...\]

    Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

        \[\phi'(x) = \left(1, x, \frac{x^2}{\sqrt{2!}}, \frac{x^3}{\sqrt{3!}}, \frac{x^4}{\sqrt{4!}}, \frac{x^5}{\sqrt{5!}}\dots\right).\]

    As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

    ----\phi'(1) = (1, 1, 0.707, 0.408, 0.204, 0.091, 0.037, 0.014, 0.005, 0.002, 0.001, 0.000, 0.000, ...)

    ----\phi'(2) = (1, 2, 2.828, 3.266, 3.266, 2.921, 2.385, 1.803, 1.275, 0.850, 0.538, 0.324, 0.187, 0.104, 0.055, 0.029, 0.014, 0.007, 0.003, 0.002, 0.001, ...)

    Let us limit ourselves to just these first 21 features. To obtain the final Gaussian kernel feature representations \phi we just need to normalize:

    ----\phi(1) =\frac{\phi'(1)}{|\phi'(1)|} = \frac{\phi'(1)}{1.649} = (0.607, 0.607, 0.429, 0.248, 0.124, 0.055, 0.023, 0.009, 0.003, 0.001, 0.000, ...)

    ----\phi(2) =\frac{\phi'(2)}{|\phi'(2)|} = \frac{\phi'(2)}{7.389} = (0.135, 0.271, 0.383, 0.442, 0.442, 0.395, 0.323, 0.244, 0.173, 0.115, 0.073, 0.044, 0.025, 0.014, 0.008, 0.004, 0.002, 0.001, 0.000, ...)

    Finally, we compute the simple dot product of these two vectors:

        \[\scriptstyle\phi(1)^T\phi(2) = 0.607\cdot 0.135 + 0.607\cdot 0.271 + \dots = {\bf 0.6065306}602....\]

    In boldface are the decimal digits, which match the value of \exp(-0.5|1-2|^2). The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

    A 2D Example

    The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

        \begin{align*} \scriptscriptstyle\phi'(&\scriptscriptstyle x_1, x_2) = \left(1, x_1, \frac{x_1x_1}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{2!}}, \frac{x_2x_1}{\sqrt{2!}}, \frac{x_2x_2}{\sqrt{2!}}, \right. \\ &\scriptscriptstyle \left.\frac{x_1x_1x_1}{\sqrt{3!}}, \frac{x_1x_1x_2}{\sqrt{3!}}, \frac{x_1x_2x_1}{\sqrt{3!}}, \frac{x_1x_2x_2}{\sqrt{3!}}, \frac{x_2x_1x_2}{\sqrt{3!}},  \frac{x_2x_2x_1}{\sqrt{3!}}, \dots\right).\\ \end{align*}

    This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

    Note that many products are repeated, however (e.g. x_1x_2 = x_2x_1), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

    Basic combinatorics will tell us, that each feature of the form \frac{x_1^a x_2^b}{\sqrt{n!}} must be repeated exactly \frac{n!}{a!b!} times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by \sqrt{\frac{n!}{a!b!}}. "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector (v, v, v), for example. Its norm is \sqrt{v^2 + v^2 + v^2} = \sqrt{3}|v|, exactly the same as the norm of the single-element vector (\sqrt{3}v).

    As we do this scaling, each feature gets converted to a nice symmetric form:

        \[\sqrt{\frac{n!}{a!b!}}\frac{x_1^a x_2^b}{\sqrt{n!}} = \frac{x_1^a x_2^b}{\sqrt{a!b!}} = \frac{x^a}{\sqrt{a!}}\frac{x^b}{\sqrt{b!}}.\]

    This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with 21^2 = 231 features instead of 2097151. Nice!

    Here is a new view of the unnormalized feature vector up to degree 3:

        \[\phi'_3(x_1, x_2) = \scriptstyle\left(1, x_1, x_2, \frac{x_1^2}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{1!1!}}, \frac{x^2}{\sqrt{2!}}, \frac{x_1^3}{\sqrt{3!}}, \frac{x_1^2x_2}{\sqrt{2!1!}}, \frac{x_1x_2^2}{\sqrt{1!2!}}, \frac{x_2^3}{\sqrt{3!}}\right).\]

    Let us limit ourselves to this degree-3 example and let x = (0.7, 0.2), y = (0.1, 0.4) (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

    ----\phi'_3(0.7, 0.2) = (1, 0.7, 0.2, 0.346, 0.140, 0.028, 0.140, 0.069, 0.020, 0.003),

    ----\phi'_3(0.1, 0.4) = (1, 0.1, 0.4, 0.007, 0.040, 0.113, 0.000, 0.003, 0.011, 0.026).

    After normalization:

    ----\phi_3(0.7, 0.2) = (0.768, 0.538, 0.154, 0.266, 0.108, 0.022, 0.108, 0.053, 0.015, 0.003),

    ----\phi_3(0.1, 0.4) = (0.919, 0.092, 0.367, 0.006, 0.037, 0.104, 0.000, 0.003, 0.010, 0.024).

    The dot product of these vectors is 0.8196\dots, what about the exact Gaussian kernel value?

        \[\exp(-0.5|x-y|^2) = 0.{\bf 81}87\dots.\]

    Close enough. Of course, this could be done for any number of dimensions, so let us conclude the post with the new observation we made:

    The features of the unnormalized d-dimensional Gaussian kernel are:

        \[\phi({\bf x})_{\bf a} = \prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},\]

    where {\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d.

    The Gaussian kernel is just the normalized version of that, and we know that the norm to divide by is \sqrt{\exp({\bf x}^T{\bf x})}. Thus we may also state the following:

    The features of the d-dimensional Gaussian kernel are:

        \[\phi({\bf x})_{\bf a} = \exp(-0.5|{\bf x}|^2)\prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},\]

    where {\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d.

    That's it, now you have seen the soul of the Gaussian kernel.

    Tags: , , , , , ,

  • Posted by Konstantin 18.09.2015 No Comments

    Steganography is a marvelous subject, which could be considered a subfield of cryptography. Unfortunately, it is some-why absolutely underrated nowadays. It is not covered by the standard computer science curricula (although it would fit perfectly both within lectures on Cryptography as well as Signal Processing), and way too many people believe steganography is something akin to adding grey stamps to images in Photoshop. Even the almighty Wikipedia, in its corresponding article, does not provide a decently clear view of the field, in my opinion. To fix this I thought I'd give a try at making up a short explanation myself.

    As mentioned, steganography is somewhat of a fellow discipline to cryptography. While the goal of cryptography is to protect the communication channel between Alice and Bob from some hypothetical evil Eve, the goal of steganography is to conceal the fact of communication itself. The classical setting is described as follows: Alice and Bob are kept imprisoned by the evil warden Walter. They are allowed to talk to each other, but Walter is listening to everything. How could Alice and Bob arrange a prison break without raising suspicion?

    Steganographic problems

    More specifically, steganographic problems can be divided into steganography against a passive warden, and steganography against an active warden.

    Passive warden steganography

    In this case Walter can only observe Alice and Bob's communication, but may not interfere. Walter's task is to be able to detect which of the messages between Alice and Bob contain secret meanings (if they do). If Walter comes up with a method by which he can distinguish innocent messages from the ones with a secret meaning with nontrivial probability, we shall say that Walter has succeeded in a "passive attack" on Alice and Bob's steganographic system.

    Besides helping imaginary prisoners hide from imaginary wardens, passive steganography has many practical applications. For example, some printers add a barely visible watermark to each printed page, which identifies the printer model and the date. Radio and TV advertisers sometimes include in their clips a hidden identifier of the product being advertised (which simplifies automated airtime monitoring and validation). Storage of image metainfo within the image itself (e.g. storing the patient id within its X-ray picture) may prevent confusion when data is transferred or converted, etc.

    It must be noted, that from the perspective of classical cryptography the methods mentioned above are not very interesting. Indeed, it is customary to assume in cryptography that all the details of the protocol used between Alice and Bob are open to all parties, including Walter. The only thing that Walter should not know is the "key". Without knowing this key Walter should be incapable of attacking the system. This is not the case in the examples with the printers and advertisements. No keys are involved and anyone who knows the algorithm can immediately discover the hidden message (i.e. attack the system).

    The methods for designing such systems focus primarily on techniques for adding information to a known audio/video/textual signal without distorting it in a perceivable manner and, possibly, making the presence of this information resilient to transformations that the signal may pass (video streaming, file format conversion, text normalization, etc). Consequently, the complexity of many such methods lies mostly in the area of signal processing. In the case of audio, for example, it could mean finding sound frequency components that, on one hand, will be left unchanged by most audio codecs, yet on the other hand could be modified without the listener perceiving the change. In some sense it resembles a fight between the codec (which attempts to remove any redundant channels) and the steganographic system (which aims to make use of any redundancy remaining).

    Active warden steganography

    A more interesting situation is where Walter is not only listening in to Alice and Bob's communications, but may also actively interfere, removing any potential hidden meanings. The best example here is copyright protection. They say that movies, when shown in cinemas, would usually contain a steganographic watermark in their video signal, which identifies the particular movie theatre. The "warden" here is a movie pirate, who might know about the existence of the watermark, and would be interested in removing it (without corrupting the movie itself, of course). Consequently, the watermark must be included in a way, where Walter the pirate is incapable of removing it, even if he knew all the details of the watermarking algorithm (except for a secret key, of course). Besides movie watermarking, similar techniques could be used by other authors to "sign" their creations so that their authorship could later be established, when necessary.

    Methods

    Multiple methods exist for solving both the active and passive warden steganography. Not being able to list all of them, I will mention just the two core concepts, which I think are enough for an introductory overview.

    LSB-steganography

    The most well-known example of steganography - least-significant-bit (LSB) steganography - is applicable to analog data, such as music and images. The idea is that small changes in some values (pixels, frequency components) of an analog signal would not affect the perception of the signal in a notable way. For example, by tuning some pixels in an image one could transmit several bits of additional information. Of course, the choice of those pixels should be determined using a shared secret key, otherwise Walter could easily detect or remove the covert message.

    The problem in the most naive version of this approach is that not every pixel in an image can be safely modified. For example, in regions filled with a single color any such change will be easy to spot. This can be overcome by only picking the pixels with enough variability in their neighborhood. Another weakness of the method is that the steganogram may be easy to remove by resizing or cropping the image. This can be prevented by hiding information in the spectral components rather than pixels - those are more resistant to the usual transformations. In fact, if we are dealing with a photograph or anything else that will be saved as a JPEG file, this approach would be quite nontrivial to break. The same ideas are used for steganographic processing of audio and video signals as well.

    Equivalence-class steganography

    Quite often Alice has a choice for multiple options for sending the same message. For example, when writing a text message to Bob, she can choose the phrasings. She can write "Hi" instead of "Hello", "Robert" instead of "Bob", punctuation can be used in multiple ways, and so on. When Alice is writing a novel, she is free to choose the names of the characters and places, chapter titles and even the order of events. Knowing the possible equivalence classes for the messages to be sent (e.g. words or sentences) as well as a common key, Alice may easily choose her words so that only Bob would be able to grasp whether there is a secret meaning and, if so, what it is.

    A simple practical example of a coding algorithm is the following: Alice will code a single bit by each sentence of her text. For obtaining this bit, Bob will have to apply a hash function to the sentence plus the secret key and use the last bit of that hash. During coding Alice will simply have to choose words so that each sentence would correspond to a correct bit.

    In its bare form the method is easy to attack - Walter may try to rephrase the text himself on transmission. This, in turn, can be overcome if Alice would hide the bits only into particularly chosen words or word sets (which can only be found by knowing the key), and do it using redundant coding (thus Walter would need to significantly alter the text if he wanted to be sure the message is removed). If Alice hides the message into the names of the characters of her novel, Walter may have no chances at all.

    Methods like that are claimed to be used by large software companies to watermark their code (here, instead of choosing synonymous words a specific algorithm will be choosing synonymous bytecode instructions). This makes it possible to track the particular license which was used to leak a pirated version. Another example is geographic mapping software, which may use various equivalent ways of displaying a fractal shore in order to watermark the image.

    As noted, other approaches and techniques exist (I did not find the space to properly mention public key steganography here, for example), yet I belive that LSB and EC-steganography largely illustrate the core ideas behind modern steganography in general as well as its possibilities and limitations. For those interested in the security proofs behind such techniques, this paper is a must read.

     

    Tags: , , ,

  • Posted by Konstantin 22.03.2015 4 Comments

    This is a repost of my quora answer to the question: In layman's terms, how does Naive Bayes work?

    Suppose that you are a working as a security guard at the airport. Your task is to look at people who pass the security line and pick some of them as being worthy of a more detailed screening. Now, of course, telling whether a person is a potential criminal or not by just looking at him/her is hard, if at all possible, but you need to do something. You have been put there for some reason, after all.

    One of the simplest ways to approach the problem, mentally, is the following. You assign a "risk value" for each person. At the beginning (when you don't have any information about the person at all) you set this value to zero.

    Now you start studying various features of the person in front of you: is it a male or a female? Is it a kid? Is he behaving nervously? Is he carrying a big bag? Is he alone? Did the metal detector beep? Is he a foreigner? etc. For each of those features you know (subconsciously due to your presuppositions, or from actual statistics) the average increase or decrease in risk of the person being a criminal that it entails. For example, if you know that the proportion of males among criminals is the same as the proportion of males among non-criminals, observing that a person is male will not affect his risk value at all. If, however, there are more males among criminals (suppose the percentage is, say, 70%) than among decent people (where the proportion is around 50%), observing that a person in front of you is a male will increase the "risk level" by some amount (the value is log(70%/50%) ~ 0.3, to be precise). Then you see that a person is nervous. OK, you think, 90% of criminals are nervous, but only 50% of normal people are. This means that nervousness should entail a further risk increase (of log(0.9/0.5) ~ 0.6, to be technical again, so by now you have counted a total risk value of 0.9). Then you notice it is a kid. Wow, there is only 1% of kids among criminals, but around 10% among normal people. Therefore, the risk value change due to this observation will be negative (log(0.01/0.10) ~ -2.3, so your totals are around -1.4 by now).

    You can continue this as long as you want, including more and more features, each of which will modify your total risk value by either increasing it (if you know this particular feature is more representative of a criminal) or decreasing (if the features is more representative of a decent person). When you are done collecting the features, all is left for you is to compare the result with some threshold level. Say, if the total risk value exceeds 10, you declare the person in front of you to be potentially dangerous and take it into a detailed screening.

    The benefit of such an approach is that it is rather intuitive and simple to compute. The drawback is that it does not take the cross-play of features into account. It may very well be the case that while the feature "the person is a kid" on its own greatly reduces the risk value, and the feature "has a moustache" on its own has close to no effect, a combination of the two ("a kid with a moustache") would actually have to increase the risk by a lot. This would not happen when you simply add the separate feature contributions, as described above.

    Tags: , , , , , ,

  • Posted by Konstantin 09.03.2015 No Comments

    Playing cards are a great tool for modeling, popularizing and explaining various mathematical and algorithmic notions. Indeed, a deck of cards is a straightforward example for a finite set, a discrete distribution or a character string. Shuffling and dealing cards represents random sampling operations. A card hand denotes information possessed by a party. Turning card face down or face up looks a lot like bit flipping. Finally, card game rules describe an instance of a particular algorithm or a protocol. I am pretty sure that for most concepts in maths or computer science you can find some card game or a card trick that is directly related to it. Here are some examples of how you can simulate a simple computing machineillustrate inductive reasoning or explain map-reduce, in particular.

    Cryptographers seem to be especially keen towards playing cards. Some cryptographic primitives could have been inspired by them. There are decently secure ciphers built upon a deck of cards. Finally, there are a couple of very enlightening card-based illustrations of such nontrivial cryptographic concepts as zero-knowledge proofs and voting protocols. The recent course on secure two-party computation given by abhi shelat at the last week's EWSCS extended my personal collection of such illustrations with another awesome example — a secure two-party protocol for computing the AND function. As I could not find a description of this protocol anywhere in the internet (and abhi did not know who is the author), I thought it was worth being written up in a blog post here (along with a small modification of my own).

    The Tinder Game

    Consider the situation, where Alice and Bob want to find out, whether they are both interested in each other (just as if they were both users of the Tinder app). More formally, suppose that Alice has her private opinion about Bob represented as a single bit (where "0" means "not interested" and "1" means "interested"). Equivalently, Bob has his private opinion about Alice represented in the same way. They need to find out whether both bits are "1". However, if it is not the case, they would like to keep their opinions private. For example, if Alice is not interested in Bob, Bob would prefer Alice not to know that he is all over her. Because, you know, opinion asymmetry may lead to awkward social dynamics when disclosed, at least among college students.

    The following simple card game happens to solve their problem. Take five cards, so that three of them are red and two are black. The cards of one color must be indistinguishable from each other (e.g. you can't simply take three different diamonds for the reds). Deal one black and one red card to Alice, one black and one red card to Bob. Put the remaining red card on the table face up.

    Initial table configuration

    Initial table configuration

    Now Alice will put her two cards face down to the left of the open card, and Bob will put his two cards to the right of the open card:

    Alice and Bob played

    Alice and Bob played

    The rules for Alice and Bob are simple: if you like the other person, put your red card next to the central red card. Otherwise, put your black card next to the central one. For example, if both Alice and Bob like each other, they would play their cards as follows (albeit still face down):

    What Alice and Bob would play if they both liked each other

    What Alice and Bob would play if they both liked each other

    Now the middle card is also turned face down, and all five cards are collected, preserving their order:

    Five cards collected, preserving order

    Five cards collected, preserving order

    Next, Alice cuts this five-card deck, moving some number of cards from the top to the bottom (Bob should not know exactly how many). Then Bob cuts the deck (also making sure that Alice does not know how many cards he cuts). The result is equivalent to applying some cyclic rotation to the five card sequence yet neither Bob nor Alice knows how many cards were shifted in total.

    The five cards can now be opened by dealing them in order onto the table to find out the result. If there happen to be three red cards one after another in a row or two black cards one after another, Alice and Bob both voted "yes". Here is one example of such a situation. It is easy to see that it is a cyclic shift of the configuration with three red aces in the middle shown above.

    Both voted "yes"

    Both voted "yes"

    Otherwise, if neither three red aces nor two black aces are side by side, we may conclude that one or both of the players voted "no". However, there is absolutely no way to find out anything more specific than that. Here is an example:

    No mutual affection (or no affection at all)

    No mutual affection (or no affection at all)

    Obviously, this result could not be obtained as a cyclic shift of the configuration with three aces clumped together. On the other hand, it could have been obtained as a cyclic shift from any of the three other alternatives ("yes-no", "no-no" and "no-yes"), hence if Alice voted "no" she will have no way of figuring out what was Bob's vote. Thus, playing cards along with a cryptographic mindset helped Alice and Bob discover their mutual affection or the lack of it without the risk of awkward situations. Isn't it great?

    Throwing in a Zero-Knowledge Proof

    There are various ways Alice or Bob can try to "break" this protocol while still trying to claim honesty. One possible example is shown below:

    Alice is trying to cheat

    Alice is trying to cheat

    Do you see what happened? Alice has put her ace upside down, which will later allow her to understand what was Bob's move (yet she can easily pretend that turning a card upside down was an honest mistake). Although the problem is easily solved by picking a deck with asymmetric backs, for the sake of example, let us assume that such a solution is somewhy unsuitable. Perhaps there are no requisite decks at Alice and Bob's disposal, or they need to have symmetric backs for some other reasons. This offers a nice possibility for us to practice even more playing card cryptography and try to secure the original algorithm from such "attacks" using a small imitation of a zero-knowledge proof for turn correctness.

    Try solving it yourself before proceeding.

    Read more...

    Tags: , , , , , ,

  • Posted by Konstantin 03.09.2012 2 Comments

    For many people, the ability for learning and adaptation seems like something unique, extremely complicated and mysterious. Indeed, those are the abilities we almost exclusively associate with high levels of intelligence and knowledge. This is, however, an illusion. Although adaptive behaviour might indeed look complex, it is not necessarily driven by "intelligent" mechanisms. One of the best illustrations of this is a fully-fledged self-learning machine made from plain matchboxes.

    A Tic-Tac-Toe machine made by James Bridle

    A Tic-Tac-Toe machine by James Bridle

    The idea for such a machine was first introduced in 1960 by Donald Michie, who devised a simple self-learning algorithm for Tic-Tac-Toe (reminiscent of what is now known to be Reinforcement Learning). Due to lack of appropriate computing power, he implemented it "in hardware" using 300 or so matchboxes.

    The idea of the machine is simple. There is a matchbox corresponding to each game position, where the "computer" has to make a move. The matchbox contains colored beads, each color corresponding to a particular move. The decision is made by picking a random bead from the matchbox. Initially (when the machine is "untrained"), there is an equal number of beads of each color, and the machine thus makes equiprobably random turns. After each game, however, the machine is "punished" by removing beads, corresponding to losing turns, or "rewarded" by adding beads, corresponding to winning turns. Thus, after several games, the machine will adapt its strategy towards a winning one.

    The idea was popularized by Martin Gardner in one of his Scientific American articles (later published in the book "The Unexpected Hanging and Other Mathematical Diversions"). Gardner invented a simple game of  "Hexapawn", and derived a matchbox machine for it, which only required as little as 19 matchboxes. He also suggests in his article, however, to create a matchbox machine for "Mini-checkers" - checkers played on a 4x4 board. Ever since I saw this article some 20 or so years ago I was thinking of making one. This summer, while teaching a machine learning course in a summer school in Kiev, I actually made one. I could use it to both fulfil my ages-old desire as well as a teaching aid. You can make one too now, if you are interested.

    The Mini-checkers Machine

    The rules of mini-checkers are exactly like those of usual checkers, with three modifications:

    • The game is played on a 4x4 field. White is the first one to move. Machine plays for black.
    • Whenever both players get a King, the game immediately ends in a draw.
    • The King must always move to the furthest possible position in the chosen direction.

    To make the machine, you first have to buy and empty 24 matchboxes. Next, print out and stick the 24 game positions onto the boxes. Draw on each box all the possible black's moves as arrows using colored markers. Finally, for  each colored arrow, add 2 beads of the same color into the matchbox. That's it, your machine is ready to play.

    The Mini-checkers machine

    The Mini-checkers machine

    The game proceeds as already described: whenever the machine (the black player) has to make a decision (i.e. whenever it has to make a move and there is more than one possibility), find the matchbox with the current position depicted on it, shake it, and pick a random bead. This will tell you the decision of the machine. If the corresponding matchbox is empty, the machine forfeits. You should keep the matchboxes, corresponding to the moves that were made, open until the end of the game.

    Once the game is over, the machine is "taught":

    • If the machine won, do nothing.
    • If the game was a draw, remove the bead corresponding to the machine's last move from the matchbox, unless it was the last bead of that color in the box.
    • If the machine lost, remove all the beads, corresponding to the machine's last move, from the last matchbox.

    It takes about 30 games or so for the machine to actually learn to play well enough. Of course, a human would understand the strategy much earlier, but it's fun none the less.

    Playing with the machine will immediately lead you towards two important questions:

    • How efficient is the suggested learning procedure? Can it be improved and generalized?
    • How do you make a matchbox machine for a more complex game without having to manage thousands of matchboxes.

    As far as I know, contemporary machine learning has only partial answers to both of them.

    Tags: , , , ,

  • Posted by Konstantin 21.11.2009 No Comments

    In the recent weeks I had to give a few introductory lectures on supervised machine learning within Jaak's data mining course. I also provided the students some home assignments, and there it was especially tricky to find a simple, fun and discussable exercise, which might help to form some intuition regarding the inherent difficulties of learning from data such as overfitting, multiple testing, and the like. The following is what I came up with and I find it a rather successful creation. It is remarkable that of the 80 students participating in the course only 4 or so came up with the correct answer 🙂

    The Riddle

    After the lecture on supervised machine learning at the University of Mordor, the teacher, Sauron, gave the students a dataset of the following form:

                1) ABACDAFXYZ    -> good
                2) CEGXEAELWMN   -> good
                3) NUWAB         -> bad
                        ...
               20) QRELAZMNPCXA  -> good

    The inputs were seemingly arbitrary strings of latin characters: these were the terrible spells known only to Sauron and chosen by Sauron at random from his Great Book of Terrible Spells. The output was a binary variable, classifying each spell as good or bad. There were 20 observations in total in the dataset.

    The students were given a task: on the basis of this data, come up with a generic spell classifier. Sauron promised to test their result on the next lecture as follows: he will pick another random terrible spell from the book and require each student to make a prediction. The student whose prediction is wrong will have to jump down from the tower as a punishment.

    The first student, Aghargh, who happened to be slacking on the lecture, couldn't come up with any smart ways to solve the task, so he ended up just counting the proportion of "good" and "bad" spells in the dataset. Having observed that 16 of the 20 spells were "good", he decided to predict "good" when asked.

    The second student, Bughrorc, chose a smarter way. Firstly, he converted each string into a vector of binary attributes — one attribute for each letter, having value "1" if that letter was present in the corresponding spell and "0" otherwise. He then split the data randomly into a training set (15 instances) and a test set (5 instances), and attempted training various classifiers using the MordorMiner software. After some experimenting he finally found five classifiers that could predict all of the training examples correctly. One of these also predicted all of the testing examples correctly. He decided to use this classifier on the forthcoming test.

    On the day of testing, Sauron asked the students to classify the spell YOZAZA. Aghargh, as he decided, provided the answer "good". Bughrorc's classifier analyzed the string and claimed that the answer should be "bad"

    Which of the students, do you think, might have a better chance of not jumping from the tower? Why? Can you quantify your answer?

    Tags: , , ,

Calendar

October 2017
M T W T F S S
« Jul    
 1
2345678
9101112131415
16171819202122
23242526272829
3031