Four Years Remaining

BreadboardBot

Posted by Konstantin 02.01.2024 No Comments
This article has been cross-posted to Hackster.

What is the best way to learn (or teach) introductory robotics? What could be the "hello world" project in this field, that would be simple for a beginner to get started with, complex enough to be exciting and qualify as a "robot", and deep enough to allow further creative extensions and experiments? Let us try solving this problem logically, step by step!

First things first: what is a "robot"?

Axiom number one: for anything to feel like a robot, it must have some moving parts. Ideally, it could have the freedom move around, and probably the easiest way to achieve that would be to equip it with two wheels. Could concepts other than a two-wheeler qualify as the simplest robot? Perhaps, but let us proceed with this one as a reasonably justified assumption.

The wheels

To drive the two upcoming wheels we will need two motors along with all that driver circuitry, ugh! But wait, what about those small continuous rotation servos! They are simple, cheap, and you just need to wire them to a single microcontroller pin. Some of them are even conveniently sold along with with a LEGO-style wheel!

The microcontroller

Rule number two: it is not a real robot, unless it has a programmable brain, like a microcontroller. But which microcontroller is the best for a beginner in year 2024? Some fifteen years ago the answer would certainly have the "Arduino" keyword in it. Nowadays, I'd say, the magic keyword is "CircuitPython". Programming just cannot get any simpler than plugging it into your computer and editing code.py on the newly appeared external drive. No need for "compiling", "flashing" or any of the "install these serial port drivers to fix obscure error messages" user experiences (especially valuable if you need to give a workshop at a school computer class). Moreover, the code to control a wheel looks as straightforward as motor.throttle = 0.5.

OK, CircuitPython it is, but what specific CircuitPython-compatible microcontroller should we choose? We need something cheap and well-supported by the manufacturer and the community. The Raspberry Pi Pico would be a great match due to its popularity, price and feature set, except it is a bit too bulky (as will become clear below). Check out the Seeedstudio Xiao Series instead. The Xiao RP2040 is built upon the same chip as the Raspberry Pi Pico, is just as cheap, but has a more compact size. Interestingly, as there is a whole series of pin-compatible boards with the same shape, we can later try other models if we do not like this one. Moreover, we are not even locked into a single vendor, as the QT Py series from Adafruit is also sharing the same form factor. How cool is that? (I secretly hope this particular footprint would get more adoption by other manufacturers).

The body

OK, cool. We have our wheels and a microcontroller to steer them. We now need to build the body for our robot and wire things together somehow. What would be the simplest way to achieve that?

Let us start with the wiring, as here we know an obviously correct answer - a breadboard. No simpler manual wiring technology has been invented so far, right? Which breadboard model and shape should we pick? This will depend on how we decide to build the frame for our robot, so let us put this question aside for a moment.

How do we build the frame? The options seem infinite: we can design it in a CAD tool and laser cut it from acrylic sheets, print on a 3D printer, use some profile rails, wooden blocks, LEGO bricks,... just ask Google for some inspiration:

Ugh, this starts to feel too complicated and possibly inaccessible to a normal schoolkid.... But wait, what is it there on the back of our breadboard? Is it sticky tape? Could we just... stick... our motors to it directly and use the breadboard itself as the frame?

If we now try out a few breadboard models, we will quickly discover that the widespread, "mini-sized", 170-pin breadboard has the perfect size to host our two servos:

By an extra lucky coincindence, the breadboard even has screw holes in the right places to properly attach the motors! If we now hot glue the servo connectors to the side as shown above, we can wire them comfortably, just as any other components on the breadboard itself. Finally, there is just enough space in the back to pack all the servo wires and stick a 3xAAA battery box, which happens to be enough to power both our servos and the Xiao RP2040:

This picture shows the FS90R servos. These work just as well as the GeekServo ones shown in in the picture above.

... and thus, the BreadboardBot is born! Could a breadboard with two motors and a battery stuck to its back qualify as the simplest, lowest-tech ever robot platform? I think it could. Note that the total cost of this whole build, including the microcontroller is under $20 (under $15 if you order at the right discounts from the right places).

But what can it do?

Line following

Rule number 3: it is not a real robot if it is not able to sense the environment and react to it somehow. Hence, we need some sensors, preferably ones that would help us steer the robot (wheels are our only "actuators" so far). What are the simplest steering techniques in introductory robotics? I'd say line following and obstacle avoidance!

Let us do line following first then. Oh, but how do we attach the line tracking sensors, we should have built some complicated frame after all, right? Nope, by a yet another happy design coincidence, we can just plug them right here into our breadboard:

Thanks to the magic of CircuitPython, all it takes now to teach our robot to (even if crudely) follow a line is just three lines of code (boilerplate definitions excluded):
```
while True:
  servo_left.throttle = 0.1 * line_left.value
  servo_right.throttle = -0.1 * line_right.value
```
<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>

Obstacle avoidance

OK, a line follower implemented with a three-line algorithm is cool, can we do anything else? Sure, look how conveniently the ultrasonic distance sensor fits onto our breadboard:

Add a few more lines of uncomplicated Python code and you get a shy robot that can follow a line but will also happily go roaming around your room, steering away from any walls or obstacles:

Note how you can hear the sonar working in the video (not normally audible).

... and more

There is still enough space on the breadboard to add a button and a buzzer:

.. or splash some color with a Xiao RGB matrix:

.. or plug in an OLED screen to give your BreadboardBot a funny animated face:
The one on the image above also had the DHT11 sensor plugged into it, so it shows humidity and temperature while it is doing its line following. Why? Because it can!

Tired of programming and just want a remote-controllable toy? Sure, plug in a HC06 bluetooth serial module, program one last time, and go steer the robot yourself with a smartphone:

Other microcontroller boards, BLE, Wifi and FPV

Xiao RP2040 is great, but it has no built-in wireless connectivity! Plug in its ESP32-based cousin Xiao ESP32S3 and you can steer your robot with a Bluetooth gamepad, or via Wifi, by opening a webpage served by the robot itself!
If you have the Xiao ESP32S3 Sense (the version with a microphone and camera), that webpage can even show you the live first-person view feed of the robot's camera (conveniently positioned to look forward):

In case you were wondering, the resulting FPV bot is not a very practical solution for spying on your friends, as the microcontroller heats up a lot and consumes the battery way too fast, but it is still a fun toy and an educational project.

Finally, although the Xiao boards fit quite well with all other gadgets on the mini breadboard you are not limited by that choice either and can try plugging in any other appropriately-sized microcontroller board. Here is, for example, a BreadboardBot with an M5Stack Atom Lite running a line-follower program implemented as an ESPHome config (which you can flash over the air!).

Conclusion

The examples above are just the tip of the iceberg. Have any other sensors lying around from those "getting started with Arduino" kits that you did not find sufficiently exciting uses for so far? Maybe plugging them into the BreadboardBot is the chance to shine they were waiting for all this time! Teaching the robot to recognize familiar faces and follow them? Solving a maze? Following voice commands? Self-balancing? Programming coordinated movement of multiple robots? Interfacing the robot with your smart home or one of these fancy AI chatbots? Organizing competitions among multiple BreadboardBots? Vacuuming your desk? The sky (and your free time) is the limit here!

Consider making one and spreading the joy of breadboard robotics!

Detailed build instructions along with wiring diagrams and example code are available on the project's Github page. Contributions are warmly welcome!
Tags: Fun, How to, Procrastination, Programming, Project, Python, Robotics, Teaching
How to Prepare A Good Presentation

Posted by Konstantin 11.02.2019 2 Comments
Presenting is hard. Although I have had the opportunity to give hundreds of talks and lectures on various topics and occasions by now, preparing every new presentation still takes me a considerable amount of effort. I have had a fair share of positive feedback, though, and have developed a small set of principles which, I believe, are key to preparing (or at least learning to prepare) good presentations. Let me share them with you.

1. Start off without slides

Plan your talk as if you had to give it using only a blackboard. Think about what to say and how to say it so that the listener would be interested to look at you and listen to you. You will then discover moments in your speech, where you need to write or draw something on the blackboard. These will be your slides. Remember, that your slides are only there to enhance and illustrate your presentation. Your speech is its main component.

Of course, preparing slides can be a very convenient way to plan your talk, but try not to fall into the popular trap, where your slides end up being the talk, whilst your own role boils down to basically "playing them back" in front of the audience, so that one may start wondering why are you even there.

When you are done with your slides, a useful practice is to go over them and remove all of the text, besides the parts which you would really have to scribble on the blackboard, if the slides were not available. For any piece of text you currently have on the slide, ask yourself, what is its purpose. Usually there are just two possible answers:
1. It is there to help the listener. In this case, what you usually need instead is an illustration. An actual chart, photo or a schematic cartoon, providing the listener with a useful visual aid. Yes, plain text can also be a visual aid sometimes, but it should be your last resort.
2. Alternatively, you put it there only to help yourself - as a reminder of what to speak about next or what specific wording to use. Delete such texts completely and put some effort into memorizing the talk instead. You can make a paper cheatsheet to peek at during the trickier spots. In general, however, if you find it hard to navigate or memorize your talk, perhaps you should have organized it better in the first place. Which leads us to the second rule:
2. Structure around questions

Imagine that you are constrained to explain your point by only asking questions to the audience. The audience would effectively have to present the topic "by themselves" by answering these questions in order. Surprisingly many seemingly complicated topics may be explained in that manner. Of course, you do not need to literally perform the talk by only querying the audience (although I would suggest you try this format at least once - it can be rather fun). The idea is that the imaginary order of questions that need to be asked, is the best order of exposition for your topic. It is this precise order, in which most new statements tend to follow logically from the previous ones not just for you, but also for the audience, who will thus find it easier to track your talk.

3. Presenting is show business

Remember that every presentation is a show. Your goal is not to explain something in the most precise manner, but to do it in the most interesting manner. There is a caveat, of course: the notion of "interestingness" depends on your audience. Presentations made for schoolkids, students, professors and grandmothers differ not because they need to be told different things, but because you need to tell them the same things in different ways. Thus, always tailor your talk to the audience.

One basic rule which fits most audiences, though, is the following: if you may include interactivity in your talk (i.e. if the format and formalities allow for that), do so. The most basic element of interactivity in a typical presentation would be a question to the audience. There are two main kinds of questions:
1. "Leading questions", where you ask the audience to explain the concepts on their own (see part 2 above).
2. "Quiz questions", for checking whether the listener is still on board. This can provide you with real-time feedback and let you know when you need to slow down or speed up.
Smart combination of these two kinds of interactions would nearly always bump the interestingness of your talk up a notch. It is not enough to know what and when to ask, however. The choice of who to direct your questions to is just as important. Try to avoid the widely spread practice of simply throwing questions into the audience "in general" and giving word to whoever jumps up first. This would often engage just the few most extraverted or excited listeners, leaving the rest feeling a bit like outsiders (although semi-anonymous surveys can be fun sometimes). Unfortunately, communicating with the audience is an art which requires a lot of practice and failed attempts to master. While you are not there yet, let me suggest one simple recipe, which I found to be unexpectedly useful for student audiences - the talking pillow game. Pick an object (not necessarily a pillow) and give it to a random listener. That listener now has to answer your next question and pass the object on to their neighbor.

Of course, asking questions is just the most basic way of introducing useful interaction. Some concepts can be explained in the format of an interactive demo. Try devising one, whenever you can. Keep your listeners awake at all costs.

4. Experiment a lot

Remember that every presentation is a chance to try something new. Feel free to experiment with the various styles and formats every time (this is especially true while you are a student and "failing" a yet-another seminar talk is not really a big deal). Try practicing blackboard-only talks. Do a talk where you only ask questions. Try making a talk where you have no text on the slides at all. Make your talk into an interactive workshop. Experiment with the slide types and formats. Try sitting/standing/walking during the talk. Dance and sing (if you're up to). Trying is the only way to figure out what works and what does not work for you. In any case, carefully rehearse all the elements of your talk which do not yet come naturally. For the first 100 times this would mean rehearsing all of the talk.

5. Film yourself

Always ask someone to film your presentation. Watching yourself is the only way to discover that you are holding your hands strangely, not looking into the audience, swinging back and forth, closing your eyes, and so on. Watching yourself is often a painfully embarrassing, but a pretty much mandatory procedure, if you want to get better.
Tags: How to, Teaching
The Meaning of the Gaussian Kernel

Posted by Konstantin 11.11.2016 4 Comments

A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMs and Kernel methods, but I never found the time. The post expects basic knowledge of those topics from the reader.

Introductory Background

The concept of kernel methods is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, $x^T y$ , and square it, the result can be regarded as a dot product of two "feature vectors", where the features are all pairwise products of the original inputs:

$\begin{align*} &k(x,y) = (x^Ty)^2 = (\sum_i x_iy_i)^2 = \sum_{i,j} x_ix_jy_iy_j \\ & = \langle (x_1x_1, x_1x_2,\dots,x_ix_j,\dots,x_nx_n), (y_1y_1,\dots,y_iy_j,\dots,y_ny_n)\rangle \end{align*}$

Analogously, if you raise $x^Ty$ to the third power, you are essentially computing a dot product within a space of all possible three-way products of your inputs, and so on, without ever actually having to see those features explicitly.

If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a kernel function, and this will magically convert your model into a linear model with nonlinear features (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

Consider, for example, plain old linear regression: $f(x) = w^Tx + b$ . We can "kernelize" it by first representing $w$ as a linear combination of the data points (this is called a dual representation):

$f(x) = \left(\sum_i \alpha_i x_i\right)^T x + b = \sum_i \alpha_i (x_i^T x) + b,$

and then swapping all the dot products $x_i^T x$ with a custom kernel function:

$f(x) = \sum_i \alpha_i k(x_i,x) + b.$

If we now substitute $k(x,y) = (x^T y)^2$ here, our model becomes a second degree polynomial regression. If $k(x,y) = (x^T y)^5$ it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

It turns out that there are lots of valid choices for the kernel function $k$ , and, of course, the Gaussian function is one of these choices:

$k(x, y) = \exp\left(-\frac{|x - y|^2}{2\sigma^2}\right).$

It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

The Gaussian Kernel

To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power $i$ makes a kernel with a feature space, which includes all $i$ -wise products. Now let us examine what happens if we add two or more kernel functions. Consider $k(x,y) = x^Ty + (x^Ty)^2$ , for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

$(x_1, x_2, \dots, x_n, \quad x_1x_1, x_1x_2,\dots,x_ix_j,\dots, x_n x_n),$

i.e. the concatenation of degree-1 (untransformed) features, and degree-2 (pairwise product) features.

Multiplying a kernel function with a constant $c$ is also meaningful. It corresponds to scaling the corresponding features by $\sqrt{c}$ . For example, $k(x,y) = 4x^Ty = (2x)^T(2y)$ .

Still with me? Great, now let us combine the tricks above and consider the following kernel:

$k(x,y) = 1 + x^Ty + \frac{(x^Ty)^2}{2} + \frac{(x^Ty)^3}{6}.$

Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by $\sqrt{2}$ and all triple products scaled down by $\sqrt{6}$ .

Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to infinity-wise products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

$\sum_{i=0}^\infty \frac{(x^Ty)^i}{i!} = \exp(x^Ty).$

We can conclude here that $k(x,y) = \exp(x^Ty)$ is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

But we are not done yet. Suppose that we decide to normalize the inputs before applying our linear model. That is, we want to convert each vector $x$ to $\frac{x}{|x|}$ before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

Consider again the linear kernel $k(x,y) = x^Ty$ . If we normalize the vectors before taking their inner product, we get

$\left(\frac{x}{|x|}\right)^T\left(\frac{y}{|y|}\right) = \frac{x^Ty}{|x||y|} = \frac{k(x,y)}{\sqrt{k(x,x)k(y,y)}}.$

With some reflection you will see that the latter expression would normalize the features for any kernel.

Let us see what happens if we apply this kernel normalization to the “infinite polynomial” (i.e. exponential) kernel we just derived:

$\begin{align*} \frac{\exp(x^Ty)}{\sqrt{\exp(x^Tx)\exp(y^Ty)}} &= \frac{\exp(x^Ty)}{\exp(|x|^2/2)\exp(|y|^2/2)} \\ &= \exp\left(-\frac{1}{2}(|x|^2+|y|^2-2x^Ty)\right) \\ &= \exp\left(-\frac{|x-y|^2}{2}\right). \end{align*}$

Voilà, the Gaussian kernel. Well, it still lacks $\sigma^2$ in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by $1/\sigma$

To conclude: the Gaussian kernel is a normalized polynomial kernel of infinite degree (where feature products if $i$ -th degree are scaled down by $\sqrt{i!}$ before normalization). Simple, right?

An Example

The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let $x = 1$ , $y = 2$ . The value of the Gaussian kernel $k(x, y)$ for these inputs is:

$k(x, y) = \exp(-0.5|1-2|^2) \approx 0.6065306597...$

Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

$\phi'(x) = \left(1, x, \frac{x^2}{\sqrt{2!}}, \frac{x^3}{\sqrt{3!}}, \frac{x^4}{\sqrt{4!}}, \frac{x^5}{\sqrt{5!}}\dots\right).$

As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

---- $\phi'(1) =$ (1, 1, 0.707, 0.408, 0.204, 0.091, 0.037, 0.014, 0.005, 0.002, 0.001, 0.000, 0.000, ...)

---- $\phi'(2) =$ (1, 2, 2.828, 3.266, 3.266, 2.921, 2.385, 1.803, 1.275, 0.850, 0.538, 0.324, 0.187, 0.104, 0.055, 0.029, 0.014, 0.007, 0.003, 0.002, 0.001, ...)

Let us limit ourselves to just these first 21 features. To obtain the final Gaussian kernel feature representations $\phi$ we just need to normalize:

---- $\phi(1) =\frac{\phi'(1)}{|\phi'(1)|} = \frac{\phi'(1)}{1.649} =$ (0.607, 0.607, 0.429, 0.248, 0.124, 0.055, 0.023, 0.009, 0.003, 0.001, 0.000, ...)

---- $\phi(2) =\frac{\phi'(2)}{|\phi'(2)|} = \frac{\phi'(2)}{7.389} =$ (0.135, 0.271, 0.383, 0.442, 0.442, 0.395, 0.323, 0.244, 0.173, 0.115, 0.073, 0.044, 0.025, 0.014, 0.008, 0.004, 0.002, 0.001, 0.000, ...)

Finally, we compute the simple dot product of these two vectors:

$\scriptstyle\phi(1)^T\phi(2) = 0.607\cdot 0.135 + 0.607\cdot 0.271 + \dots = {\bf 0.6065306}602....$

In boldface are the decimal digits, which match the value of $\exp(-0.5|1-2|^2)$ . The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

A 2D Example

The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

$\begin{align*} \scriptscriptstyle\phi'(&\scriptscriptstyle x_1, x_2) = \left(1, x_1, \frac{x_1x_1}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{2!}}, \frac{x_2x_1}{\sqrt{2!}}, \frac{x_2x_2}{\sqrt{2!}}, \right. \\ &\scriptscriptstyle \left.\frac{x_1x_1x_1}{\sqrt{3!}}, \frac{x_1x_1x_2}{\sqrt{3!}}, \frac{x_1x_2x_1}{\sqrt{3!}}, \frac{x_1x_2x_2}{\sqrt{3!}}, \frac{x_2x_1x_2}{\sqrt{3!}}, \frac{x_2x_2x_1}{\sqrt{3!}}, \dots\right).\\ \end{align*}$

This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

Note that many products are repeated, however (e.g. $x_1x_2 = x_2x_1$ ), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

Basic combinatorics will tell us, that each feature of the form $\frac{x_1^a x_2^b}{\sqrt{n!}}$ must be repeated exactly $\frac{n!}{a!b!}$ times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by $\sqrt{\frac{n!}{a!b!}}$ . "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector $(v, v, v)$ , for example. Its norm is $\sqrt{v^2 + v^2 + v^2} = \sqrt{3}|v|$ , exactly the same as the norm of the single-element vector $(\sqrt{3}v)$ .

As we do this scaling, each feature gets converted to a nice symmetric form:

$\sqrt{\frac{n!}{a!b!}}\frac{x_1^a x_2^b}{\sqrt{n!}} = \frac{x_1^a x_2^b}{\sqrt{a!b!}} = \frac{x^a}{\sqrt{a!}}\frac{x^b}{\sqrt{b!}}.$

This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with $21^2$ = 231 features instead of 2097151. Nice!

Here is a new view of the unnormalized feature vector up to degree 3:

$\phi'_3(x_1, x_2) = \scriptstyle\left(1, x_1, x_2, \frac{x_1^2}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{1!1!}}, \frac{x^2}{\sqrt{2!}}, \frac{x_1^3}{\sqrt{3!}}, \frac{x_1^2x_2}{\sqrt{2!1!}}, \frac{x_1x_2^2}{\sqrt{1!2!}}, \frac{x_2^3}{\sqrt{3!}}\right).$

Let us limit ourselves to this degree-3 example and let $x = (0.7, 0.2)$ , $y = (0.1, 0.4)$ (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

---- $\phi'_3(0.7, 0.2) =$ (1, 0.7, 0.2, 0.346, 0.140, 0.028, 0.140, 0.069, 0.020, 0.003),

---- $\phi'_3(0.1, 0.4) =$ (1, 0.1, 0.4, 0.007, 0.040, 0.113, 0.000, 0.003, 0.011, 0.026).

After normalization:

---- $\phi_3(0.7, 0.2) =$ (0.768, 0.538, 0.154, 0.266, 0.108, 0.022, 0.108, 0.053, 0.015, 0.003),

---- $\phi_3(0.1, 0.4) =$ (0.919, 0.092, 0.367, 0.006, 0.037, 0.104, 0.000, 0.003, 0.010, 0.024).

The dot product of these vectors is $0.8196\dots$ , what about the exact Gaussian kernel value?

$\exp(-0.5|x-y|^2) = 0.{\bf 81}87\dots.$

Close enough. Of course, this could be done for any number of dimensions, so let us conclude the post with the new observation we made:

The features of the unnormalized $d$ -dimensional Gaussian kernel are:

$\phi({\bf x})_{\bf a} = \prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},$

where ${\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d$ .

The Gaussian kernel is just the normalized version of that, and we know that the norm to divide by is $\sqrt{\exp({\bf x}^T{\bf x})}$ . Thus we may also state the following:

The features of the $d$ -dimensional Gaussian kernel are:

$\phi({\bf x})_{\bf a} = \exp(-0.5|{\bf x}|^2)\prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},$

where ${\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d$ .

That's it, now you have seen the soul of the Gaussian kernel.

Tags: Definition, Explanation, Machine learning, Mathematics, SVM, Teaching, Theory
What is Steganography

Posted by Konstantin 18.09.2015 No Comments

Steganography is a marvelous subject, which could be considered a subfield of cryptography. Unfortunately, it is some-why absolutely underrated nowadays. It is not covered by the standard computer science curricula (although it would fit perfectly both within lectures on Cryptography as well as Signal Processing), and way too many people believe steganography is something akin to adding grey stamps to images in Photoshop. Even the almighty Wikipedia, in its corresponding article, does not provide a decently clear view of the field, in my opinion. To fix this I thought I'd give a try at making up a short explanation myself.

As mentioned, steganography is somewhat of a fellow discipline to cryptography. While the goal of cryptography is to protect the communication channel between Alice and Bob from some hypothetical evil Eve, the goal of steganography is to conceal the fact of communication itself. The classical setting is described as follows: Alice and Bob are kept imprisoned by the evil warden Walter. They are allowed to talk to each other, but Walter is listening to everything. How could Alice and Bob arrange a prison break without raising suspicion?

Steganographic problems

More specifically, steganographic problems can be divided into steganography against a passive warden, and steganography against an active warden.

Passive warden steganography

In this case Walter can only observe Alice and Bob's communication, but may not interfere. Walter's task is to be able to detect which of the messages between Alice and Bob contain secret meanings (if they do). If Walter comes up with a method by which he can distinguish innocent messages from the ones with a secret meaning with nontrivial probability, we shall say that Walter has succeeded in a "passive attack" on Alice and Bob's steganographic system.

Besides helping imaginary prisoners hide from imaginary wardens, passive steganography has many practical applications. For example, some printers add a barely visible watermark to each printed page, which identifies the printer model and the date. Radio and TV advertisers sometimes include in their clips a hidden identifier of the product being advertised (which simplifies automated airtime monitoring and validation). Storage of image metainfo within the image itself (e.g. storing the patient id within its X-ray picture) may prevent confusion when data is transferred or converted, etc.

It must be noted, that from the perspective of classical cryptography the methods mentioned above are not very interesting. Indeed, it is customary to assume in cryptography that all the details of the protocol used between Alice and Bob are open to all parties, including Walter. The only thing that Walter should not know is the "key". Without knowing this key Walter should be incapable of attacking the system. This is not the case in the examples with the printers and advertisements. No keys are involved and anyone who knows the algorithm can immediately discover the hidden message (i.e. attack the system).

The methods for designing such systems focus primarily on techniques for adding information to a known audio/video/textual signal without distorting it in a perceivable manner and, possibly, making the presence of this information resilient to transformations that the signal may pass (video streaming, file format conversion, text normalization, etc). Consequently, the complexity of many such methods lies mostly in the area of signal processing. In the case of audio, for example, it could mean finding sound frequency components that, on one hand, will be left unchanged by most audio codecs, yet on the other hand could be modified without the listener perceiving the change. In some sense it resembles a fight between the codec (which attempts to remove any redundant channels) and the steganographic system (which aims to make use of any redundancy remaining).

Active warden steganography

A more interesting situation is where Walter is not only listening in to Alice and Bob's communications, but may also actively interfere, removing any potential hidden meanings. The best example here is copyright protection. They say that movies, when shown in cinemas, would usually contain a steganographic watermark in their video signal, which identifies the particular movie theatre. The "warden" here is a movie pirate, who might know about the existence of the watermark, and would be interested in removing it (without corrupting the movie itself, of course). Consequently, the watermark must be included in a way, where Walter the pirate is incapable of removing it, even if he knew all the details of the watermarking algorithm (except for a secret key, of course). Besides movie watermarking, similar techniques could be used by other authors to "sign" their creations so that their authorship could later be established, when necessary.

Methods

Multiple methods exist for solving both the active and passive warden steganography. Not being able to list all of them, I will mention just the two core concepts, which I think are enough for an introductory overview.

LSB-steganography

The most well-known example of steganography - least-significant-bit (LSB) steganography - is applicable to analog data, such as music and images. The idea is that small changes in some values (pixels, frequency components) of an analog signal would not affect the perception of the signal in a notable way. For example, by tuning some pixels in an image one could transmit several bits of additional information. Of course, the choice of those pixels should be determined using a shared secret key, otherwise Walter could easily detect or remove the covert message.

The problem in the most naive version of this approach is that not every pixel in an image can be safely modified. For example, in regions filled with a single color any such change will be easy to spot. This can be overcome by only picking the pixels with enough variability in their neighborhood. Another weakness of the method is that the steganogram may be easy to remove by resizing or cropping the image. This can be prevented by hiding information in the spectral components rather than pixels - those are more resistant to the usual transformations. In fact, if we are dealing with a photograph or anything else that will be saved as a JPEG file, this approach would be quite nontrivial to break. The same ideas are used for steganographic processing of audio and video signals as well.

Equivalence-class steganography

Quite often Alice has a choice for multiple options for sending the same message. For example, when writing a text message to Bob, she can choose the phrasings. She can write "Hi" instead of "Hello", "Robert" instead of "Bob", punctuation can be used in multiple ways, and so on. When Alice is writing a novel, she is free to choose the names of the characters and places, chapter titles and even the order of events. Knowing the possible equivalence classes for the messages to be sent (e.g. words or sentences) as well as a common key, Alice may easily choose her words so that only Bob would be able to grasp whether there is a secret meaning and, if so, what it is.

A simple practical example of a coding algorithm is the following: Alice will code a single bit by each sentence of her text. For obtaining this bit, Bob will have to apply a hash function to the sentence plus the secret key and use the last bit of that hash. During coding Alice will simply have to choose words so that each sentence would correspond to a correct bit.

In its bare form the method is easy to attack - Walter may try to rephrase the text himself on transmission. This, in turn, can be overcome if Alice would hide the bits only into particularly chosen words or word sets (which can only be found by knowing the key), and do it using redundant coding (thus Walter would need to significantly alter the text if he wanted to be sure the message is removed). If Alice hides the message into the names of the characters of her novel, Walter may have no chances at all.

Methods like that are claimed to be used by large software companies to watermark their code (here, instead of choosing synonymous words a specific algorithm will be choosing synonymous bytecode instructions). This makes it possible to track the particular license which was used to leak a pirated version. Another example is geographic mapping software, which may use various equivalent ways of displaying a fractal shore in order to watermark the image.

As noted, other approaches and techniques exist (I did not find the space to properly mention public key steganography here, for example), yet I belive that LSB and EC-steganography largely illustrate the core ideas behind modern steganography in general as well as its possibilities and limitations. For those interested in the security proofs behind such techniques, this paper is a must read.

Tags: Computer science, Cryptography, Definition, Teaching
Naive Bayes in Layman's Terms

Posted by Konstantin 22.03.2015 4 Comments

This is a repost of my quora answer to the question: In layman's terms, how does Naive Bayes work?

Suppose that you are a working as a security guard at the airport. Your task is to look at people who pass the security line and pick some of them as being worthy of a more detailed screening. Now, of course, telling whether a person is a potential criminal or not by just looking at him/her is hard, if at all possible, but you need to do something. You have been put there for some reason, after all.

One of the simplest ways to approach the problem, mentally, is the following. You assign a "risk value" for each person. At the beginning (when you don't have any information about the person at all) you set this value to zero.

Now you start studying various features of the person in front of you: is it a male or a female? Is it a kid? Is he behaving nervously? Is he carrying a big bag? Is he alone? Did the metal detector beep? Is he a foreigner? etc. For each of those features you know (subconsciously due to your presuppositions, or from actual statistics) the average increase or decrease in risk of the person being a criminal that it entails. For example, if you know that the proportion of males among criminals is the same as the proportion of males among non-criminals, observing that a person is male will not affect his risk value at all. If, however, there are more males among criminals (suppose the percentage is, say, 70%) than among decent people (where the proportion is around 50%), observing that a person in front of you is a male will increase the "risk level" by some amount (the value is log(70%/50%) ~ 0.3, to be precise). Then you see that a person is nervous. OK, you think, 90% of criminals are nervous, but only 50% of normal people are. This means that nervousness should entail a further risk increase (of log(0.9/0.5) ~ 0.6, to be technical again, so by now you have counted a total risk value of 0.9). Then you notice it is a kid. Wow, there is only 1% of kids among criminals, but around 10% among normal people. Therefore, the risk value change due to this observation will be negative (log(0.01/0.10) ~ -2.3, so your totals are around -1.4 by now).

You can continue this as long as you want, including more and more features, each of which will modify your total risk value by either increasing it (if you know this particular feature is more representative of a criminal) or decreasing (if the features is more representative of a decent person). When you are done collecting the features, all is left for you is to compare the result with some threshold level. Say, if the total risk value exceeds 10, you declare the person in front of you to be potentially dangerous and take it into a detailed screening.

The benefit of such an approach is that it is rather intuitive and simple to compute. The drawback is that it does not take the cross-play of features into account. It may very well be the case that while the feature "the person is a kid" on its own greatly reduces the risk value, and the feature "has a moustache" on its own has close to no effect, a combination of the two ("a kid with a moustache") would actually have to increase the risk by a lot. This would not happen when you simply add the separate feature contributions, as described above.

Tags: Bayes, Computer science, Data analysis, Definition, Machine learning, Probability theory, Teaching
Playing Card Cryptography

Posted by Konstantin 09.03.2015 No Comments

Playing cards are a great tool for modeling, popularizing and explaining various mathematical and algorithmic notions. Indeed, a deck of cards is a straightforward example for a finite set, a discrete distribution or a character string. Shuffling and dealing cards represents random sampling operations. A card hand denotes information possessed by a party. Turning card face down or face up looks a lot like bit flipping. Finally, card game rules describe an instance of a particular algorithm or a protocol. I am pretty sure that for most concepts in maths or computer science you can find some card game or a card trick that is directly related to it. Here are some examples of how you can simulate a simple computing machine, illustrate inductive reasoning or explain map-reduce, in particular.

Cryptographers seem to be especially keen towards playing cards. Some cryptographic primitives could have been inspired by them. There are decently secure ciphers built upon a deck of cards. Finally, there are a couple of very enlightening card-based illustrations of such nontrivial cryptographic concepts as zero-knowledge proofs and voting protocols. The recent course on secure two-party computation given by abhi shelat at the last week's EWSCS extended my personal collection of such illustrations with another awesome example — a secure two-party protocol for computing the AND function. As I could not find a description of this protocol anywhere in the internet (and abhi did not know who is the author), I thought it was worth being written up in a blog post here (along with a small modification of my own).

The Tinder Game

Consider the situation, where Alice and Bob want to find out, whether they are both interested in each other (just as if they were both users of the Tinder app). More formally, suppose that Alice has her private opinion about Bob represented as a single bit (where "0" means "not interested" and "1" means "interested"). Equivalently, Bob has his private opinion about Alice represented in the same way. They need to find out whether both bits are "1". However, if it is not the case, they would like to keep their opinions private. For example, if Alice is not interested in Bob, Bob would prefer Alice not to know that he is all over her. Because, you know, opinion asymmetry may lead to awkward social dynamics when disclosed, at least among college students.

The following simple card game happens to solve their problem. Take five cards, so that three of them are red and two are black. The cards of one color must be indistinguishable from each other (e.g. you can't simply take three different diamonds for the reds). Deal one black and one red card to Alice, one black and one red card to Bob. Put the remaining red card on the table face up.

Initial table configuration

Now Alice will put her two cards face down to the left of the open card, and Bob will put his two cards to the right of the open card:

Alice and Bob played

The rules for Alice and Bob are simple: if you like the other person, put your red card next to the central red card. Otherwise, put your black card next to the central one. For example, if both Alice and Bob like each other, they would play their cards as follows (albeit still face down):

What Alice and Bob would play if they both liked each other

Now the middle card is also turned face down, and all five cards are collected, preserving their order:

Five cards collected, preserving order

Next, Alice cuts this five-card deck, moving some number of cards from the top to the bottom (Bob should not know exactly how many). Then Bob cuts the deck (also making sure that Alice does not know how many cards he cuts). The result is equivalent to applying some cyclic rotation to the five card sequence yet neither Bob nor Alice knows how many cards were shifted in total.

The five cards can now be opened by dealing them in order onto the table to find out the result. If there happen to be three red cards one after another in a row or two black cards one after another, Alice and Bob both voted "yes". Here is one example of such a situation. It is easy to see that it is a cyclic shift of the configuration with three red aces in the middle shown above.

Both voted "yes"

Otherwise, if neither three red aces nor two black aces are side by side, we may conclude that one or both of the players voted "no". However, there is absolutely no way to find out anything more specific than that. Here is an example:

No mutual affection (or no affection at all)

Obviously, this result could not be obtained as a cyclic shift of the configuration with three aces clumped together. On the other hand, it could have been obtained as a cyclic shift from any of the three other alternatives ("yes-no", "no-no" and "no-yes"), hence if Alice voted "no" she will have no way of figuring out what was Bob's vote. Thus, playing cards along with a cryptographic mindset helped Alice and Bob discover their mutual affection or the lack of it without the risk of awkward situations. Isn't it great?

Throwing in a Zero-Knowledge Proof

There are various ways Alice or Bob can try to "break" this protocol while still trying to claim honesty. One possible example is shown below:

Alice is trying to cheat

Do you see what happened? Alice has put her ace upside down, which will later allow her to understand what was Bob's move (yet she can easily pretend that turning a card upside down was an honest mistake). Although the problem is easily solved by picking a deck with asymmetric backs, for the sake of example, let us assume that such a solution is somewhy unsuitable. Perhaps there are no requisite decks at Alice and Bob's disposal, or they need to have symmetric backs for some other reasons. This offers a nice possibility for us to practice even more playing card cryptography and try to secure the original algorithm from such "attacks" using a small imitation of a zero-knowledge proof for turn correctness.

Try solving it yourself before proceeding.

Read more...

Tags: Computer science, Cryptography, Fun, Games, Mathematics, Teaching, Theory
Self-learning Mini-checkers Machine

Posted by Konstantin 03.09.2012 2 Comments
For many people, the ability for learning and adaptation seems like something unique, extremely complicated and mysterious. Indeed, those are the abilities we almost exclusively associate with high levels of intelligence and knowledge. This is, however, an illusion. Although adaptive behaviour might indeed look complex, it is not necessarily driven by "intelligent" mechanisms. One of the best illustrations of this is a fully-fledged self-learning machine made from plain matchboxes.

A Tic-Tac-Toe machine by James Bridle

The idea for such a machine was first introduced in 1960 by Donald Michie, who devised a simple self-learning algorithm for Tic-Tac-Toe (reminiscent of what is now known to be Reinforcement Learning). Due to lack of appropriate computing power, he implemented it "in hardware" using 300 or so matchboxes.

The idea of the machine is simple. There is a matchbox corresponding to each game position, where the "computer" has to make a move. The matchbox contains colored beads, each color corresponding to a particular move. The decision is made by picking a random bead from the matchbox. Initially (when the machine is "untrained"), there is an equal number of beads of each color, and the machine thus makes equiprobably random turns. After each game, however, the machine is "punished" by removing beads, corresponding to losing turns, or "rewarded" by adding beads, corresponding to winning turns. Thus, after several games, the machine will adapt its strategy towards a winning one.

The idea was popularized by Martin Gardner in one of his Scientific American articles (later published in the book "The Unexpected Hanging and Other Mathematical Diversions"). Gardner invented a simple game of "Hexapawn", and derived a matchbox machine for it, which only required as little as 19 matchboxes. He also suggests in his article, however, to create a matchbox machine for "Mini-checkers" - checkers played on a 4x4 board. Ever since I saw this article some 20 or so years ago I was thinking of making one. This summer, while teaching a machine learning course in a summer school in Kiev, I actually made one. I could use it to both fulfil my ages-old desire as well as a teaching aid. You can make one too now, if you are interested.

The Mini-checkers Machine

The rules of mini-checkers are exactly like those of usual checkers, with three modifications:
- The game is played on a 4x4 field. White is the first one to move. Machine plays for black.
- Whenever both players get a King, the game immediately ends in a draw.
- The King must always move to the furthest possible position in the chosen direction.
To make the machine, you first have to buy and empty 24 matchboxes. Next, print out and stick the 24 game positions onto the boxes. Draw on each box all the possible black's moves as arrows using colored markers. Finally, for each colored arrow, add 2 beads of the same color into the matchbox. That's it, your machine is ready to play.

The Mini-checkers machine

The game proceeds as already described: whenever the machine (the black player) has to make a decision (i.e. whenever it has to make a move and there is more than one possibility), find the matchbox with the current position depicted on it, shake it, and pick a random bead. This will tell you the decision of the machine. If the corresponding matchbox is empty, the machine forfeits. You should keep the matchboxes, corresponding to the moves that were made, open until the end of the game.

Once the game is over, the machine is "taught":
- If the machine won, do nothing.
- If the game was a draw, remove the bead corresponding to the machine's last move from the matchbox, unless it was the last bead of that color in the box.
- If the machine lost, remove all the beads, corresponding to the machine's last move, from the last matchbox.
It takes about 30 games or so for the machine to actually learn to play well enough. Of course, a human would understand the strategy much earlier, but it's fun none the less.

Playing with the machine will immediately lead you towards two important questions:
- How efficient is the suggested learning procedure? Can it be improved and generalized?
- How do you make a matchbox machine for a more complex game without having to manage thousands of matchboxes.
As far as I know, contemporary machine learning has only partial answers to both of them.
Tags: Fun, Machine learning, Procrastination, Reinforcement learning, Teaching
Machine Learning in Mordor

Posted by Konstantin 21.11.2009 No Comments
In the recent weeks I had to give a few introductory lectures on supervised machine learning within Jaak's data mining course. I also provided the students some home assignments, and there it was especially tricky to find a simple, fun and discussable exercise, which might help to form some intuition regarding the inherent difficulties of learning from data such as overfitting, multiple testing, and the like. The following is what I came up with and I find it a rather successful creation. It is remarkable that of the 80 students participating in the course only 4 or so came up with the correct answer 🙂

The Riddle

After the lecture on supervised machine learning at the University of Mordor, the teacher, Sauron, gave the students a dataset of the following form:
```
            1) ABACDAFXYZ    -> good
            2) CEGXEAELWMN   -> good
            3) NUWAB         -> bad
                    ...
           20) QRELAZMNPCXA  -> good
```
The inputs were seemingly arbitrary strings of latin characters: these were the terrible spells known only to Sauron and chosen by Sauron at random from his Great Book of Terrible Spells. The output was a binary variable, classifying each spell as good or bad. There were 20 observations in total in the dataset.

The students were given a task: on the basis of this data, come up with a generic spell classifier. Sauron promised to test their result on the next lecture as follows: he will pick another random terrible spell from the book and require each student to make a prediction. The student whose prediction is wrong will have to jump down from the tower as a punishment.

The first student, Aghargh, who happened to be slacking on the lecture, couldn't come up with any smart ways to solve the task, so he ended up just counting the proportion of "good" and "bad" spells in the dataset. Having observed that 16 of the 20 spells were "good", he decided to predict "good" when asked.

The second student, Bughrorc, chose a smarter way. Firstly, he converted each string into a vector of binary attributes — one attribute for each letter, having value "1" if that letter was present in the corresponding spell and "0" otherwise. He then split the data randomly into a training set (15 instances) and a test set (5 instances), and attempted training various classifiers using the MordorMiner software. After some experimenting he finally found five classifiers that could predict all of the training examples correctly. One of these also predicted all of the testing examples correctly. He decided to use this classifier on the forthcoming test.

On the day of testing, Sauron asked the students to classify the spell YOZAZA. Aghargh, as he decided, provided the answer "good". Bughrorc's classifier analyzed the string and claimed that the answer should be "bad"

Which of the students, do you think, might have a better chance of not jumping from the tower? Why? Can you quantify your answer?
Tags: Fun, Machine learning, Puzzle, Teaching