For many people, the ability for learning and adaptation seems like something unique, extremely complicated and mysterious. Indeed, those are the abilities we almost exclusively associate with high levels of intelligence and knowledge. This is, however, an illusion. Although adaptive behaviour might indeed *look *complex, it is not necessarily *driven by* "intelligent" mechanisms. One of the best illustrations of this is a fully-fledged self-learning machine made from plain matchboxes.

The idea for such a machine was first introduced in 1960 by Donald Michie, who devised a simple self-learning algorithm for Tic-Tac-Toe (reminiscent of what is now known to be Reinforcement Learning). Due to lack of appropriate computing power, he implemented it "in hardware" using 300 or so matchboxes.

The idea of the machine is simple. There is a matchbox corresponding to each game position, where the "computer" has to make a move. The matchbox contains colored beads, each color corresponding to a particular move. The decision is made by picking a random bead from the matchbox. Initially (when the machine is "untrained"), there is an equal number of beads of each color, and the machine thus makes equiprobably random turns. After each game, however, the machine is "punished" by removing beads, corresponding to losing turns, or "rewarded" by adding beads, corresponding to winning turns. Thus, after several games, the machine will adapt its strategy towards a winning one.

The idea was popularized by Martin Gardner in one of his Scientific American articles (later published in the book "The Unexpected Hanging and Other Mathematical Diversions"). Gardner invented a simple game of "Hexapawn", and derived a matchbox machine for it, which only required as little as 19 matchboxes. He also suggests in his article, however, to create a matchbox machine for "Mini-checkers" - checkers played on a 4x4 board. Ever since I saw this article some 20 or so years ago I was thinking of making one. This summer, while teaching a machine learning course in a summer school in Kiev, I actually made one. I could use it to both fulfil my ages-old desire as well as a teaching aid. You can make one too now, if you are interested.

### The Mini-checkers Machine

The rules of mini-checkers are exactly like those of usual checkers, with three modifications:

- The game is played on a 4x4 field. White is the first one to move. Machine plays for black.
- Whenever both players get a King, the game immediately ends in a draw.
- The King must always move to the furthest possible position in the chosen direction.

To make the machine, you first have to buy and empty 24 matchboxes. Next, print out and stick the 24 game positions onto the boxes. Draw on each box all the possible black's moves as arrows using colored markers. Finally, for each colored arrow, add 2 beads of the same color into the matchbox. That's it, your machine is ready to play.

The game proceeds as already described: whenever the machine (the black player) has to make a decision (i.e. whenever it has to make a move and there is more than one possibility), find the matchbox with the current position depicted on it, shake it, and pick a random bead. This will tell you the decision of the machine. If the corresponding matchbox is empty, the machine forfeits. You should keep the matchboxes, corresponding to the moves that were made, open until the end of the game.

Once the game is over, the machine is "taught":

- If the machine won, do nothing.
- If the game was a draw, remove the bead corresponding to the machine's last move from the matchbox, unless it was the last bead of that color in the box.
- If the machine lost, remove all the beads, corresponding to the machine's last move, from the last matchbox.

It takes about 30 games or so for the machine to actually learn to play well enough. Of course, a human would understand the strategy much earlier, but it's fun none the less.

Playing with the machine will immediately lead you towards two important questions:

- How efficient is the suggested learning procedure? Can it be improved and generalized?
- How do you make a matchbox machine for a more complex game without having to manage thousands of matchboxes.

As far as I know, contemporary machine learning has only partial answers to both of them.

"reminiscent of what is now known to be Reinforcement Learning" - I guess that I'd classify it as a temporal difference learning algorithm. (As I understand, that's a special case of reinforcement learning?) The machine learns not only how to move in each position of the game, but also to estimate the quality of a given position. Namely: empty box = bad position. And after each move, the quality of the next position is used as feedback for the current move.

But this view also shows up a weakness of your machine: Positions where the machine can win and positions where the machine can only enforce a draw will be assigned the same quality (namely: non-empty box). Which means the machine will happily move to draw-in-one positions. I'd love to try and play against the machine, I guess it should not be too hard to use this insight to reach draws from lose-positions.

How do we make the machine more clever without removing the simplicity of the match box computation? The simplest I could come up with is to add the following rule:

* If a matchbox is reached that has only beads of different colors, remove the bead corresponding to the machine's last move from the matchbox, unless it was the last bead of that color in the box.

Now draws propagate back, and eventually moves that lead to draws will only have one bead. And boxes that are draws have no two beads of the same color.

The machine does not play optimally yet. But at least a winning move will have twice the probability as a draw-move. We could change the move selection as follows: "Before shaking the box, remove all beads that have no duplicate, unless this would make the box empty." That would lead to optimal play, but by now the machine is a bit complicated. 🙁

Any ideas how to achieve optimal play without sacrificing simplicity?

As I see the problem, the matchbox algorithm is a version of Q-learning rather than TD-learning (both of which are regarded as techniques in a vaguely defined field of reinforcement learning). Indeed, each matchbox is essentially an entry in the Q function table, which is updated by back-propagating feedback once result is known. In addition, the matchbox machine uses a kind of a custom "explore-exploit" technique. In "traditional" Q-learning (or policy iteration) the next step would usually be chosen by picking the best known option in the current state. Here, however, instead, we use the known Q function as a probability distribution to randomize / explore.

Note that if you followed the "traditional" way (look at the beads, find the color with the maximum number, make this move), the problem of not "learning" to distinguish draws from wins would not be there.

Your observation that, besides simply learning Q-scores, the machine learns to essentially classify moves into "draw"/"non-draw" and states into "win"/"loss" is cool, I haven't thought about it even.

One mechanical option to facilitate the approach you suggested could perhaps be the following: have two compartments in the matchbox. Initially each compartment has one bead of each color. Moves are made by picking from compartment "A" unless there is none, in which case the compartment "B" is used. Punishment for the loss always removes the bead, however, punishment for a draw only removes beads from "A". If I'm not mistaken, this is equivalent to your idea, but easier to use.