Four Years Remaining » Explanation

The Data Science Workflow

Posted by Konstantin 29.11.2018 4 Comments
By popular suggestion this text was cross-posted to Medium.

Suppose you are starting a new data science project (which could either be a short analysis of one dataset, or a complex multi-year collaboration). How should your organize your workflow? Where do you put your data and code? What tools do you use and why? In general, what should you think about before diving head first into your data? In the software engineering industry such questions have some commonly known answers. Although every software company might have its unique traits and quirks, the core processes in most of them are based on the same established principles, practices and tools. These principles are described in textbooks and taught in universities.

Data science is a less mature industry, and things are different. Although you can find a variety of template projects, articles, blogposts, discussions, or specialized platforms (open-source [1,2,3,4,5,6,7,8,9,10], commercial [11,12,13,14,15,16,17] and in-house [18,19,20]) to help you organize various parts of your workflow, there is no textbook yet to provide universally accepted answers. Every data scientist eventually develops their personal preferences, mostly learned from experience and mistakes. I am no exception. Over time I have developed my understanding of what is a typical "data science project", how it should be structured, what tools to use, and what should be taken into account. I would like to share my vision in this post.

The workflow

Although data science projects can range widely in terms of their aims, scale, and technologies used, at a certain level of abstraction most of them could be implemented as the following workflow:

Colored boxes denote the key processes while icons are the respective inputs and outputs. Depending on the project, the focus may be on one process or another. Some of them may be rather complex while others trivial or missing. For example, scientific data analysis projects would often lack the "Deployment" and "Monitoring" components. Let us now consider each step one by one.

Source data access

Whether you are working on the human genome or playing with iris.csv, you typically have some notion of "raw source data" you start your project with. It may be a directory of *.csv files, a table in an SQL server or a HDFS cluster. The data may be fixed, constantly changing, automatically generated or streamed. It could be stored locally or in the cloud. In any case, your first step is to define access to the source data. Here are some examples of how this may look like:
- Your source data is provided as a set of *.csv files. You follow the cookiecutter-data-science approach, make a data/raw subdirectory in your project's root folder, and put all the files there. You create the docs/data.rst file, where you describe the meaning of your source data. (Note: Cookiecutter-DataScience template actually recommends references/ as the place for data dictionaries, while I pesonally prefer docs. Not that it matters much).
- Your source data is provided as a set of *.csv files. You set up an SQL server, create a schema named raw and import all your CSV files as separate tables. You create the docs/data.rst file, where you describe the meaning of your source data as well as the location and access procedures for the SQL server.
- Your source data is a messy collection of genome sequence files, patient records, Excel files and Word documents, which may later grow in unpredicted ways. In addition, you know that you will need to query several external websites to receive extra information. You create an SQL database server in the cloud and import most of the tables from Excel files there. You create the data/raw directory in your project, put all the huge genome sequence files into the dna subdirectory. Some of the Excel files were too dirty to be imported into a database table, so you store them in data/raw/unprocessed directory along with the Word files. You create an Amazon S3 bucket and push your whole data/raw directory there using DVC. You create a Python package for querying the external websites. You create the docs/data.rst file, where you specify the location of the SQL server, the S3 bucket, the external websites, describe how to use DVC to download the data from S3 and the Python package to query the websites. You also describe, to the best of your understanding, the meaning and contents of all the Excel and Word files as well as the procedures to be taken when new data is added.
- Your source data consists of constantly updated website logs. You set up the ELK stack and configure the website to stream all the new logs there. You create docs/data.rst, where you describe the contents of the log records as well as the information needed to access and configure the ELK stack.
- Your source data consists of 100'000 colored images of size 128x128. You put all the images together into a single tensor of size 100'000 x 128 x 128 x 3 and save it in an HDF5 file images.h5. You create a Quilt data package and push it to your private Quilt repository. You create the docs/data.rst file, where you describe that in order to use the data it must first be pulled into the workspace via quilt install mypkg/images and then imported in code via from quilt.data.mypkg import images.
- Your source data is a simulated dataset. You implement the dataset generation as a Python class and document its use in a README.txt file.
In general, remember the following rules of thumb when setting up the source data:
1. Whenever you can meaningfully store your source data in a conveniently queryable/indexable form (an SQL database, the ELK stack, an HDF5 file or a raster database), you should do it. Even if your source data is a single csv and you are reluctant to set up a server, do yourself a favor and import it into an SQLite file, for example. If your data is nice and clean, it can be as simple as:
```
import sqlalchemy as sa
import pandas as pd
e = sa.create_engine("sqlite:///raw_data.sqlite")
pd.read_csv("raw_data.csv").to_sql("raw_data", e)
```
2. If you work in a team, make sure the data is easy to share. Use an NFS partition, an S3 bucket, a Git-LFS repository, a Quilt package, etc.
3. Make sure your source data is always read-only and you have a backup copy.
4. Take your time to document the meaning of all of your data as well as its location and access procedures.
5. In general, take this step very seriously. Any mistake you make here, be it an invalid source file, a misunderstood feature name, or a misconfigured server may waste you a lot of time and effort down the road.
Data processing

The aim of the data processing step is to turn the source data into a "clean" form, suitable for use in the following modeling stage. This "clean" form is, in most cases, a table of features, hence the gist of "data processing" often boils down to various forms of feature engineering. The core requirements here are to ensure that the feature engineering logic is maintainable, the target datasets are reproducible and, sometimes, that the whole pipeline is traceable to the source representations (otherwise you would not be able to deploy the model). All these requirements can be satisfied, if the processing is organized in an explicitly described computation graph. There are different possibilities for implementing this graph, however. Here are some examples:
- You follow the cookiecutter-data-science route and use Makefiles to describe the computation graph. Each step is implemented in a script, which takes some data file as input and produces a new data file as output, which you store in the data/interim or data/processed subdirectories of your project. You enjoy easy parallel computation via make -j <njobs>.
- You rely on DVC rather than Makefiles to describe and execute the computation graph. The overall procedure is largely similar to the solution above, but you get some extra convenience features, such as easy sharing of the resulting files.
- You use Luigi, Airflow or any other dedicated workflow management system instead of Makefiles to describe and execute the computation graph. Among other things this would typically let you observe the progress of your computations on a fancy web-based dashboard, integrate with a computing cluster's job queue, or provide some other tool-specific benefits.
- All of your source data is stored in an SQL database as a set of tables. You implement all of the feature extraction logic in terms of SQL views. In addition, you use SQL views to describe the samples of objects. You can then use these feature- and sample-views to create the final modeling datasets using auto-generated queries like
```
select
   s.Key
   v1.AverageTimeSpent,
   v1.NumberOfClicks,
   v2.Country
   v3.Purchase as Target
from vw_TrainSample s
left join vw_BehaviourFeatures v1 on v1.Key = s.Key
left join vw_ProfileFeatures v2 on v2.Key = s.Key
left join vw_TargetFeatures v3 on v3.Key = s.Key
```
  This particular approach is extremely versatile, so let me expand on it a bit. Firstly, it lets you keep track of all the currently defined features easily without having to store them in huge data tables - the feature definitions are only kept as code until you actually query them. Secondly, the deployment of models to production becomes rather straightforward - assuming the live database uses the same schema, you only need to copy the respective views. Moreover, you may even compile all the feature definitions into a single query along with the final model prediction computation using a sequence of CTE statements:
```
with _BehaviourFeatures as (
 ... inline the view definition ...
),
_ProfileFeatures as (
 ... inline the view definition ...
),
_ModelInputs as (
 ... concatenate the feature columns ...
)
select
     Key,
     1/(1.0 + exp(-1.2 + 2.1*Feature1 - 0.2*Feature3)) as Prob
from _ModelInputs
```
  This technique has been implemented in one in-house data science workbench tool of my design (not publicly available so far, unfortunately) and provides a very streamlined workflow.
  
  Example of an SQL-based feature engineering pipeline
No matter which way you choose, keep these points in mind:
1. Always organize the processing in the form of a computation graph and keep reproducibility in mind.
2. This is the place where you have to think about the compute infrastructure you may need. Do you plan to run long computations? Do you need to parallelize or rent a cluster? Would you benefit from a job queue with a management UI for tracking task execution?
3. If you plan to deploy the models into production later on, make sure your system will support this use case out of the box. For example, if you are developing a model to be included in a Java Android app, yet you prefer to do your data science in Python, one possibility for avoiding a lot of hassle down the road would be to express all of your data processing in a specially designed DSL rather than free-from Python. This DSL may then be translated into Java or an intermediate format like PMML.
4. Consider storing some metadata about your designed features or interim computations. This does not have to be complicated - you can save each feature column to a separate file, for example, or use Python function annotations to annotate each feature-generating function with a list of its outputs. If your project is long and involves several people designing features, having such a registry may end up quite useful.
Modeling

Once you have done cleaning your data, selecting appropriate samples and engineering useful features, you enter the realm of modeling. In some projects all of the modeling boils down to a single m.fit(X,y) command or a click of a button. In others it may involve weeks of iterations and experiments. Often you would start with modeling way back in the "feature engineering" stage, when you decide that outputs of one model make for great features themselves, so the actual boundary between this step and the previous one is vague. Both steps should be reproducible and must make part of your computation graph. Both revolve around computing, sometimes involving job queues or clusters. None the less, it still makes sense to consider the modeling step separately, because it tends to have a special need: experiment management. As before, let me explain what I mean by example.
- You are training models for classifying Irises in the iris.csv dataset. You need to try out ten or so standard sklearn models, applying each with a number of different parameter values and testing different subsets of your handcrafted features. You do not have a proper computation graph or computing infrastructure set up - you just work in a single Jupyter notebook. You make sure, however, that the results of all training runs are saved in separate pickle files, which you can later analyze to select the best model.
- You are designing a neural-network-based model for image classification. You use ModelDB (or an alternative experiment management tool, such as TensorBoard, Sacred, FGLab, Hyperdash, FloydHub, Comet.ML, DatMo, MLFlow, ...) to record the learning curves and the results of all the experiments in order to choose the best one later on.
- You implement your whole pipeline using Makefiles (or DVC, or a workflow engine). Model training is just one of the steps in the computation graph, which outputs a model-<id>.pkl file, appends the model final AUC score to a CSV file and creates a model-<id>.html report, with a bunch of useful model performance charts for later evaluation.
- This is how experiment management / model versioning looks in the UI of the custom workbench mentioned above:
  Experiment management
The takeaway message: decide on how you plan to manage fitting multiple models with different hyperparameters and then selecting the best result. You do not have to rely on complex tools - sometimes even a manually updated Excel sheet works well, when used consistently. If you plan lengthy neural network trainings, however, do consider using a web-based dashboard. All the cool kids do it.

Model deployment

Unless your project is purely exploratory, chances are you will need to deploy your final model to production. Depending on the circumstances this can turn out to be a rather painful step, but careful planning will alleviate the pain. Here are some examples:
- Your modeling pipeline spits out a pickle file with the trained model. All of your data access and feature engineering code was implemented as a set of Python functions. You need to deploy your model into a Python application. You create a Python package which includes the necessary function and the model pickle file as a file resource inside. You remember to test your code. The deployment procedure is a simple package installation followed by a run of integration tests.
- Your pipeline spits out a pickle file with the trained model. To deploy the model you create a REST service using Flask, package it as a docker container and serve via your company's Kubernetes cloud. Alternatively, you upload the saved model to an S3 bucket and serve it via Amazon Lambda. You make sure your deployment is tested.
- Your training pipeline produces a TensorFlow model. You use Tensorflow Serving (or any of the alternatives) to serve it as a REST service. You do not forget to create tests and run them every time you update the model.
- Your pipeline produces a PMML file. Your Java application can read it using the JPMML library. You make sure that your PMML exporter includes model validation tests in the PMML file.
- Your pipeline saves the model and the description of the preprocessing steps in a custom JSON format. To deploy it into your C# application you have developed a dedicated component which knows how to load and execute these JSON-encoded models. You make sure you have 100% test coverage of your model export code in Python, the model import code in C# and predictions of each new model you deploy.
- Your pipeline compiles the model into an SQL query using SKompiler. You hard-code this query into your application. You remember about testing.
- You train your models via a paid service, which also offers a way to publish them as REST (e.g. Azure ML Studio, YHat ScienceOps). You pay a lot of money, but you still test the deployment.
Summarizing this:
1. There are many ways how a model can be deployed. Make sure you understand your circumstances and plan ahead. Will you need to deploy the model into a codebase written in a different language than the one you use to train it? If you decide to serve it via REST, what load does the service expect, should it be capable of predicting in batches? If you plan to buy a service, estimate how much it will cost you. If you decide to use PMML, make sure it can support your expected preprocessing logic and that fancy Random Forest implementation you plan to use. If you used third party data sources during training, think whether you will need to integrate with them in production and how will you encode this access information in the model exported from your pipeline.
2. As soon as you deploy your model to production, it turns from an artefact of data science to actual code, and should therefore be subject to all the requirements of application code. This means testing. Ideally, your deployment pipeline should produce both the model package for deployment as well as everything needed to test this model (e.g. sample data). It is not uncommon for the model to stop working correctly after being transferred from its birthplace to a production environment. It may be be a bug in the export code, a mismatch in the version of pickle, a wrong input conversion in the REST call. Unless you explicitly test the predictions of the deployed model for correctness, you risk running an invalid model without even knowing it. Everything would look fine, as it will keep predicting some values, just the wrong ones.
Model monitoring

Your data science project does not end when you deploy the model to production. The heat is still on. Maybe the distribution of inputs in your training set differs from the real life. Maybe this distribution drifts slowly and the model needs to be retrained or recalibrated. Maybe the system does not work as you expected it to. Maybe you are into A-B testing. In any case you should set up the infrastructure to continuously collect data about model performance and monitor it. This typically means setting up a visualization dashboard, hence the primary example would be the following:
- For every request to your model you save the inputs and the model's outputs to logstash or a database table (making sure you stay GDPR-compliant somehow). You set up Metabase (or Tableau, MyDBR, Grafana, etc) and create reports which visualize the performance and calibration metrics of your model.
Exploration and reporting

Throughout the life of the data science project you will constantly have to sidestep from the main modeling pipeline in order to explore the data, try out various hypotheses, produce charts or reports. These tasks differ from the main pipeline in two important aspects.

Firstly, most of them do not have to be reproducible. That is, you do not absolutely need to include them in the computation graph as you would with your data preprocessing and model fitting logic. You should always try to stick to reproducibility, of course - it is great when you have all the code to regenerate a given report from raw data, but there would still be many cases where this hassle is unnecessary. Sometimes making some plots manually in Jupyter and pasting them into a Powerpoint presentation serves the purpose just fine, no need to overengineer.

The second, actually problematic particularity of these "Exploration" tasks is that they tend to be somewhat disorganized and unpredictable. One day you might need to analyze a curious outlier in the performance monitoring logs. Next day you want to test a new algorithm, etc. If you do not decide on a suitable folder structure, soon your project directory will be filled with notebooks with weird names, and no one in the team would understand what is what. Over the years I have only found one more or less working solution to this problem: ordering subprojects by date. Namely:
- You create a projects directory in your project folder. You agree that each "exploratory" project must create a folder named projects/YYYY-MM-DD - Subproject title, where YYYY-MM-DD is the date when the subproject was initiated. After a year of work your projects folder looks as follows:
```
./2017-01-19 - Training prototype/
                (README, unsorted files)
./2017-01-25 - Planning slides/
                (README, slides, images, notebook)
./2017-02-03 - LTV estimates/
                 README
                 tasks/
                   (another set of 
                    date-ordered subfolders)
./2017-02-10 - Cleanup script/
                 README
                 script.py
./... 50 folders more ...
```
  Note that you are free to organize the internals of each subproject as you deem necessary. In particular, it may even be a "data science project" in itself, with its own raw/processed data subfolders, its own Makefile-based computation graph, as well as own subprojects (which I would tend to name tasks in this case). In any case, always document each subproject (have a README file at the very least). Sometimes it helps to also have a root projects/README.txt file, which briefly lists the meaning of each subproject directory.
  
  Eventually you may discover that the project list becomes too long, and decide to reorganize the projects directory. You compress some of them and move to an archive folder. You regroup some related projects and move them to the tasks subdirectory of some parent project.
Exploration tasks come in two flavors. Some tasks are truly one-shot analyses, which can be solved using a Jupyter notebook that will never be executed again. Others aim to produce reusable code (not to be confused with reproducible outputs). I find it important to establish some conventions for how the reusable code should be kept. For example, the convention may be to have a file named script.py in the subproject's root which outputs an argparse-based help message when executed. Or you may decide to require providing a run function, configured as a Celery task, so it can easily be submitted to the job queue. Or it could be something else - anything is fine, as long as it is consistent.

The service checklist

There is an other, orthogonal perspective on the data science workflow, which I find useful. Namely, rather than speaking about it in terms of a pipeline of processes, we may instead discuss the key services that data science projects typically rely upon. This way you may describe your particular (or desired) setup by specifying how exactly should each of the following 9 key services be provided:

Data science services
1. File storage. Your project must have a home. Often this home must be shared by the team. Is it a folder on a network drive? Is it a working folder of a Git repository? How do you organize its contents?
2. Data services. How do you store and access your data? "Data" here includes your source data, intermediate results, access to third-party datasets, metadata, models and reports - essentially everything which is read by or written by a computer. Yes, keeping a bunch of HDF5 files is also an example of a "data service".
3. Versioning. Code, data, models, reports and documentation - everything should be kept under some form of version control. Git for code? Quilt for data? DVC for models? Dropbox for reports? Wiki for documentation? Once we're at it, do not forget to set up regular back ups for everything.
4. Metadata and documentation. How do you document your project or subprojects? Do you maintain any machine-readable metadata about your features, scripts, datasets or models?
5. Interactive computing. Interactive computing is how most of the hard work is done in data science. Do you use JupyterLab, RStudio, ROOT, Octave or Matlab? Do you set up a cluster for interactive parallel computing (e.g. ipyparallel or dask)?
6. Job queue & scheduler. How do you run your code? Do you use a job processing queue? Do you have the capability (or the need) to schedule regular maintenance tasks?
7. Computation graph. How do you describe the computation graph and establish reproducibility? Makefiles? DVC? Airflow?
8. Experiment manager. How do you collect, view and analyze model training progress and results? ModelDB? Hyperdash? FloydHub?
9. Monitoring dashboard. How do you collect and track the performance of the model in production? Metabase? Tableau? PowerBI? Grafana?
The tools

To conclude and summarize the exposition, here is a small spreadsheet, listing the tools mentioned in this post (as well as some extra ones I added or will add later on), categorizing them according to which stages of the data science workflow (in the terms defined in this post) they aim to support. Disclaimer - I did try out most, but not all of them. In particular, my understanding of the capabilities of the non-free solutions in the list is so far only based on their online demos or descriptions on the site.
Tags: Data science, Explanation, How to, Python, Software devlopment, Survey
What is a SIM Card?

Posted by Konstantin 28.06.2018 No Comments
I had to replace my SIM card today. The old one was mini-sized and I needed a micro-sized one. My previous SIM card stayed with me for about 20 years or so. I have gone through high school, university, and traveled around to countless places with it. I changed my phone several times, starting from an older, bulkier Nokia through several candy-bar Nokias and a couple of smartphones. My personal computer evolved from a 200MHz Pentium with 16MB RAM to a 2.8GHz quad-core with 32GB RAM. But my SIM card always stayed the same. Let me dedicate this post to the memory of this glorious tiny computer - the most long-lasting and reliable piece of computing equipment I have ever used.

My retiring SIM card. The mobile operator changed its name from Radiolinja to Elisa in 2000, but the SIM card did not care.

What is a SIM card?

Your SIM card is just another smart card, technologically similar to any of the "chip-and-pin" bank cards in your wallet, your ID card and even your contactless public transport tickets (which communicate using the same set of protocols, just wirelessly). Note, however, that banks started to switch from magnetic to "chip-and-pin" credit cards and countries began issuing smart-card-based identity documents only about 15 years ago or so. SIM cards, on the other hand, had been in wide use much earlier, long before the term "smart card" became a popularized buzzword.

All smart cards are built to serve one primary purpose - identification. The card stores a secret key (which is impossible to retrieve without disassembling the card using ultra-high-tech lab equipment), and provides a way to execute the following challenge-response protocol: you send a random string (a challenge) to the card. The card encrypts the string using the stored key and returns the response. The correctness of this response can now be verified by a second party (e.g. the mobile operator, the bank or a website using ID-card authentication). The actual details of the computation differ among the various smart cards. For example, some use symmetric while other - asymmetric cryptography. Some cards provide additional services, such as creating digital signatures or storing information on the card. None the less, identification is always the core function. This explains the name of the SIM card: Subscriber Identity Module. It is a module, which identifies you (the subscriber) to the network provider.

What is inside a SIM card?

The best way to understand what is inside a SIM (or any other type of smart-) card is to connect it to your computer and "talk" to it. Many laptops have integrated smart-card readers nowadays, so if you find a suitable frame for your nano/micro/mini-SIM, you may simply put it into the reader just as you would do with an ID or a bank card.

Old mini-SIM in a frame from a newer card

Note, though, that if your frame is flimsy and not fitting perfectly (as is mine on the photo above) you run the risk of losing the chip somewhere in the depths of the card reader slot before you even manage to slide it in completely. Hence, in my case, a tiny cross-style USB card reader was a more convenient and reliable option:

SIM in a frame in a reader

Plug it in, wait until the system recognizes the device, and we are ready to talk. There are many tools for "talking" to the card on a low level, but one of the best for the purposes of educational exploration is, in my opinion, the program called CardPeek. Let us fire it up. At start up it asks for the card reader to use (note that you can use it to explore both contact and contactless cards, if you have the necessary reader):

Select reader screen

We can now click "Analyzer → GSM SIM", provide the PIN, wait a bit, and have the program extract a wealth of information stored on the card:

SIM card analyzed

Fun, right? Let us now see where all this data came from and what it actually means.

How does it work?

At the hardware level, a smart card has a very simple connector with six active pins, which are designed for sending and receiving data to and from the card:

Smart card connectors (source)

Four pins (VCC, GND, CLK, I/O) are used for the basic half-duplex synchronous serial communication. RST is the reset pin, and VPP is used for supplying the higher voltage when (re)programming the card. When you first connect to the card, the protocol requires to zero the "reset" pin, to which the card will reply by sending a fixed sequence of bytes, identifying the card and its capabilities. It is known as the card's ATR ("Answer To Reset") string. You can see this string listed as the "cold ATR" entry in the screenshot named "SIM card analyzed" above.

Besides providing pre-packaged analysis scripts for various kinds of smart cards, CardPeek allows to send custom commands to the card by using its Lua scripting interface directly. This is what the "Command" text input at the bottom of the screen is for. Let us now switch to the "logs" tab and try sending some commands. First of all, we need to establish the connection to the card. This is done via the card.connect() call:

card.connect() example

The ATR string is received by the reader when the connection is first established. We can obtain it via card.last_atr() and print out to the log window in hex-encoded form using log.print and bytes:format (the documentation for all these APIs is available here):

ATR example

As we see, the ATR for my card happens to be 3BBA9400401447473352533731365320 (in hex). If you search the web, you will find that this particular ATR is known to be a signature of the Elisa SIM cards. It is not a random string, though, and every byte has a meaning. In particular:
- 3B is the "initial byte", and this particular value identifies the smart card as a SIM card.
- BA is the "format" byte. Its first four bits (1011) tell us that we have to expect fields TA1, TB1 and TD1 to follow. The last four bits denote the number 10 - the number of "historical bytes" at the end of the ATR.
- 94 is the field TA1, specifying the clock rate of the serial protocol
- 00 is the field TB1, specifying the programming voltage (apparently, the card is not re-programmable)
- 40 tells us that we have to read out another byte field TC2 (this is in the left-side part of the byte, 4 ) and that the card uses T=0 protocol (this is in the right-side part, 0).
- 14 is the promised TC2 field (not sure what it is meant for),
- the last 10 bytes are the "historical bytes", providing card manufacturer-specific information.
Greeting your Card

Now that we are connected, we can send various commands to the card. Let us proceed by example. The first command we might want to send is "VERIFY CHV", which is essentially greeting the card by providing our PIN1 code ("CHV" stands for "Card Holder Verification").

VERIFY CHV (source)

Every smart card command starts with a two-byte identifier (for example, A0 20 is the (hex) identifier of the VERIFY CHV command). It is followed by two parameter bytes P1 and P2. For example, the parameter P1 for VERIFY CHV is always 0, and P2 must indicate the number of the PIN we are submitting (i.e. 1 for PIN1, 2 for PIN2). Next comes P3 - a byte, specifying the length of the data which follows. For VERIFY CHV the data is the provided PIN itself, and it is always 8 bytes long. If the PIN is shorter than 8 bytes, it must be padded by bytes FF. The PIN itself is encoded in plain ASCII (i.e. 1234 would be 31 32 33 34).

Now, supposing my PIN1 is, in fact "1234", I can authenticate myself to the card via CardPeek as follows:
```
sw, resp = card.send(bytes.new(8, "A0 20 00 01 08 31 32 33 34 FF FF FF FF"))
```
Here, card.send is the sending command and bytes.new(8, ...) constructs an array of 8-bit bytes from a hex string (see CardPeek reference).

The sw and resp are the two components of the T=0 protocol response. For VERIFY CHV we only care that sw is equal to 9000, which means "OK". Note how this is printed in the log.

VERIFY CHV example

Beware that if you do not receive a 9000 response, it means that the card denied your PIN for some reason. Trying to submit a wrong PIN three times in a row will block the card.

Reading the Data

Now that we have identified ourselves to the card, let us try to read the data stored on it. The data on the card is organized in a hierarchy of files. It is this exact hierarchy that you can observe as the output of the "Analyzer" script. The root file is called "MF", it has the ICCID, TELECOM and GSM sub-files, which, in turn, have a number of predefined sub-files themselves, etc. The names are just conventions, the card itself uses two-byte identifiers for each file. For example, "MF" is actually 3F 00, "TELECOM" is 7F 10, etc. While the card is connected you can navigate around the file structure just like you would do in a normal operating system using the cd command, except that in smart-card lingo the corresponding command is called SELECT.

The binary form of the SELECT command is A0 A4 00 00 02 {x} {y}, where {x} {y} is the file identifier. Just like before, A0 A4 is the command code, 00 00 are the ignored P1 and P2 parameters, and 02 tells us that exactly two bytes must follow.

Consequently, if we wanted to select the file "MF (3f 00)/TELECOM (7F 10)/ADN (6F 3A)", which contains the address book records, we could achieve it by sending the following sequence of commands via CardPeek:
```
card.send(bytes.new(8, "A0 A4 00 00 02 3F 00"))
card.send(bytes.new(8, "A0 A4 00 00 02 7F 10"))
card.send(bytes.new(8, "A0 A4 00 00 02 6F 3A"))
```
Selecting files is a common task, and CardPeek provides a convenient shorthand: card.select("#7f10"). For some cards (not mine, unfortunately) it should also be possible to do things like card.select("/7f10/6f3a").

Once we have selected the ADN ("Abbreviated Dialing Numbers") file, we may read out the individual phone numbers from it using the READ RECORD command. The procedure is complicated by the fact that READ RECORD needs to be provided the "record size" as one of its parameters, which, in turn, must be taken from the response data of the last SELECT command, and this must be obtained via the GET RESPONSE command. The complete example would therefore be:
```
-- Select /TELECOM/ADN
card.select("#7F10")
sw, resp = card.select("#6F3A")

-- Read file metadata
GET_RESPONSE = "A0 C0 00 00"
sw, resp = card.send(bytes.new(8, GET_RESPONSE, bit.AND(sw, 0xff)))

-- Read out first record in the file
sw, resp = card.read_record('.', 1, resp:get(14))

-- Print the record to the log
log.print(log.INFO, resp:format("%P"))
```
Reading out a phone record

Note that instead of printing the output to the log via log.print you could also show a message box:
```
ui.readline(resp:format("%P"))
```
or append a new node to the tree in the "card view" tab:
```
nodes.append(nodes.root(), {classname="block", 
                            label="Name", 
                            val=resp:format("%P"), 
                            size=#resp})
```
In fact, at this point you should go and read the script $HOME/.cardpeek/scripts/gsm (beta).lua. It contains the code we ran in the beginning of this post to analyze the card. The script simply sends the relevant commands to the card and appends all responses as nodes to the tree.

Authentication

While data storage is a useful capability of a SIM card, its main purpose is subscriber authentication. Thus, our acquaintance with the SIM card would be incomplete without checking the corresponding function out as well. It is quite simple:

RUN GSM ALGORITHM (source)

That is, the process is the following: we send the byte sequence A0 88 00 00 10, followed by a 16 byte-long challenge string (which is normally given by the mobile operator when the phone joins the network). The SIM card responds with 12 bytes, of which the first 4 we should send back to the mobile operator for verification, and use the remaining 8 as a cipher key.

Before we can use the RUN GSM ALGORITHM command we need to verify our PIN1 (already done) and SELECT the GSM (7F 20) file:
```
RUN_GSM_ALGO = "A0 88 00 00 10"
GET_RESPONSE = "A0 C0 00 00"
DF_GSM = "#7F20"

CHALLENGE_STRING = "00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00"

-- Select GSM file
card.select(DF_GSM)

-- Send the challenge data
sw, resp = card.send(bytes.new(8, RUN_GSM_ALGO, CHALLENGE_STRING))

-- Read 12-byte-long response
sw, RESPONSE_STRING = card.send(bytes.new(8, GET_RESPONSE, 12))
log.print(log.INFO, RESPONSE_STRING:format("%D"))
```
RUN GSM ALGORITHM example

And that is it. I hope you learned to understand and appreciate your SIM card a bit more today.
Tags: Computer science, Electronics, Explanation, How to, Procrastination, Smart cards, Software development
The Mystery of Early Stopping

Posted by Konstantin 06.12.2017 7 Comments

Early stopping is a technique that is very often used when training neural networks, as well as with some other iterative machine learning algorithms. The idea is quite intuitive - let us measure the performance of our model on a separate validation dataset during the training iterations. We may then observe that, despite constant score improvements on the training data, the model's performance on the validation dataset would only improve during the first stage of training, reach an optimum at some point and then turn to getting worse with further iterations.

The early stopping principle

It thus seems reasonable to stop training at the point when the minimal validation error is achieved. Training the model any further only leads to overfitting. Right? The reasoning sounds solid and, indeed, early stopping is often claimed to improve generalization in practice. Most people seem to take the benefit of the technique for granted. In this post I would like to introduce some skepticism into this view or at least illustrate that things are not necessarily as obvious as they may seem from the diagram with the two lines above.

How does Early Stopping Work?

To get a better feeling of what early stopping actually does, let us examine its application to a very simple "machine learning model" - the estimation of the mean. Namely, suppose we are given a sample of 50 points $\mathbf{x}_i$ from a normal distribution with unit covariance and we need to estimate the mean $\mathbf{w}$ of this distribution.

Sample

The maximum likelihood estimate of $\mathbf{w}$ can be found as the point which has the smallest sum of squared distances to all the points in the sample. In other words, "model fitting" boils down to finding the minimum of the following objective function:

$f_\mathrm{train}(\mathrm{w}) := \sum_{i=1}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2$

As our estimate is based on a finite sample, it, of course, won't necessarily be exactly equal to the true mean of the distribution, which I chose in this particular example to be exactly (0,0):

Sample mean as a minimum of the objective function

The circles in the illustration above are the contours of the objective function, which, as you might guess, is a paraboloid bowl. The red dot marks its bottom and is thus the solution to our optimization problem, i.e. the estimate of the mean we are looking for. We may find this solution in various ways. For example, a natural closed-form analytical solution is simply the mean of the training set. For our purposes, however, we will be using the gradient descent iterative optimization algorithm. It is also quite straightforward: start with any point (we'll pick (-0.5, 0) for concreteness' sake) and descend in small steps downwards until we reach the bottom of the bowl:

Gradient descent

Let us now introduce early stopping into the fitting process. We will split our 50 points randomly into two separate sets: 40 points will be used to fit the model and 10 will form the early stopping validation set. Thus, technically, we now have two different objective functions to deal with:

$f_\mathrm{fit}(\mathrm{w}) := \sum_{i=1}^{40} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2$

and

$f_\mathrm{stop}(\mathrm{w}) := \sum_{i=41}^{50} \Vert \mathbf{x}_i - \mathbf{w}\Vert^2.$

Each of those defines its own "paraboloid bowl", both slightly different from the original one (because those are different subsets of data):

Fitting and early stopping objectives

As our algorithm descends towards the red point, we will be tracking the value of $f_\mathrm{stop}$ at each step along the way:

Gradient descent with validation

With a bit of imagination you should see on the image above, how the validation error decreases as the yellow trajectory approaches the purple dot and then starts to increase after some point midway. The spot where the validation error achieves the minimum (and thus the result of the early stopping algorithm) is shown by the green dot on the figure below:

Early stopping

In a sense, the validation function now acts as a kind of a "guardian", preventing the optimization from converging towards the bottom of our main objective. The algorithm is forced to settle on a model, which is neither an optimum of $f_\mathrm{fit}$ nor of $f_\mathrm{stop}$ . Moreover, both $f_\mathrm{fit}$ and $f_\mathrm{stop}$ use less data than $f_\mathrm{train}$ , and are thus inherently a worse representation of the problem altogether.

So, by applying early stopping we effectively reduced our training set size, used an even less reliable dataset to abort training, and settled on a solution which is not an optimum of anything at all. Sounds rather stupid, doesn't it?

Indeed, observe the distribution of the estimates found with (blue) and without (red) early stopping in repeated experiments (each time with a new random dataset):

Solutions found with and without early stopping

As we see, early stopping greatly increases the variance of the estimate and adds a small bias towards our optimization starting point.

Finally, let us see how the quality of the fit depends on the size of the validation set:

Fit quality vs validation set size

Here the y axis shows the squared distance of the estimated point to the true value (0,0), smaller is better (the dashed line is the expected distance of a randomly picked point from the data). The x axis shows all possible sizes of the validation set. We see that using no early stopping at all (x=0) results in the best expected fit. If we do decide to use early stopping, then for best results we should split the data approximately equally into training and validation sets. Interestingly, there do not seem to be much difference in whether we pick 30%, 50% or 70% of data for the validation set - the validation set seems to play just as much role in the final estimate as the training data.

Early Stopping with Non-convex Objectives

The experiment above seems to demonstrate that early stopping should be almost certainly useless (if not harmful) for fitting simple convex models. However, it is never used with such models in practice. Instead, it is most often applied to the training of multilayer neural networks. Could it be the case that the method somehow becomes useful when the objective is highly non-convex? Let us run a small experiment, measuring the benefits of early stopping for fitting a convolutional neural-network on the MNIST dataset. For simplicity, I took the standard example from the Keras codebase, and modified it slightly. Here is the result we get when training the the most basic model:

MNIST - Basic

The y axis depicts log-loss on the 10k MNIST test set, the x axis shows the proportion of the 60k MNIST training set set aside for early stopping. Ignoring small random measurement noise, we may observe that using early stopping with about 10% of the training data does seem to convey a benefit. Thus, contrary to our previous primitive example, when the objective is complex, early stopping does work as a regularization method. Why and how does it work here? Here's one intuition I find believable (there are alternative possible explanations and measurements, none of which I find too convincing or clear, though): stopping the training early prevents the algorithm from walking too far away from the initial parameter values. This limits the overall space of models and is vaguely analogous to suppressing the norm of the parameter vector. In other words, early stopping resembles an ad-hoc version of $\ell_p$ regularization.

Indeed, observe how the use of early stopping affects the results of fitting the same model with a small $\ell_2$ -penalty added to the objective:

MNIST - L2

All of the benefits of early stopping are gone now, and the baseline (non-early-stopped, $\ell_2$ -regularized) model is actually better overall than it was before. Let us now try an even more heavily regularized model by adding dropout (instead of the $\ell_2$ penalty), as is customary for deep neural networks. We can observe an even cleaner result:

MNIST - Dropout

Early stopping is again not useful at all, and the overall model is better than all of our previous attempts.

Conclusion: Do We Need Early Stopping?

Given the reasoning and the anecdotal experimental evidence above, I personally tend to think that beliefs in the usefulness of early stopping (in the context of neural network training) may be well overrated. Even if it may improve generalization for some nonlinear models, you would most probably achieve the same effect more reliably using other regularization techniques, such as dropout or a simple $\ell_2$ penalty.

Note, though, that there is a difference between early stopping in the context of neural networks and, say, boosting models. In the latter case early stopping is actually more explicitly limiting the complexity of the final model and, I suspect, might have a much more meaningful effect. At least we can't directly carry over the experimental examples and results in this blog post to that case.

Also note, that no matter whether early stopping helps or harms the generalization of the trained model, it is still a useful heuristic as to when to stop a lengthy training process automatically if we simply need results that are good enough.

Tags: Data analysis, Explanation, Machine learning, Project, Python, Statistics, Theory, Unclear
The Blockchain Consensus Problem

Posted by Konstantin 09.07.2017 No Comments

The Dark Side of the Bitcoin

Recall that Bitcoin is a currency, i.e. it is a technology, which aims to provide a store of value along with a payment medium. With all due respect to its steadily growing adoption, it would be fair to note that it is not very good at fulfilling either of these two functions currently. Firstly, it is not a very reliable store of value due to extreme volatility in the price. Secondly, and most importantly, it is a mediocre payment medium because it is slow and expensive.

A typical transfer costs around $2 nowadays and takes about an hour for a full confirmation (or longer, if you pay a smaller fee). When you need to transfer a million dollars, this looks like a reasonable deal. When you buy a chocolate bar at a grocery store (something one probably does more often than transferring a million), it is unacceptable. Any plain old bank's payment card would offer a faster and cheaper solution, which is ironic, given that Bitcoin was meant to be all friendly, distributed and free (as in freedom) while banks are, as we all know, evil empires hungry for our money, flesh and souls.

The irony does not end here. The evil banks typically provide some useful services in exchange for the fees they collect, such as an online self-service portal, 24h support personnel, cash handling and ATMs, some security guarantees, interests on deposits, etc. The friendly Bitcoin offers nothing of this kind. What is Bitcoin wasting our money on then? Electricity, mainly! The Proof of Work (PoW) algorithm employed in the Bitcoin's blockchain requires the computation of quintillions of random, meaningless hashes to "confirm" payments. The "miner" nodes, running the Bitcoin's network are collectively performing more than 5 000 000 000 000 000 000 (five quintillion or five exa-) hash computations every second, continuously consuming as much electricity as the whole country of Turkmenistan. The situation is even worse if you consider that Bitcoin is just one of many other "coins" built upon the PoW algorithm (Ethereum and Litecoin being the two other prominent examples), and their overall power consumption is only growing with each day.

Just think of it: most of the $2 fee a Bitcoin user needs to pay for a transaction will neither end up as someone's wage nor make a return on investment in someone's pocket. Instead, it will burn up in fossil fuels which generate power for the "miners", wasting precious resources of our planet, contributing to global warming and pushing poor polar bears faster towards extinction. Is all this mayhem at least a "necessary evil"? Sadly, it is not.

The Unnecessary Evil

Formally speaking, Proof of Work is an algorithm for achieving consensus among a distributed set of nodes which collectively maintain a common blockchain. Is it the only such algorithm? Of course not! Many alternative methods exist, most of them (if not all) are both faster and less energy-hungry. In fact, the only valuable property of PoW is its ingenious simplicity. In terms of implementation it may very well be among the simplest distributed blockchain consensus algorithms ever to be invented.

It is natural that a successful pioneering technology (such as the Bitcoin) is originally built from simple blocks. Progress comes in small steps and you cannot innovate on all fronts at once, after all. There must come a time, however, when the limitations of the initially chosen basic blocks become apparent and the technology gets upgraded to something more efficient. With more than $1 billion dollars in electricity bills paid by Bitcoin users last year for the inefficiency of PoW, Bitcoin has long surpassed this turning point, in my opinion.

Unfortunately, due to its pioneering status, enormous inertia, ongoing hype and the high stakes involved, Bitcoin continues to roll on its old wooden proof-of-work wheels with no improvement in sight, somewhy still being perceived as the leader in the brave new world of cryptocurrencies.

Are nearly-instant and nearly-free payment along with energy efficiency too much to ask from a real "currency of the future"? I do not think so. In fact, Bitcoin could be such a currency, if only it could switch from the evil Proof of Work to a different, fast and eco-friendly consensus algorithm.

Which algorithm could it be? Let me offer you an overview of some of the current options I am personally aware of, so you could decide for yourself.

The Eco-Friendly Blockchain Consensus

Consider a network of many nodes, which needs to maintain a common state for a chain of blocks. There seem to be roughly three general categories of algorithms which the nodes could employ for their purpose: Proof of Authority (PoA), Nakamoto Consensus, and Byzantine Fault Tolerance (BFT). Let us consider them in order.

Proof of Authority

Perhaps the most straightforward solution would be to nominate a fixed subset of nodes as "authoritative", and let any of them append new blocks by signing them cryptographically. To avoid conflicting updates, nodes may agree on a predefined round-robin signing order, honestly randomize their waiting intervals, or use some kind of a deterministic lottery for selecting the signer for next block, etc.

As this approach relies on a fixed subset of (reasonably) trusted nodes, it does not look robust and secure enough for a proper worldwide distributed blockchain. For example, in the limit case of a single trusted party it is equivalent to using a single service provider such as a bank. None the less, it is a convenient baseline and an important primitive, actually applicable to a wide range of real-life blockchain deployments. By relying on a set of well-behaving parties, a PoA blockchain actually sidesteps most of the complexities of a real distributed algorithm, and can thus be made to perform much faster than any of the "truly distributed" algorithms.

The Ethereum software provides an implementation of this approach for those who want to run private chains. PeerCoin relies on the PoA principle by having "checkpoint blocks" signed regularly by a trusted authority. Finally, the Delegated Proof of Stake algorithm makes PoA work on a larger scale by relying on voting. It is probably one of the most interesting practical implementations of the idea.

Delegated Proof of Stake

Delegated Proof of Stake (DPoS) is a consensus algorithm implemented in Graphene-based blockchains (BitShares, Steem, EOS). It is a variant of Proof of Authority, where the small set of authoritative delegate nodes is elected by voting. When electing the delegates, each node can cast the number of votes, proportional to their account value (or "stakeholder share"), thus "delegating their stake in the network". The elected authorities then participate in a simple and fast round-robin block confirmation with each node given a two second window for confirming the next block.

The security of DPoS hinges on the assumption that the nodes with the most stake in the system should generally manage to elect a set of reasonable authorities, and in case of errors, the misbehaving authorities will not cause too much trouble and will be quickly voted out. At the same time, being internally a PoA implementation, the DPoS-based blockchains are by an order of magnitude faster in terms of transaction throughput than any other currently running public blockchains. Notably, they can also naturally support fee-less transactions.

Nakamoto Consensus

Consider the variation of PoA, where there are no pre-selected trusted nodes (i.e. all nodes may participate in the algorithm). Each time a new block needs to be added to the chain, let us pick the node who will gain the right to add it according to some deterministic "lottery" system. The consensus can then be achieved by simply verifying that the resulting blockchain is conforming to the lottery rules at all times, and the conflicting chains are resolved by always preferring the "harder" chain (according to some notion of "hardness").

For example, the infamous Proof-of-Work is an example of such a method. The "lottery" here is based on the ability of a node to find a suitable nonce value. The "hardness" is simply the length of the chain. Such "lottery" methods are sometimes referred to as "Nakamoto consensus algorithms". In terms of efficiency, Nakamoto consensus algorithms are among the slowest consensus algorithms.

Several alternatives to the "PoW lottery" have been proposed. Let us review some of them.

Proof of Stake

Proof of Stake (PoS), first implemented in the Nxt cryptocurrency, is a Nakamoto consensus technique, where the nodes with a greater balance on their account are given a higher chance to "win the lottery" and sign the next block. The actual technique used in Nxt is the following: before signing a block every node obtains a pseudo-random "lottery ticket number" $x$ by hashing the last block data with its own identifier. If this number is smaller than

$\alpha \cdot \text{(account balance)}\cdot \text{(time since last block)},$

(where $\alpha$ is a block-specific constant), the node gets the right to sign the next block. The higher the node's balance, the higher is the probability it will get a chance to sign. The rationale is that nodes with larger balances have more at stake, are more motivated to behave honestly, and thus need to be given more opportunities to participate in generating the blockchain.

Proof of Stake is typically considered as the primary alternative to Proof of Work without all the wasteful computation, and it should, in principle, be possible to transition the whole blockchain from the latter to the former. In fact, this is what may probably happen to Ethereum eventually.

Proof of Space

In Proof of Space (PoSpace), a consensus mechanism implemented in Burstcoin, the "miners" must first pre-generate a set of "lottery ticket numbers" in a particular manner for themselves, save these numbers on a hard drive and commit the hash (the Merkle tree root) of this complete ticket set to the blockchain. Then, similarly to Proof of Stake, by hashing the last block's data, a miner deterministically picks one of his own "lottery tickets" for the next block. If the value of this ticket, discounted by the number of tickets in possession, is small enough, the miner gets the right to sign the block. The more tickets a miner generates and stores, the better are his chances. When signing the block, the miner must present a couple of special hashes which he can only know if he constantly stores his complete set of tickets (or fully recomputes a large part of it every time, which is impractical). Consequently, instead of spending energy on the "mining" process, the nodes must constantly dedicate a certain amount of disk space to the algorithm.

Although it is probably among the less widely known methods, from both technical and practical standpoint, it is one of the most interesting techniques, in my opinion. Note how it combines the properties of PoS (speed and energy efficiency) with those of PoW (ownership of a real-world resource as a proxy for decentralization).

Proof of Burn

The idea behind Proof of Burn is to allow the nodes to generate their "lottery ticket numbers" by irretrievably transferring some coins to a nonexistent address and taking the hash of the resulting transaction. The resulting hash, scaled by the amount of coins burned, can then be used to gain the right to sign blocks just like in other Nakamoto lottery systems. The act of wasting coins is meant to be a virtual analogue of spending electricity on PoW mining, without actually spending it. Blockchains based purely on Proof of Burn do not seem to exist at the moment. However, the technique can be used alongside PoW, PoS or other approaches.

Proof of Elapsed Time

Presumably, some Intel processors have specialized instructions for emitting signed tokens, which prove that a given process called a particular function a certain period of time ago. The Hyperledger project proposes to build a consensus algorithm around those. Each "miner" will gain the right to sign a block after it waits for a certain period of time. The token which proves that the miner did in fact wait the allotted time, would act as a winning lottery ticket. I do not see how this method could work outside of the trusted Intel-only environment or how is it better than a trivialized Proof of Stake (not sure I even understood the idea correcty), but I could not help mentioning it here for completeness' sake.

Hybrid Nakamoto Consensus Systems

Some systems interleave PoW and PoS confirmations, or add PoA signatures from time to time to lock the chain or speed-up block confirmations. In fact, it is not too hard to invent nearly arbitrary combinations of delegation, voting, payments, authorities and lotteries.

Byzantine Fault Tolerance

The Practical Byzantine Fault Tolerance (PBFT) algorithm offers an alternative solution to the consensus problem. Here the blockchain state is tracked by a set of "bookkeeping" nodes, which constantly broadcast all changes among themselves and consider a change reliably replicated when it is signed and confirmed by given quorum (e.g. 2/3) of the bookkeepers. The algorithms of this type can be shown to be reliable if no more than a third of the nodes are dishonest. The Ripple, Stellar and Antshares are examples of blockchains based on such techniques. This algorithm allows much higher transaction throughputs than Nakamoto consensus (PoW, PoS, PoSpace), yet it still lags behind the speed of PoA or DPoS.

Tags: Bitcoin, Blockchain, Computer science, Cryptography, Distributed computing, Explanation
Blockchain in Simple Terms

Posted by Konstantin 29.03.2017 1 Comment
The following is an expanded version of an explanatory comment I posted here.

Alice's Diary

Alice decided to keep a diary. For that she bought a notebook, and started filling it with lines like:
1. Bought 5 apples.
2. Called mom.
  ....
3. Gave Bob $250.
4. Kissed Carl.
5. Ate a banana.
  ...
Alice did her best to keep a meticulous account of events, and whenever she had a discussion with friends about something that happened earlier, she would quickly resolve all arguments by taking out the notebook and demonstrating her records. One day she had a dispute with Bob about whether she lent him $250 earlier or not. Unfortunately, Alice did not have her notebook at hand at the time of the dispute, but she promised to bring it tomorrow to prove Bob owed her money.

Bob really did not want to return the money, so that night he got into Alice's house, found the notebook, found line 132 and carefully replaced it with "132. Kissed Dave". The next day, when Alice opened the notebook, she did not find any records about money being given to Bob, and had to apologize for making a mistake.

Alice's Blockchain

A year later Bob's conscience got to him and he confessed his crime to Alice. Alice forgave him, but decided to improve the way she kept the diary, to avoid the risk of forging records in the future. Here's what she came up with. The operating system Linups that she was using had a program named md5sum, which could convert any text to its hash - a strange sequence of 32 characters. Alice did not really understand what the program did with the text, it just seemed to produce a sufficiently random sequence. For example, if you entered "hello" into the program, it would output "b1946ac92492d2347c6235b4d2611184", and if you entered "hello " with a space at the end, the output would be "1a77a8341bddc4b45418f9c30e7102b4".

Alice scratched her head a bit and invented the following way of making record forging more complicated to people like Bob in the future: after each record she would insert the hash, obtained by feeding the md5sum program with the text of the record and the previous hash. The new diary now looked as follows:
1. 0000 (the initial hash, let us limit ourselves with just four digits for brevity)
2. Bought 5 apples.
3. 4178 (the hash of "0000" and "Bought 5 apples")
4. Called mom.
5. 2314 (the hash of "4178" and "Called mom")
  ...
  4492
6. Gave Bob $250.
  1010 (the hash of "4492" and "Gave Bob $250")
7. Kissed Carl.
  8204 (the hash of "1010" and "Kissed Carl")
  ...
Now each record was "confirmed" by a hash. If someone wanted to change the line 132 to something else, they would have to change the corresponding hash (it would not be 1010 anymore). This, in turn, would affect the hash of line 133 (which would not be 8204 anymore), and so on all the way until the end of the diary. In order to change one record Bob would have to rewrite confirmation hashes for all the following diary records, which is fairly time-consuming. This way, hashes "chain" all records together, and what was before a simple journal became now a chain of records or "blocks" - a blockchain.

Proof-of-Work Blockchain

Time passed, Alice opened a bank. She still kept her diary, which now included serious banking records like "Gave out a loan" or "Accepted a deposit". Every record was accompanied with a hash to make forging harder. Everything was fine, until one day a guy named Carl took a loan of $1000000. The next night a team of twelve elite Chinese diary hackers (hired by Carl, of course) got into Alice's room, found the journal and substituted in it the line "143313. Gave out a $1000000 loan to Carl" with a new version: "143313. Gave out a $10 loan to Carl". They then quickly recomputed all the necessary hashes for the following records. For a dozen of hackers armed with calculators this did not take too long.

Fortunately, Alice saw one of the hackers retreating and understood what happened. She needed a more secure system. Her new idea was the following: let us append a number (called "nonce") in brackets to each record, and choose this number so that the confirmation hash for the record would always start with two zeroes. Because hashes are rather unpredictable, the only way to do it is to simply try out different nonce values until one of them results in a proper hash:
1. 0000
2. Bought 5 apples (22).
3. 0042 (the hash of "0000" and "Bought 5 apples (22)")
4. Called mom (14).
5. 0089 (the hash of "0042" and "Called mom (14)")
  ...
  0057
6. Gave Bob $250 (33).
  0001
7. Kissed Carl (67).
  0093 (the hash of "0001" and "Kissed Carl (67)")
  ...
To confirm each record one now needs to try, on average, about 50 different hashing operations for different nonce values, which makes it 50 times harder to add new records or forge them than previously. Hopefully even a team of hackers wouldn't manage in time. Because each confirmation now requires hard (and somewhat senseless) work, the resulting method is called a proof-of-work system.

Distributed Blockchain

Tired of having to search for matching nonces for every record, Alice hired five assistants to help her maintain the journal. Whenever a new record needed to be confirmed, the assistants would start to seek for a suitable nonce in parallel, until one of them completed the job. To motivate the assistants to work faster she allowed them to append the name of the person who found a valid nonce, and promised to give promotions to those who confirmed more records within a year. The journal now looked as follows:
1. 0000
2. Bought 5 apples (29, nonce found by Mary).
3. 0013 (the hash of "0000" and "Bought 5 apples (29, nonce found by Mary)")
4. Called mom (45, nonce found by Jack).
5. 0089 (the hash of "0013" and "Called mom (45, nonce found by Jack)")
  ...
  0068
6. Gave Bob $250 (08, nonce found by Jack).
  0028
7. Kissed Carl (11, nonce found by Mary).
  0041
  ...
A week before Christmas, two assistants came to Alice seeking for a Christmas bonus. Assistant Jack, showed a diary where he confirmed 140 records and Mary confirmed 130, while Mary showed a diary where she, reportedly, confirmed more records than Jack. Each of them was showing Alice a journal with all the valid hashes, but different entries! It turns out that ever since having found out about the promotion the two assistants were working hard to keep their own journals, such that all nonces would have their names. Since they had to maintain the journals individually they had to do all the work confirming records alone rather than splitting it among other assistants. This of course made them so busy that they eventually had to miss some important entries about Alice's bank loans.

Consequently, Jacks and Mary's "own journals" ended up being shorter than the "real journal", which was, luckily, correctly maintained by the three other assistants. Alice was disappointed, and, of course, did not give neither Jack nor Mary a promotion. "I will only give promotions to assistants who confirm the most records in the valid journal", she said. And the valid journal is the one with the most entries, of course, because the most work has been put into it!

After this rule has been established, the assistants had no more motivation to cheat by working on their own journal alone - a collective honest effort always produced a longer journal in the end. This rule allowed assistants to work from home and completely without supervision. Alice only needed to check that the journal had the correct hashes in the end when distributing promotions. This way, Alice's blockchain became a distributed blockchain.

Bitcoin

Jack happened to be much more effective finding nonces than Mary and eventually became a Senior Assistant to Alice. He did not need any more promotions. "Could you transfer some of the promotion credits you got from confirming records to me?", Mary asked him one day. "I will pay you $100 for each!". "Wow", Jack thought, "apparently all the confirmations I did still have some value for me now!". They spoke with Alice and invented the following way to make "record confirmation achievements" transferable between parties.

Whenever an assistant found a matching nonce, they would not simply write their own name to indicate who did it. Instead, they would write their public key. The agreement with Alice was that the corresponding confirmation bonus would belong to whoever owned the matching private key:
1. 0000
2. Bought 5 apples (92, confirmation bonus to PubKey61739).
3. 0032 (the hash of "0000" and "Bought 5 apples (92, confirmation bonus to PubKey61739)")
4. Called mom (52, confirmation bonus to PubKey55512).
5. 0056 (the hash of "0032" and "Called mom (52, confirmation bonus to PubKey55512)")
  ...
  0071
6. Gave Bob $250 (22, confirmation bonus to PubKey61739).
  0088
7. Kissed Carl (40, confirmation bonus to PubKey55512).
  0012
  ...
To transfer confirmation bonuses between parties a special type of record would be added to the same diary. The record would state which confirmation bonus had to be transferred to which new public key owner, and would be signed using the private key of the original confirmation owner to prove it was really his decision:
1. 0071
2. Gave Bob $250 (22, confirmation bonus to PubKey6669).
  0088
3. Kissed Carl (40, confirmation bonus to PubKey5551).
  0012
  ...
  0099
4. TRANSFER BONUS IN RECORD 132 TO OWNER OF PubKey1111, SIGNED BY PrivKey6669. (83, confirmation bonus to PubKey4442).
  0071
In this example, record 284 transfers bonus for confirming record 132 from whoever it belonged to before (the owner of private key 6669, presumably Jack in our example) to a new party - the owner of private key 1111 (who could be Mary, for example). As it is still a record, there is also a usual bonus for having confirmed it, which went to owner of private key 4442 (who could be John, Carl, Jack, Mary or whoever else - it does not matter here). In effect, record 284 currently describes two different bonuses - one due to transfer, and another for confirmation. These, if necessary, can be further transferred to different parties later using the same procedure.

Once this system was implemented, it turned out that Alice's assistants and all their friends started actively using the "confirmation bonuses" as a kind of an internal currency, transferring them between each other's public keys, even exchanging for goods and actual money. Note that to buy a "confirmation bonus" one does not need to be Alice's assistant nor register anywhere. One just needs to provide a public key.

This confirmation bonus trading activity became so prominent that Alice stopped using the diary for her own purposes, and eventually all the records in the diary would only be about "who transferred which confirmation bonus to whom". This idea of a distributed proof-of-work-based blockchain with transferable confirmation bonuses is known as the Bitcoin.

Smart Contracts

But wait, we are not done yet. Note how Bitcoin is born from the idea of recording "transfer claims", cryptographically signed by the corresponding private key, into a blockchain-based journal. There is no reason we have to limit ourselves to this particular cryptographic protocol. For example, we could just as well make the following records:
1. Transfer bonus in record 132 to whoever can provide signatures, corresponding to PubKey1111 AND PubKey3123.
This would be an example of a collective deposit, which may only be extracted by a pair of collaborating parties. We could generalize further and consider conditions of the form:
1. Transfer bonus in record 132 to whoever first provides $x$ , such that $f(x) = \text{true}$ .
Here $f(x)$ could be any predicate describing a "contract". For example, in Bitcoin the contract requires $x$ to be a valid signature, corresponding to a given public key (or several keys). It is thus a "contract", verifying the knowledge of a certain secret (the private key). However, $f(x)$ could just as well be something like:

$f(x) = \text{true, if }x = \text{number of bytes in record #42000},$

which would be a kind of a "future prediction" contract - it can only be evaluated in the future, once record 42000 becomes available. Alternatively, consider a "puzzle solving contract":

$f(x) = \text{true, if }x = \text{valid, machine-verifiable}$

$\qquad\qquad\text{proof of a complex theorem},$

Finally, the first part of the contract, namely the phrase "Transfer bonus in record ..." could also be fairly arbitrary. Instead of transferring "bonuses" around we could just as well transfer arbitrary tokens of value:
1. Whoever first provides $x$ , such that $f(x) = \text{true}$ will be DA BOSS.
  ...
2. $x=42$ satisifes the condition in record 284.
  Now and forever, John is DA BOSS!
The value and importance of such arbitrary tokens will, of course, be determined by how they are perceived by the community using the corresponding blockchain. It is not unreasonable to envision situations where being DA BOSS gives certain rights in the society, and having this fact recorded in an automatically-verifiable public record ledger makes it possible to include the this knowledge in various automated systems (e.g. consider a door lock which would only open to whoever is currently known as DA BOSS in the blockchain).

Honest Computing

As you see, we can use a distributed blockchain to keep journals, transfer "coins" and implement "smart contracts". These three applications are, however, all consequences of one general, core property. The participants of a distributed blockchain ("assistants" in the Alice example above, or "miners" in Bitcoin-speak) are motivated to precisely follow all rules necessary for confirming the blocks. If the rules say that a valid block is the one where all signatures and hashes are correct, the miners will make sure these indeed are. If the rules say that a valid block is the one where a contract function needs to be executed exactly as specified, the miners will make sure it is the case, etc. They all seek to get their confirmation bonuses, and they will only get them if they participate in building the longest honestly computed chain of blocks.

Because of that, we can envision blockchain designs where a "block confirmation" requires running arbitrary computational algorithms, provided by the users, and the greedy miners will still execute them exactly as stated. This general idea lies behind the Ethereum blockchain project.

There is just one place in the description provided above, where miners have some motivational freedom to not be perfectly honest. It is the decision about which records to include in the next block to be confirmed (or which algorithms to execute, if we consider the Ethereum blockchain). Nothing really prevents a miner to refuse to ever confirm a record "John is DA BOSS", ignoring it as if it never existed at all. This problem is overcome in modern blockchains by having users offer additional "tip money" reward for each record included in the confirmed block (or for every algorithmic step executed on the Ethereum blockchain). This aligns the motivation of the network towards maximizing the number of records included, making sure none is lost or ignored. Even if some miners had something against John being DA BOSS, there would probably be enough other participants who would not turn down the opportunity of getting an additional tip.

Consequently, the whole system is economically incentivised to follow the protocol, and the term "honest computing" seems appropriate to me.
Tags: Bitcoin, Computer science, Cryptography, Distributed computing, Economics, Explanation
Implication and Provability

Posted by Konstantin 28.03.2017 No Comments
Consider the following question:
Which of the following two statements is logically true?
1. All planets of the Solar System orbit the Sun. The Earth orbits the Sun. Consequently, the Earth is a planet of the Solar System.
2. God is the creator of all things which exist. The Earth exists. Consequently, God created the Earth.
I've seen this question or variations of it pop up as "provocative" posts in social networks several times. At times they might invite lengthy discussions, where the participants would split into camps - some claim that the first statement is true, because Earth is indeed a planet of the Solar System and God did not create the Earth. Others would laugh at the stupidity of their opponents and argue that, obviously, only the second statement is correct, because it makes a valid logical implication, while the first one does not.

Not once, however, have I ever seen a proper formal explanation of what is happening here. And although it is fairly trivial (once you know it), I guess it is worth writing up. The root of the problem here is the difference between implication and provability - something I myself remember struggling a bit to understand when I first had to encounter these notions in a course on mathematical logic years ago.

Indeed, any textbook on propositional logic will tell you in one of the first chapters that you may write

$A \Rightarrow B$

to express the statement " $A$ implies $B$ ". A chapter or so later you will learn that there is also a possibility to write

$A \vdash B$

to express a confusingly similar statement, that " $B$ is provable from $A$ ". To confirm your confusion, another chapter down the road you should discover, that $A \Rightarrow B$ is the same as $\vdash A \Rightarrow B$ , which, in turn, is logically equivalent to $A \vdash B$ . Therefore, indeed, whenever $A \Rightarrow B$ is true, $A \vdash B$ is true, and vice-versa. Is there a difference between $\vdash$ and $\Rightarrow$ then, and why do we need the two different symbols at all? The "provocative" question above provides an opportunity to illustrate this.

The spoken language is rather informal, and there can be several ways of formally interpreting the same statement. Both statements in the puzzle are given in the form " $A$ , $B$ , consequently $C$ ". Here are at least four different ways to put them formally, which make the two statements true or false in different ways.

The Pure Logic Interpretation

Anyone who has enough experience solving logic puzzles would know that both statements should be interpreted as abstract claims about provability (i.e. deducibility):

$A, B \vdash C.$

As mentioned above, this is equivalent to

$(A\,\&\, B) \Rightarrow C.$

or

$\vdash (A\,\&\, B) \Rightarrow C.$

In this interpretation the first statement is wrong and the second is a correct implication.

The Pragmatic Interpretation

People who have less experience with math puzzles would often assume that they should not exclude their common sense knowledge from the task. The corresponding formal statement of the problem then becomes the following:

$[\text{common knowledge}] \vdash (A\,\&\, B) \Rightarrow C.$

In this case both statements become true. The first one is true simply because the consequent $C$ is true on its own, given common knowledge (the Earth is indeed a planet) - the antecedents and provability do not play any role at all. The second is true because it is a valid reasoning, independently of the common knowledge.

This type of interpretation is used in rhetorical phrases like "If this is true, I am a Dutchman".

The Overly Strict Interpretation

Some people may prefer to believe that a logical statement should only be deemed correct if every single part of it is true and logically valid. The two claims must then be interpreted as follows:

$([\text{common}] \vdash A)\,\&\, ([\text{common}] \vdash B)\,\&\, (A, B\vdash C).$

Here the issue of provability is combined with the question about the truthfulness of the facts used. Both statements are false - the first fails on logic, and the second on facts (assuming that God creating the Earth is not part of common knowledge).

The Oversimplified Interpretation

Finally, people very unfamiliar with strict logic would sometimes tend to ignore the words "consequently", "therefore" or "then", interpreting them as a kind of an extended synonym for "and". In their minds the two statements could be regarded as follows:

$[\text{common}] \vdash A\,\&\, B\,\&\, C.$

From this perspective, the first statement becomes true and the second (again, assuming the aspects of creation are not commonly known) is false.

Although the author of the original question most probably did really assume the "pure logic" interpretation, as is customary for such puzzles, note how much leeway there can be when converting a seemingly simple phrase in English to a formal statement. In particular, observe that questions about provability, where you deliberately have to abstain from relying on common knowledge, may be different from questions about facts and implications, where common sense may (or must) be assumed and you can sometimes skip the whole "reasoning" part if you know the consequent is true anyway.

Here is an quiz question to check whether you understood what I meant to explain.

"The sky is blue, and therefore the Earth is round." True or false?
Tags: Explanation, Logic, Mathematics, Puzzle, Theory
The Schrödinger's Cat Uncertainty

Posted by Konstantin 21.03.2017 No Comments

Ever since Erwin Schrödinger described a thought experiment, in which a cat in a sealed box happened to be "both dead and alive at the same time", popular science writers have been relying on it heavily to convey the mysteries of quantum physics to the layman. Unfortunately, instead of providing any useful intuition, this example has instead laid solid base to a whole bunch of misconceptions. Having read or heard something about the strange cat, people would tend to jump to profound conclusions, such as "according to quantum physics, cats can be both dead and alive at the same time" or "the notion of a conscious observer is important in quantum physics". All of these are wrong, as is the image of a cat, who is "both dead and alive at the same time". The corresponding Wikipedia page does not stress this fact well enough, hence I thought the Internet might benefit from a yet another explanatory post.

The Story of the Cat

The basic notion in quantum mechanics is a quantum system. Pretty much anything could be modeled as a quantum system, but the most common examples are elementary particles, such as electrons or photons. A quantum system is described by its state. For example, a photon has polarization, which could be vertical or horizontal. Another prominent example of a particle's state is its wave function, which represents its position in space.

There is nothing special about saying that things have state. For example, we may say that any cat has a "liveness state", because it can be either "dead" or "alive". In quantum mechanics we would denote these basic states using the bra-ket notation as $|\mathrm{dead}\rangle$ and $|\mathrm{alive}\rangle$ . The strange thing about quantum mechanical systems, though, is the fact that quantum states can be combined together to form superpositions. Not only could a photon have a purely vertical polarization $\left|\updownarrow\right\rangle$ or a purely horizontal polarization $\left|\leftrightarrow\right\rangle$ , but it could also be in a superposition of both vertical and horizontal states:

$\left|\updownarrow\right\rangle + \left|\leftrightarrow\right\rangle.$

This means that if you asked the question "is this photon polarized vertically?", you would get a positive answer with 50% probability - in another 50% of cases the measurement would report the photon as horizontally-polarized. This is not, however, the same kind of uncertainty that you get from flipping a coin. The photon is not either horizontally or vertically polarized. It is both at the same time.

Amazed by this property of quantum systems, Schrödinger attempted to construct an example, where a domestic cat could be considered to be in the state

$|\mathrm{dead}\rangle + |\mathrm{alive}\rangle,$

which means being both dead and alive at the same time. The example he came up with, in his own words (citing from Wikipedia), is the following:

A cat is penned up in a steel chamber, along with the following device (which must be secured against direct interference by the cat): in a Geiger counter, there is a tiny bit of radioactive substance, so small, that perhaps in the course of the hour one of the atoms decays, but also, with equal probability, perhaps none; if it happens, the counter tube discharges and through a relay releases a hammer that shatters a small flask of hydrocyanic acid. If one has left this entire system to itself for an hour, one would say that the cat still lives if meanwhile no atom has decayed. The first atomic decay would have poisoned it.

The idea is that after an hour of waiting, the radiactive substance must be in the state

$|\mathrm{decayed}\rangle + |\text{not decayed}\rangle,$

the poison flask should thus be in the state

$|\mathrm{broken}\rangle + |\text{not broken}\rangle,$

and the cat, consequently, should be

$|\mathrm{dead}\rangle + |\mathrm{alive}\rangle.$

Correct, right? No.

The Cat Ensemble

Superposition, which is being "in both states at once" is not the only type of uncertainty possible in quantum mechanics. There is also the "usual" kind of uncertainty, where a particle is in either of two states, we just do not exactly know which one. For example, if we measure the polarization of a photon, which was originally in the superposition $\left|\updownarrow\right\rangle + \left|\leftrightarrow\right\rangle$ , there is a 50% chance the photon will end up in the state $\left|\updownarrow\right\rangle$ after the measurement, and a 50% chance the resulting state will be $\left|\leftrightarrow\right\rangle$ . If we do the measurement, but do not look at the outcome, we know that the resulting state of the photon must be either of the two options. It is not a superposition anymore. Instead, the corresponding situation is described by a statistical ensemble:

$\{\left|\updownarrow\right\rangle: 50\%, \quad\left|\leftrightarrow\right\rangle: 50\%\}.$

Although it may seem that the difference between a superposition and a statistical ensemble is a matter of terminology, it is not. The two situations are truly different and can be distinguished experimentally. Essentially, every time a quantum system is measured (which happens, among other things, every time it interacts with a non-quantum system) all the quantum superpositions are "converted" to ensembles - concepts native to the non-quantum world. This process is sometimes referred to as decoherence.

Now recall the Schrödinger's cat. For the cat to die, a Geiger counter must register a decay event, triggering a killing procedure. The registration within the Geiger counter is effectively an act of measurement, which will, of course, "convert" the superposition state into a statistical ensemble, just like in the case of a photon which we just measured without looking at the outcome. Consequently, the poison flask will never be in a superposition of being "both broken and not". It will be either, just like any non-quantum object should. Similarly, the cat will also end up being either dead or alive - you just cannot know exactly which option it is before you peek into the box. Nothing special or quantum'y about this.

The Quantum Cat

"But what gives us the right to claim that the Geiger counter, the flask and the cat in the box are "non-quantum" objects?", an attentive reader might ask here. Could we imagine that everything, including the cat, is a quantum system, so that no actual measurement or decoherence would happen inside the box? Could the cat be "both dead and alive" then?

Indeed, we could try to model the cat as a quantum system with $|\mathrm{dead}\rangle$ and $|\mathrm{alive}\rangle$ being its basis states. In this case the cat indeed could end up in the state of being both dead and alive. However, this would not be its most exciting capability. Way more suprisingly, we could then kill and revive our cat at will, back and forth, by simply measuring its liveness state appropriately. It is easy to see how this model is unrepresentative of real cats in general, and the worry about them being able to be in superposition is just one of the many inconsistencies. The same goes for the flask and the Geiger counter, which, if considered to be quantum systems, get the magical abilities to "break" and "un-break", "measure" and "un-measure" particles at will. Those would certainly not be a real world flask nor a counter anymore.

The Cat Multiverse

There is one way to bring quantum superposition back into the picture, although it requires some rather abstract thinking. There is a theorem in quantum mechanics, which states that any statistical ensemble can be regarded as a partial view of a higher-dimensional superposition. Let us see what this means. Consider a (non-quantum) Schrödinger's cat. As it might be hopefully clear from the explanations above, the cat must be either dead or alive (not both), and we may formally represent this as a statistical ensemble:

$\{\left|\text{dead}\right\rangle: 50\%, \quad\left|\text{alive}\right\rangle: 50\%\}.$

It turns out that this ensemble is mathematically equivalent in all respects to a superposition state of a higher order:

$\left|\text{Universe A}, \text{dead}\right\rangle + \left|\text{Universe B}, \text{alive}\right\rangle,$

where "Universe A" and "Universe B" are some abstract, unobservable "states of the world". The situation can be interpreted by imagining two parallel universes: one where the cat is dead and one where it is alive. These universes exist simultaneously in a superposition, and we are present in both of them at the same time, until we open the box. When we do, the universe superposition collapses to a single choice of the two options and we are presented with either a dead, or a live cat.

Yet, although the universes happen to be in a superposition here, existing both at the same time, the cat itself remains completely ordinary, being either totally dead or fully alive, depending on the chosen universe. The Schrödinger's cat is just a cat, after all.

Tags: Explanation, Mathematics, Paradox, Philosophy, Physics, Quantum, Science
What is the Covariance Matrix?

Posted by Konstantin 23.11.2016 16 Comments
Basic linear algebra, introductory statistics and some familiarity with core machine learning concepts (such as PCA and linear models) are the prerequisites of this post. Otherwise it will probably make no sense. An abridged version of this text is also posted on Quora.

Most textbooks on statistics cover covariance right in their first chapters. It is defined as a useful "measure of dependency" between two random variables:

$\mathrm{cov}(X,Y) = E[(X - E[X])(Y - E[Y])].$

The textbook would usually provide some intuition on why it is defined as it is, prove a couple of properties, such as bilinearity, define the covariance matrix for multiple variables as ${\bf\Sigma}_{i,j} = \mathrm{cov}(X_i, X_j)$ , and stop there. Later on the covariance matrix would pop up here and there in seeminly random ways. In one place you would have to take its inverse, in another - compute the eigenvectors, or multiply a vector by it, or do something else for no apparent reason apart from "that's the solution we came up with by solving an optimization task".

In reality, though, there are some very good and quite intuitive reasons for why the covariance matrix appears in various techniques in one or another way. This post aims to show that, illustrating some curious corners of linear algebra in the process.

Meet the Normal Distribution

The best way to truly understand the covariance matrix is to forget the textbook definitions completely and depart from a different point instead. Namely, from the the definition of the multivariate Gaussian distribution:

We say that the vector $\bf x$ has a normal (or Gaussian) distribution with mean $\bf \mu$ and covariance $\bf \Sigma$ if:

$\Pr({\bf x}) =|2\pi{\bf\Sigma}|^{-1/2} \exp\left(-\frac{1}{2}({\bf x} - {\bf\mu})^T{\bf\Sigma}^{-1}({\bf x} - {\bf \mu})\right).$

To simplify the math a bit, we will limit ourselves to the centered distribution (i.e. ${\bf\mu} = {\bf 0}$ ) and refrain from writing out the normalizing constant $|2\pi{\bf\Sigma}|^{-1/2}$ . Now, the definition of the (centered) multivariate Gaussian looks as follows:

$\Pr({\bf x}) \propto \exp\left(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x}\right).$

Much simpler, isn't it? Finally, let us define the covariance matrix as nothing else but the parameter of the Gaussian distribution. That's it. You will see where it will lead us in a moment.

Transforming the Symmetric Gaussian

Consider a symmetric Gaussian distribution, i.e. the one with ${\bf \Sigma = \bf I}$ (the identity matrix). Let us take a sample from it, which will of course be a symmetric, round cloud of points:

We know from above that the likelihood of each point in this sample is

(1) $P({\bf x}) \propto \exp(-0.5 {\bf x}^T {\bf x}).$

Now let us apply a linear transformation ${\bf A}$ to the points, i.e. let ${\bf y} ={\bf Ax}$ . Suppose that, for the sake of this example, ${\bf A}$ scales the vertical axis by 0.5 and then rotates everything by 30 degrees. We will get the following new cloud of points ${\bf y}$ :

What is the distribution of ${\bf y}$ ? Just substitute ${\bf x}={\bf A}^{-1}{\bf y}$ into (1), to get:

(2) $\begin{align*} P({\bf y}) &\propto \exp(-0.5 ({\bf A}^{-1}{\bf y})^T({\bf A}^{-1}{\bf y}))\\ &=\exp(-0.5{\bf y}^T({\bf AA}^T)^{-1}{\bf y}) \end{align*}$

This is exactly the Gaussian distribution with covariance ${\bf \Sigma} = {\bf AA}^T$ . The logic works both ways: if we have a Gaussian distribution with covariance $\bf \Sigma$ , we can regard it as a distribution which was obtained by transforming the symmetric Gaussian by some ${\bf A}$ , and we are given ${\bf AA}^T$ .

More generally, if we have any data, then, when we compute its covariance to be $\bf\Sigma$ , we can say that if our data were Gaussian, then it could have been obtained from a symmetric cloud using some transformation $\bf A$ , and we just estimated the matrix ${\bf AA}^T$ , corresponding to this transformation.

Note that we do not know the actual $\bf A$ , and it is mathematically totally fair. There can be many different transformations of the symmetric Gaussian which result in the same distribution shape. For example, if $\bf A$ is just a rotation by some angle, the transformation does not affect the shape of the distribution at all. Correspondingly, ${\bf AA}^T = {\bf I}$ for all rotation matrices. When we see a unit covariance matrix we really do not know, whether it is the “originally symmetric” distribution, or a “rotated symmetric distribution”. And we should not really care - those two are identical.

There is a theorem in linear algebra, which says that any symmetric matrix $\bf \Sigma$ can be represented as:

(3) ${\bf \Sigma} = {\bf VDV}^T,$

where ${\bf V}$ is orthogonal (i.e. a rotation) and ${\bf D}$ is diagonal (i.e. a coordinate-wise scaling). If we rewrite it slightly, we will get:

(4) ${\bf \Sigma} = ({\bf VD}^{1/2})({\bf VD}^{1/2})^T = {\bf AA}^T,$

where ${\bf A} = {\bf VD}^{1/2}$ . This, in simple words, means that any covariance matrix $\bf \Sigma$ could have been the result of transforming the data using a coordinate-wise scaling ${\bf D}^{1/2}$ followed by a rotation $\bf V$ . Just like in our example with $\bf x$ and $\bf y$ above.

Principal Component Analysis

Given the above intuition, PCA already becomes a very obvious technique. Suppose we are given some data. Let us assume (or “pretend”) it came from a normal distribution, and let us ask the following questions:
1. What could have been the rotation $\bf V$ and scaling ${\bf D}^{1/2}$ , which produced our data from a symmetric cloud?
2. What were the original, “symmetric-cloud” coordinates $\bf x$ before this transformation was applied.
3. Which original coordinates were scaled the most by $\bf D$ and thus contribute most to the spread of the data now. Can we only leave those and throw the rest out?
All of those questions can be answered in a straightforward manner if we just decompose $\bf \Sigma$ into $\bf V$ and $\bf D$ according to (3). But (3) is exactly the eigenvalue decomposition of $\bf\Sigma$ . I’ll leave you to think for just a bit and you’ll see how this observation lets you derive everything there is about PCA and more.

The Metric Tensor

Bear me for just a bit more. One way to summarize the observations above is to say that we can (and should) regard ${\bf\Sigma}^{-1}$ as a metric tensor. A metric tensor is just a fancy formal name for a matrix, which summarizes the deformation of space. However, rather than claiming that it in some sense determines a particular transformation $\bf A$ (which it does not, as we saw), we shall say that it affects the way we compute angles and distances in our transformed space.

Namely, let us redefine, for any two vectors $\bf v$ and $\bf w$ , their inner product as:

(5) $\langle {\bf v}, {\bf w}\rangle_{\Sigma^{-1}} = {\bf v}^T{\bf \Sigma}^{-1}{\bf w}.$

To stay consistent we will also need to redefine the norm of any vector as

(6) $|{\bf v}|_{\Sigma^{-1}} = \sqrt{{\bf v}^T{\bf \Sigma}^{-1}{\bf v}},$

and the distance between any two vectors as

(7) $|{\bf v}-{\bf w}|_{\Sigma^{-1}} = \sqrt{({\bf v}-{\bf w})^T{\bf \Sigma}^{-1}({\bf v}-{\bf w})}.$

Those definitions now describe a kind of a “skewed world” of points. For example, a unit circle (a set of points with “skewed distance” 1 to the center) in this world might look as follows:

And here is an example of two vectors, which are considered “orthogonal”, a.k.a. “perpendicular” in this strange world:

Although it may look weird at first, note that the new inner product we defined is actually just the dot product of the “untransformed” originals of the vectors:

(8) ${\bf v}^T{\bf \Sigma}^{-1}{\bf w} = {\bf v}^T({\bf AA}^T)^{-1}{\bf w}=({\bf A}^{-1}{\bf v})^T({\bf A}^{-1}{\bf w}),$

The following illustration might shed light on what is actually happening in this $\Sigma$ -“skewed” world. Somehow “deep down inside”, the ellipse thinks of itself as a circle and the two vectors behave as if they were (2,2) and (-2,2).

Getting back to our example with the transformed points, we could now say that the point-cloud $\bf y$ is actually a perfectly round and symmetric cloud “deep down inside”, it just happens to live in a skewed space. The deformation of this space is described by the tensor ${\bf\Sigma}^{-1}$ (which is, as we know, equal to $({\bf AA}^T)^{-1}$ . The PCA now becomes a method for analyzing the deformation of space, how cool is that.

The Dual Space

We are not done yet. There’s one interesting property of “skewed” spaces worth knowing about. Namely, the elements of their dual space have a particular form. No worries, I’ll explain in a second.

Let us forget the whole skewed space story for a moment, and get back to the usual inner product ${\bf w}^T{\bf v}$ . Think of this inner product as a function $f_{\bf w}({\bf v})$ , which takes a vector $\bf v$ and maps it to a real number, the dot product of $\bf v$ and $\bf w$ . Regard the $\bf w$ here as the parameter (“weight vector”) of the function. If you have done any machine learning at all, you have certainly come across such linear functionals over and over, sometimes in disguise. Now, the set of all possible linear functionals $f_{\bf w}$ is known as the dual space to your “data space”.

Note that each linear functional is determined uniquely by the parameter vector $\bf w$ , which has the same dimensionality as $\bf v$ , so apparently the dual space is in some sense equivalent to your data space - just the interpretation is different. An element $\bf v$ of your “data space” denotes, well, a data point. An element $\bf w$ of the dual space denotes a function $f_{\bf w}$ , which projects your data points on the direction $\bf w$ (recall that if $\bf w$ is unit-length, ${\bf w}^T{\bf v}$ is exactly the length of the perpendicular projection of $\bf v$ upon the direction $\bf w$ ). So, in some sense, if $\bf v$ -s are “vectors”, $\bf w$ -s are “directions, perpendicular to these vectors”. Another way to understand the difference is to note that if, say, the elements of your data points numerically correspond to amounts in kilograms, the elements of $\bf w$ would have to correspond to “units per kilogram”. Still with me?

Let us now get back to the skewed space. If $\bf v$ are elements of a skewed Euclidean space with the metric tensor ${\bf\Sigma}^{-1}$ , is a function $f_{\bf w}({\bf v}) = {\bf w}^T{\bf v}$ an element of a dual space? Yes, it is, because, after all, it is a linear functional. However, the parameterization of this function is inconvenient, because, due to the skewed tensor, we cannot interpret it as projecting vectors upon $\bf w$ nor can we say that $\bf w$ is an “orthogonal direction” (to a separating hyperplane of a classifier, for example). Because, remember, in the skewed space it is not true that orthogonal vectors satisfy ${\bf w}^T{\bf v}=0$ . Instead, they satisfy ${\bf w}^T{\bf \Sigma}^{-1}{\bf v} = 0$ . Things would therefore look much better if we parameterized our dual space differently. Namely, by considering linear functionals of the form $f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf \Sigma}^{-1}{\bf v}$ . The new parameters $\bf z$ could now indeed be interpreted as an “orthogonal direction” and things overall would make more sense.

However when we work with actual machine learning models, we still prefer to have our functions in the simple form of a dot product, i.e. $f_{\bf w}$ , without any ugly $\bf\Sigma$ -s inside. What happens if we turn a “skewed space” linear functional from its natural representation into a simple inner product?

(9) $f^{\Sigma^{-1}}_{\bf z}({\bf v}) = {\bf z}^T{\bf\Sigma}^{-1}{\bf v} = ({\bf \Sigma}^{-1}{\bf z})^T{\bf v} = f_{\bf w}({\bf v}),$

where ${\bf w} = {\bf \Sigma}^{-1}{\bf z}$ . (Note that we can lose the transpose because $\bf \Sigma$ is symmetric).

What it means, in simple terms, is that when you fit linear models in a skewed space, your resulting weight vectors will always be of the form ${\bf \Sigma}^{-1}{\bf z}$ . Or, in other words, ${\bf\Sigma}^{-1}$ is a transformation, which maps from “skewed perpendiculars” to “true perpendiculars”. Let me show you what this means visually.

Consider again the two “orthogonal” vectors from the skewed world example above:

Let us interpret the blue vector as an element of the dual space. That is, it is the $\bf z$ vector in a linear functional ${\bf z}^T{\bf\Sigma}^{-1}{\bf v}$ . The red vector is an element of the “data space”, which would be mapped to 0 by this functional (because the two vectors are “orthogonal”, remember).

For example, if the blue vector was meant to be a linear classifier, it would have its separating line along the red vector, just like that:

If we now wanted to use this classifier, we could, of course, work in the “skewed space” and use the expression ${\bf z}^T{\bf\Sigma}^{-1}{\bf v}$ to evaluate the functional. However, why don’t we find the actual normal $\bf w$ to that red separating line so that we wouldn’t need to do an extra matrix multiplication every time we use the function?

It is not too hard to see that ${\bf w}={\bf\Sigma}^{-1}{\bf z}$ will give us that normal. Here it is, the black arrow:

Therefore, next time, whenever you see expressions like ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ or $({\bf v}-{\bf w})^T{\bf\Sigma}^{-1}({\bf v}-{\bf w})$ , remember that those are simply inner products and (squared) distances in a skewed space, while ${\bf \Sigma}^{-1}{\bf z}$ is a conversion from a skewed normal to a true normal. Also remember that the “skew” was estimated by pretending that the data were normally-distributed.

Once you see it, the role of the covariance matrix in some methods like the Fisher’s discriminant or Canonical correlation analysis might become much more obvious.

The Dual Space Metric Tensor

“But wait”, you should say here. “You have been talking about expressions like ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ all the time, while things like ${\bf w}^T{\bf\Sigma}{\bf v}$ are also quite common in practice. What about those?”

Hopefully you know enough now to suspect that ${\bf w}^T{\bf\Sigma}{\bf v}$ is again an inner product or a squared norm in some deformed space, just not the “internal data metric space”, that we considered so far. Which space is it? It turns out it is the “internal dual metric space”. That is, whilst the expression ${\bf w}^T{\bf\Sigma}^{-1}{\bf v}$ denoted the “new inner product” between the points, the expression ${\bf w}^T{\bf\Sigma}{\bf v}$ denotes the “new inner product” between the parameter vectors. Let us see why it is so.

Consider an example again. Suppose that our space transformation $\bf A$ scaled all points by 2 along the $x$ axis. The point (1,0) became (2,0), the point (3, 1) became (6, 1), etc. Think of it as changing the units of measurement - before we measured the $x$ axis in kilograms, and now we measure it in pounds. Consequently, the norm of the point (2,0) according to the new metric, $|(2,0)|_{\Sigma^{-1}}$ will be 1, because 2 pounds is still just 1 kilogram “deep down inside”.

What should happen to the parameter ("direction") vectors due to this transformation? Can we say that the parameter vector (1,0) also got scaled to (2,0) and that the norm of the parameter vector (2,0) is now therefore also 1? No! Recall that if our initial data denoted kilograms, our dual vectors must have denoted “units per kilogram”. After the transformation they will be denoting “units per pound”, correspondingly. To stay consistent we must therefore convert the parameter vector (”1 unit per kilogram”, 0) to its equivalent (“0.5 units per pound”,0). Consequently, the norm of the parameter vector (0.5,0) in the new metric will be 1 and, by the same logic, the norm of the dual vector (2,0) in the new metric must be 4. You see, the “importance of a parameter/direction” gets scaled inversely to the “importance of data” along that parameter or direction.

More formally, if ${\bf x}'={\bf Ax}$ , then

(10) $\begin{align*} f_{\bf w}({\bf x}) &= {\bf w}^T{\bf x} = {\bf w}^T{\bf A}^{-1}{\bf x}'\\ & =(({\bf A}^{-1})^T{\bf w})^T{\bf x}'=f_{({\bf A}^{-1})^T{\bf w}}({\bf x}'). \end{align*}$

This means, that the transformation $\bf A$ of the data points implies the transformation ${\bf B}:=({\bf A}^{-1})^T$ of the dual vectors. The metric tensor for the dual space must thus be:

(11) $({\bf BB}^T)^{-1}=(({\bf A}^{-1})^T{\bf A}^{-1})^{-1}={\bf AA}^T={\bf \Sigma}.$

Remember the illustration of the “unit circle” in the ${\bf \Sigma}^{-1}$ metric? This is how the unit circle looks in the corresponding $\bf\Sigma$ metric. It is rotated by the same angle, but it is stretched in the direction where it was squished before.

Intuitively, the norm (“importance”) of the dual vectors along the directions in which the data was stretched by $\bf A$ becomes proportionally larger (note that the “unit circle” is, on the contrary, “squished” along those directions).

But the “stretch” of the space deformation in any direction can be measured by the variance of the data. It is therefore not a coincidence that ${\bf w}^T{\bf \Sigma}{\bf w}$ is exactly the variance of the data along the direction $\bf w$ (assuming $|{\bf w}|=1$ ).

The Covariance Estimate

Once we start viewing the covariance matrix as a transformation-driven metric tensor, many things become clearer, but one thing becomes extremely puzzling: why is the inverse covariance of the data a good estimate for that metric tensor? After all, it is not obvious that ${\bf X}^T{\bf X}/n$ (where $\bf X$ is the data matrix) must be related to the $\bf\Sigma$ in the distribution equation $\exp(-0.5{\bf x}^T{\bf\Sigma}^{-1}{\bf x})$ .

Here is one possible way to see the connection. Firstly, let us take it for granted that if $\bf X$ is sampled from a symmetric Gaussian, then ${\bf X}^T{\bf X}/n$ is, on average, a unit matrix. This has nothing to do with transformations, but just a consequence of pairwise independence of variables in the symmetric Gaussian.

Now, consider the transformed data, ${\bf Y}={\bf XA}^T$ (vectors in the data matrix are row-wise, hence the multiplication on the right with a transpose). What is the covariance estimate of $\bf Y$ ?

(12) ${\bf Y}^T{\bf Y}/n=({\bf XA}^T)^T{\bf XA}^T/n={\bf A}({\bf X}^T{\bf X}){\bf A}^T/n\approx {\bf AA}^T,$

the familiar tensor.

This is a place where one could see that a covariance matrix may make sense outside the context of a Gaussian distribution, after all. Indeed, if you assume that your data was generated from any distribution $P$ with uncorrelated variables of unit variance and then transformed using some matrix $\bf A$ , the expression ${\bf X}^T{\bf X}/n$ will still be an estimate of ${\bf AA}^T$ , the metric tensor for the corresponding (dual) space deformation.

However, note that out of all possible initial distributions $P$ , the normal distribution is exactly the one with the maximum entropy, i.e. the “most generic”. Thus, if you base your analysis on the mean and the covariance matrix (which is what you do with PCA, for example), you could just as well assume your data to be normally distributed. In fact, a good rule of thumb is to remember, that whenever you even mention the word "covariance matrix", you are implicitly fitting a Gaussian distribution to your data.
Tags: Algebra, Explanation, Machine learning, Mathematics, Probability theory, Statistics, Visualization
The Secrets of Spring Motion

Posted by Konstantin 17.11.2016 3 Comments

Imagine a weight hanging on a spring. Let us pull the weight a bit and release it into motion. What will its motion look like? If you remember some of your high-school physics, you should probably answer that the resulting motion is a simple harmonic oscillation, best described by a sinewave. Although this is a fair answer, it actually misses an interesting property of real-life springs. A property most people don't think much about, because it goes a bit beyond the high school curriculum. This property is best illustrated by

The Slinky Drop

The "slinky drop" is a fun little experiment which has got its share of internet fame.

The Slinky Drop

When the top end of a suspended slinky is released, the bottom seems to patiently wait for the top to arrive before starting to fall as well. This looks rather unexpected. After all, we know that things fall down according to a parabola, and we know that springs collapse according to a sinewave, however neither of the two rules seem to apply here. If you browse around, you will see lots of awesome videos demonstrating or explaining this effect. There are news articles, forum discussions, blog posts and even research papers dedicated to the magical slinky. However, most of them are either too sketchy or too complex, and none seem to mention the important general implications, so let me give a shot at another explanation here.

The Slinky Drop Explained Once More

Let us start with the classical, "high school" model of a spring. The spring has some length $L$ in the relaxed state, and if we stretch it, making it longer by $\Delta L$ , the two ends of the spring exert a contracting force of $k\Delta L$ . Assume we hold the top of the spring at the vertical coordinate $y_{\mathrm{top}}=0$ and have it balance out. The lower end will then position at the coordinate $y_{\mathrm{bot}} = -(L+mg/k)$ , where the gravity force $mg$ is balanced out exactly by the spring force.

How would the two ends of the spring behave if we let go off the top now? Here's how:

The horozontal axis here denotes the time, the vertical axis - is the vertical position. The blue curve is the trajectory of the top end of the spring, the green curve - trajectory of the bottom end. The dotted blue line is offset from the blue line by exactly $L$ - the length of the spring in relaxed state.

Observe that the lower end (the green curve), similarly to the slinky, "waits" for quite a long time for the top to approach before starting to move with discernible velocity. Why is it the case? The trajectory of the lower point can be decomposed in two separate movements. Firstly, the point is trying to fall down due to gravity, following a parabola. Secondly, the point is being affected by string tension and thus follows a cosine trajectory. Here's how the two trajectories look like separately:

They are surprisingly similar at the start, aren't they? And indeed, the cosine function does resemble a parabola up to $o(x^3)$ . Recall the corresponding Taylor expansion:

$\cos(x) = 1 - \frac{x^2}{2} + \frac{x^4}{24} + \dots \approx 1 - \frac{x^2}{2}.$

If we align the two curves above, we can see how well they match up at the beginning:

Consequently, the two forces happen to "cancel" each other long enough to leave an impression that the lower end "waits" for the upper for some time. This effect is, however, much more pronounced in the slinky. Why so?

Because, of course, a single spring is not a good model for the slinky. It is more correct to regard a slinky as a chain of strings. Observe what happens if we model the slinky as a chain of just three simple springs:

Each curve here is the trajectory of one of the nodes inbetween the three individual springs. We can see that the top two curves behave just like a single spring did - the green node waits a bit for the blue and then starts moving. The red one, however, has to wait longer, until the green node moves sufficiently far away. The bottom, in turn, will only start moving observably when the red node approaches it close enough, which means it has to wait even longer yet - by that time the top has already arrived. If we consider a more detailed model, the movement of a slinky composed of, say, 9 basic springs, the effect of intermediate nodes "waiting" becomes even more pronounced:

To make a "mathematically perfect" model of a slinky we have to go to the limit of having infinitely many infinitely small springs. Let's briefly take a look at how that solution looks like.

The Continuous Slinky

Let $x$ denote the coordinate of a point on a "relaxed" slinky. For example, in the two discrete models above the slinky had 4 and 10 points, numbered $1,\dots, 4$ and $1,\dots, 10$ respectively. The continuous slinky will have infinitely many points numbered $[0,1]$ .

Let $h(x,t)$ denote the vertical coordinate of a point $x$ at time $t$ . The acceleration of point $x$ at time $t$ is then, by definition $\frac{\partial^2 h(x,t)}{\partial^2 t}$ , and there are two components affecting it: the gravitational pull $-g$ and the force of the spring.

The spring force acting on a point $x$ is proportional to the stretch of the spring at that point $\frac{\partial h(x,t)}{\partial x}$ . As each point is affected by the stretch from above and below, we have to consider a difference of the "top" and "bottom" stretches, which is thus the derivative of the stretch, i.e. $\frac{\partial^2 h(x,t)}{\partial^2 x}$ . Consequently, the dynamics of the slinky can be described by the equation:

$\frac{\partial^2 h(x,t)}{\partial^2 t} = a\frac{\partial^2 h(x,t)}{\partial^2 x} - g.$

where $a$ is some positive constant. Let us denote the second derivatives by $h_{tt}$ and $h_{xx}$ , replace $a$ with $v^2$ and rearrange to get:

(1) $h_{tt} - v^2 h_{xx} = -g,$

which is known as the wave equation. The name stems from the fact that solutions to this equation always resemble "waves" propagating at a constant speed $v$ through some medium. In our case the medium will be the slinky itself. Now it becomes apparent that, indeed, the lower end of the slinky should not move before the wave of disturbance, unleashed by releasing the top end, reaches it. Most of the explanations of the slinky drop seem to refer to that fact. However when it is stated alone, without the wave-equation-model context, it is at best a rather incomplete explanation.

Given how famous the equation is, it is not too hard to solve it. We'll need to do it twice - first to find the initial configuration of a suspended slinky, then to compute its dynamics when the top is released.

In the beginning the slinky must satisfy $h_t(x, t) = 0$ (because it is not moving at all), $h(0, t) = 0$ (because the top end is located at coordinate 0), and $h_x(1, t) = 0$ (because there is no stretch at the bottom). Combining this with (1) and searching for a polynomial solution, we get:

$h(x, t) = h_0(x) = \frac{g}{2v^2}x(x-2).$

Next, we release the slinky, hence the conditions $h_t(x,t)=0$ and $h(0,t)=0$ disappear and we may use the d'Alembert's formula with reflected boundaries to get the solution:

$h(x,t) = \frac{1}{2}(\phi(x-vt) + \phi(x+vt)) - \frac{gt^2}{2},$

$\text{ where }\phi(x) = h_0(\mathrm{mod}(x, 2)).$

Here's how the solution looks like visually:

Notice how the part of the slinky to which the wave has not arrived yet, stays completely fixed in place. Here are the trajectories of 4 equally-spaced points on the slinky:

Note how, quite surprisingly, all points of the slinky are actually moving with a constant speed, changing it abruptly at certain moments. Somewhat magically, the mean of all these piecewise-linear trajectories (i.e. the trajectory of the center of mass of the slinky) is still a smooth parabola, just as it should be:

The Secret of Spring Motion

Now let us come back to where we started. Imagine a weight on a spring. What will its motion be like? Obviously, any real-life spring is, just like the slinky, best modeled not as a Hooke's simple spring, but rather via the wave equation. Which means that when you let go off the weight, the weight will send a deformation wave, which will move along the spring back and forth, affecting the pure sinewave movement you might be expecting from the simple Hooke's law. Watch closely:

Here is how the movement of the individual nodes looks like:

The fat red line is the trajectory of the weight, and it is certainly not a sinewave. It is a curve inbetween the piecewise-linear "sawtooth" (which is the limit case when the weight is zero) and the true sinusoid (which is the limit case when the mass of the spring is zero). Here's how the zero-weight case looks like:

And this is the other extreme - the massless spring:

These observations can be summarized into the following obviously-sounding conclusion: the basic Hooke's law applies exactly only to the the massless spring. Any real spring has a mass and thus forms an oscillation wave traveling back and forth along its length, which will interfere with the weight's simple harmonic oscillation, making it "less simple and harmonic". Luckily, if the mass of the weight is large enough, this interference is negligible.

And that is, in my opinion, one of the interesting, yet often overlooked aspects of spring motion.

Tags: Explanation, Mathematics, Paradox, Physics, Science, Theory, Visualization
The Meaning of the Gaussian Kernel

Posted by Konstantin 11.11.2016 4 Comments

A question on Quora reminded me that I wanted to post this explanation here every time I got a chance to teach SVMs and Kernel methods, but I never found the time. The post expects basic knowledge of those topics from the reader.

Introductory Background

The concept of kernel methods is probably one of the coolest tricks in machine learning. With most machine learning research nowadays being centered around neural networks, they have gone somewhat out of fashion recently, but I suspect they will strike back one day in some way or another.

The idea of a kernel method starts with the curious observation that if you take a dot product of two vectors, $x^T y$ , and square it, the result can be regarded as a dot product of two "feature vectors", where the features are all pairwise products of the original inputs:

$\begin{align*} &k(x,y) = (x^Ty)^2 = (\sum_i x_iy_i)^2 = \sum_{i,j} x_ix_jy_iy_j \\ & = \langle (x_1x_1, x_1x_2,\dots,x_ix_j,\dots,x_nx_n), (y_1y_1,\dots,y_iy_j,\dots,y_ny_n)\rangle \end{align*}$

Analogously, if you raise $x^Ty$ to the third power, you are essentially computing a dot product within a space of all possible three-way products of your inputs, and so on, without ever actually having to see those features explicitly.

If you now take any linear model (e.g. linear regression, linear classification, PCA, etc) it turns out you can replace the "real" dot product in its formulation model with such a kernel function, and this will magically convert your model into a linear model with nonlinear features (e.g. pairwise or triple products). As those features are never explicitly computed, there is no problem if there were millions or billions of them.

Consider, for example, plain old linear regression: $f(x) = w^Tx + b$ . We can "kernelize" it by first representing $w$ as a linear combination of the data points (this is called a dual representation):

$f(x) = \left(\sum_i \alpha_i x_i\right)^T x + b = \sum_i \alpha_i (x_i^T x) + b,$

and then swapping all the dot products $x_i^T x$ with a custom kernel function:

$f(x) = \sum_i \alpha_i k(x_i,x) + b.$

If we now substitute $k(x,y) = (x^T y)^2$ here, our model becomes a second degree polynomial regression. If $k(x,y) = (x^T y)^5$ it is the fifth degree polynomial regression, etc. It's like magic, you plug in different functions and things just work.

It turns out that there are lots of valid choices for the kernel function $k$ , and, of course, the Gaussian function is one of these choices:

$k(x, y) = \exp\left(-\frac{|x - y|^2}{2\sigma^2}\right).$

It is not too surprising - the Gaussian function tends to pop up everywhere, after all, but it is not obvious what "implicit features" it should represent when viewed as a kernel function. Most textbooks do not seem to cover this question in sufficient detail, usually, so let me do it here.

The Gaussian Kernel

To see the meaning of the Gaussian kernel we need to understand the couple of ways in which any kernel functions can be combined. We saw before that raising a linear kernel to the power $i$ makes a kernel with a feature space, which includes all $i$ -wise products. Now let us examine what happens if we add two or more kernel functions. Consider $k(x,y) = x^Ty + (x^Ty)^2$ , for example. It is not hard to see that it corresponds to an inner product of feature vectors of the form

$(x_1, x_2, \dots, x_n, \quad x_1x_1, x_1x_2,\dots,x_ix_j,\dots, x_n x_n),$

i.e. the concatenation of degree-1 (untransformed) features, and degree-2 (pairwise product) features.

Multiplying a kernel function with a constant $c$ is also meaningful. It corresponds to scaling the corresponding features by $\sqrt{c}$ . For example, $k(x,y) = 4x^Ty = (2x)^T(2y)$ .

Still with me? Great, now let us combine the tricks above and consider the following kernel:

$k(x,y) = 1 + x^Ty + \frac{(x^Ty)^2}{2} + \frac{(x^Ty)^3}{6}.$

Apparently, it is a kernel which corresponds to a feature mapping, which concatenates a constant feature, all original features, all pairwise products scaled down by $\sqrt{2}$ and all triple products scaled down by $\sqrt{6}$ .

Looks impressive, right? Let us continue and add more members to this kernel, so that it would contain all four-wise, five-wise, and so on up to infinity-wise products of input features. We shall choose the scaling coefficients for each term carefully, so that the resulting infinite sum would resemble a familiar expression:

$\sum_{i=0}^\infty \frac{(x^Ty)^i}{i!} = \exp(x^Ty).$

We can conclude here that $k(x,y) = \exp(x^Ty)$ is a valid kernel function, which corresponds to a feature space, which includes products of input features of any degree, up to infinity.

But we are not done yet. Suppose that we decide to normalize the inputs before applying our linear model. That is, we want to convert each vector $x$ to $\frac{x}{|x|}$ before feeding it to the model. This is quite often a smart idea, which improves generalization. It turns out we can do this “data normalization” without really touching the data points themselves, but by only tuning the kernel instead.

Consider again the linear kernel $k(x,y) = x^Ty$ . If we normalize the vectors before taking their inner product, we get

$\left(\frac{x}{|x|}\right)^T\left(\frac{y}{|y|}\right) = \frac{x^Ty}{|x||y|} = \frac{k(x,y)}{\sqrt{k(x,x)k(y,y)}}.$

With some reflection you will see that the latter expression would normalize the features for any kernel.

Let us see what happens if we apply this kernel normalization to the “infinite polynomial” (i.e. exponential) kernel we just derived:

$\begin{align*} \frac{\exp(x^Ty)}{\sqrt{\exp(x^Tx)\exp(y^Ty)}} &= \frac{\exp(x^Ty)}{\exp(|x|^2/2)\exp(|y|^2/2)} \\ &= \exp\left(-\frac{1}{2}(|x|^2+|y|^2-2x^Ty)\right) \\ &= \exp\left(-\frac{|x-y|^2}{2}\right). \end{align*}$

Voilà, the Gaussian kernel. Well, it still lacks $\sigma^2$ in the denominator but by now you hopefully see that adding it is equivalent to scaling the inputs by $1/\sigma$

To conclude: the Gaussian kernel is a normalized polynomial kernel of infinite degree (where feature products if $i$ -th degree are scaled down by $\sqrt{i!}$ before normalization). Simple, right?

An Example

The derivations above may look somewhat theoretic if not "magical", so let us work through a couple of numeric examples. Suppose our original vectors are one-dimensional (that is, real numbers), and let $x = 1$ , $y = 2$ . The value of the Gaussian kernel $k(x, y)$ for these inputs is:

$k(x, y) = \exp(-0.5|1-2|^2) \approx 0.6065306597...$

Let us see whether we can obtain the same value as a simple dot product of normalized polynomial feature vectors of a high degree. For that, we first need to compute the corresponding unnormalized feature representation:

$\phi'(x) = \left(1, x, \frac{x^2}{\sqrt{2!}}, \frac{x^3}{\sqrt{3!}}, \frac{x^4}{\sqrt{4!}}, \frac{x^5}{\sqrt{5!}}\dots\right).$

As our inputs are rather small in magnitude, we can hope that the feature sequence quickly approaches zero, so we don't really have to work with infinite vectors. Indeed, here is how the feature sequences look like:

---- $\phi'(1) =$ (1, 1, 0.707, 0.408, 0.204, 0.091, 0.037, 0.014, 0.005, 0.002, 0.001, 0.000, 0.000, ...)

---- $\phi'(2) =$ (1, 2, 2.828, 3.266, 3.266, 2.921, 2.385, 1.803, 1.275, 0.850, 0.538, 0.324, 0.187, 0.104, 0.055, 0.029, 0.014, 0.007, 0.003, 0.002, 0.001, ...)

Let us limit ourselves to just these first 21 features. To obtain the final Gaussian kernel feature representations $\phi$ we just need to normalize:

---- $\phi(1) =\frac{\phi'(1)}{|\phi'(1)|} = \frac{\phi'(1)}{1.649} =$ (0.607, 0.607, 0.429, 0.248, 0.124, 0.055, 0.023, 0.009, 0.003, 0.001, 0.000, ...)

---- $\phi(2) =\frac{\phi'(2)}{|\phi'(2)|} = \frac{\phi'(2)}{7.389} =$ (0.135, 0.271, 0.383, 0.442, 0.442, 0.395, 0.323, 0.244, 0.173, 0.115, 0.073, 0.044, 0.025, 0.014, 0.008, 0.004, 0.002, 0.001, 0.000, ...)

Finally, we compute the simple dot product of these two vectors:

$\scriptstyle\phi(1)^T\phi(2) = 0.607\cdot 0.135 + 0.607\cdot 0.271 + \dots = {\bf 0.6065306}602....$

In boldface are the decimal digits, which match the value of $\exp(-0.5|1-2|^2)$ . The discrepancy is probably more due to lack of floating-point precision rather than to our approximation.

A 2D Example

The one-dimensional example might have seemed somewhat too simplistic, so let us also go through a two-dimensional case. Here our unnormalized feature representation is the following:

$\begin{align*} \scriptscriptstyle\phi'(&\scriptscriptstyle x_1, x_2) = \left(1, x_1, \frac{x_1x_1}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{2!}}, \frac{x_2x_1}{\sqrt{2!}}, \frac{x_2x_2}{\sqrt{2!}}, \right. \\ &\scriptscriptstyle \left.\frac{x_1x_1x_1}{\sqrt{3!}}, \frac{x_1x_1x_2}{\sqrt{3!}}, \frac{x_1x_2x_1}{\sqrt{3!}}, \frac{x_1x_2x_2}{\sqrt{3!}}, \frac{x_2x_1x_2}{\sqrt{3!}}, \frac{x_2x_2x_1}{\sqrt{3!}}, \dots\right).\\ \end{align*}$

This looks pretty heavy, and we didn't even finish writing out the third degree products. If we wanted to continue all the way up to degree 20, we would end up with a vector with 2097151 elements!

Note that many products are repeated, however (e.g. $x_1x_2 = x_2x_1$ ), hence these are not really all different features. Let us try to pack them more efficiently. As you'll see in a moment, this will open up a much nicer perspective on the feature vector in general.

Basic combinatorics will tell us, that each feature of the form $\frac{x_1^a x_2^b}{\sqrt{n!}}$ must be repeated exactly $\frac{n!}{a!b!}$ times in our current feature vector. Thus, instead of repeating it, we could replace it with a single feature, scaled by $\sqrt{\frac{n!}{a!b!}}$ . "Why the square root?" you might ask here. Because when combining a repeated feature we must preserve the overall vector norm. Consider a vector $(v, v, v)$ , for example. Its norm is $\sqrt{v^2 + v^2 + v^2} = \sqrt{3}|v|$ , exactly the same as the norm of the single-element vector $(\sqrt{3}v)$ .

As we do this scaling, each feature gets converted to a nice symmetric form:

$\sqrt{\frac{n!}{a!b!}}\frac{x_1^a x_2^b}{\sqrt{n!}} = \frac{x_1^a x_2^b}{\sqrt{a!b!}} = \frac{x^a}{\sqrt{a!}}\frac{x^b}{\sqrt{b!}}.$

This means that we can compute the 2-dimensional feature vector by first expanding each parameter into a vector of powers, like we did in the previous example, and then taking all their pairwise products. This way, if we wanted to limit ourselves with maximum degree 20, we would only have to deal with $21^2$ = 231 features instead of 2097151. Nice!

Here is a new view of the unnormalized feature vector up to degree 3:

$\phi'_3(x_1, x_2) = \scriptstyle\left(1, x_1, x_2, \frac{x_1^2}{\sqrt{2!}}, \frac{x_1x_2}{\sqrt{1!1!}}, \frac{x^2}{\sqrt{2!}}, \frac{x_1^3}{\sqrt{3!}}, \frac{x_1^2x_2}{\sqrt{2!1!}}, \frac{x_1x_2^2}{\sqrt{1!2!}}, \frac{x_2^3}{\sqrt{3!}}\right).$

Let us limit ourselves to this degree-3 example and let $x = (0.7, 0.2)$ , $y = (0.1, 0.4)$ (if we picked larger values, we would need to expand our feature vectors to a higher degree to get a reasonable approximation of the Gaussian kernel). Now:

---- $\phi'_3(0.7, 0.2) =$ (1, 0.7, 0.2, 0.346, 0.140, 0.028, 0.140, 0.069, 0.020, 0.003),

---- $\phi'_3(0.1, 0.4) =$ (1, 0.1, 0.4, 0.007, 0.040, 0.113, 0.000, 0.003, 0.011, 0.026).

After normalization:

---- $\phi_3(0.7, 0.2) =$ (0.768, 0.538, 0.154, 0.266, 0.108, 0.022, 0.108, 0.053, 0.015, 0.003),

---- $\phi_3(0.1, 0.4) =$ (0.919, 0.092, 0.367, 0.006, 0.037, 0.104, 0.000, 0.003, 0.010, 0.024).

The dot product of these vectors is $0.8196\dots$ , what about the exact Gaussian kernel value?

$\exp(-0.5|x-y|^2) = 0.{\bf 81}87\dots.$

Close enough. Of course, this could be done for any number of dimensions, so let us conclude the post with the new observation we made:

The features of the unnormalized $d$ -dimensional Gaussian kernel are:

$\phi({\bf x})_{\bf a} = \prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},$

where ${\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d$ .

The Gaussian kernel is just the normalized version of that, and we know that the norm to divide by is $\sqrt{\exp({\bf x}^T{\bf x})}$ . Thus we may also state the following:

The features of the $d$ -dimensional Gaussian kernel are:

$\phi({\bf x})_{\bf a} = \exp(-0.5|{\bf x}|^2)\prod_{i = 1}^d \frac{x_i^{a_i}}{\sqrt{a_i!}},$

where ${\bf a} = (a_1, \dots, a_d) \in \mathbb{N}_0^d$ .

That's it, now you have seen the soul of the Gaussian kernel.

Tags: Definition, Explanation, Machine learning, Mathematics, SVM, Teaching, Theory