Four Years Remaining

Data Engineering

Posted by Konstantin 07.09.2015 2 Comments

Research and engineering always go hand in hand. Their relationship is so tight that often people cease to see the difference. Indeed, both are development activities that bring along technical advances. Many researchers work in engineering companies, many engineers do what is essentially research, is there even a need to see the difference? Quite often there is. Let me highlight this difference.

Research (sometimes synonymous with "science") is, ideally, an experimental activity, which aims to explore some space of possibilities in order to hopefully come up with a certain "understanding" of what those possibilities are or how they should be used. Researchers primarily deliver written reports, summarizing their findings. There is absolutely no guarantee that those findings would be "interesting" or "useful" to any degree - indeed, a research project by definition starts off in a position of uncertainty. As a result, for practical, commercial or investment purposes, research is always a risky activity. It may take a long time, it will produce mostly text, and there is no guarantee that the results will be of use. The upside of this risk is, of course, the chance to stumble upon unique innovation, which may eventually bring considerable benefits. Later.

Engineering is a construction activity, where existing knowledge and tools are put together to deliver actual products. In order for this process to be successful, experimentation and uncertainty must be, ideally, brought down to a minimum. Thus, when compared to pure research, engineering projects are low risk endeavors, where the expected outputs are known in advance.

In simple terms - researchers answer questions whilst engineers build things, and by this definition those occupations are, obviously, very different. This difference is often apparent in material fields, such as construction or electronics, where the engineers can be distinguished from the researchers by the kind of tools they would mostly hold in their hands and the places they would spend their time the most. With computational fields this visual difference does not exist - both engineers and researchers spend most of their time behind a computer screen. The tools and approaches are still different, but you won't spot this unless you are in the field. I did not know the difference between Software Engineering and Computer Science until I had the chance to try both.

Things become even more confusing when data mining gets involved. The popularity of projects, which focus on building data-driven intelligent systems, is ever growing. As a result, more and more companies seem to be eager to embrace this magical world by hiring "data scientists" to do "data science" for them. The irony of the situation is that most of those companies are engineering businesses (e.g. software developer firms) and, as such, they would not (or at least should not) normally consider hiring anyone with the word "scientist" in the job title. Because scientists are not too famous for bringing stable income, engineers are.

The term "data science" is a vague one, but I think it is quite succinct in that it captures the exploratory aspect that is inherent in general-purpose data analysis, as well as the current state of the art in the field. Although there are some good high level tools for a wide range of "simple" machine learning tasks nowadays, as soon as you want to try something more exotic, you are often on your own, faced with uncertainty and the need to experiment before you can even try to "build" anything. For a typical engineering company such uncertainty is not a good thing to rely upon.

It does not mean that one cannot engineer data-driven systems nowadays, it means that in reality most of the companies, whether they know it or not, need a very particular kind of "data scientists". They need specialists with a good knowledge of simple reliable tools and are capable of applying them to various data formats. Those, who would perhaps avoid excessive experimentation in favor of simple, scalable, working solutions, even if those are somehow simplistic, suboptimal and do not employ custom-designed forty-layer convolutional networks with inception blocks, which require several months to debug and train. Those, who might not know much about concentration inequalities but would be fluent in data warehousing and streaming. There's actually a name for such people: "Data engineers".

There is nothing novel about the use of such terminology, yet I still regularly encounter way too much misunderstanding around this topic over and over again.

In particular, I would expect way more of the "Data engineering" curricula to appear at universities alongside the more or less standard "Data science" ones. The difference between the two would be slight, but notable - pretty much of the same order as the difference between our "Computer science" and "Software engineering" master's programmes, for example.

Tags: Computer science, Data science, Definition, Philosophy, Weltschmertz
On Gödel's Theorems and the Universe

Posted by Konstantin 27.01.2013 7 Comments

Most results that are published in mathematics and natural sciences are highly technical and obscure to anyone outside the particular area of research. Consequently, most scientific publications and discoveries do not get any media attention at all. This is a problem both for the public (which would benefit from knowing a bit more about our world), the science in general (which would benefit from wider dissemination) and the scientists themselves (which would certainly appreciate if their work reached more than a couple of other people). The issue is usually addressed by the researchers trying to popularize their findings — provide simple interpretations, which could be grasped by the layman without the need to understand the foundations. Unfortunately, if the researcher was lucky to grab enough media attention, chances are high the process of "popularization" will get completely out of hands, and the message received by the public will have nothing to do with the original finding.

There are some theorems in pure mathematics, which happened to suffer especially seriously from this phenomenon. My two most favourite examples are the Heizenberg's uncertainty principle and Gödel's incompleteness theorems. Every once in a while I stumble on pieces, written by people who have zero knowledge of basic physics (not even mentioning quantum mechanics or Fourier theory), but claim to have clearly understood the meaning and the deep implications of the uncertainty principle upon our daily lives. Even more popular are Gödel's theorems. "Science proves that some things in this world can never be understood or predicted by us, mere humans!!!" is a typical take on Gödel by the press. Oh, and it is loved by religion enthusiasts!

This post is my desperate attempt to bring some sanity back into this world by providing a (yet another, but maybe slightly different) popular explanation for Gödel's ideas.

Gödel's theorems explained (with enough context)

Kurt Gödel

In our daily lives we primarily use symbols to communicate and describe the world. Those may be formulas in mathematics or simply sentences in the English language. Symbol-based systems can be used in different ways. From the perspective of arts and humanities, one is welcome to construct arbitrary sequences of words and sentences, as long as the result is "beautiful", "inspiring" or "seems smart and convincing". From a more technically-minded point of view, however, one is only welcome to operate with "true and logical" statements. However, it is not at all obvious which statements should be universally considered "true and logical". There are two ways the problem can be resolved.

Firstly, we can determine the truth by consulting "reality". For example, the phrase "All people are mortal, hence Socrates is mortal" is true because, indeed (avoiding a lecture-worth of formalities or so), in all possible universes where there are people and they are mortal, Socrates will always also be mortal.

Secondly, we can simply postulate a set of statements, which we shall universally regard as "the true ones". To distinguish those postulated truths from the "actual truths", mentioned in the paragraph above, let us call them "logically correct statements". That is, we can hypothetically write down an (infinite) list of all the "logically correct" sentences in some hypothetical "Book of All Logically Correct Statements". From that time on, every claim not in the book will be deemed "illogical" and hence "false" (the question of whether it is false in reality would, of course, depend on the goodness and relevance of the book we created).

This is really convenient, because now in order to check the correctness of the claim about Socrates we do not have to actually visit all the possible universes. It suffices to look it up in our book! If the book is any good, it will provide good answers, so the concept of "logical correctness" may be close or even equivalent to "actual truth". And in principle, there must exist "The Actual Book of Truth", which would list precisely the "actually true" statements.

Unfortunately, infinite books are somewhat inconvenient to print and carry around. Luckily, we can try to represent an infinite book in a finite way using algorithms. We have all seen how a short algorithm is capable of producing an infinite amount of nonsense, haven't we? Thus, instead of looking for an (infinitely-sized) "book of truth", we should look for a (finitely-sized) "algorithm of truth", which will be either capable of generating the whole book, or checking for any given statement, whether it belongs to the book.

Frege's propositional calculus

For clarity and historical reasons, the algorithms for describing "books of truths" are usually developed and written down using a particular "programming language", which consists of "axioms" and "inference rules". Here is one of the simplest examples of how a "program" might look like.

This is where a natural question arises: even if we assume that we have access to "the actual book of truth", will it really be possible to compress it down to a finite algorithm? Isn't it too much to ask? It turns out it is. There is really no way to create a compact representation for any but the most trivial "books of truths".

Besides the "asking too much" intuition, there are at least two other simple reasons why this is impossible. Firstly, most of the readers here know that there exist problems, that are in principle uncomputable. Those are usually about statements, the correctness of which cannot be checked in finite time by a finite algorithm, with the most famous example being the halting problem, as well as everything related to uncomputable numbers.

Naturally, if we can't even use a finite algorithm to list "all truly halting programs", our hopes for getting a finite description for any more complicated "book of truths" is gone. But there is another funny reason. Once we have decided to define the book with all true statements and devise an algorithm for generating it, we must not forget to include into the book some statements about the book and the algorithm themselves. In particular, we must consider the following phrase: "This sentence will not be included in the book generated by our algorithm."

If our algorithm does include this sentence in the generated book, the claim turns out to be false. As a result, our precious book would have false statements in it: it would be inconsistent. On the other hand, if our algorithm produces a book which does not include the above sentence, the sentence turns out to be true and the book is incomplete. It is just the liar's paradox.

And so it is — we have to put up with the situation, where it is impossible to produce a finite representation for all true statements without stumbling into logical paradoxes. Or, alternatively, that there are mathematical statements, which cannot be checked or proven within a finite time.

This describes what is known as "Gödel's first incompleteness theorem" (namely, the claim that in most cases our algorithm is only capable of producing either an incomplete or an inconsistent "book"). A curious corollary may be derived from it, which is known as the "Gödel's second incompleteness theorem". The theorem says that if our "book-generating algorithm" happens to include into it exactly the phrase "This book is consistent", it must necessarily be a false claim, and the algorithm must actually be inconsistent. This is often interpreted that no "honest" theory is capable of claiming it's own correctness (there are caveats, though).

Gödel's theorems have several interesting implications for the foundations of mathematics. None the less, those are still theorems about the fundamental impossibility of using a finite algorithm for generating a book of logically correct statements without stumbling into logical paradoxes and infinite loops. Nothing more and nothing less. They have nothing to do with the experimental sciences: note how we had to cut away all connections to reality in the very first paragraphs of the explanation. And thus, Gödel's theorems are not related to the causes of "human inability to grasp the universe", "failure or success of startup business plans", or anything else of that kind.

Tags: Computer science, Definition, Logic, Mathematics, Philosophy, Theory, Weltschmertz
Everything New is a Well-forgotten Old

Posted by Konstantin 25.12.2008 No Comments

A long time ago, information was stored and transmitted by people who passed it in the form of poems from mouth to mouth, generation to generation, until, at some moment, writing was invented. Some poems were lucky enough to be carefully written down in letters on the scrolls of papyrus or pergament, yet a considerable number of them was left unwritten and thus lost. Because in the age of writing, an unwritten poem is a non-existent poem. Later on came the printing press and brought a similar revolution: some books from the past were diligently reprinted in thousands of copies and thus preserved for the future. The remaining ones were effectively lost, because in the age of the printing press, an unpublished book is a nonexistent book. And then came the Internet. Once again, although a lot of the past knowledge has migrated here, a large amount hasn't, which means that is has been lost for most practical purposes. Because in the age of the Internet, if it is not in the Internet, it does not exist. The tendency is especially notable in science, because science is essentially about accumulating knowledge.

The effect of such regular "cleanups" (and I am sure these will continue regularly for as long as humankind exists) is twofold. On one hand, the existing knowledge is reviewed and only the worthy pieces get a chance to be converted into the new media format. As a result, a lot of useless crap is thrown away in an act of natural selection. On the other hand, a considerable amount of valuable information gets lost too, simply because it seemed useless at that specific moment. Of course, it will be reinvented sooner or later anyway, but the fact that it was right here and we just lost it seems disturbing, doesn't it.

I'm still browsing through that old textbook from the previous post and enjoying the somewhat unfamiliar way the material is presented. Bayesian learning, boolean logic and context-free-grammars are collected together and related to decision theory. If I did not know the publication date of the book, I could easily mistake this "old" way of presenting the topic, for being something new. Moreover, I guess that, with an addition of a minor twist, some ideas from the book could probably be republished in a low-impact journal and thus recognized as "novel". It would be close to impossible to figure out the copy, because a pre-Internet-era non-English text simply does not exist.

Tags: History, Pattern, Weltschmertz
Decreased Expectations

Posted by Konstantin 19.12.2008 3 Comments

The day before I've accidentally stumbled upon an old textbook on pattern analysis written in Russian (the second edition of a book originally published in 1977, which is more-or-less the time of the classics). A brief review of its contents was enormously enlightening.

It was both fun and sad to see how smallishly incremental the progress in pattern analysis has been for the last 30 years. If I wasn't told that the book was first published in 1977, I wouldn't be able to tell it from any contemporary textbook. I'm not complaining that the general approaches and techniques haven't changed much, these shouldn't have. What disturbs me is that the vision of the future 30 years ago was not significantly different from what we have today.

Pattern recognition systems nowadays are getting more and more widespread and it is difficult to name a scientific field or an area of industry where these are not used or won't be used in the nearest future...

Further on the text briefly describes the application areas for pattern analysis that range from medicine to agriculture to "intellectual fifth-generation computing machines" and robots that were supposed to be here somewhere around nineties already. And although machines did get somewhat more intelligent, we have clearly failed our past expectations. Our current vision of the future is not significantly different from the one we had 30 years ago. It has probably become somewhat more modest, in fact.

Interesing, is this situation specific to pattern analysis or is it like that in most areas of computer science?

Tags: History, Pattern analysis, Weltschmertz
Unmotivated Students

Posted by Konstantin 16.10.2008 2 Comments
Dan said once that I should produce more Weltschmertz in this blog. However, I find this area to be expremely complex to write about because it is really difficult to be even marginally constructive or at least analytic in this field. Anyway, here's an attempt to touch on the classics.

Today we had a "CS institute day" - a local-scale PR-event aiming at introducing the inner workings of the computer science institute to the students and answering whatever questions they might have about their present or future studies. As it often is the case with such events, the majority of the attendees were not the students but rather the members of the faculty who were either willing to help answer potential questions or just curious about the event. Not more than a dozen students attended. This is unfortunate, as it displays a significant lack of interest of the students towards their studies. There are several reasons for that, the most prominent being perhaps the following two issues:
- The first "problem" is that studies are free (for most people) here and many students regard them as a nuisance rather than as a way to learn necessary skills. It's often about "I go to university because of some stupid tradition" rather than "I study because I really need knowledge and skills". I heard that the attitude is different in those universities where students pay for their studies. Of course, I'm not promoting the idea of paid studies, but I think that the students would benefit if they could at least mentally put themselves in a situation where the university is something expensive and optional (rather than free and necessary). I'm not sure how to do that, though: whenever the faculty attempts to arrange a motivational event, noone attends.
- The second "problem" lies in the bloated market demand for IT specialists of any level. Nowadays one can easily get a disproportionately well-paid code-monkey position without any education. This will change, however. As the price level rises, Estonia quickly loses its appeal as an IT outsource country. The internal market for IT solutions does exist, but the pool of available developers grows faster than demand. Three years ago the local large IT companies were literally fighting to recruit as many IT students of any age as possible. Today they are probably still glad to get new people, but it's not as critical. Tomorrow they won't have free projects to assign to arbitrary new people. Finally, the day after tomorrow we are hopefully going to see real competition in IT-skills, which will be the normal situation for an industry. It is high time for a lazy student to think about that seriously now.
Tags: Weltschmertz
Prologue

Posted by Konstantin 01.09.2008 10 Comments
There is no easy way to explain why would a person, who has spent 11 years at school, 4 years at bachelor's and 3 years doing master's studies, then decide to enter PhD studies, risking yet another 4 years of his life, if not more. Unlike the case with the bachelor's or master's, there does not seem to be any social pressure encouraging to get a PhD. Neither is there much economical motivation, because getting a PhD does not guarantee higher pay. Finally, although the PhD degree is indeed a damn cool thing to have, it is doubtful whether people possessing it enjoy life more than all the rest do.

Nonetheless, PhD is a required attribute of anyone aspiring for the academic career, and is regarded as a qualitatively higher step on the education ladder. So, the question is, what makes this qualitative difference and what issues should one focus on most during the 4 years. Here's my initial guess:

Your doctorate studies did not go in vain, if:
1. You can generate a publishable paper in a month, a really good paper in 2-3 months,
2. You can write a convincing grant/project proposal and you know when and where to submit it,
3. You have realistic but useful ideas for future work and research,
4. You know how to supervise/direct/collaborate with others and be actually useful at it,
5. You know the most important people in your field, and they know you,
6. You are good at lecturing or other kinds of oral presentation,
7. You know what to do after you defend.
I'm sure there's something missing, but this list has some aims complicated enough already. Hopefully, this blog will help me with points 3,5,7 of the above agenda as well as keep reminding of the fact that I don't want to loose my 4 years for nothing. It's day 1 today. 1460 days left. Yay!
Tags: Philosophy, Plan, Weltschmertz