Four Years Remaining

Data Engineering

Posted by Konstantin 07.09.2015

Research and engineering always go hand in hand. Their relationship is so tight that often people cease to see the difference. Indeed, both are development activities that bring along technical advances. Many researchers work in engineering companies, many engineers do what is essentially research, is there even a need to see the difference? Quite often there is. Let me highlight this difference.

Research (sometimes synonymous with "science") is, ideally, an experimental activity, which aims to explore some space of possibilities in order to hopefully come up with a certain "understanding" of what those possibilities are or how they should be used. Researchers primarily deliver written reports, summarizing their findings. There is absolutely no guarantee that those findings would be "interesting" or "useful" to any degree - indeed, a research project by definition starts off in a position of uncertainty. As a result, for practical, commercial or investment purposes, research is always a risky activity. It may take a long time, it will produce mostly text, and there is no guarantee that the results will be of use. The upside of this risk is, of course, the chance to stumble upon unique innovation, which may eventually bring considerable benefits. Later.

Engineering is a construction activity, where existing knowledge and tools are put together to deliver actual products. In order for this process to be successful, experimentation and uncertainty must be, ideally, brought down to a minimum. Thus, when compared to pure research, engineering projects are low risk endeavors, where the expected outputs are known in advance.

In simple terms - researchers answer questions whilst engineers build things, and by this definition those occupations are, obviously, very different. This difference is often apparent in material fields, such as construction or electronics, where the engineers can be distinguished from the researchers by the kind of tools they would mostly hold in their hands and the places they would spend their time the most. With computational fields this visual difference does not exist - both engineers and researchers spend most of their time behind a computer screen. The tools and approaches are still different, but you won't spot this unless you are in the field. I did not know the difference between Software Engineering and Computer Science until I had the chance to try both.

Things become even more confusing when data mining gets involved. The popularity of projects, which focus on building data-driven intelligent systems, is ever growing. As a result, more and more companies seem to be eager to embrace this magical world by hiring "data scientists" to do "data science" for them. The irony of the situation is that most of those companies are engineering businesses (e.g. software developer firms) and, as such, they would not (or at least should not) normally consider hiring anyone with the word "scientist" in the job title. Because scientists are not too famous for bringing stable income, engineers are.

The term "data science" is a vague one, but I think it is quite succinct in that it captures the exploratory aspect that is inherent in general-purpose data analysis, as well as the current state of the art in the field. Although there are some good high level tools for a wide range of "simple" machine learning tasks nowadays, as soon as you want to try something more exotic, you are often on your own, faced with uncertainty and the need to experiment before you can even try to "build" anything. For a typical engineering company such uncertainty is not a good thing to rely upon.

It does not mean that one cannot engineer data-driven systems nowadays, it means that in reality most of the companies, whether they know it or not, need a very particular kind of "data scientists". They need specialists with a good knowledge of simple reliable tools and are capable of applying them to various data formats. Those, who would perhaps avoid excessive experimentation in favor of simple, scalable, working solutions, even if those are somehow simplistic, suboptimal and do not employ custom-designed forty-layer convolutional networks with inception blocks, which require several months to debug and train. Those, who might not know much about concentration inequalities but would be fluent in data warehousing and streaming. There's actually a name for such people: "Data engineers".

There is nothing novel about the use of such terminology, yet I still regularly encounter way too much misunderstanding around this topic over and over again.

In particular, I would expect way more of the "Data engineering" curricula to appear at universities alongside the more or less standard "Data science" ones. The difference between the two would be slight, but notable - pretty much of the same order as the difference between our "Computer science" and "Software engineering" master's programmes, for example.

Posted by Konstantin @ 1:16 am

Tags: Computer science, Data science, Definition, Philosophy, Weltschmertz
2 Comments
1. Martin Vahi on 07.09.2015 at 08:43 (Reply)
  
  I think that there is a fundamental flaw in this contemplation. Namely, it is stated here that engineers stick to well known and tested tools and only combine well known and tested components together and that activity is claimed to be a low risk activity that avoids experimentation, walking in the unknown, but in reality, at least, the way I see it at the time of writing this comment, the moment the well tested and well known components are combined together, it is already experimentation and walking in the unknown. I can illustrate that by the following examples:
  
  x) The parameters of an ordinary wall brick can be really well known, the bricks can reside in the range of those well known parameters very reliably, but all the calculations about the houses that can be built from those bricks is clearly a whole different field with pure experimentation at every new project;
  
  x) The basic chemical elements, the periodic table, may be very well known and "tested", but the moment the very well known things are assembled to something bigger, for example, organic chemical compounds, bacteria, viruses, it's a whole new game that is equivalent to a walk in the unknown.
  
  The same with software. As of 2015 I as a software developer always consider the combining of known components as a walk in the unknown. It holds with houses, it holds with bacteria and I do not see a reason, specially given my personal experience, why it should not hold in software development. To put it directly, people, who think that they can take well known and "tested" chess buttons, know all the basic chess rules and think that this is enough to be able to play chess properly, ARE EVEN STUPIDER THAN I AM. I find it very sad that so many human life hours are wasted due to the fact that many people believe that combining well known components together is not a walk in the unknown.
  1. Konstantin on 07.09.2015 at 11:45 (Reply)
    
    You are reading my post slightly wrong. Of course, things are not black and white: every research activity requires some engineering from time to time, and every engineering activity needs some research. That's why the words "ideally" are used in a couple of places in the definitions. Moreover, there is always uncertainty out there. Strictly speaking, one cannot be perfectly certain about what happens in the next minute.
    
    However, there is still a fundamental difference in the attitude towards "walking in the unknown" in research and engineering projects. For engineering, the "walking in the unknown" is an annoyance that should be minimized. You should stop walking as soon as you find a decent exit. If you do not find any exits, your project fails because you did not know enough about the place you went for a walk and you should have done your research first.
    
    In research, "walking" is a purpose on its own. The ultimate goal is to walk through all the corners and "build a map" of the place, so that next time people would not need to wander around as much and could go straight to the nearest exit. As long as you end up with the map, you succeed as far as research is concerned, even if you do not discover any exits.
    
    As I said, this distinction becomes especially confusing in data-centric projects. I've seen a number of projects where the goal was to "build a system for predicting X", yet there was absolutely no clarity at the planning stage whether the predicting of X was even possible given the available data and resources. This uncertainty was considered unimportant and somehow "usual for engineering", while it is not. This is research creeping into engineering departments, where it may, unexpectedly for the management, make projects fail, budgets bubble, and, indeed, many human life hours go to waste.
Leave a comment

Name (required)

E-Mail:(not displayed)(required)

Website:

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

Reply to:

September 2015
M	T	W	T	F	S	S
« Jul		Jan »
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Oli on The Data Science Workflow
Adam on The Curse of Genomic Coordinates
second on How to Send an SMS
6 Regularization Techniques for Deep Learning | Python | Keras - AI ASPIRANT on The Mystery of Early Stopping
Aldo D'Ottavio on What is the Covariance Matrix?

Data Engineering

2 Comments

Leave a comment

Calendar

Recent Comments

Archives