• Posted by Konstantin 26.09.2012

    The issues related to scientific publishing, peer-review and funding always make for popular discussion topics at conferences. In fact, the ongoing ECML PKDD 2012 had a whole workshop, where researchers could complain about discuss some of their otherwise interesting results that were hard or impossible to publish. The rejection reasons ranged from "a negative result" or "too small to be worthy of publication" to "lack of theoretical justification". The overall consensus seemed to be that this is indeed a problem, at least in the field of machine learning.

    The gist of the problem is the following. Machine learning relies a lot on computational experiments - empirically measuring the performance of methods in various contexts. The current mainstream methodology suggests that such experiments should primarily play a supportive role, either demonstrating a general theoretic statement, or simply measuring the exact magnitude of the otherwise obvious benefit. This, unfortunately, leaves no room for "unexpected" experimental results, where the measured behaviour of a method is either contradicting or at least not explained by the available theory. Including such results in papers is very difficult, if not impossible, as they get criticised heavily by the reviewers. A reviewer expects all results in the paper to make sense. If anything is strange, it should either be explained or would better be disregarded as a mistake. This is a natural part of the quality assurance process in science as a whole.

    Quite often, though, unexpected results in computational experiments do happen. They typically have little relevance to the main topic of the paper, and the burden of explaining them can be just too large for a researcher to pursue. It is way easier to either drop the corresponding measurement, or find a dataset that behaves "nicely". As a result, a lot of  relevant information about such cases never sees the light of day. Thus, again and again, other researchers would continue stumbling on similar unexpected results, but continue shelving them away.

    The problem would not be present if the researchers cared to, say, write up such results as blog posts or tech-reports in ArXiv, thus making the knowledge available. However, even formulating the unexpected discoveries in writing, let along go any deeper, is often regarded as a waste of time that won't get the researcher much (if any) credit. Indeed, due to how the scientific funding works nowadays, the only kind of credit that counts for a scientist is (co-)authoring a publication in a "good" journal or conference.

    I believe that with time, science will evolve to naturally accommodate such smaller pieces of research into its process (mini-, micro-, nano-publications?), providing the necessary incentives for the researchers to expose, rather than shelve their "unexpected" results. Meanwhile, though, other methods could be employed, and one of the ideas that I find interesting is the concept I'd call "co-authorship licensing".

    Instead of ignoring a "small", "insignificant", or an "unexpected" result, the researcher should consider publishing it as either a blog post or a short (yet properly written) tech report. He should then add an explicit requirement, that the material may be referred to, cited, or used as-is in a "proper" publication (a journal or a conference paper) with the condition that the author of the post must be included in the author's list of the paper.

    I feel there could be multiple benefits to such an approach. Firstly, it non-invasively addresses the drawbacks of the current science funding model. If being cited as a co-author is the only real credit that counts in the scientific world, why not use it explicitly and thus allow to effectively "trade" smaller pieces of research. Secondly, it enables a meaningful separation of work. "Doing research" and "publishing papers" are two very different types of activities. Some scientists, who are good at producing interesting experimental results or observations, can be completely helpless when it comes to the task of getting their results published. On the other hand, those, who are extremely talented in presenting and organizing results into high-quality papers, may often prefer the actual experimentation to be done by someone else. Currently, the two activities have to be performed by the same person or, at best, by the people working at the same lab. Otherwise, if the obtained results are not immediately "properly" published, there is no incentive for the researchers to expose them. "Co-authorship licensing" could provide this incentive, acting as an open call for collaboration at the same time. (In fact, the somewhat ugly "licensing" term could be replaced with a friendlier equivalent, such as "open collaboration invitation", for example. I do feel, though, that it is more important to stress that others are allowed to collaborate rather than that someone is invited to).

    I'll conclude with three hypothetical examples.

    • A Bachelor's student makes a nice empirical study of System X in his thesis, but has no idea how to turn this to a journal article. He publishes his work in ArXiv under "co-authorship license", where it is found by a PhD student working in this area, who was lacking exactly those results for his next paper.
    • A data miner at company X, as a side-effect of his work, ends up with a large-scale evaluation of learning algorithm Y on an interesting dataset. He puts those results up as a "co-authorship licensed" report. It is discovered by a researcher, who is preparing a review paper about algorithm Y and is happy to include such results.
    • A bioinformatician discovers unexpected behaviour of algorithm X on a particular dataset. He writes his findings up as a blog post with a "co-authorship license", where those are discovered by a machine learning researcher, who is capable of explaining the results, putting them in context, and turning into an interesting paper.

    It seems to me that without the use of "co-authorship licensing" the situations above would end in no productive results, as they do nowadays.

    Of course, this all will only make sense once many people give it a thought. Unfortunately, no one reads this blog 🙂

    Posted by Konstantin @ 3:27 am

    Tags: ,

  • 4 Comments

    1. Albrecht Zimmermann on 02.10.2012 at 20:34 (Reply)

      You know that I agree with much of what you said but I still believe that the main problems remains that under the current financing setup there is no motivation for anyone to actually credit the "co-author".

      To make this simple: if I am someone angling for a permanent position who aims to pad his publication record, then crediting someone else at the same stage of their career gives a competitor a boost even while it gives a boost to me.

      In the case of a proper tech report, this cannot really be avoided (which is also why I feel that equating blog posts and tech reports is misleading) since this is would be a registered publication, in the same way that it cannot be avoided if it were a "real" collaborator.

      In the case of a non-traditional publication (like the aforementioned blog post), there is strong incentive to actually NOT credit the original writer and put the burden of proof on them. We see in the current climate how "properly published" work in conferences gets ignored for the sake of improving one's publication chances. How much more likely that this would happen to a blog post one could disavow?

      Your model will work great under a collaborative approach to science. Unfortunately, we currently have a competitive one.

      1. Konstantin on 02.10.2012 at 21:19 (Reply)

        Well, firstly, if a piece of work is explicitly published under a certain "license", then the motivation to credit should stem from plain honesty. Yes, this is an assumption, but observing that there are similar honesty-based systems (GPL, copyright, patenting), that actually do work to a fairly large extent, it does not seem to be too improbable to me.

        Secondly, I find it especially hard to imagine that someone striving for a competitive academic career would risk staining his reputation, instead of simply adding a co-author (especially if such acknowledgement is duly deserved).

        Thirdly, I tend to disagree with the (otherwise popular) notion of a "registered publication". In practice (at least from the perspective of a reviewer), whatever is "googleable" should be considered "registered" way stronger than an obscure tech report (or even, excuse me, an obscure Springer publication) without direct access to it.

        It might be the case that indeed, a report that is formatted as PDF and uploaded at ArXiv "feels" more authoritative and thus better "motivates honesty", but this is a fairly tiny (albeit curious) psychological detail, isn't it.

    2. Dominique Unruh on 02.10.2012 at 22:37 (Reply)

      I am not sure whether I understand the approach correctly, but if I do, then there seems to be the following problem:

      We (scientists) have the principle that all research is "free" in the sense that you can use a published result in your work. If I write a paper, I am not allowed to say "you may use this theorem / refer to this result only if you pay me". Where it does not matter what kind of payment I require, be it co-authorship or money.

      Of course, if I include some result as a whole in my paper (in a way similar to having some co-author write a section of the paper), then I would include the author of that result as a co-author of my paper.

      But in many cases, I would naturally just "use" a result. In the same way I would use it if it were already properly published. I'd write something like "in [1], the following theorem was shown" or "in [1], it was experimentally shown that ... We summarize their findings."

      I don't think we should disallow citing blogs. But if we don't disallow that, then that's how I would use the results published on blogs.

      (Also: if I don't know the author of the blog article, I might not want to work with him on my paper. With some people, coordinating the writing of a paper can be troublesome. But I also would not wish to polish someone else's write-up to make it suitable for inclusion in my paper. But I'd probably have to do that - a blog article might not be written with enough care to pass the reviewers. Because some people practice fast-typing in their blogs.)

      1. Konstantin on 03.10.2012 at 02:52 (Reply)

        I agree with the principle that research should be "free", however, at the same time I do not think that "free" means "not giving due credit".

        My suggestion here is not so much about prohibiting other people to cite blogs but rather about giving other people explicit rights to take charge of writing up the results we are not willing or capable of publishing, into something that "counts" scientifically. Consider again any of the examples I provided.

        As for referencing blog entries, you noted yourself that most blog posts would never pass a proper review and as such are not considered a "part of science". Thus, you can't usually cite them. However, what if someone's fast-typed blog posts are in fact worthy of becoming a part of science, given some additional work (say, polishing the text and adding some more theory). Obviously, the blog post author can only benefit from his work making its way into Nature. Also, the person who decided to turn the blog post into a Nature paper is incentivised, as he has some of the work done for him. And no, the parties are not forced to actively coordinate in collaboration. The "user" is solely responsible for the publication.

        I am wondering what would happen if such practice could be adopted? What could be done to enable it? Is it the only meaningful way to enable and incentivise "micropublications" in the current scientific world?

    Leave a comment

    Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.