• Posted by Konstantin 05.04.2015 4 Comments

    When it comes to data analysis, there are hundreds of exciting approaches: simple summary statistics and hypothesis tests, various clustering methods, linear and nonlinear regression or classification techniques, neural networks of various types and depths, decision rules and frequent itemsets, feature extractors and dimension reductors, ensemble methods, bayesian approaches and graphical models, logic-based approaches and fuzzy stuff, ant colonies, genetic algorithms and other optimization methods, monte-carlo algorithms, sampling and density estimation, logic-based and graph methods. Don't even get me started on the numerous visualization techniques.

    This sheer number of options is, however, both a blessing and a curse at the same time. In many practical situations just having those methods at your disposal may pose more problems than solutions. First you need to pick one of the approaches that might possibly fit your purpose. Then you will try to adapt it appropriately, spend several iterations torturing the data only to obtain very dubious first results, come to the conclusion that most probably you are doing something wrong, reconvince yourself that you need to try harder in that direction, spend some more iterations testing various parameter settings. Nothing works as you want it to, so you start everything from scratch with another method to find yourself obtaining new, even more dubious results, torturing the data even further, getting tired of that and finally settling on something "intermediately decent", which "probably makes sense", although you are not so sure any more and feel frustrated.

    I guess life of a statistician was probably way simpler back in the days when you could run a couple of t-tests, or an F-test from a linear regression and call it a day. In fact, it seems that many experimental (e.g. wetlab) scientists still live in that kind of world, when it comes to analyzing their experimental results. The world of T-tests is cozy and safe. They don't get you frustrated. Unfortunately, t-tests can feel ad-hockish, because they force you to believe that something "is normally distributed". Also, in practice, they are mainly used to confirm the obvious rather than discover something new from the data. A simple scatterplot will most often be better than a t-test as an analysis method. Hence, I am not a big fan of T-tests. However, I do have my own favourite statistical method, which always feels cozy and safe, and never gets me frustrated. I tend to apply it whenever I see a chance. It is the Fisher exact test in the particular context of feature selection.

    My appreciation of it stems from my background in bioinformatics and some experience with motif detection in particular. Suppose you have measured the DNA sequences for a bunch of genes. What can you do to learn something new about the sequence structure from that data? One of your best bets is to first group your sequences according to some known criteria. Suppose you know from previous experiments that some of the genes are cancer-related whereas others are not. As soon as you have specified those groups, you can start making observations like the following: "It seems that 10 out of my 20 cancer-related genes have the subsequence GATGAG in their DNA code. The same sequence is present in only 5 out of 100 non-cancer-related ones. How probable would it be to obtain similar counts of GATGAG, if the two groups were picked randomly?" If the probability to get those counts at random is very low, then obviously there is something fishy about GATGAG and cancer - perhaps they are related. To compute this probability you will need to use the hypergeometric distribution, and the resulting test (i.e. the question "how probable is this situation in a random split?") is known as the Fishers' exact test.

    This simple logic (with a small addition of a multiple testing correction on top) has worked wonders for finding actually important short sequences on the DNA. Of course it is not limited to sequence search. One of our research group's most popular web tools uses the same approach to discover functional annotations, that are "significantly overrepresented" in a given group of genes. The same approach can be used to construct decision trees, and in pretty much any other "supervised learning" situation, where you have groups of objects and want to find binary features of those objects, associated with the groups.

    Although in general the Fisher test is just one particular measure of association, it is, as I noted above, rather "cozy and comfortable". It does not force me to make any weird assumptions, there is no "ad-hoc" aspect to it, it is simple to compute and, most importantly, in my experience it nearly always produces "relevant" results.

    Words overrepresented in the speeches of Greece MPs

    Words overrepresented in the speeches of Greece MPs

    A week ago me, Ilya and Alex happened to take part in a small data analysis hackathon, dedicated to the analysis of speech transcripts from the European Parliament. Somewhat analogously to DNA sequences, speeches can be grouped in various ways: you can group them by the speaker who gave them, by country, gender or political party of that speaker, by the month or year when the speech was given or by any combination of such groupings. The obvious "features" of a speech are words, which can be either present or not present in it. Once you view the problem this way the task of finding group-specific words becomes self-evident and the Fisher test is the natural solution to it. We implemented this idea and extracted "country-specific" and "time-specific" words from the speeches (other options were left out due to time constraints). As is usual the case with my favourite method, the obtained results look relevant, informative and, when shown in the form of a word cloud, fun. Check them out.

    The complete source code of the analysis scripts and the visualization application is available on Github.


    Tags: , , , , , , ,

  • Posted by Konstantin 22.10.2008 No Comments

    A beloved child has many names, and so does the field of pattern analysis. Some people call it soft computing, others refer to it as machine learning or data-driven statistics. Although in practice all those terms denote the same approach to data analysis, there is always a certain bias hidden behind each name that relates it to its history and origins. I'm not sure whether I understand these biases correctly, but here's how I would define the terms:

    Pattern Analysis
    This is probably the most general term that refers to a kind of data analysis, where the goal is to find "something interesting" in a given dataset. We typically know what kind of things (patterns) we consider interesting (this might be an association rule, a frequent subsequence, a classifier, etc), and the task is to detect the instances of this kind of patterns. Conceptually somewhat opposite to pattern analysis would be statistical hypothesis testing, where the task is to test a given pattern for interestingness. In this sense pattern analysis is very close to the multiple hypothesis testing statistical problem.
    The simplest way to formalize this generic notion I've seen in this exposition by Tilj de Bie. Let D denote the dataset, let \Pi be the pattern space (e.g. "the space of all sequences", or "the space of all classifiers", etc), and let for each \pi\in\Pi there be a pattern function p_\pi(\cdot), so that p_\pi(D) measures the "interestingness" of this pattern with respect to the data D. Then the general task of pattern analysis is just the following maximization problem:

    \max_{\pi\in\Pi} p_\pi(D)

    Machine Learning
    Machine learning can be regarded as a specific type of pattern analysis, where the central interest lies in finding classifiers or regression functions. The dataset in this case consists of a number of "instances" x_i \in X, and the task is to find a function f: X \to Y that maps these instances to corresponding "outputs" \hat{y}_i = f(x_i) in various ways.
    In the case of supervised machine learning, a "true" output y_i is provided with each instance x_i, and the resulting function should aim to match this output for given instances (i.e. f(x_i) \approx y_i) as well as generalize to the unseen ones. Statisticians call this task regression analysis. A classical example is the problem of detecting incoming spam mail by learning the classification rule from previously labeled messages.
    The task of unsupervised machine learning is to find a classifier f without any prespecified labeling of data. Typically one searches for a function that maps similar inputs to similar outputs. If the set of outputs Y is small and discrete, the task is referred to as clustering and somewhat resembles the problem of quantization. If the outputs are continuous, the task can often be related to either factor analysis or noise reduction.
    Finally, the problem of semi-supervised machine learning is to find a classifier for a dataset where some instances have labels and some don't. This can sometimes be regarded as task of missing value imputation.
    The typical approach to all types of machine learning task usually employs the definition of a loss measure L(f), that estimates the goodness of fit for a given classifier f, and then searching for a classifier that optimizes this measure, i.e.:

    \max_f L(f)

    This defines machine learning as a certain kind of pattern analysis.

    Soft Computing
    If the previous terms denote purely mathematical approaches, for which the exact computational method of obtaining a solution is somewhat irrelevant, then this term, on the contrary, denotes a set of certain metaheuristical computational techniques for obtaining these solutions. In the narrowest sense, by soft computing one might collectively denote the areas of evolutionary algorithms, fuzzy logic and neural networks. This gives soft computing a somewhat ad-hockish flavour: "If you don't know how to solve it, simply put it into a neural network, neural networks just work".

    Data Mining
    This might be just my opinion, but I define data mining as any kind of pattern analysis, that is applied in an offline setting. Moreover, I'd say that, ideally, true data mining happens when you attempt to search for patterns in data that was not initially meant for that specific analysis, thus contradicting the classical statistical ideology "specify your hypothesis first, collect the data later".

    Knowledge Discovery from Databases
    As far as I understand it, KDD is just a YABA and marketingspeak for "data mining", somewhat in the same manner as OLAP and Business Intelligence are marketingspeak for "descriptive statistics".

    Data-Driven Statistics
    In the narrowest sense, data-driven statistics denotes all kinds of nonparametric statistical approaches. These are those methods, where one manages to perform data analysis without specifying any parameterized distributions for the inputs or the like. Typical examples would be the various resampling and randomization techniques. In the broad sense, again, this might very much be the same as pattern analysis in general.

    Any other popular terms for the same thing I've forgotten here?

    Tags: , , , , ,