• Posted by Konstantin 25.02.2013

    Most of bioinformatics revolves, in one way or another, around the genome. Even in the largely "non-genomic" areas, such as systems biologyproteomics, or metabolomics, the key players are still genes, proteins, and their regulatory regions, which can be associated with particular points on the genome. Consequently, no matter what kind of data you work with, if you do bioinformatics, you will sooner or later have to deal with genomic coordinates.

    To interpret genomic coordinates you need to know the reference genome version and coordinate conventions used. Does the data file mention those?

    To interpret genomic coordinates you need to know the reference genome version and coordinate conventions used. Does the data file mention those?

    Surprisingly, despite being of such central importance to bioinformatics, the whole genomic coordinate business seems to be in a rather unfriendly state nowadays. Firstly, there are several ways to refer to genomic positions (e.g. 0-based vs 1-based indexing), and for every possible combination of conventions, there is at least one file format that will be using it. Then, of course, you have to deal with several versions of the reference genome, and, to make your life harder yet, most data files will not tell you what genome version should be used to interpret the coordinates stored there. Finally, if you decide that you need to unify the coordinates among your different data files by converting them to the same reference genome version, you will find out that your only tools here are a couple of heavyweight web apps and a command-line UCSC liftOver utility. Should you look for R or Python libraries to script your task, you will discover that those do no smarter than forward all the conversion tasks to that same liftOver.

    I found this is all to be extremely wrong and hacked up a tiny Python tool recently, which will happily convert your coordinates without the need to invoke an external liftOver process. It does not fully replace liftOver (yet?), as it does not convert regions - a task just a bit more tricky than lifting over single points. However it lets you do single-point conversion in the simplest way possible:

    1. from pyliftover import LiftOver
    2. lo = LiftOver('hg17', 'hg18')
    3. lo.convert_coordinate('chr1', 1000000, '-') # 0-based indexing

    As usual, install via: easy_install pyliftover (or pip install pyliftover)

    For more information consult the PyPI page.

    Posted by Konstantin @ 1:28 am

    Tags: , , , ,

  • 2 Comments

    1. Veera on 02.03.2015 at 12:37 (Reply)

      So, This tool converts just a single point in the genome at once. If at all I want to convert all points in my bed file, I need to write a loop function in python.I found this of no use at all. If I am interested to look at a single point I will easily look at the dbSNP website. Also, I found that this works for only 'hg17' to 'hg18' conversion. It is not working correctly for other conversions, for example 'hg18' to 'hg38'.

      1. Konstantin on 02.03.2015 at 13:55 (Reply)

        Well, this is meant for scripting workflows, where writing loops is the normal approach, and "looking at the dbSNP website" is not a solution (moreover, dbSNP would not convert arbitrary coordinates anyway).

        I would be happy if you could bring an example of an invalid hg18-to-hg38 conversion, as it would be a bug worth fixing.

    Leave a comment

    Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.