Four Years Remaining

The Curse of Genomic Coordinates

Posted by Konstantin 25.02.2013
Most of bioinformatics revolves, in one way or another, around the genome. Even in the largely "non-genomic" areas, such as systems biology, proteomics, or metabolomics, the key players are still genes, proteins, and their regulatory regions, which can be associated with particular points on the genome. Consequently, no matter what kind of data you work with, if you do bioinformatics, you will sooner or later have to deal with genomic coordinates.

To interpret genomic coordinates you need to know the reference genome version and coordinate conventions used. Does the data file mention those?

Surprisingly, despite being of such central importance to bioinformatics, the whole genomic coordinate business seems to be in a rather unfriendly state nowadays. Firstly, there are several ways to refer to genomic positions (e.g. 0-based vs 1-based indexing), and for every possible combination of conventions, there is at least one file format that will be using it. Then, of course, you have to deal with several versions of the reference genome, and, to make your life harder yet, most data files will not tell you what genome version should be used to interpret the coordinates stored there. Finally, if you decide that you need to unify the coordinates among your different data files by converting them to the same reference genome version, you will find out that your only tools here are a couple of heavyweight web apps and a command-line UCSC liftOver utility. Should you look for R or Python libraries to script your task, you will discover that those do no smarter than forward all the conversion tasks to that same liftOver.

I found this is all to be extremely wrong and hacked up a tiny Python tool recently, which will happily convert your coordinates without the need to invoke an external liftOver process. It does not fully replace liftOver (yet?), as it does not convert regions - a task just a bit more tricky than lifting over single points. However it lets you do single-point conversion in the simplest way possible:
```
from pyliftover import LiftOver
lo = LiftOver('hg17', 'hg18')
lo.convert_coordinate('chr1', 1000000, '-') # 0-based indexing
```
As usual, install via: easy_install pyliftover (or pip install pyliftover)

For more information consult the PyPI page.
Posted by Konstantin @ 1:28 am

Tags: Bioinformatics, Programming, Project, Python, Tool
9 Comments
1. Veera on 02.03.2015 at 12:37 (Reply)
  
  So, This tool converts just a single point in the genome at once. If at all I want to convert all points in my bed file, I need to write a loop function in python.I found this of no use at all. If I am interested to look at a single point I will easily look at the dbSNP website. Also, I found that this works for only 'hg17' to 'hg18' conversion. It is not working correctly for other conversions, for example 'hg18' to 'hg38'.
  1. Konstantin on 02.03.2015 at 13:55 (Reply)
    
    Well, this is meant for scripting workflows, where writing loops is the normal approach, and "looking at the dbSNP website" is not a solution (moreover, dbSNP would not convert arbitrary coordinates anyway).
    
    I would be happy if you could bring an example of an invalid hg18-to-hg38 conversion, as it would be a bug worth fixing.
2. jdblischak on 27.03.2017 at 19:04 (Reply)
  
  I know this post is years old, but I think it's a shame that the only comment is a negative one. I found PyLiftover to be very useful. I was writing a script to convert a custom genotype file format into the standard VCF. Being able to convert line-by-line in my Python script allowed me to avoid the cumbersome process of first converting the file to BED, then running liftOver, and then trying to combine this back with the information in the original file format. It also allowed me to easily log if the conversion was successful or not.
  
  In short, thanks for writing, sharing, and advertising this useful software.
  1. Konstantin on 27.03.2017 at 21:59 (Reply)
    
    Hey, thanks for the positive feedback!
3. Harlan on 29.03.2017 at 18:41 (Reply)
  
  Oddly enough, I am just starting to use this tool also and have to say my thanks. I have previously been using Ensembl REST to convert from hg19 to GRCh38 and while that is fine for a small number of queries, its not good enough for the large numbers of variant positions I am converting from the GTEx project. Comparatively, this is insanely fast and is saving me a lot of time. Seems that unfortunately your first comment was from someone who didn't know what they were doing. However, from me, great thanks!
4. Leon on 15.10.2017 at 12:49 (Reply)
  
  Thanks for the tool.
  Could you tell me what does the second number mean (2290061362) ?
  
  >>> lo.convert_coordinate('chrY', 6848769)
  [('chrY', 6832130, '+', 2290061362)]
  1. Konstantin on 15.10.2017 at 14:53 (Reply)
    
    Citing the doc for convert_coordinate (accessible as help(lo.convert_coordinate) or lo.convert_coordinate? if you use ipython):
    
    Returns a list of possible conversions for a given chromosome position.
    The list may be empty (no conversion), have a single element (unique conversion), or several elements (position mapped to several chains).
    The list contains tuples (target_chromosome, target_position, target_strand, conversion_chain_score),
    where conversion_chain_score is the "alignment score" field specified at the chain used to perform conversion. If there
    are several possible conversions, they are sorted by decreasing conversion_chain_score. ...
5. Isaac on 16.12.2017 at 01:28 (Reply)
  
  I'm also finding this useful at this late juncture.
  
  It was ahead of its time, I guess, in the sense that lots of folks weren't working between hg38 and hg19 on the regular like we are now.
6. Adam on 21.10.2020 at 08:01 (Reply)
  
  Just another testimony that this was a useful script for me. I couldn't figure out how to make any vcf file remapping program to work so I used this library and strung together a simple python script and finally had success. Thanks!
Leave a comment

Name (required)

E-Mail:(not displayed)(required)

Website:

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.

Reply to:

February 2013
M	T	W	T	F	S	S
« Jan		Apr »
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28

Oli on The Data Science Workflow
Adam on The Curse of Genomic Coordinates
second on How to Send an SMS
6 Regularization Techniques for Deep Learning | Python | Keras - AI ASPIRANT on The Mystery of Early Stopping
Aldo D'Ottavio on What is the Covariance Matrix?

The Curse of Genomic Coordinates

9 Comments

Leave a comment

Calendar

Recent Comments

Archives