Most of bioinformatics revolves, in one way or another, around the genome. Even in the largely "non-genomic" areas, such as systems biology, proteomics, or metabolomics, the key players are still genes, proteins, and their regulatory regions, which can be associated with particular points on the genome. Consequently, no matter what kind of data you work with, if you do bioinformatics, you will sooner or later have to deal with genomic coordinates.
Surprisingly, despite being of such central importance to bioinformatics, the whole genomic coordinate business seems to be in a rather unfriendly state nowadays. Firstly, there are several ways to refer to genomic positions (e.g. 0-based vs 1-based indexing), and for every possible combination of conventions, there is at least one file format that will be using it. Then, of course, you have to deal with several versions of the reference genome, and, to make your life harder yet, most data files will not tell you what genome version should be used to interpret the coordinates stored there. Finally, if you decide that you need to unify the coordinates among your different data files by converting them to the same reference genome version, you will find out that your only tools here are a couple of heavyweight web apps and a command-line UCSC liftOver utility. Should you look for R or Python libraries to script your task, you will discover that those do no smarter than forward all the conversion tasks to that same liftOver.
I found this is all to be extremely wrong and hacked up a tiny Python tool recently, which will happily convert your coordinates without the need to invoke an external liftOver process. It does not fully replace liftOver (yet?), as it does not convert regions - a task just a bit more tricky than lifting over single points. However it lets you do single-point conversion in the simplest way possible:
from pyliftover import LiftOver
lo = LiftOver('hg17', 'hg18')
lo.convert_coordinate('chr1', 1000000, '-') # 0-based indexing
As usual, install via: easy_install pyliftover
(or pip install pyliftover
)
For more information consult the PyPI page.
So, This tool converts just a single point in the genome at once. If at all I want to convert all points in my bed file, I need to write a loop function in python.I found this of no use at all. If I am interested to look at a single point I will easily look at the dbSNP website. Also, I found that this works for only 'hg17' to 'hg18' conversion. It is not working correctly for other conversions, for example 'hg18' to 'hg38'.
Well, this is meant for scripting workflows, where writing loops is the normal approach, and "looking at the dbSNP website" is not a solution (moreover, dbSNP would not convert arbitrary coordinates anyway).
I would be happy if you could bring an example of an invalid hg18-to-hg38 conversion, as it would be a bug worth fixing.
I know this post is years old, but I think it's a shame that the only comment is a negative one. I found PyLiftover to be very useful. I was writing a script to convert a custom genotype file format into the standard VCF. Being able to convert line-by-line in my Python script allowed me to avoid the cumbersome process of first converting the file to BED, then running liftOver, and then trying to combine this back with the information in the original file format. It also allowed me to easily log if the conversion was successful or not.
In short, thanks for writing, sharing, and advertising this useful software.
Hey, thanks for the positive feedback!
Oddly enough, I am just starting to use this tool also and have to say my thanks. I have previously been using Ensembl REST to convert from hg19 to GRCh38 and while that is fine for a small number of queries, its not good enough for the large numbers of variant positions I am converting from the GTEx project. Comparatively, this is insanely fast and is saving me a lot of time. Seems that unfortunately your first comment was from someone who didn't know what they were doing. However, from me, great thanks!
Thanks for the tool.
Could you tell me what does the second number mean (2290061362) ?
>>> lo.convert_coordinate('chrY', 6848769)
[('chrY', 6832130, '+', 2290061362)]
Citing the doc for
convert_coordinate
(accessible ashelp(lo.convert_coordinate)
orlo.convert_coordinate?
if you useipython
):I'm also finding this useful at this late juncture.
It was ahead of its time, I guess, in the sense that lots of folks weren't working between hg38 and hg19 on the regular like we are now.
Just another testimony that this was a useful script for me. I couldn't figure out how to make any vcf file remapping program to work so I used this library and strung together a simple python script and finally had success. Thanks!