HLA-VBSeq is software to estimate the most likely HLA types from high-throughput sequencing data.
Read data (FASTQ), or read data aligned to GRCh37/hg19 in BAM format. (e.g. NA12878.sorted.bam)
HLA types estimated from sequencing data
If the JAR files do not work, try the following:
HLA v2 database based on IMGT/HLA database Release 3.31.0 and Japanese HLA reference dataset (Mimori et al Pharmacogenomics J. 2018) New 2018/10/12
HLA v1 database based on IMGT/HLA database, Release 3.15.0
Please align read data (FASTQ file) to GRCh37/hg19 Please convert sam file into bam file, and then sort the bam file by genomic coordinate
Extract a list of read name that were aligned to HLA loci (HLA-A, B, C, DM, DO, DP, DQ, DR, E, F, G, H, J, K, L, P, V, MIC, and TAP) samtools view NA12878.sorted.bam chr6:29907037-29915661 chr6:31319649-31326989 chr6:31234526-31241863 chr6:32914391-32922899 chr6:32900406-32910847 chr6:32969960-32979389 chr6:32778540-32786825 chr6:33030346-33050555 chr6:33041703-33059473 chr6:32603183-32613429 chr6:32707163-32716664 chr6:32625241-32636466 chr6:32721875-32733330 chr6:32405619-32414826 chr6:32544547-32559613 chr6:32518778-32554154 chr6:32483154-32559613 chr6:30455183-30463982 chr6:29689117-29699106 chr6:29792756-29800899 chr6:29793613-29978954 chr6:29855105-29979733 chr6:29892236-29899009 chr6:30225339-30236728 chr6:31369356-31385092 chr6:31460658-31480901 chr6:29766192-29772202 chr6:32810986-32823755 chr6:32779544-32808599 chr6:29756731-29767588 | awk '{print $$1}' | sort | uniq > NA12878_partial_reads.txt
Build read name index and search read pairs and their sequences on HLA loci java -jar -Xmx32g -Xms32g bamNameIndex.jar index NA12878.sorted.bam --indexFile NA12878.sorted.bam.idx java -jar bamNameIndex.jar search NA12878.sorted.bam --name NA12878_partial_reads.txt --output NA12878_partial.sam java -jar SamToFastq.jar I=NA12878_partial.sam F=NA12878_partial_1.fastq F2=NA12878_partial_2.fastq If for some reason bamNameIndex.jar doesn't work, please use bedtools to extract reads from bam files: http://bedtools.readthedocs.org/en/latest/content/tools/bamtofastq.html Or, alternatively, below is a python script that uses pysam to extract reads by read name from a bam file: http://timoast.github.io/2015/10/12/ExtractReads/ Or, please try samgrep: https://github.com/lindenb/jvarkit/wiki/SamGrep
Extract unmapped reads samtools view -bh -f 12 NA12878.sorted.bam > NA12878.sorted_unmapped.bam java -jar SamToFastq.jar I=NA12878.sorted_unmapped.bam F=NA12878_unmapped_1.fastq F2=NA12878_unmapped_2.fastq
Combine reads in FASTQ format cat NA12878_partial_1.fastq NA12878_unmapped_1.fastq > NA12878_part_1.fastq cat NA12878_partial_2.fastq NA12878_unmapped_2.fastq > NA12878_part_2.fastq
For analysis with v2 HLA database:
Alignment by BWA-MEM allowing multiple alignments for each read bwa index hla_all_v2.fasta bwa mem -t 8 -P -L 10000 -a hla_all_v2.fasta NA12878_part_1.fastq NA12878_part_2.fastq > NA12878_part.sam
Estimation of HLA types by HLA-VBSeq For paired-end read data: java -jar HLAVBSeq.jar hla_all_v2.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01 --is_paired For single-end read data: java -jar HLAVBSeq.jar hla_all_v2.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01 Here, alpha_zero is a hyperparameter as described in the paper and we recommend to use 0.01.
For analysis with v1 HLA database (old database):
Alignment by BWA-MEM allowing multiple alignments for each read bwa index hla_all.fasta bwa mem -t 8 -P -L 10000 -a hla_all.fasta NA12878_part_1.fastq NA12878_part_2.fastq > NA12878_part.sam
Estimation of HLA types by HLA-VBSeq For paired-end read data: java -jar HLAVBSeq.jar hla_all.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01 --is_paired For single-end read data: java -jar HLAVBSeq.jar hla_all.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01 Here, alpha_zero is a hyperparameter as described in the paper and we recommend to use 0.01.
Output file format 1st column: HLA allele ID (e.g. HLA00001) 2nd column: genomic locus length (in bp) in fasta format (e.g. 3503 bp) 3rd column: Z, the number of reads that the algorithm assigned to the HLA allele (e.g. 300 reads) 4th column: Normalized number of reads (Fragments Per Kilobase per Million mapped fragments) 5th column: Relative abundance, theta, which is Z divided by the number of total mapped reads.
How HLA typing result looks like (e.g. HLA-A) For v2 HLA database:
./parse_result.pl Allelelist_v2.txt NA12878_result.txt | grep "^A\*" | sort -k2 -n -r > HLA_A.txt less HLA_A.txtFor v1 HLA database (old database):
./parse_result.pl Allelelist.txt NA12878_result.txt | grep "^A\*" | sort -k2 -n -r > HLA_A.txt less HLA_A.txt A*01:01:01:01 17.4022266628604 A*11:01:01 12.0376819868684 1st column: HLA allele name 2nd column: Average depth of coverage Here, the perl script, parse_result.pl, calculates the average depth of coverage for each HLA allele. Please modify "parse_result.pl" according to your data. This perl script assumes 100bp x 2 data as in the paper.
$(PYTHON2.7) call_hla_digits.py -v NA12878_result.txt -a Allelelist.txt -r 90 -d 4 --ispaired > NA12878_report.d4.txt -v xxxxx_result.txt : Need to set the output file from the HLA-VBSeq -a Allelelist.txt : IMGT HLA Allelelist -r 90 : mean single read length (mean_rlen) -d 4 : HLA call resolutioni4 or 6 or 8j --ispaired : if set, twice the mean rlen for depth calculation (need to specify when the sequenced data is paired-end protocol)
Any published works where HLA-VBSeq has been used in data analysis should include a citation:
HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data
Naoki Nariai, Kaname Kojima, Sakae Saito, Takahiro Mimori, Yukuto Sato, Yosuke Kawai, Yumi Yamaguchi-Kabata, Jun Yasuda and Masao Nagasaki
BMC Genomics 2015, 16(Suppl 2):S7
To use v2 please cite the following manuscript as a reference:
Construction of full-length Japanese reference panel of class I HLA genes with single-molecule, real-time sequencing, Mimori T, Yasuda J, Kuroki Y, Shibata TF, Katsuoka F, Saito S, Nariai N, Ono A, Nakai-Inagaki N, Misawa K, Tateno K, Kawai Y, Fuse N, Hozawa A, Kuriyama S, Sugawara J, Minegishi N, Suzuki K, Kinoshita K, Nagasaki M and Yamamoto M Pharmacogenomics J. 2018
HLA-VBSeq, (c) 2015, Tohoku University (the "Software")
The Software remains the property of Tohoku University.
The Software is distributed "AS IS" under this License solely for non-commercial use in the hope that it will be useful, but in order that Tohoku University as the intellectual property right holder of the software protects its assets for the benefit of its educational and research purposes, Tohoku University makes clear that no condition is made or to be implied, nor is any warranty given or to be implied, as to the accuracy of the Software, or that it will be suitable for any particular purpose or for use under any specific conditions. Furthermore, Tohoku University disclaims all responsibility for the use which is made of the Software. It further disclaims any liability for the outcomes arising from using the Software.
The Licensee agrees to indemnify Tohoku University and hold Tohoku University harmless from and against any and all claims, damages and liabilities asserted by third parties (including claims for negligence) which arise directly or indirectly from the use of the Software or the sale of any products based on the Software.
No part of the Software may be reproduced, modified, reverse engineered, transmitted or transferred in any form or by any means, electronic or mechanical, without the express permission of Tohoku University.
You are not permitted under this License to use the Software commercially. Use for which any financial return is received shall be defined as commercial use, and includes (1) integration of all or part of the source code or the Software into a product for sale or license by or on behalf of Licensee to third parties or (2) use of the Software or any derivative of it for research with the final aim of developing software products for sale or license to a third party or (3) use of the Software or any derivative of it for research with the final aim of developing non-software products for sale or license to a third party, or (4) use of the Software to provide any service to an external organization for which payment is received. If you are interested in using the Software commercially, please contact Tohoku University ( nariai@megabank.tohoku.ac.jp and nagasaki@megabank.tohoku.ac.jp ), to negotiate a license.