HLA-VBSeq

HLA-VBSeq is software to estimate the most likely HLA types from high-throughput sequencing data.

Input:

Read data (FASTQ), or read data aligned to GRCh37/hg19 in BAM format. (e.g. NA12878.sorted.bam)

Output:

HLA types estimated from sequencing data

Download:

If the JAR files do not work, try the following:

SamToFastq.jar (please download from the Picard website)
parse_result.pl
call_hla_digits.py

HLA v2 database based on IMGT/HLA database Release 3.31.0 and Japanese HLA reference dataset (Mimori et al Pharmacogenomics J. 2018) New 2018/10/12

HLA v1 database based on IMGT/HLA database, Release 3.15.0

hla_all.fasta (constructed from the IMGT/HLA database, Release 3.15.0, 2014-01-17. It is recommended to use the latest version.)
Allelelist.txt (obtained from the IMGT/HLA database, Release 3.15.0, 2014-01-17. It is recommended to use the latest version.)

Usage (from FASTQ file):

Please align read data (FASTQ file) to GRCh37/hg19

Please convert sam file into bam file, and then sort the bam file by genomic coordinate

Usage (from BAM file):

Extract a list of read name that were aligned to HLA loci (HLA-A, B, C, DM, DO, DP, DQ, DR, E, F, G, H, J, K, L, P, V, MIC, and TAP)

    samtools view NA12878.sorted.bam chr6:29907037-29915661 chr6:31319649-31326989 chr6:31234526-31241863
        chr6:32914391-32922899 chr6:32900406-32910847 chr6:32969960-32979389 chr6:32778540-32786825
        chr6:33030346-33050555 chr6:33041703-33059473 chr6:32603183-32613429 chr6:32707163-32716664
        chr6:32625241-32636466 chr6:32721875-32733330 chr6:32405619-32414826 chr6:32544547-32559613
        chr6:32518778-32554154 chr6:32483154-32559613 chr6:30455183-30463982 chr6:29689117-29699106
        chr6:29792756-29800899 chr6:29793613-29978954 chr6:29855105-29979733 chr6:29892236-29899009
        chr6:30225339-30236728 chr6:31369356-31385092 chr6:31460658-31480901 chr6:29766192-29772202
        chr6:32810986-32823755 chr6:32779544-32808599 chr6:29756731-29767588
        | awk '{print $$1}' | sort | uniq > NA12878_partial_reads.txt

Build read name index and search read pairs and their sequences on HLA loci

    java -jar -Xmx32g -Xms32g bamNameIndex.jar index NA12878.sorted.bam --indexFile NA12878.sorted.bam.idx
    java -jar bamNameIndex.jar search NA12878.sorted.bam --name NA12878_partial_reads.txt --output NA12878_partial.sam
    java -jar SamToFastq.jar I=NA12878_partial.sam F=NA12878_partial_1.fastq F2=NA12878_partial_2.fastq
   
    If for some reason bamNameIndex.jar doesn't work, please use bedtools to extract reads from bam files:
    http://bedtools.readthedocs.org/en/latest/content/tools/bamtofastq.html

    Or, alternatively, below is a python script that uses pysam to extract reads by read name from a bam file:
    http://timoast.github.io/2015/10/12/ExtractReads/

    Or, please try samgrep:
    https://github.com/lindenb/jvarkit/wiki/SamGrep

Extract unmapped reads

    samtools view -bh -f 12 NA12878.sorted.bam > NA12878.sorted_unmapped.bam
    java -jar SamToFastq.jar I=NA12878.sorted_unmapped.bam F=NA12878_unmapped_1.fastq F2=NA12878_unmapped_2.fastq

Combine reads in FASTQ format

    cat NA12878_partial_1.fastq NA12878_unmapped_1.fastq > NA12878_part_1.fastq
    cat NA12878_partial_2.fastq NA12878_unmapped_2.fastq > NA12878_part_2.fastq

For analysis with v2 HLA database:

Alignment by BWA-MEM allowing multiple alignments for each read

    bwa index hla_all_v2.fasta
    bwa mem -t 8 -P -L 10000 -a hla_all_v2.fasta NA12878_part_1.fastq NA12878_part_2.fastq > NA12878_part.sam

Estimation of HLA types by HLA-VBSeq

    For paired-end read data:
    java -jar HLAVBSeq.jar hla_all_v2.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01 --is_paired

    For single-end read data:
    java -jar HLAVBSeq.jar hla_all_v2.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01

    Here, alpha_zero is a hyperparameter as described in the paper and we recommend to use 0.01.

For analysis with v1 HLA database (old database):

Alignment by BWA-MEM allowing multiple alignments for each read
    bwa index hla_all.fasta
    bwa mem -t 8 -P -L 10000 -a hla_all.fasta NA12878_part_1.fastq NA12878_part_2.fastq > NA12878_part.sam

Estimation of HLA types by HLA-VBSeq

    For paired-end read data:
    java -jar HLAVBSeq.jar hla_all.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01 --is_paired

    For single-end read data:
    java -jar HLAVBSeq.jar hla_all.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01

    Here, alpha_zero is a hyperparameter as described in the paper and we recommend to use 0.01.

Prediction Results:

Output file format

    1st column:
    HLA allele ID (e.g. HLA00001)

    2nd column:
    genomic locus length (in bp) in fasta format (e.g. 3503 bp)

    3rd column:
    Z, the number of reads that the algorithm assigned to the HLA allele (e.g. 300 reads)

    4th column:
    Normalized number of reads (Fragments Per Kilobase per Million mapped fragments)

    5th column:
    Relative abundance, theta, which is Z divided by the number of total mapped reads.

How HLA typing result looks like (e.g. HLA-A)

For v2 HLA database:

./parse_result.pl Allelelist_v2.txt NA12878_result.txt | grep "^A\*" | sort -k2 -n -r > HLA_A.txt less HLA_A.txt

For v1 HLA database (old database):

./parse_result.pl Allelelist.txt NA12878_result.txt | grep "^A\*" | sort -k2 -n -r > HLA_A.txt less HLA_A.txt A*01:01:01:01 17.4022266628604 A*11:01:01 12.0376819868684 1st column: HLA allele name 2nd column: Average depth of coverage Here, the perl script, parse_result.pl, calculates the average depth of coverage for each HLA allele. Please modify "parse_result.pl" according to your data. This perl script assumes 100bp x 2 data as in the paper.

HLA Call:

    $(PYTHON2.7) call_hla_digits.py -v NA12878_result.txt -a Allelelist.txt -r 90 -d 4 --ispaired > NA12878_report.d4.txt

    -v xxxxx_result.txt : Need to set the output file from the HLA-VBSeq
    -a Allelelist.txt : IMGT HLA Allelelist
    -r 90 : mean single read length (mean_rlen)
    -d 4 : HLA call resolution（4 or 6 or 8）
    --ispaired : if set, twice the mean rlen for depth calculation (need to specify when the sequenced data is paired-end protocol)

Reference:

Any published works where HLA-VBSeq has been used in data analysis should include a citation:

HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data
Naoki Nariai, Kaname Kojima, Sakae Saito, Takahiro Mimori, Yukuto Sato, Yosuke Kawai, Yumi Yamaguchi-Kabata, Jun Yasuda and Masao Nagasaki
BMC Genomics 2015, 16(Suppl 2):S7

To use v2 please cite the following manuscript as a reference:

Construction of full-length Japanese reference panel of class I HLA genes with single-molecule, real-time sequencing, Mimori T, Yasuda J, Kuroki Y, Shibata TF, Katsuoka F, Saito S, Nariai N, Ono A, Nakai-Inagaki N, Misawa K, Tateno K, Kawai Y, Fuse N, Hozawa A, Kuriyama S, Sugawara J, Minegishi N, Suzuki K, Kinoshita K, Nagasaki M and Yamamoto M Pharmacogenomics J. 2018

License:

HLA-VBSeq, (c) 2015, Tohoku University (the "Software")

The Software remains the property of Tohoku University.

The Software is distributed "AS IS" under this License solely for non-commercial use in the hope that it will be useful, but in order that Tohoku University as the intellectual property right holder of the software protects its assets for the benefit of its educational and research purposes, Tohoku University makes clear that no condition is made or to be implied, nor is any warranty given or to be implied, as to the accuracy of the Software, or that it will be suitable for any particular purpose or for use under any specific conditions. Furthermore, Tohoku University disclaims all responsibility for the use which is made of the Software. It further disclaims any liability for the outcomes arising from using the Software.

The Licensee agrees to indemnify Tohoku University and hold Tohoku University harmless from and against any and all claims, damages and liabilities asserted by third parties (including claims for negligence) which arise directly or indirectly from the use of the Software or the sale of any products based on the Software.

No part of the Software may be reproduced, modified, reverse engineered, transmitted or transferred in any form or by any means, electronic or mechanical, without the express permission of Tohoku University.

You are not permitted under this License to use the Software commercially. Use for which any financial return is received shall be defined as commercial use, and includes (1) integration of all or part of the source code or the Software into a product for sale or license by or on behalf of Licensee to third parties or (2) use of the Software or any derivative of it for research with the final aim of developing software products for sale or license to a third party or (3) use of the Software or any derivative of it for research with the final aim of developing non-software products for sale or license to a third party, or (4) use of the Software to provide any service to an external organization for which payment is received. If you are interested in using the Software commercially, please contact Tohoku University ( nariai@megabank.tohoku.ac.jp and nagasaki@megabank.tohoku.ac.jp ), to negotiate a license.

Contact:

Please send e-mail to nariai@megabank.tohoku.ac.jp and nagasaki@megabank.tohoku.ac.jp

Last updated: 10/28/2024