HLA-VBSeq is software to estimate the most likely HLA types from high-throughput sequencing data.


Read data (FASTQ), or read data aligned to GRCh37/hg19 in BAM format. (e.g. NA12878.sorted.bam)


HLA types estimated from sequencing data


Usage (from FASTQ file):

  • Please align read data (FASTQ file) to GRCh37/hg19
  • Please convert sam file into bam file, and then sort the bam file by genomic coordinate
  • Usage (from BAM file):

  • Extract a list of read name that were aligned to HLA loci (HLA-A, B, C, DM, DO, DP, DQ, DR, E, F, G, H, J, K, L, P, V, MIC, and TAP)
  •     samtools view NA12878.sorted.bam chr6:29907037-29915661 chr6:31319649-31326989 chr6:31234526-31241863
            chr6:32914391-32922899 chr6:32900406-32910847 chr6:32969960-32979389 chr6:32778540-32786825
            chr6:33030346-33050555 chr6:33041703-33059473 chr6:32603183-32613429 chr6:32707163-32716664
            chr6:32625241-32636466 chr6:32721875-32733330 chr6:32405619-32414826 chr6:32544547-32559613
            chr6:32518778-32554154 chr6:32483154-32559613 chr6:30455183-30463982 chr6:29689117-29699106
            chr6:29792756-29800899 chr6:29793613-29978954 chr6:29855105-29979733 chr6:29892236-29899009
            chr6:30225339-30236728 chr6:31369356-31385092 chr6:31460658-31480901 chr6:29766192-29772202
            chr6:32810986-32823755 chr6:32779544-32808599 chr6:29756731-29767588
            | awk '{print $$1}' | sort | uniq > NA12878_partial_reads.txt
  • Build read name index and search read pairs and their sequences on HLA loci
  •     java -jar -Xmx32g -Xms32g bamNameIndex.jar index NA12878.sorted.bam --indexFile NA12878.sorted.bam.idx
        java -jar bamNameIndex.jar search NA12878.sorted.bam --name NA12878_partial_reads.txt --output NA12878_partial.sam
        java -jar SamToFastq.jar I=NA12878_partial.sam F=NA12878_partial_1.fastq F2=NA12878_partial_2.fastq
        If for some reason bamNameIndex.jar doesn't work, please use bedtools to extract reads from bam files:
        Or, alternatively, below is a python script that uses pysam to extract reads by read name from a bam file:
        Or, please try samgrep:
  • Extract unmapped reads
  •     samtools view -bh -f 12 NA12878.sorted.bam > NA12878.sorted_unmapped.bam
        java -jar SamToFastq.jar I=NA12878.sorted_unmapped.bam F=NA12878_unmapped_1.fastq F2=NA12878_unmapped_2.fastq
  • Combine reads in FASTQ format
  •     cat NA12878_partial_1.fastq NA12878_unmapped_1.fastq > NA12878_part_1.fastq
        cat NA12878_partial_2.fastq NA12878_unmapped_2.fastq > NA12878_part_2.fastq
  • Alignment by BWA-MEM allowing multiple alignments for each read
  •     bwa index hla_all.fasta
        bwa mem -t 8 -P -L 10000 -a hla_all.fasta NA12878_part_1.fastq NA12878_part_2.fastq > NA12878_part.sam
  • Estimation of HLA types by HLA-VBSeq
  •     For paired-end read data:
        java -jar HLAVBSeq.jar hla_all.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01 --is_paired
        For single-end read data:
        java -jar HLAVBSeq.jar hla_all.fasta NA12878_part.sam NA12878_result.txt --alpha_zero 0.01
        Here, alpha_zero is a hyperparameter as described in the paper and we recommend to use 0.01.

    Prediction Results:

  • Output file format
  •     1st column:
        HLA allele ID (e.g. HLA00001)
        2nd column:
        genomic locus length (in bp) in fasta format (e.g. 3503 bp)
        3rd column:
        Z, the number of reads that the algorithm assigned to the HLA allele (e.g. 300 reads)
        4th column:
        Normalized number of reads (Fragments Per Kilobase per Million mapped fragments)
        5th column:
        Relative abundance, theta, which is Z divided by the number of total mapped reads.
  • How HLA typing result looks like (e.g. HLA-A)
  •     ./parse_result.pl Allelelist.txt NA12878_result.txt | grep "^A\*" | sort -k2 -n -r > HLA_A.txt
        less HLA_A.txt
          A*01:01:01:01   17.4022266628604
          A*11:01:01      12.0376819868684
        1st column:
        HLA allele name
        2nd column:
        Average depth of coverage
        Here, the perl script, parse_result.pl, calculates the average depth of coverage for each HLA allele.
        Please modify "parse_result.pl" according to your data. This perl script assumes 100bp x 2 data as in the paper.


    Any published works where HLA-VBSeq has been used in data analysis should include a citation:

    HLA-VBSeq: accurate HLA typing at full resolution from whole-genome sequencing data
    Naoki Nariai, Kaname Kojima, Sakae Saito, Takahiro Mimori, Yukuto Sato, Yosuke Kawai, Yumi Yamaguchi-Kabata, Jun Yasuda and Masao Nagasaki
    BMC Genomics 2015, 16(Suppl 2):S7


    HLA-VBSeq, (c) 2015, Tohoku University (the "Software")

    The Software remains the property of Tohoku University.

    The Software is distributed "AS IS" under this License solely for non-commercial use in the hope that it will be useful, but in order that Tohoku University as the intellectual property right holder of the software protects its assets for the benefit of its educational and research purposes, Tohoku University makes clear that no condition is made or to be implied, nor is any warranty given or to be implied, as to the accuracy of the Software, or that it will be suitable for any particular purpose or for use under any specific conditions. Furthermore, Tohoku University disclaims all responsibility for the use which is made of the Software. It further disclaims any liability for the outcomes arising from using the Software.

    The Licensee agrees to indemnify Tohoku University and hold Tohoku University harmless from and against any and all claims, damages and liabilities asserted by third parties (including claims for negligence) which arise directly or indirectly from the use of the Software or the sale of any products based on the Software.

    No part of the Software may be reproduced, modified, reverse engineered, transmitted or transferred in any form or by any means, electronic or mechanical, without the express permission of Tohoku University.

    You are not permitted under this License to use the Software commercially. Use for which any financial return is received shall be defined as commercial use, and includes (1) integration of all or part of the source code or the Software into a product for sale or license by or on behalf of Licensee to third parties or (2) use of the Software or any derivative of it for research with the final aim of developing software products for sale or license to a third party or (3) use of the Software or any derivative of it for research with the final aim of developing non-software products for sale or license to a third party, or (4) use of the Software to provide any service to an external organization for which payment is received. If you are interested in using the Software commercially, please contact Tohoku University ( nariai@megabank.tohoku.ac.jp and nagasaki@megabank.tohoku.ac.jp ), to negotiate a license.


    Last updated: 03/30/2017