JoGo-LILR Caller v1

JoGo-LILR Caller is a software to call the haplotype pattern of complex LILR region especially LILRB3-LILRA6 region with various CN patterns. JoGo-LILR takes the short read sequencing data and outputs diploid CNs and probable haplotype from short read sequencing data.

Download

The package of JoGo-LILR Caller can be downloaded from here (package of JoGo-LILR_v1.tgz)

Required Python3 Packages

numpy pandas scipy plotly.express kaleido

Supported Reference Assembly

First you need to align short read sequencing data to GRCh38 based-reference assembly. JoGo-LILR Caller was developed by using the same reference assembly that was used in the 1000 Genomes project (https://github.com/igsr/1000Genomes_data_indexes/blob/master/data_collections/1000_genomes_project/README.1000genomes.GRCh38DH.alignment.), named hs38DH.

If the major reference assembly is GRCh38 coordinate, JoGo-LILR Caller will work but our team does not test intensively.

The hs38DH.fa can be downloaded from

FTP ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

HTTPS

The same reference assembly is included in the download package (package of JoGo-LILR_v1.tgz).

Step1 Alignment

If you already have the aligned result to GRCh38 coordinate bam or cram file you can skip this step.

The usual protocol of whole-genome short read sequencing data is paired-end mode and the data can be aligned by using alignment tools, e.g. BWA and bowtie2. In JoGo-LILR Caller, the aligned result using bwa was used but can be used for other alignment tool.

For the paired-end fastq file, test_R1.fa and test_R2.fa.

bwa mem

bwa mem hs38DH.fa test_R1.fq test_R2.fq | samtools view -bS > test.tmp.bam

samtools sort test.tmp.bam -o test.bam

samtools index test.bam

Then you will obtain test.bam and test.bam.bai

The JoGo-LILR Caller also accepts cram format instead of bam.

Step2 JoGo-LILR Preprocessing

For the aligned cram or bam file, JoGo-LILR calls the coverages for four specific regions in LILRB3 and LILRA6. To normalize GC bias and other factors (e.g., global depth of coverage), JoGo-LILR caller uses CNVNator.

JoGo-LILR package includes the singularity image for customized CNVNator for JoGo-LILR and you don’t need to install CNVNator independently.

python3 ./jogo-lilr-preprocess.py –id HG007 –bam input/HG007.bam

For a cram file you might need to specify the cache directory of reference assembly. For hs38DH.fa case, the cache file is bundled and would work with the following environment setting. The setting is not required for bam file.

export REF_CACHE=../input/hs38DH/cache/%2s/%2s/%s

export REF_PATH=../input/hs38DH/cache/%2s/%2s/%s

Note: The message ‘Can’t find directory ‘bin_1000’…’ is shown when you process the step2 command, but you don’t need to worry about this output.

Step3 JoGo-LILR Diploid Copy Number Calling

After calling the coverages for four specific regions in LILRB3 and LILRA6 in Step2, the diploid copy numbers of LILRB3 and LILRA6 can be jointly called with the prior information from Hapmap 3,204 samples (with default parameter).

After executing the example in Step2, output/result.HG007.for_caller.txt will be created. By taking the preprocessing file as input diploid copy number for LILRB3 and LILRA6 can be called with the following command.

python3 ./jogo-lilr-caller.py –region_cnv output/result.HG007.for_caller.txt –targetsamples HG007

After executing the command, the following files will be created.

test.diplod_stable.target.tsv

sampleid selected_cnv_type selected_cnv_type_distance X(LILRB3+LILRA6) Y(LILRB3core/LILRA6core) cluster_id group popname gpopname

HG007 CN5_B2A3 0.3639 5.3328 0.5194 5 test global global

The selected_cnv_type is the called copy number of LILRB3 and LILRA6. CN5_B2A3 means, the total copies of LILRB3 and LILRA6 are 5. The total copies of LILRB3 is 2 and LILRA6 is 3.

test.diplod_stable.all.{pdf/png/html} is the clustering plot with prior dataset (default is 3204 samples from Hapmap data).

test.diplod_stable.target.{pdf/png/html} is the selected dataset specified in –targetsamples from test.diplod_stable.all.{pdf/png/html}.

Step4 JoGo-LILR Haploid Copy Number Calling

After calling the coverages for four specific regions in LILRB3 and LILRA6 in Step2, the haploid copy numbers of LILRB3 and LILRA6 can be jointly called with the prior information from Hapmap 3,204 samples (with default parameter).

python3 ./jogo-lilr-caller.py –region_cnv output/result.HG007.for_caller.txt –targetsamples HG007 –distance_type haploid

After executing the command, the following files will be created.

output/test.haploid.target.tsv

sampleid selected_cnv_type selected_cnv_type_distance X(LILRB3+LILRA6) Y(LILRB3core/LILRA6core) cluster_id group popname gpopname

HG007_0.1 CN5_B1A1_B1A2 0.3639 5.3328 0.5194 5 test global global

The selected_cnv_type is the called copy number of LILRB3 and LILRA6. CN5_B1A1_B1A2 means, the total copies of LILRB3 and LILRA6 are 5. The combination of haploid is B1A1 (LILRB3 is 1 and LILRA6 is 1)and B1A2 (LILRB3 is 1 and LILRA6 is 2).

test.haploid.all.{pdf/png/html} is the clustering plot with prior dataset (default is 3204 samples from Hapmap data).

test.haploid.target.{pdf/png/html} is the selected dataset specified in –targetsamples from test.haploid.all.{pdf/png/html}.

Step5 Multiple Sample Joint-calling

In Step3 and Step4, only one sample is called in jogo-lilr-caller.py. If you merge each preprocessing result in Step2 with the header, jogo-lilr-caller can call multiple samples at once.

python3 ./jogo-lilr-caller.py –region_cnv output/result.HG007_and_HG008.for_caller.txt –targetsamples HG007,HG008 –distance_type haploid

Step6 Calculate probable haplotype pair of LILRB3 and LILRA6

By executing Step4, all possible patterns of haplotype pairs are listed in output/test.haploid.target.tsv (in the example of Step4).

If multiple candidates exist, the script informs the most probable pair of haplotype. If multiple candidates exist, all possible pairs of haplotype with probabilities are informed (with –all options).

python3 calculate_allelic_probability.py –input output/test.haploid.target.tsv

sampleid selected_cnv_type selected_cnv_type_distance X(LILRB3+LILRA6) Y(LILRB3core/LILRA6core) cluster_id group popname gpopname SampleCNType_bestguess SampleCNType_bestguess_probability note

HG007_0.1 CN5_B1A1_B1A2 0.3639 5.3328 0.5194 5 test global global B1A1/B1A2 1.0

For multiple candidate case, the output is like

NA21102 CN4_B1A1_B1A1_CN4_B1A0_B1A2 0.092 3.9552 0.9362 3 test global global B1A1/B1A1 0.9868 B1A0/B1A2 0.0132

In this case, the most probable pair if B1A1 and B1A1 as 98.68% and the second candidate is B1A0 and B1A2 as 1.32%

Step7 Calculate probable haplotype pair of LILRB3 and LILRA6 for trio dataset

For the trio dataset, the most probable pair can be calculated to keep the Mendelian rule from the haploid estimated result pairs for Step4.

In the following example, pack the father, mother, child to one line by pack_pedigree_estimated_result.py with –ped information (the format is the same as PLINK).

>python3 pack_pedigree_estimated_result.py –input ../test/paper.hapmap3202.allele_stable.tsv –ped ../test/1kGP.3202_samples.pedigree_info.txt –output output/hapmap3202_finalreport.estimated.by_pedigree.all.tsv

The pedigree_performance_check.py command checks the consistency with Mendelian rule among father, mother and child.

>python3 pedigree_performance_check.py –input output/hapmap3202_finalreport.estimated.by_pedigree.all.tsv –output output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.tsv

>cat output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.tsv | grep -v NON_PEDIGREE > output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.onlyped.tsv

The calculate_trio_probability.py calculates the most probable pair from multiple candidate pairs with Mendelian rule.

>python3 calculate_trio_probability.py –input output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.onlyped.tsv –output output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.best_guess.onlyped.tsv

Nagasaki Lab | Division of Biomedical Information Analysis | Medical Research Center for High Depth Omics | Medical Institute of Bioregulation | Kyushu University

Nagasaki Lab | Division of Biomedical Information Analysis | Medical Research Center for High Depth Omics | Medical Institute of Bioregulation | Kyushu University