JoGo-LILR Caller is a software to call the haplotype pattern of complex LILR region especially LILRB3-LILRA6 region with various CN patterns. JoGo-LILR takes the short read sequencing data and outputs diploid CNs and probable haplotype from short read sequencing data.
Download
The package of JoGo-LILR Caller can be downloaded from here (package of JoGo-LILR_v1.tgz)
Required Python3 Packages
numpy pandas scipy plotly.express kaleido
Supported Reference Assembly
First you need to align short read sequencing data to GRCh38 based-reference assembly. JoGo-LILR Caller was developed by using the same reference assembly that was used in the 1000 Genomes project (https://github.com/igsr/1000Genomes_data_indexes/blob/master/data_collections/1000_genomes_project/README.1000genomes.GRCh38DH.alignment.), named hs38DH.
If the major reference assembly is GRCh38 coordinate, JoGo-LILR Caller will work but our team does not test intensively.
The hs38DH.fa can be downloaded from
or
HTTPS
The same reference assembly is included in the download package (package of JoGo-LILR_v1.tgz).
Step1 Alignment
If you already have the aligned result to GRCh38 coordinate bam or cram file you can skip this step.
The usual protocol of whole-genome short read sequencing data is paired-end mode and the data can be aligned by using alignment tools, e.g. BWA and bowtie2. In JoGo-LILR Caller, the aligned result using bwa was used but can be used for other alignment tool.
For the paired-end fastq file, test_R1.fa and test_R2.fa.
bwa mem
bwa mem hs38DH.fa test_R1.fq test_R2.fq | samtools view -bS > test.tmp.bam
samtools sort test.tmp.bam -o test.bam
samtools index test.bam
Then you will obtain test.bam and test.bam.bai
The JoGo-LILR Caller also accepts cram format instead of bam.
Step2 JoGo-LILR Preprocessing
For the aligned cram or bam file, JoGo-LILR calls the coverages for four specific regions in LILRB3 and LILRA6. To normalize GC bias and other factors (e.g., global depth of coverage), JoGo-LILR caller uses CNVNator.
JoGo-LILR package includes the singularity image for customized CNVNator for JoGo-LILR and you don’t need to install CNVNator independently.
python3 ./jogo-lilr-preprocess.py –id HG007 –bam input/HG007.bam
For a cram file you might need to specify the cache directory of reference assembly. For hs38DH.fa case, the cache file is bundled and would work with the following environment setting. The setting is not required for bam file.
export REF_CACHE=../input/hs38DH/cache/%2s/%2s/%s
export REF_PATH=../input/hs38DH/cache/%2s/%2s/%s
Note: The message ‘Can’t find directory ‘bin_1000’…’ is shown when you process the step2 command, but you don’t need to worry about this output.
Step3 JoGo-LILR Diploid Copy Number Calling
After calling the coverages for four specific regions in LILRB3 and LILRA6 in Step2, the diploid copy numbers of LILRB3 and LILRA6 can be jointly called with the prior information from Hapmap 3,204 samples (with default parameter).
After executing the example in Step2, output/result.HG007.for_caller.txt will be created. By taking the preprocessing file as input diploid copy number for LILRB3 and LILRA6 can be called with the following command.
python3 ./jogo-lilr-caller.py –region_cnv output/result.HG007.for_caller.txt –targetsamples HG007
After executing the command, the following files will be created.
test.diplod_stable.target.tsv
sampleid selected_cnv_type selected_cnv_type_distance X(LILRB3+LILRA6) Y(LILRB3core/LILRA6core) cluster_id group popname gpopname
HG007 CN5_B2A3 0.3639 5.3328 0.5194 5 test global global
The selected_cnv_type is the called copy number of LILRB3 and LILRA6. CN5_B2A3 means, the total copies of LILRB3 and LILRA6 are 5. The total copies of LILRB3 is 2 and LILRA6 is 3.
test.diplod_stable.all.{pdf/png/html} is the clustering plot with prior dataset (default is 3204 samples from Hapmap data).
test.diplod_stable.target.{pdf/png/html} is the selected dataset specified in –targetsamples from test.diplod_stable.all.{pdf/png/html}.
Step4 JoGo-LILR Haploid Copy Number Calling
After calling the coverages for four specific regions in LILRB3 and LILRA6 in Step2, the haploid copy numbers of LILRB3 and LILRA6 can be jointly called with the prior information from Hapmap 3,204 samples (with default parameter).
After executing the example in Step2, output/result.HG007.for_caller.txt will be created. By taking the preprocessing file as input diploid copy number for LILRB3 and LILRA6 can be called with the following command.
python3 ./jogo-lilr-caller.py –region_cnv output/result.HG007.for_caller.txt –targetsamples HG007 –distance_type haploid
After executing the command, the following files will be created.
output/test.haploid.target.tsv
sampleid selected_cnv_type selected_cnv_type_distance X(LILRB3+LILRA6) Y(LILRB3core/LILRA6core) cluster_id group popname gpopname
HG007_0.1 CN5_B1A1_B1A2 0.3639 5.3328 0.5194 5 test global global
The selected_cnv_type is the called copy number of LILRB3 and LILRA6. CN5_B1A1_B1A2 means, the total copies of LILRB3 and LILRA6 are 5. The combination of haploid is B1A1 (LILRB3 is 1 and LILRA6 is 1)and B1A2 (LILRB3 is 1 and LILRA6 is 2).
test.haploid.all.{pdf/png/html} is the clustering plot with prior dataset (default is 3204 samples from Hapmap data).
test.haploid.target.{pdf/png/html} is the selected dataset specified in –targetsamples from test.haploid.all.{pdf/png/html}.
Step5 Multiple Sample Joint-calling
In Step3 and Step4, only one sample is called in jogo-lilr-caller.py. If you merge each preprocessing result in Step2 with the header, jogo-lilr-caller can call multiple samples at once.
python3 ./jogo-lilr-caller.py –region_cnv output/result.HG007_and_HG008.for_caller.txt –targetsamples HG007,HG008 –distance_type haploid
Step6 Calculate probable haplotype pair of LILRB3 and LILRA6
By executing Step4, all possible patterns of haplotype pairs are listed in output/test.haploid.target.tsv (in the example of Step4).
If multiple candidates exist, the script informs the most probable pair of haplotype. If multiple candidates exist, all possible pairs of haplotype with probabilities are informed (with –all options).
python3 calculate_allelic_probability.py –input output/test.haploid.target.tsv
sampleid selected_cnv_type selected_cnv_type_distance X(LILRB3+LILRA6) Y(LILRB3core/LILRA6core) cluster_id group popname gpopname SampleCNType_bestguess SampleCNType_bestguess_probability note
HG007_0.1 CN5_B1A1_B1A2 0.3639 5.3328 0.5194 5 test global global B1A1/B1A2 1.0
For multiple candidate case, the output is like
NA21102 CN4_B1A1_B1A1_CN4_B1A0_B1A2 0.092 3.9552 0.9362 3 test global global B1A1/B1A1 0.9868 B1A0/B1A2 0.0132
In this case, the most probable pair if B1A1 and B1A1 as 98.68% and the second candidate is B1A0 and B1A2 as 1.32%
Step7 Calculate probable haplotype pair of LILRB3 and LILRA6 for trio dataset
For the trio dataset, the most probable pair can be calculated to keep the Mendelian rule from the haploid estimated result pairs for Step4.
In the following example, pack the father, mother, child to one line by pack_pedigree_estimated_result.py with –ped information (the format is the same as PLINK).
>python3 pack_pedigree_estimated_result.py –input ../test/paper.hapmap3202.allele_stable.tsv –ped ../test/1kGP.3202_samples.pedigree_info.txt –output output/hapmap3202_finalreport.estimated.by_pedigree.all.tsv
The pedigree_performance_check.py command checks the consistency with Mendelian rule among father, mother and child.
>python3 pedigree_performance_check.py –input output/hapmap3202_finalreport.estimated.by_pedigree.all.tsv –output output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.tsv
>cat output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.tsv | grep -v NON_PEDIGREE > output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.onlyped.tsv
The calculate_trio_probability.py calculates the most probable pair from multiple candidate pairs with Mendelian rule.
>python3 calculate_trio_probability.py –input output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.onlyped.tsv –output output/hapmap3202_finalreport.estimated.by_pedigree.all.withQC.best_guess.onlyped.tsv