Newer
Older
# Standard operating procedure - Setup for variant prioritization in trio whole exome samples at the Edinburgh Parallel Computing Centre
This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE). It assumes that data has been successfully aligned, variant called and annotated (see SOP_trio_whole_exome_EPCC_pipeline). Scripts and resource datasets are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome. Request access by e-mail: alison.meynert@ed.ac.uk.
This setup must not be altered in any way unless sanctioned by NHSS. Contact Morad Ansari morad.ansari@nhs.net, morad.ansari@nhslothian.scot.nhs.uk.
## Definitions
Command lines starting with E> are to be executed on Eddie; U> refers to Ultra2 (EPCC).
Text in angle brackets, e.g. <chr> indicates variable parameters.
## The Variant Effect Predictor (VEP)
Version: ensembl-vep-100.4-0
Location at EPCC: /home/u035/u035/shared/software/bcbio/anaconda/bin/vep
Note: make sure the VEP cache files are tabix converted (significant speedup)
### Check the downloaded VEP cache files were tabix converted
```
U> ls -l /home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/vep/homo_sapiens_merged/100_GRCh38/<chr>
```
where <chr> is the chromosome name without the ‘chr’ prefix (e.g., 1,2, ..., 22).
A presence of a file called all_vars.gz.csi indicates that the VEP cache files were tabix converted.
## The G2P plugin for VEP
Version: 100
Source: https://github.com/Ensembl/VEP_plugins/blob/release/100/G2P.pm
Date obtained: 21/04/2021
Location at EPCC: /home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0
File name: G2P.pm
### Add splice region variants (1-3 bases of the exon or 3-8 bases of the intron) for consideration by G2P
Edit the G2P.pm file by adding the type `splice_region_variant` at the end of:
```
types : SO consequence types to include ... (line 84)
types => {map {$_ => 1} qw(splice_donor_variant ... (line 145)
```
### Setting G2P in completely offline mode
All external datasets listed for the `af_from_vcf_keys flag` (gnomADe_r2.1.1_GRCh38|gnomADg_r3.1.1_GRCh38) in:
* `/home/u035/u035/shared/scripts/process_NHS_WES_trio.sh`
* `/home/u035/u035/shared/scripts/process_NHS_WES_aff_probands.sh`
must be available locally (see below for downloading gnomADe and gnomADg datasets)
* gnomADg dataset (r3.1.1, downloaded 23/08/2021): `/home/u035/u035/shared/resources/gnomad/r3.1.1/genomes`
* gnomADe dataset (r2.1.1, downloaded 23/08/2021): `/home/u035/u035/shared/resources/gnomad/r2.1.1/exomes`
To re-fetch the gnomADg dataset:
```
U> cd /home/u035/u035/shared/resources/gnomad/r3.1.1/genomes
U> for i in {1..22} X Y
U> do
U> wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.1/vcf/genomes/gnomad.genomes.v3.1.1.sites.chr${i}.vcf.bgz
U> wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.1/vcf/genomes/gnomad.genomes.v3.1.1.sites.chr${i}.vcf.bgz.tbi
U> done
```
To re-fetch the gnomADe dataset:
```
U> cd /home/u035/u035/shared/resources/gnomad/r2.1.1/exomes
U> for i in {1..22} X Y
U> do
U> wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.${i}.liftover_grch38.vcf.bgz
U> wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.${i}.liftover_grch38.vcf.bgz.tbi
U> done
Edit the `/home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json` file to update type (remote -> local) and filename_template (local path to datasets) variables for these local datasets.
gnomADg (lines 138-143)
```
"id": "gnomADg_r3.1.1_GRCh38",
"description": "Genome Aggregation Database genomes r3.1.1",
"species": "homo_sapiens",
"assembly": "GRCh38",
"type": "local",
"filename_template": "/home/u035/u035/shared/resources/gnomad/r3.1.1/genomes/gnomad.genomes.v3.1.1.sites.chr###CHR###.vcf.bgz",
```
gnomADe (lines 199-204)
```
"id": "gnomADe_r2.1.1_GRCh38",
"description": "Genome Aggregation Database exomes r2.1.1 liftover to GRCh38",
"species": "homo_sapiens",
"assembly": "GRCh38",
"type": "local",
"filename_template": "/home/u035/u035/shared/resources/gnomad/r2.1.1/exomes/gnomad.exomes.r2.1.1.sites.###CHR###.liftover_grch38.vcf.bgz",
```
## Parameter Values for G2P call
Files
* `/home/u035/u035/shared/scripts/process_NHS_WES_trio.sh`
* `/home/u035/u035/shared/scripts/process_NHS_WES_aff_probands.sh`
VEP="/home/u035/u035/shared/software/bcbio/anaconda/bin/perl /home/u035/u035/shared/software/bcbio/anaconda/bin/vep"
# this points to ../share/ensembl-vep-100.4-0/vep
REFERENCE_GENOME=/home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
IN_FILE=${VCF_DIR}/${PLATE_ID}_${FAMILY_ID}.clean.vcf
G2P_LOG_DIR=${G2P_DIR}/${PLATE_ID}_${FAMILY_ID}_LOG_DIR
mkdir ${G2P_LOG_DIR}
TXT_OUT=${G2P_LOG_DIR}/${PLATE_ID}_${FAMILY_ID}.report.txt
HTML_OUT=${G2P_LOG_DIR}/${PLATE_ID}_${FAMILY_ID}.report.html
VCF_KEYS='gnomADe_GRCh38|gnomADg_r3.0_GRCh38'
time ${VEP} \
-i ${IN_FILE} \
--output_file ${G2P_LOG_DIR}/${PLATE_ID}_${FAMILY_ID}_inter_out.txt \
--force_overwrite \
--assembly GRCh38 \
--fasta ${REFERENCE_GENOME} \
--offline \
--merged \
--use_given_ref \
--cache --cache_version 100 \
--dir_cache /home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/vep \
--transcript_filter "gene_symbol in /home/u035/u035/shared/resources/G2P/genes_in_DDG2P.20201208.txt" \
--dir_plugins /home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0 \
--plugin G2P,file='/home/u035/u035/shared/resources/G2P/DDG2P.20201208.csv',af_from_vcf=1,confidence_levels='confirmed&probable&both RD and IF',af_from_vcf_keys=${VCF_KEYS},\
log_dir=${G2P_LOG_DIR},txt_report=${TXT_OUT},html_report=${HTML_OUT}
```
## Developmental Disorders (DD) gene panel and list of unique gene names (with synonyms)
Source: https://www.ebi.ac.uk/gene2phenotype/downloads
Date obtained: 06/07/2021
Location at EPCC: /home/u035/u035/shared/resources/G2P
File names: DDG2P.20210706.csv and genes_in_DDG2P.20210706.txt
Example of <eddie_work_folder>: `/exports/igmm/eddie/IGMM-VariantAnalysis/mike/work_folder`
Example of <datastore_work_folder>: `I:\IGMM-VariantAnalysis\documentation\trio_whole_exome\work_folder`
### Obtaining and pre-processing the gene panel file
```
E> cd <eddie_work_folder>
E> wget https://www.ebi.ac.uk/gene2phenotype/downloads/DDG2P.csv.gz
E> mv DDG2P.csv.gz DDG2P.orig.<date_downloaded>.csv.gz
E> gunzip -c DDG2P.orig.<date_downloaded>.csv.gz > DDG2P.orig.<date_downloaded>.csv
```
* Take a note of the date tag of the outdated DDG2P gene panel; call it `<date_downloaded_old>`
* Copy `DDG2P.orig.<date_downloaded>.csv` from `<eddie_work_folder>` to `<datastore_work_folder>`
* From `<datastore_work_folder>` open the file with Excel and sort by allelic requirement
* Remove entries with no allelic requirement listed (if any)
* Split records (rows) with multiple (comma separated) allelic requirements; sort again
* Save as `DDG2P.<date_downloaded>.csv`
* Copy `DDG2P.<date_downloaded>.csv` from `<datastore_work_folder>` to `<eddie_work_folder>` and to ultra at `/home/u035/u035/shared/resources/G2P`
`DDG2P.20210706.csv` stats
```
records with biallelic requirement: 1339 # 1310 in 20201208
records with monoallelic requirement: 873 # 911 in 20201208
records with hemizygous requirement: 159 # 179 in 20201208
records with x-linked dominant requirement: 46 # 44 in 20201208
records with x-linked over-dominance requirement: 2 # 2 in 20201208
records with digenic requirement: 1 (G2P ignores them)
records with imprinted requirement: 10 (G2P ignores them)
records with mitochondrial requirement: 2 (G2P ignores them)
records with uncertain requirement: 0 (G2P ignores them)
records with mosaic requirement: 12 (G2P ignores them)
records with no allelic requirement: 3 (excluded)
```
### Obtaining the list of all unique gene names in the DD gene panel (current gene symbol and previous gene symbols)
```
E> cd <eddie_work_folder>
E> time python /exports/igmm/eddie/IGMM-VariantAnalysis/mike/scripts/extract_unique_genes.py DDG2P.<date_downloaded>.csv genes_in_DDG2P.<date_downloaded>.txt
Found 3553 unique gene names (incl.synonyms) in DDG2P.20210706.csv
recorded 3553 unique gene names (incl. synonyms); outfile = genes_in_DDG2P.20210706.txt
```
* Copy `genes_in_DDG2P.<date_downloaded>.txt` from `<eddie_work_folder>` to `<datastore_work_folder>` and to ultra at /home/u035/u035/shared/resources/G2P
* After updating the DDG2P list, update the resources for coverage analysis, see below:
* CCDS Dataset (check if updated CCDS is available)
* DD genes Dataset
* ClinVar Dataset
* ClinVar annotated DD exons from CCDS
After the updates of the resources for the coverage analysis are completed, update the driver scripts for all implemented family structures:
* `/home/u035/u035/shared/scripts/process_NHS_WES_trio.sh`
* `/home/u035/u035/shared/scripts/process_NHS_WES_aff_probands.sh`
* `/home/u035/u035/shared/scripts/process_NHS_WES_quad.sh`
to point to the updated files:
```
TARGETS=/home/u035/u035/shared/resources/G2P/DDG2P.20210706.plus15bp.merged.bed
CLINVAR=/home/u035/u035/shared/resources/G2P/DDG2P.20210706.clinvar.20210626.plus15bp.txt
echo "Performing G2P analysis (DD genes)for FAMILY_ID = ${PLATE_ID}_${FAMILY_ID}..."
echo "Using ${TARGETS}"
--transcript_filter "gene_symbol in /home/u035/u035/shared/resources/G2P/genes_in_DDG2P.20210706.txt"
--plugin G2P,file='/home/u035/u035/shared/resources/G2P/DDG2P.20210706.csv',af_from_vcf...
```
## Resources for coverage analysis
### CCDS Dataset
Source: https://www.ncbi.nlm.nih.gov/projects/CCDS
Date obtained: 28/02/2019
Location at EPCC: /home/u035/u035/shared/resources/exome_targets
File name: CCDS.20180614.plus15bp.merged.bed
To check if updates are available, go to https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi?REQUEST=SHOW_STATISTICS
Downloaded CCDS.20180614.txt (hg38) from the CCDS site, converted to BED format (exon number appended to CCDS id in order), added 15bp each side, sorted, and merged.
```
E> perl ../scripts/ccds_to_bed.pl
E> perl ../scripts/ccds_to_bed.pl -i CCDS.20180614.txt -o CCDS.20180614.bed
E> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg38.chromInfo" > hg38.genome
E> bedtools slop -i CCDS.20180614.bed -b 15 -g hg38.genome > CCDS.20180614.plus15bp.bed
E> bedtools sort -i CCDS.20180614.plus15bp.bed -faidx /exports/igmm/eddie/bioinfsvice/ameynert/software/bcbio-1.0.7/genomes/Hsapiens/hg38/seq/hg38.fa.fai > CCDS.20180614.plus15bp.sorted.bed
E> bedtools merge -i CCDS.20180614.plus15bp.sorted.bed -c 4 -o distinct > CCDS.20180614.plus15bp.merged.bed
```
### DD genes Dataset
Source: /home/u035/u035/shared/resources/G2P/genes_in_DDG2P.20210706.txt
Date obtained: 08/12/2020
Location at EPCC: /home/u035/u035/shared/resources/G2P
File name: DDG2P.20210706.plus15bp.merged.bed
From the CCDS BED file (above), extract a BED file for the DD genes
```
U> cd /home/u035/u035/shared/resources/G2P
U> PYTHON=/home/u035/u035/shared/software/bcbio/anaconda/envs/python2/bin/python2.7
U> time $PYTHON /home/u035/u035/shared/scripts/extract_BED_CCDS_DDG2P.py CCDS.20180614.plus15bp.merged.bed genes_in_DDG2P.20210706.txt DDG2P.20210706.plus15bp.merged.bed
Found 3553 unique gene names in genes_in_DDG2P.20210706.txt
Read 193346 records from the input BED file = CCDS.20180614.plus15bp.merged.bed
Wrote 33275 record for the DDG2P genes in the output BED file = DDG2P.20210706.plus15bp.merged.bed
Found intervals for 2156 uniq DDG2P genes
```
### ClinVar Dataset
Source: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/
Date obtained: 06/07/2021
Location at EPCC: /home/u035/u035/shared/resources/clinvar
File name: clinvar_20210626.P_LP.ACP.vcf
Description of ClinVar VCF @ `https://www.ncbi.nlm.nih.gov/variation/docs/ClinVar_vcf_files/`
```
U> cd /home/u035/u035/shared/resources/clinvar
U> wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20210626.vcf.gz
U> wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20210626.vcf.gz.tbi
# grab the header and only variants annotated as Pathogenic or Likely Pathogenic # and not with conflicting interpretation
U> zgrep '^#' clinvar_20210626.vcf.gz > clinvar_20210626.P_LP.vcf && zgrep -E 'CLNSIG=Likely_pathogenic;|CLNSIG=Pathogenic;' clinvar_20210626.vcf.gz >> clinvar_20210626.P_LP.vcf
# need to add chr prefix in the clinvar_20201128.P_LP.vcf file
U> awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' clinvar_20210626.P_LP.vcf > clinvar_20210626.P_LP.chr.vcf
# Exclude variants with “no assertion criteria provided”
U> grep '^#' clinvar_20210626.P_LP.chr.vcf > clinvar_20210626.P_LP.ACP.vcf && grep -v 'CLNREVSTAT=no_assertion_criteria_provided' clinvar_20210626.P_LP.chr.vcf >> clinvar_20210626.P_LP.ACP.vcf
```
### ClinVar annotated DD exons from CCDS
Source: DDG2P.20210706.plus15bp.merged.bed & clinvar_20210626.P_LP.ACP.vcf (see above)
Location at EPCC: /home/u035/u035/shared/resources/G2P
File name: DDG2P.20210706.clinvar.20210626.plus15bp.txt
A BED file for all CCDS exons (15bp padded) found in the DD genes, annotated with the number of “relevant” variants reported in ClinVar (pathogenic or likely pathogenic with provided assertion criteria) for which the proband coverage is to be computed.
Use bedtools to count and record the number of P/LP variants per each interval
```
U> cd /home/u035/u035/shared/resources/G2P
U> BEDTOOLS=/home/u035/u035/shared/software/bcbio/anaconda/bin/bedtools
U> $BEDTOOLS intersect -wa -c -a DDG2P.20210706.plus15bp.merged.bed -b ../clinvar/clinvar_20210626.P_LP.ACP.vcf > DDG2P.20210706.clinvar.20210626.plus15bp.txt
# proportion of all P/LP variants with assertion criteria provided in DD genes
U> grep -v '^#' ../clinvar/clinvar_20210626.P_LP.ACP.vcf | wc -l
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
Total of 102341 ClinVar vars
U> cat DDG2P.20210706.clinvar.20210626.plus15bp.txt | awk '{sum += $5} END {print sum}'
67623 of the ClinVar vars are in DD genes; 67623/102341=66% of all are in DD gene
```
### Mapping the Family_ID (aka DECIPHER_ID) to Proband_ID (aka DECIPHER internal ID)
Source: https://git.ecdf.ed.ac.uk/igmmbioinformatics/decipher-id-mapping/tree/master/bin (in-house script)
Date obtained: 19/09/2019
Location: a suitable and secure folder on your local laptop/PC
File name: decipher-id-mapping.jar
To get access to the source, email alison.meynert@ed.ac.uk
* Make sure your java version is reasonably recent (e.g. java version "13" 2019-09-17)
* Download the decipher-id-mapping.jar file (downloads as jar.zip, delete .zip)
* Get the Selenium Chrome driver for the laptop’s Chrome browser version from https://chromedriver.chromium.org/downloads, store and unzip in the same folder as the decipher-id-mapping.jar file
* Chrome v77 can be downloaded from: https://www.neowin.net/news/google-chrome-770386575-offline-installer/ (64bit)
### Variant Blacklist
Source: NHSS
Date obtained: 25/09/2019
Location at EPCC: /home/u035/u035/shared/resources/blacklist
File name: current_blacklist.txt
This is a file which contains variant which were assessed by NHSS as safe to be excluded from analysis and reports (e.g., cannot be lifted over from GRCh38 to DECIPHER’s v37 and/or seen too frequently in previously analyzed batches, etc.). It should be provided by NHSS before the variants in a new batch are to be prioritized (contact Morad Ansari morad.ansari@nhs.net).
Open the Excel file provided by NHSS and store the information in a tab-separated file named `blacklist.<date_received>.txt` with the format chr pos ref alt, adding the ‘chr’ prefix if necessary. Create a copy of the file named `current_blacklist.txt` which is looked for and used by `NHS_WES_filter_LQ_GT.py`.
```
U> cd /home/u035/u035/shared/resources/blacklist
U> nano blacklist.2019-11-27.txt
U> cp blacklist.2019-11-27.txt current_blacklist.txt
```
### Transcript Replacement
Source: NHSS
Date obtained: 25/09/2019
Location at EPCC: /home/u035/u035/shared/resources/trans_map
File name: current_trans_map.txt
Some of the VEP (v97) GRCh38 transcripts are not currently recognized by DECIPHER during bulk upload. NHSS will generate a transcript replacement file (usually included in the same excel file as the variant blacklist) for the preferred transcript ID (available in DECIPHER). It should be provided by NHSS before the variants in a new batch are to be prioritized (contact Morad Ansari morad.ansari@nhs.net).
Open the Excel file provided by NHSS and store the information in a tab-separated file named `trans_map.<date_received>.txt` with the format `Unrecognized_transcript Replacement_transcript`. Create a copy of the file named `current_trans_map.txt` which is looked for and used by `NHS_WES_filter_LQ_GT.py`.
```
U> cd /home/u035/u035/shared/resources/trans_map
U> nano trans_map.2019-11-27.txt
U> cp trans_map.2019-11-27.txt current_trans_map.txt
```
### VASE setup in STRICT mode
Location at EPCC: /home/u035/u035/shared/software/bcbio/anaconda/bin/vase
Parameter Values for VASE STRICT
```
VASE=/home/u035/u035/shared/software/bcbio/anaconda/bin/vase
IN_FILE=${VCF_DIR}/${PLATE_ID}_${FAMILY_ID}.ready.vcf.gz
OUT_FILE=${VASE_DIR}/${PLATE_ID}_${FAMILY_ID}.strict.denovo.vcf
PED_FILE=${PED_DIR}/${BATCH_ID}_${PLATE_ID}_${FAMILY_ID}.ped
time ${VASE} \
-i ${IN_FILE} \
-o ${OUT_FILE} \
--log_progress \
--prog_interval 100000 \
--freq 0.0001 \
--gq 30 --dp 10 \
--het_ab 0.3 \
--max_alt_alleles 1 \
--csq all \
--biotypes all \
--control_gq 15 --control_dp 5 \
--control_het_ab 0.01 \
--control_max_ref_ab 0.05 \
--de_novo \
--ped ${PED_FILE}
```