Skip to content
Snippets Groups Projects

Standard operating procedure - Setup for variant prioritization in trio whole exome samples at the Edinburgh Parallel Computing Centre

This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE). It assumes that data has been successfully aligned, variant called and annotated (see SOP_trio_whole_exome_EPCC_pipeline). Scripts and resource datasets are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome. Request access by e-mail: alison.meynert@ed.ac.uk.

This setup must not be altered in any way unless sanctioned by NHSS. Contact Morad Ansari morad.ansari@nhs.net, morad.ansari@nhslothian.scot.nhs.uk.

Definitions

Command lines starting with E> are to be executed on Eddie; U> refers to Ultra2 (EPCC).

Text in angle brackets, e.g. indicates variable parameters.

The Variant Effect Predictor (VEP)

Version: ensembl-vep-100.4-0

Location at EPCC: /home/u035/u035/shared/software/bcbio/anaconda/bin/vep

Note: make sure the VEP cache files are tabix converted (significant speedup)

Check the downloaded VEP cache files were tabix converted

U> ls -l /home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/vep/homo_sapiens_merged/100_GRCh38/<chr>

where is the chromosome name without the ‘chr’ prefix (e.g., 1,2, ..., 22). A presence of a file called all_vars.gz.csi indicates that the VEP cache files were tabix converted.

The G2P plugin for VEP

Version: 100

Source: https://github.com/Ensembl/VEP_plugins/blob/release/100/G2P.pm

Date obtained: 21/04/2021

Location at EPCC: /home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0

File name: G2P.pm

Add splice region variants (1-3 bases of the exon or 3-8 bases of the intron) for consideration by G2P

Edit the G2P.pm file by adding the type splice_region_variant at the end of:

types        : SO consequence types to include ...		(line 84)
types => {map {$_ => 1} qw(splice_donor_variant ...		(line 145)

Setting G2P in completely offline mode

All external datasets listed for the af_from_vcf_keys flag (gnomADe_r2.1.1_GRCh38|gnomADg_r3.1.1_GRCh38) in:

  • /home/u035/u035/shared/scripts/process_NHS_WES_trio.sh
  • /home/u035/u035/shared/scripts/process_NHS_WES_aff_probands.sh

must be available locally (see below for downloading gnomADe and gnomADg datasets)

  • gnomADg dataset (r3.1.1, downloaded 23/08/2021): /home/u035/u035/shared/resources/gnomad/r3.1.1/genomes
  • gnomADe dataset (r2.1.1, downloaded 23/08/2021): /home/u035/u035/shared/resources/gnomad/r2.1.1/exomes

To re-fetch the gnomADg dataset:

U> cd /home/u035/u035/shared/resources/gnomad/r3.1.1/genomes
U> for i in {1..22} X Y
U> do
U>   wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.1/vcf/genomes/gnomad.genomes.v3.1.1.sites.chr${i}.vcf.bgz
U>   wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.1/vcf/genomes/gnomad.genomes.v3.1.1.sites.chr${i}.vcf.bgz.tbi
U> done

To re-fetch the gnomADe dataset:

U> cd /home/u035/u035/shared/resources/gnomad/r2.1.1/exomes
U> for i in {1..22} X Y
U> do
U>   wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.${i}.liftover_grch38.vcf.bgz
U>   wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.${i}.liftover_grch38.vcf.bgz.tbi
U> done

Edit the /home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json file to update type (remote -> local) and filename_template (local path to datasets) variables for these local datasets.

gnomADg (lines 138-143)

"id": "gnomADg_r3.1.1_GRCh38",
"description": "Genome Aggregation Database genomes r3.1.1",
"species": "homo_sapiens",
"assembly": "GRCh38",
"type": "local",
"filename_template": "/home/u035/u035/shared/resources/gnomad/r3.1.1/genomes/gnomad.genomes.v3.1.1.sites.chr###CHR###.vcf.bgz",

gnomADe (lines 199-204)

"id": "gnomADe_r2.1.1_GRCh38",
"description": "Genome Aggregation Database exomes r2.1.1 liftover to GRCh38",
"species": "homo_sapiens",
"assembly": "GRCh38",
"type": "local",
"filename_template": "/home/u035/u035/shared/resources/gnomad/r2.1.1/exomes/gnomad.exomes.r2.1.1.sites.###CHR###.liftover_grch38.vcf.bgz",

Parameter Values for G2P call

Files

  • /home/u035/u035/shared/scripts/process_NHS_WES_trio.sh
  • /home/u035/u035/shared/scripts/process_NHS_WES_aff_probands.sh
VEP="/home/u035/u035/shared/software/bcbio/anaconda/bin/perl /home/u035/u035/shared/software/bcbio/anaconda/bin/vep"    
# this points to ../share/ensembl-vep-100.4-0/vep

REFERENCE_GENOME=/home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa

IN_FILE=${VCF_DIR}/${PLATE_ID}_${FAMILY_ID}.clean.vcf
G2P_LOG_DIR=${G2P_DIR}/${PLATE_ID}_${FAMILY_ID}_LOG_DIR
mkdir ${G2P_LOG_DIR}
TXT_OUT=${G2P_LOG_DIR}/${PLATE_ID}_${FAMILY_ID}.report.txt
HTML_OUT=${G2P_LOG_DIR}/${PLATE_ID}_${FAMILY_ID}.report.html
VCF_KEYS='gnomADe_GRCh38|gnomADg_r3.0_GRCh38'

time ${VEP} \
    -i ${IN_FILE} \
    --output_file ${G2P_LOG_DIR}/${PLATE_ID}_${FAMILY_ID}_inter_out.txt \
    --force_overwrite \
    --assembly GRCh38 \
    --fasta ${REFERENCE_GENOME} \
    --offline \
    --merged \
    --use_given_ref \
    --cache --cache_version 100 \
    --dir_cache /home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/vep \
    --individual all \
    --transcript_filter "gene_symbol in /home/u035/u035/shared/resources/G2P/genes_in_DDG2P.20201208.txt" \
    --dir_plugins /home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0 \
    --plugin G2P,file='/home/u035/u035/shared/resources/G2P/DDG2P.20201208.csv',af_from_vcf=1,confidence_levels='confirmed&probable&both RD and IF',af_from_vcf_keys=${VCF_KEYS},\
log_dir=${G2P_LOG_DIR},txt_report=${TXT_OUT},html_report=${HTML_OUT}

Developmental Disorders (DD) gene panel and list of unique gene names (with synonyms)

Source: https://www.ebi.ac.uk/gene2phenotype/downloads

Date obtained: 06/07/2021

Location at EPCC: /home/u035/u035/shared/resources/G2P

File names: DDG2P.20210706.csv and genes_in_DDG2P.20210706.txt

Example of <eddie_work_folder>: /exports/igmm/eddie/IGMM-VariantAnalysis/mike/work_folder Example of <datastore_work_folder>: I:\IGMM-VariantAnalysis\documentation\trio_whole_exome\work_folder

Obtaining and pre-processing the gene panel file

E> cd <eddie_work_folder>
E> wget https://www.ebi.ac.uk/gene2phenotype/downloads/DDG2P.csv.gz
E> mv DDG2P.csv.gz DDG2P.orig.<date_downloaded>.csv.gz
E> gunzip -c DDG2P.orig.<date_downloaded>.csv.gz > DDG2P.orig.<date_downloaded>.csv
  • Take a note of the date tag of the outdated DDG2P gene panel; call it <date_downloaded_old>
  • Copy DDG2P.orig.<date_downloaded>.csv from <eddie_work_folder> to <datastore_work_folder>
  • From <datastore_work_folder> open the file with Excel and sort by allelic requirement
  • Remove entries with no allelic requirement listed (if any)
  • Split records (rows) with multiple (comma separated) allelic requirements; sort again
  • Save as DDG2P.<date_downloaded>.csv
  • Copy DDG2P.<date_downloaded>.csv from <datastore_work_folder> to <eddie_work_folder> and to ultra at /home/u035/u035/shared/resources/G2P

DDG2P.20210706.csv stats

	records with biallelic requirement:				1339	# 1310 in 20201208
	records with monoallelic requirement:			873	# 911 in 20201208
	records with hemizygous requirement:			159	# 179	in 20201208
	records with x-linked dominant requirement:		46	# 44 in 20201208
	records with x-linked over-dominance requirement:	2	# 2 in 20201208
	records with digenic requirement:				1	(G2P ignores them)
	records with imprinted requirement:			10	(G2P ignores them)
	records with mitochondrial requirement:			2	(G2P ignores them)
	records with uncertain requirement:			0	(G2P ignores them)
	records with mosaic requirement:				12	(G2P ignores them)
	records with no allelic requirement:			3	(excluded)

Obtaining the list of all unique gene names in the DD gene panel (current gene symbol and previous gene symbols)

E> cd <eddie_work_folder>
E> time python /exports/igmm/eddie/IGMM-VariantAnalysis/mike/scripts/extract_unique_genes.py DDG2P.<date_downloaded>.csv genes_in_DDG2P.<date_downloaded>.txt
Found 3553 unique gene names (incl.synonyms) in DDG2P.20210706.csv
recorded 3553 unique gene names (incl. synonyms); outfile = genes_in_DDG2P.20210706.txt
  • Copy genes_in_DDG2P.<date_downloaded>.txt from <eddie_work_folder> to <datastore_work_folder> and to ultra at /home/u035/u035/shared/resources/G2P
  • After updating the DDG2P list, update the resources for coverage analysis, see below:
    • CCDS Dataset (check if updated CCDS is available)
    • DD genes Dataset
    • ClinVar Dataset
    • ClinVar annotated DD exons from CCDS

After the updates of the resources for the coverage analysis are completed, update the driver scripts for all implemented family structures:

  • /home/u035/u035/shared/scripts/process_NHS_WES_trio.sh
  • /home/u035/u035/shared/scripts/process_NHS_WES_aff_probands.sh
  • /home/u035/u035/shared/scripts/process_NHS_WES_quad.sh

to point to the updated files:

TARGETS=/home/u035/u035/shared/resources/G2P/DDG2P.20210706.plus15bp.merged.bed
CLINVAR=/home/u035/u035/shared/resources/G2P/DDG2P.20210706.clinvar.20210626.plus15bp.txt
echo "Performing G2P analysis (DD genes)for FAMILY_ID = ${PLATE_ID}_${FAMILY_ID}..."
echo "Using ${TARGETS}"
--transcript_filter "gene_symbol in /home/u035/u035/shared/resources/G2P/genes_in_DDG2P.20210706.txt"
--plugin G2P,file='/home/u035/u035/shared/resources/G2P/DDG2P.20210706.csv',af_from_vcf...

Resources for coverage analysis

CCDS Dataset

Source: https://www.ncbi.nlm.nih.gov/projects/CCDS

Date obtained: 28/02/2019

Location at EPCC: /home/u035/u035/shared/resources/exome_targets

File name: CCDS.20180614.plus15bp.merged.bed

To check if updates are available, go to https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi?REQUEST=SHOW_STATISTICS

Downloaded CCDS.20180614.txt (hg38) from the CCDS site, converted to BED format (exon number appended to CCDS id in order), added 15bp each side, sorted, and merged.

E> perl ../scripts/ccds_to_bed.pl
E> perl ../scripts/ccds_to_bed.pl -i CCDS.20180614.txt -o CCDS.20180614.bed
E> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg38.chromInfo" > hg38.genome
E> bedtools slop -i CCDS.20180614.bed -b 15 -g hg38.genome > CCDS.20180614.plus15bp.bed
E> bedtools sort -i CCDS.20180614.plus15bp.bed -faidx /exports/igmm/eddie/bioinfsvice/ameynert/software/bcbio-1.0.7/genomes/Hsapiens/hg38/seq/hg38.fa.fai > CCDS.20180614.plus15bp.sorted.bed
E> bedtools merge -i CCDS.20180614.plus15bp.sorted.bed -c 4 -o distinct > CCDS.20180614.plus15bp.merged.bed

DD genes Dataset

Source: /home/u035/u035/shared/resources/G2P/genes_in_DDG2P.20210706.txt

Date obtained: 08/12/2020

Location at EPCC: /home/u035/u035/shared/resources/G2P

File name: DDG2P.20210706.plus15bp.merged.bed

From the CCDS BED file (above), extract a BED file for the DD genes

U> cd /home/u035/u035/shared/resources/G2P
U> PYTHON=/home/u035/u035/shared/software/bcbio/anaconda/envs/python2/bin/python2.7
U> time $PYTHON /home/u035/u035/shared/scripts/extract_BED_CCDS_DDG2P.py CCDS.20180614.plus15bp.merged.bed genes_in_DDG2P.20210706.txt DDG2P.20210706.plus15bp.merged.bed
Found 3553 unique gene names in genes_in_DDG2P.20210706.txt
Read 193346 records from the input BED file = CCDS.20180614.plus15bp.merged.bed
Wrote 33275 record for the DDG2P genes in the output BED file = DDG2P.20210706.plus15bp.merged.bed
Found intervals for 2156 uniq DDG2P genes

ClinVar Dataset

Source: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

Date obtained: 06/07/2021

Location at EPCC: /home/u035/u035/shared/resources/clinvar

File name: clinvar_20210626.P_LP.ACP.vcf

Description of ClinVar VCF @ https://www.ncbi.nlm.nih.gov/variation/docs/ClinVar_vcf_files/

U> cd /home/u035/u035/shared/resources/clinvar
U> wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20210626.vcf.gz
U> wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20210626.vcf.gz.tbi

# grab the header and only variants annotated as Pathogenic or Likely Pathogenic # and not with conflicting interpretation
U> zgrep '^#' clinvar_20210626.vcf.gz > clinvar_20210626.P_LP.vcf && zgrep -E 'CLNSIG=Likely_pathogenic;|CLNSIG=Pathogenic;' clinvar_20210626.vcf.gz >>  clinvar_20210626.P_LP.vcf

# need to add chr prefix in the clinvar_20201128.P_LP.vcf file
U> awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' clinvar_20210626.P_LP.vcf > clinvar_20210626.P_LP.chr.vcf

# Exclude variants with “no assertion criteria provided”
U> grep '^#' clinvar_20210626.P_LP.chr.vcf > clinvar_20210626.P_LP.ACP.vcf && grep -v 'CLNREVSTAT=no_assertion_criteria_provided' clinvar_20210626.P_LP.chr.vcf >> clinvar_20210626.P_LP.ACP.vcf 

ClinVar annotated DD exons from CCDS

Source: DDG2P.20210706.plus15bp.merged.bed & clinvar_20210626.P_LP.ACP.vcf (see above)

Location at EPCC: /home/u035/u035/shared/resources/G2P

File name: DDG2P.20210706.clinvar.20210626.plus15bp.txt

A BED file for all CCDS exons (15bp padded) found in the DD genes, annotated with the number of “relevant” variants reported in ClinVar (pathogenic or likely pathogenic with provided assertion criteria) for which the proband coverage is to be computed.

Use bedtools to count and record the number of P/LP variants per each interval

U> cd /home/u035/u035/shared/resources/G2P
U> BEDTOOLS=/home/u035/u035/shared/software/bcbio/anaconda/bin/bedtools
U> $BEDTOOLS intersect -wa -c -a DDG2P.20210706.plus15bp.merged.bed -b ../clinvar/clinvar_20210626.P_LP.ACP.vcf > DDG2P.20210706.clinvar.20210626.plus15bp.txt

# proportion of all P/LP variants with assertion criteria provided in DD genes
U> grep -v '^#' ../clinvar/clinvar_20210626.P_LP.ACP.vcf | wc -l		
Total of 102341 ClinVar vars

U> cat DDG2P.20210706.clinvar.20210626.plus15bp.txt | awk '{sum += $5} END {print sum}'
67623 of the ClinVar vars are in DD genes; 67623/102341=66% of all are in DD gene

Mapping the Family_ID (aka DECIPHER_ID) to Proband_ID (aka DECIPHER internal ID)

Source: https://git.ecdf.ed.ac.uk/igmmbioinformatics/decipher-id-mapping/tree/master/bin (in-house script)

Date obtained: 19/09/2019

Location: a suitable and secure folder on your local laptop/PC

File name: decipher-id-mapping.jar

To get access to the source, email alison.meynert@ed.ac.uk

Variant Blacklist

Source: NHSS

Date obtained: 25/09/2019

Location at EPCC: /home/u035/u035/shared/resources/blacklist

File name: current_blacklist.txt

This is a file which contains variant which were assessed by NHSS as safe to be excluded from analysis and reports (e.g., cannot be lifted over from GRCh38 to DECIPHER’s v37 and/or seen too frequently in previously analyzed batches, etc.). It should be provided by NHSS before the variants in a new batch are to be prioritized (contact Morad Ansari morad.ansari@nhs.net).

Open the Excel file provided by NHSS and store the information in a tab-separated file named blacklist.<date_received>.txt with the format chr pos ref alt, adding the ‘chr’ prefix if necessary. Create a copy of the file named current_blacklist.txt which is looked for and used by NHS_WES_filter_LQ_GT.py.

U> cd /home/u035/u035/shared/resources/blacklist
U> nano blacklist.2019-11-27.txt
U> cp blacklist.2019-11-27.txt current_blacklist.txt

Transcript Replacement

Source: NHSS

Date obtained: 25/09/2019

Location at EPCC: /home/u035/u035/shared/resources/trans_map

File name: current_trans_map.txt

Some of the VEP (v97) GRCh38 transcripts are not currently recognized by DECIPHER during bulk upload. NHSS will generate a transcript replacement file (usually included in the same excel file as the variant blacklist) for the preferred transcript ID (available in DECIPHER). It should be provided by NHSS before the variants in a new batch are to be prioritized (contact Morad Ansari morad.ansari@nhs.net).

Open the Excel file provided by NHSS and store the information in a tab-separated file named trans_map.<date_received>.txt with the format Unrecognized_transcript Replacement_transcript. Create a copy of the file named current_trans_map.txt which is looked for and used by NHS_WES_filter_LQ_GT.py.

U> cd /home/u035/u035/shared/resources/trans_map
U> nano trans_map.2019-11-27.txt
U> cp trans_map.2019-11-27.txt current_trans_map.txt

VASE setup in STRICT mode

Location at EPCC: /home/u035/u035/shared/software/bcbio/anaconda/bin/vase

Parameter Values for VASE STRICT

VASE=/home/u035/u035/shared/software/bcbio/anaconda/bin/vase

IN_FILE=${VCF_DIR}/${PLATE_ID}_${FAMILY_ID}.ready.vcf.gz
OUT_FILE=${VASE_DIR}/${PLATE_ID}_${FAMILY_ID}.strict.denovo.vcf
PED_FILE=${PED_DIR}/${BATCH_ID}_${PLATE_ID}_${FAMILY_ID}.ped

time ${VASE} \
    -i ${IN_FILE} \
    -o ${OUT_FILE} \
    --log_progress \
    --prog_interval 100000 \
    --freq 0.0001 \
    --gq 30 --dp 10 \
    --het_ab 0.3 \
    --max_alt_alleles 1 \
    --csq all \
    --biotypes all \
    --control_gq 15 --control_dp 5 \
    --control_het_ab 0.01 \
    --control_max_ref_ab 0.05 \
    --de_novo \
    --ped ${PED_FILE}