Setup_variant_prioritization.md

# Standard operating procedure - Setup for variant prioritization in trio whole exome samples at the Edinburgh Parallel Computing Centre

This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE). It assumes that data has been successfully aligned, variant called and annotated (see SOP_trio_whole_exome_EPCC_pipeline). Scripts and resource datasets are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome. Request access by e-mail: alison.meynert@ed.ac.uk.

This setup must not be altered in any way unless sanctioned by NHSS. Contact Morad Ansari morad.ansari@nhs.net, morad.ansari@nhslothian.scot.nhs.uk.

## Definitions

Command lines starting with E> are to be executed on Eddie; U> refers to Ultra2 (EPCC).

Text in angle brackets, e.g. <chr> indicates variable parameters. 


## The Variant Effect Predictor (VEP)
Version:		ensembl-vep-100.4-0

Location at EPCC:	/home/u035/u035/shared/software/bcbio/anaconda/bin/vep

Note:			make sure the VEP cache files are tabix converted (significant speedup)

### Check the downloaded VEP cache files were tabix converted

```
U> ls -l /home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/vep/homo_sapiens_merged/100_GRCh38/<chr>
```

where <chr> is the chromosome name without the ‘chr’ prefix (e.g., 1,2, ..., 22).
A presence of a file called all_vars.gz.csi indicates that the VEP cache files were tabix converted.

## The G2P plugin for VEP

Version:		100

Source:		https://github.com/Ensembl/VEP_plugins/blob/release/100/G2P.pm 

Date obtained:	21/04/2021

Location at EPCC:	/home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0

File name:		G2P.pm 

### Add splice region variants (1-3 bases of the exon or 3-8 bases of the intron) for consideration by G2P

Edit the G2P.pm file by adding the type `splice_region_variant` at the end of:

```
types        : SO consequence types to include ...		(line 84)
types => {map {$_ => 1} qw(splice_donor_variant ...		(line 145)
```

### Setting G2P in completely offline mode

All external datasets listed for the `af_from_vcf_keys flag` (gnomADe_r2.1.1_GRCh38|gnomADg_r3.1.1_GRCh38) in:
* `/home/u035/u035/shared/scripts/process_NHS_WES_trio.sh`
* `/home/u035/u035/shared/scripts/process_NHS_WES_aff_probands.sh`

must be available locally  (see below for downloading gnomADe and gnomADg datasets)

* gnomADg dataset (r3.1.1, downloaded 23/08/2021): `/home/u035/u035/shared/resources/gnomad/r3.1.1/genomes`
* gnomADe dataset (r2.1.1, downloaded 23/08/2021): `/home/u035/u035/shared/resources/gnomad/r2.1.1/exomes`

To re-fetch the gnomADg dataset:
```
U> cd /home/u035/u035/shared/resources/gnomad/r3.1.1/genomes
U> for i in {1..22} X Y
U> do
U>   wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.1/vcf/genomes/gnomad.genomes.v3.1.1.sites.chr${i}.vcf.bgz
U>   wget https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.1/vcf/genomes/gnomad.genomes.v3.1.1.sites.chr${i}.vcf.bgz.tbi
U> done
```

To re-fetch the gnomADe dataset:
```
U> cd /home/u035/u035/shared/resources/gnomad/r2.1.1/exomes
U> for i in {1..22} X Y
U> do
U>   wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.${i}.liftover_grch38.vcf.bgz
U>   wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/liftover_grch38/vcf/exomes/gnomad.exomes.r2.1.1.sites.${i}.liftover_grch38.vcf.bgz.tbi
U> done
```

Edit the `/home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0/Bio/EnsEMBL/Variation/DBSQL/vcf_config.json` file to update type (remote -> local) and filename_template (local path to datasets) variables for these local datasets.

gnomADg (lines 138-143)
```
"id": "gnomADg_r3.1.1_GRCh38",
"description": "Genome Aggregation Database genomes r3.1.1",
"species": "homo_sapiens",
"assembly": "GRCh38",
"type": "local",
"filename_template": "/home/u035/u035/shared/resources/gnomad/r3.1.1/genomes/gnomad.genomes.v3.1.1.sites.chr###CHR###.vcf.bgz",
```

gnomADe (lines 199-204)
```
"id": "gnomADe_r2.1.1_GRCh38",
"description": "Genome Aggregation Database exomes r2.1.1 liftover to GRCh38",
"species": "homo_sapiens",
"assembly": "GRCh38",
"type": "local",
"filename_template": "/home/u035/u035/shared/resources/gnomad/r2.1.1/exomes/gnomad.exomes.r2.1.1.sites.###CHR###.liftover_grch38.vcf.bgz",
```

## Parameter Values for G2P call

Files
* `/home/u035/u035/shared/scripts/process_NHS_WES_trio.sh`
* `/home/u035/u035/shared/scripts/process_NHS_WES_aff_probands.sh` 

```
VEP="/home/u035/u035/shared/software/bcbio/anaconda/bin/perl /home/u035/u035/shared/software/bcbio/anaconda/bin/vep"    
# this points to ../share/ensembl-vep-100.4-0/vep

REFERENCE_GENOME=/home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa

IN_FILE=${VCF_DIR}/${PLATE_ID}_${FAMILY_ID}.clean.vcf
G2P_LOG_DIR=${G2P_DIR}/${PLATE_ID}_${FAMILY_ID}_LOG_DIR
mkdir ${G2P_LOG_DIR}
TXT_OUT=${G2P_LOG_DIR}/${PLATE_ID}_${FAMILY_ID}.report.txt
HTML_OUT=${G2P_LOG_DIR}/${PLATE_ID}_${FAMILY_ID}.report.html
VCF_KEYS='gnomADe_GRCh38|gnomADg_r3.0_GRCh38'

time ${VEP} \
    -i ${IN_FILE} \
    --output_file ${G2P_LOG_DIR}/${PLATE_ID}_${FAMILY_ID}_inter_out.txt \
    --force_overwrite \
    --assembly GRCh38 \
    --fasta ${REFERENCE_GENOME} \
    --offline \
    --merged \
    --use_given_ref \
    --cache --cache_version 100 \
    --dir_cache /home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/vep \
    --individual all \
    --transcript_filter "gene_symbol in /home/u035/u035/shared/resources/G2P/genes_in_DDG2P.20201208.txt" \
    --dir_plugins /home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0 \
    --plugin G2P,file='/home/u035/u035/shared/resources/G2P/DDG2P.20201208.csv',af_from_vcf=1,confidence_levels='confirmed&probable&both RD and IF',af_from_vcf_keys=${VCF_KEYS},\
log_dir=${G2P_LOG_DIR},txt_report=${TXT_OUT},html_report=${HTML_OUT}
```

## Developmental Disorders (DD) gene panel and list of unique gene names (with synonyms)

Source:		https://www.ebi.ac.uk/gene2phenotype/downloads

Date obtained:	06/07/2021

Location at EPCC:	/home/u035/u035/shared/resources/G2P

File names:		DDG2P.20210706.csv and genes_in_DDG2P.20210706.txt 

Example of <eddie_work_folder>: `/exports/igmm/eddie/IGMM-VariantAnalysis/mike/work_folder`
Example of <datastore_work_folder>: `I:\IGMM-VariantAnalysis\documentation\trio_whole_exome\work_folder`

### Obtaining and pre-processing the gene panel file

```
E> cd <eddie_work_folder>
E> wget https://www.ebi.ac.uk/gene2phenotype/downloads/DDG2P.csv.gz
E> mv DDG2P.csv.gz DDG2P.orig.<date_downloaded>.csv.gz
E> gunzip -c DDG2P.orig.<date_downloaded>.csv.gz > DDG2P.orig.<date_downloaded>.csv
```

* Take a note of the date tag of the outdated DDG2P gene panel; call it `<date_downloaded_old>`
* Copy `DDG2P.orig.<date_downloaded>.csv` from `<eddie_work_folder>` to `<datastore_work_folder>` 
* From `<datastore_work_folder>` open the file with Excel and sort by allelic requirement
* Remove entries with no allelic requirement listed (if any)
* Split records (rows) with multiple (comma separated) allelic requirements; sort again
* Save as `DDG2P.<date_downloaded>.csv`
* Copy `DDG2P.<date_downloaded>.csv` from `<datastore_work_folder>` to `<eddie_work_folder>` and to ultra at `/home/u035/u035/shared/resources/G2P`

`DDG2P.20210706.csv` stats
```
	records with biallelic requirement:				1339	# 1310 in 20201208
	records with monoallelic requirement:			873	# 911 in 20201208
	records with hemizygous requirement:			159	# 179	in 20201208
	records with x-linked dominant requirement:		46	# 44 in 20201208
	records with x-linked over-dominance requirement:	2	# 2 in 20201208
	records with digenic requirement:				1	(G2P ignores them)
	records with imprinted requirement:			10	(G2P ignores them)
	records with mitochondrial requirement:			2	(G2P ignores them)
	records with uncertain requirement:			0	(G2P ignores them)
	records with mosaic requirement:				12	(G2P ignores them)
	records with no allelic requirement:			3	(excluded)
```

### Obtaining the list of all unique gene names in the DD gene panel (current gene symbol and previous gene symbols)

```
E> cd <eddie_work_folder>
E> time python /exports/igmm/eddie/IGMM-VariantAnalysis/mike/scripts/extract_unique_genes.py DDG2P.<date_downloaded>.csv genes_in_DDG2P.<date_downloaded>.txt
Found 3553 unique gene names (incl.synonyms) in DDG2P.20210706.csv
recorded 3553 unique gene names (incl. synonyms); outfile = genes_in_DDG2P.20210706.txt
```

* Copy `genes_in_DDG2P.<date_downloaded>.txt` from `<eddie_work_folder>` to `<datastore_work_folder>` and to ultra at /home/u035/u035/shared/resources/G2P
* After updating the DDG2P list, update the resources for coverage analysis, see below:
  * CCDS Dataset (check if updated CCDS  is available)
  * DD genes Dataset
  * ClinVar Dataset
  * ClinVar annotated DD exons from CCDS

After the updates of the resources for the coverage analysis are completed, update the driver scripts for all implemented family structures:
* `/home/u035/u035/shared/scripts/process_NHS_WES_trio.sh`
* `/home/u035/u035/shared/scripts/process_NHS_WES_aff_probands.sh`
* `/home/u035/u035/shared/scripts/process_NHS_WES_quad.sh`

to point to the updated files:
```
TARGETS=/home/u035/u035/shared/resources/G2P/DDG2P.20210706.plus15bp.merged.bed
CLINVAR=/home/u035/u035/shared/resources/G2P/DDG2P.20210706.clinvar.20210626.plus15bp.txt
echo "Performing G2P analysis (DD genes)for FAMILY_ID = ${PLATE_ID}_${FAMILY_ID}..."
echo "Using ${TARGETS}"
--transcript_filter "gene_symbol in /home/u035/u035/shared/resources/G2P/genes_in_DDG2P.20210706.txt"
--plugin G2P,file='/home/u035/u035/shared/resources/G2P/DDG2P.20210706.csv',af_from_vcf...
```

## Resources for coverage analysis

### CCDS Dataset

Source:		https://www.ncbi.nlm.nih.gov/projects/CCDS

Date obtained:	28/02/2019

Location at EPCC:	/home/u035/u035/shared/resources/exome_targets

File name:		CCDS.20180614.plus15bp.merged.bed 

To check if updates are available, go to https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi?REQUEST=SHOW_STATISTICS

Downloaded CCDS.20180614.txt (hg38) from the CCDS site, converted to BED format (exon number appended to CCDS id in order), added 15bp each side, sorted, and merged.

```
E> perl ../scripts/ccds_to_bed.pl
E> perl ../scripts/ccds_to_bed.pl -i CCDS.20180614.txt -o CCDS.20180614.bed
E> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e "select chrom, size from hg38.chromInfo" > hg38.genome
E> bedtools slop -i CCDS.20180614.bed -b 15 -g hg38.genome > CCDS.20180614.plus15bp.bed
E> bedtools sort -i CCDS.20180614.plus15bp.bed -faidx /exports/igmm/eddie/bioinfsvice/ameynert/software/bcbio-1.0.7/genomes/Hsapiens/hg38/seq/hg38.fa.fai > CCDS.20180614.plus15bp.sorted.bed
E> bedtools merge -i CCDS.20180614.plus15bp.sorted.bed -c 4 -o distinct > CCDS.20180614.plus15bp.merged.bed
```

### DD genes Dataset

Source:		/home/u035/u035/shared/resources/G2P/genes_in_DDG2P.20210706.txt  

Date obtained:	08/12/2020

Location at EPCC:	/home/u035/u035/shared/resources/G2P

File name:		DDG2P.20210706.plus15bp.merged.bed 

From the CCDS BED file (above), extract a BED file for the DD genes

```
U> cd /home/u035/u035/shared/resources/G2P
U> PYTHON=/home/u035/u035/shared/software/bcbio/anaconda/envs/python2/bin/python2.7
U> time $PYTHON /home/u035/u035/shared/scripts/extract_BED_CCDS_DDG2P.py CCDS.20180614.plus15bp.merged.bed genes_in_DDG2P.20210706.txt DDG2P.20210706.plus15bp.merged.bed
Found 3553 unique gene names in genes_in_DDG2P.20210706.txt
Read 193346 records from the input BED file = CCDS.20180614.plus15bp.merged.bed
Wrote 33275 record for the DDG2P genes in the output BED file = DDG2P.20210706.plus15bp.merged.bed
Found intervals for 2156 uniq DDG2P genes
```

### ClinVar Dataset

Source:		ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

Date obtained:	06/07/2021

Location at EPCC:	/home/u035/u035/shared/resources/clinvar

File name:		clinvar_20210626.P_LP.ACP.vcf  

Description of ClinVar VCF @ `https://www.ncbi.nlm.nih.gov/variation/docs/ClinVar_vcf_files/`

```
U> cd /home/u035/u035/shared/resources/clinvar
U> wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20210626.vcf.gz
U> wget ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar_20210626.vcf.gz.tbi

# grab the header and only variants annotated as Pathogenic or Likely Pathogenic # and not with conflicting interpretation
U> zgrep '^#' clinvar_20210626.vcf.gz > clinvar_20210626.P_LP.vcf && zgrep -E 'CLNSIG=Likely_pathogenic;|CLNSIG=Pathogenic;' clinvar_20210626.vcf.gz >>  clinvar_20210626.P_LP.vcf

# need to add chr prefix in the clinvar_20201128.P_LP.vcf file
U> awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' clinvar_20210626.P_LP.vcf > clinvar_20210626.P_LP.chr.vcf

# Exclude variants with “no assertion criteria provided”
U> grep '^#' clinvar_20210626.P_LP.chr.vcf > clinvar_20210626.P_LP.ACP.vcf && grep -v 'CLNREVSTAT=no_assertion_criteria_provided' clinvar_20210626.P_LP.chr.vcf >> clinvar_20210626.P_LP.ACP.vcf 
```

### ClinVar annotated DD exons from CCDS

Source:	DDG2P.20210706.plus15bp.merged.bed & clinvar_20210626.P_LP.ACP.vcf (see above)

Location at EPCC:	/home/u035/u035/shared/resources/G2P

File name:		DDG2P.20210706.clinvar.20210626.plus15bp.txt  


A BED file for all CCDS exons (15bp padded) found in the DD genes, annotated with the number of “relevant” variants reported in ClinVar (pathogenic or likely pathogenic with provided assertion criteria) for which the proband coverage is to be computed. 

Use bedtools to count and record the number of P/LP variants per each interval

```
U> cd /home/u035/u035/shared/resources/G2P
U> BEDTOOLS=/home/u035/u035/shared/software/bcbio/anaconda/bin/bedtools
U> $BEDTOOLS intersect -wa -c -a DDG2P.20210706.plus15bp.merged.bed -b ../clinvar/clinvar_20210626.P_LP.ACP.vcf > DDG2P.20210706.clinvar.20210626.plus15bp.txt

# proportion of all P/LP variants with assertion criteria provided in DD genes
U> grep -v '^#' ../clinvar/clinvar_20210626.P_LP.ACP.vcf | wc -l		
Total of 102341 ClinVar vars

U> cat DDG2P.20210706.clinvar.20210626.plus15bp.txt | awk '{sum += $5} END {print sum}'
67623 of the ClinVar vars are in DD genes; 67623/102341=66% of all are in DD gene
```

### Mapping the Family_ID (aka DECIPHER_ID) to Proband_ID (aka DECIPHER internal ID)

Source:	https://git.ecdf.ed.ac.uk/igmmbioinformatics/decipher-id-mapping/tree/master/bin (in-house script)

Date obtained:	19/09/2019

Location:		a suitable and secure folder on your local laptop/PC

File name:		decipher-id-mapping.jar

To get access to the source, email alison.meynert@ed.ac.uk 

* Make sure your java version is reasonably recent (e.g. java version "13" 2019-09-17)
* Download the decipher-id-mapping.jar file (downloads as jar.zip, delete .zip)
* Get the Selenium Chrome driver for the laptop’s Chrome browser version from https://chromedriver.chromium.org/downloads, store and unzip in the same folder as the  decipher-id-mapping.jar file
* Chrome v77 can be downloaded from: https://www.neowin.net/news/google-chrome-770386575-offline-installer/ (64bit)

### Variant Blacklist

Source:		NHSS

Date obtained:	25/09/2019

Location at EPCC:	/home/u035/u035/shared/resources/blacklist

File name:		current_blacklist.txt

This is a file which contains variant which were assessed by NHSS as safe to be excluded from analysis and reports (e.g., cannot be lifted over from GRCh38 to DECIPHER’s v37 and/or seen too frequently in previously analyzed batches, etc.). It should be provided by NHSS before the variants in a new batch are to be prioritized (contact Morad Ansari morad.ansari@nhs.net).


Open the Excel file provided by NHSS and store the information in a tab-separated file named `blacklist.<date_received>.txt` with the format chr pos ref alt, adding the ‘chr’ prefix if necessary. Create a copy of the file named `current_blacklist.txt` which is looked for and used by `NHS_WES_filter_LQ_GT.py`.

```
U> cd /home/u035/u035/shared/resources/blacklist
U> nano blacklist.2019-11-27.txt
U> cp blacklist.2019-11-27.txt current_blacklist.txt
```

### Transcript Replacement

Source:		NHSS

Date obtained:	25/09/2019

Location at EPCC:	/home/u035/u035/shared/resources/trans_map

File name:		current_trans_map.txt

Some of the VEP (v97) GRCh38 transcripts are not currently recognized by DECIPHER during bulk upload. NHSS will generate a transcript replacement file (usually included in the same excel file as the variant blacklist) for the preferred transcript ID (available in DECIPHER). It should be provided by NHSS before the variants in a new batch are to be prioritized (contact Morad Ansari morad.ansari@nhs.net).

Open the Excel file provided by NHSS and store the information in a tab-separated file named `trans_map.<date_received>.txt` with the format `Unrecognized_transcript Replacement_transcript`. Create a copy of the file named `current_trans_map.txt` which is looked for and used by `NHS_WES_filter_LQ_GT.py`.

```
U> cd /home/u035/u035/shared/resources/trans_map
U> nano trans_map.2019-11-27.txt
U> cp trans_map.2019-11-27.txt current_trans_map.txt
```

### VASE setup in STRICT mode

Location at EPCC:	/home/u035/u035/shared/software/bcbio/anaconda/bin/vase

Parameter Values for VASE STRICT

```
VASE=/home/u035/u035/shared/software/bcbio/anaconda/bin/vase

IN_FILE=${VCF_DIR}/${PLATE_ID}_${FAMILY_ID}.ready.vcf.gz
OUT_FILE=${VASE_DIR}/${PLATE_ID}_${FAMILY_ID}.strict.denovo.vcf
PED_FILE=${PED_DIR}/${BATCH_ID}_${PLATE_ID}_${FAMILY_ID}.ped

time ${VASE} \
    -i ${IN_FILE} \
    -o ${OUT_FILE} \
    --log_progress \
    --prog_interval 100000 \
    --freq 0.0001 \
    --gq 30 --dp 10 \
    --het_ab 0.3 \
    --max_alt_alleles 1 \
    --csq all \
    --biotypes all \
    --control_gq 15 --control_dp 5 \
    --control_het_ab 0.01 \
    --control_max_ref_ab 0.05 \
    --de_novo \
    --ped ${PED_FILE}
```