Newer
Older
# Standard operating procedure - Alignment, variant calling, and annotation of trio whole exome samples at the Edinburgh Parallel Computing Centre
This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE) or the Edinburgh Clinical Research Facility (ECRF). It assumes that data has been successfully transferred to the Edinburgh Parallel Computing Centre (EPCC) (see SOP: Transfer of whole exome sequencing samples from Edinburgh Genomics to Edinburgh Parallel Computing Centre). Scripts are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome. Request access by e-mail: alison.meynert@igmm.ed.ac.uk.
## Definitions
In this document, N is the total number of samples in the project, and X is the number of families.
Text in angle brackets, e.g. <project> indicates variable parameters. A variable parameter such as <family1-X> indicates that there are X instances of the parameter, each with their own unique value.
## Software and data requirements
The analysis is run with the bcbio pipeline (version 1.2.3) located at /home/u035/project/software/bcbio. All genome reference and annotation data resources are contained within the genomes/Hsapiens/hg38 subfolder.
The TWIST target BED file is at: /home/u035/project/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed
To generate the target BED file, first copy the file Twist_Exome_RefSeq_targets_hg38.bed from NHS Clinical Genetics Services to /home/u035/project/resources on ultra, then pad it by 15bp each side.
```
cd /home/u035/project/resources
bedtools slop -g $REFERENCE_GENOME.fai -i Twist_Exome_RefSeq_targets_hg38.bed -b 15 | \
bedtools merge > Twist_Exome_RefSeq_targets_hg38.plus15bp.bed
ameyner2
committed
## Input
A 6-column tab-delimited PED/FAM format file (https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status.
### Sample id format
The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv inside the dated batch folder. The identifiers are in the format:
```
<pcr_plate_id>_<indiv_id>_<family_id><suffix>
```
The suffix identifies the exome kit, e.g. "_IDT-A". These identifiers are referenced below in the output file structure.
### Reads - Edinburgh Genomics
A set of paired end FASTQ files (designated by R1 or R2 suffixes), possibly more than one pair per sample. Each sample's files are in its own folder. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 1*.
ameyner2
committed
```
<EdGE_project_id>/
+---md5_check.txt
+---raw_data/
| +---<dated_batch>/
| | +---<EdGE_sample_id>/
| | | +---<fastq_id>_R1.fastq.count
| | | +---<fastq_id>_R1.fastq.gz
| | | +---<fastq_id>_R2.fastq.count
| | | +---<fastq_id>_R2.fastq.gz
| | +---file_list.tsv
| | +---md5sums.txt
| +---<dated_batch>_tree.txt
| +---Information.txt
```
*Figure 1.* File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – in general we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself.
ameyner2
committed
### Reads - Edinburgh Clinical Research Facility
ameyner2
committed
A set of paired end FASTQ files (designated by R1 or R2 suffixes), generally one pair per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 2*.
ameyner2
committed
```
<EdGE_project_id>/
+---<internal_id_-md5.txt
+---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R1_001.fastq.gz
+---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R2_001.fastq.gz
+...
ameyner2
committed
```
*Figure 2.* File name and directory structure for a batch of sequencing from the ECRF.
ameyner2
committed
## Working directories
The project working directories will be in the folder /scratch/u035/project/trio_whole_exome/analysis and follow the structure in *Figure 3*.
ameyner2
committed
```
config – bcbio configuration files in YAML format
logs – PBS job submission log files
output – output to be passed to variant prioritization and archiving
params – parameters for PBS job submission
reads – symlinks to input FASTQ files
work – bcbio working folder
```
ameyner2
committed
## Project configuration
A configuration script sets environment variables common to scripts used in this SOP. This is stored at /home/u035/project/scripts/trio_whole_exome_config.sh.
```
#!/usr/bin/bash
#
# Basic configuration options for trio WES pipeline
#
SCRIPTS=/home/u035/project/scripts
BCBIO_TEMPLATE=$SCRIPTS/trio_whole_exome_bcbio_template.yaml
TARGET=/home/u035/project/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed
DOWNLOAD_DIR=/scratch/u035/project/trio_whole_exome/data
REFERENCE_GENOME=/home/u035/project/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
BASE=/scratch/u035/project/trio_whole_exome/analysis
PARAMS_DIR=$BASE/params
READS_DIR=$BASE/reads
CONFIG_DIR=$BASE/config
WORK_DIR=$BASE/work
OUTPUT_DIR=$BASE/output
ARCHIVE_DIR=/archive/u035/trio_whole_exome
ameyner2
committed
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
````
## Template for bcbio configuration
Bcbio requires a template file in YAML format to define the procedures run in the pipeline. The template for this project is stored at /home/u035/project/scripts/trio_whole_exome_bcbio_template.yaml.
```
details:
- algorithm:
platform: illumina
quality_format: standard
aligner: bwa
mark_duplicates: true
realign: false
recalibrate: true
effects: vep
effects_transcripts: all
variantcaller: gatk-haplotype
indelcaller: false
remove_lcr: true
tools_on:
- vep_splicesite_annotations
analysis: variant2
genome_build: hg38
upload:
dir: /scratch/u035/project/trio_whole_exome/analysis/output
```
## Output
Per sample: BAM file of aligned reads against the hg38 genome assembly
Per family: Annotated VCF file and QC report
Output will be in the folder /scratch/u035/project/trio_whole_exome/analysis/output and follow the structure in *Figure 4* (with multiple instances of the indiv_id sub directories, one per sequenced family member.). The qc sub-directories are not enumerated, and automatically generated index files are not listed for brevity. An additional directory at the root of the output folder called “qc” will contain the MultiQC reports generated for an entire batch.
ameyner2
committed
```
<analysis_date>_<EdGE_project_id>_<pcr_plate_id>_<family_id>/
+---<indiv_id>_<family_id>/
| +---<indiv_id>_<family_id>-callable.bed
| +---<indiv_id>_<family_id>-ready.bam
| +---qc/
+---<pcr_plate>_<family_id>-gatk-haplotype-annotated.vcf.gz
+---bcbio-nextgen-commands.log
+---bcbio-nextgen.log
+---data_versions.csv
+---metadata.csv
+---multiqc/
| +---list_files_final.txt
| +---multiqc_config.yaml
| +---multiqc_data/
| +---multiqc_report.html
| +---report/
+---programs.txt
+---project-summary.yaml
```
*Figure 4.* File name and output directory structure for each family in a batch of sequencing.
ameyner2
committed
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
## Procedure
1. Set environment variable project_id and general configuration variables.
```
project_id=<EdGE_project_id>
source /home/u035/project/scripts/trio_whole_exome_config.sh
```
2. Copy the PED file for the batch to the params folder in the working area. It should be named <EdGE_project_id>.ped, relating it to the input directory for the FASTQ files. If the PED file given was not named in this way, don’t rename it, create a symlink with the correct name.
```
cd $PARAMS_DIR
ped_file=<input_ped_file>
ln -s $ped_file $project_id.ped
```
3. In the params folder, create the symlinks to the reads and the bcbio configuration files. If specifying a common sample suffix, ensure it includes any joining characters, e.g. “-“ or “_”, so that the family identifier can be cleanly separated from the suffix. Get the number of families from the batch.
```
cd $PARAMS_DIR
sample_suffix=<sample_suffix>
/home/u035/project/scripts/prepare_bcbio_config.sh \
/home/u035/project/scripts/trio_whole_exome_config.sh \
$project_id $sample_suffix &> $project_id.log
X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
```
4. Submit the bcbio jobs from the logs folder.
```
cd /home/u035/project/trio_whole_exome/analysis/logs
ameyner2
committed
qsub -v PROJECT_ID=$project_id,\
CONFIG_SH=/home/u035/project/scripts/trio_whole_exome_config.sh \
-J 1-$X -N trio_whole_exome_bcbio.$project_id \
/home/u035/project/scripts/submit_bcbio_trio_wes.sh
```
If all log files end in ‘Finished’ or ‘Storing in local filesystem’ for a metadata file (occasionally the job completes without quite outputting all of the ‘Storing’ messages), the batch is complete. If this is not the case, resubmit the incomplete jobs – they will resume where they left off.
5. Generate a MultiQC report for all files in the batch.
```
source /home/u035/project/scripts/trio_whole_exome_config.sh
cd /scratch/u035/project/trio_whole_exome/analysis/output
/home/u035/project/software/bcbio/anaconda/bin/multiqc --title "Trio whole exome QC report: $project_id" \
ameyner2
committed
--outdir qc \
--filename ${project_id}_qc_report.html \
*$project_id*
```
6. Check the parent-child relationships predicted by peddy match the pedigree information. There should be no entries in the <EdGE_project_id>.ped_check.txt file that do not end in ‘True’. If there are, report these back to the NHS Clinical Scientist who generated the PED file for this batch. The batch id is the 5 digit number that prefixes all the family ids in the output.
```
cd /scratch/u035/project/trio_whole_exome/analysis/output
perl /home/u035/project/scripts/trio_whole_exome_parse_peddy_ped_csv.pl \
--output /scratch/u035/project/trio_whole_exome/analysis/output \
--project $project_id \
--batch $batch_id \
--ped /scratch/u035/project/trio_whole_exome/analysis/params/$project_id.ped
grep -v False$ qc/$project_id.ped_check.txt
ameyner2
committed
```
7. Clear the work directory and move the log files to the complete sub-directory.
```
cd /scratch/u035/project/trio_whole_exome/work
rm -r *
cd /home/u035/project/trio_whole_exome/logs
mv trio_whole_exome_bcbio.$project_id* complete/
```
8. Copy the MultiQC report to the IGMM-VariantAnalysis area on the IGMM datastore.
```
ssh eddie3.ecdf.ed.ac.uk
qlogin -q staging
cd /exports/igmm/datastore/IGMM-VariantAnalysis/documentation/trio_whole_exome/qc
user=<ultra_user_id>
project_id=<EdGE_project_id>
scp $user@ultra.epcc.ed.ac.uk:/scratch/u035/project/trio_whole_exome/analysis/output/qc/${project_id}_qc_report.html ./
```