Standard operating procedure - Alignment, variant calling, and annotation of trio whole exome samples at the Edinburgh Parallel Computing Centre
This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE). It assumes that data has been successfully transferred from EdGE to the Edinburgh Parallel Computing Centre (EPCC) (see SOP: Transfer of whole exome sequencing samples from Edinburgh Genomics to Edinburgh Parallel Computing Centre). Scripts are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome. Request access by e-mail: alison.meynert@igmm.ed.ac.uk.
Definitions
In this document, N is the total number of samples in the project, and X is the number of families.
Text in angle brackets, e.g. indicates variable parameters. A variable parameter such as indicates that there are X instances of the parameter, each with their own unique value.
Software and data requirements
The analysis is run with the bcbio pipeline (version 1.1.15) located at /home/u035/project/software/bcbio. All genome reference and annotation data resources are contained within the genomes/Hsapiens/hg38 subfolder.
The TWIST target BED file is at: /home/u035/project/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed
To generate the target BED file, first copy the file Twist_Exome_RefSeq_targets_hg38.bed from NHS Clinical Genetics Services to /home/u035/project/resources on ultra, then pad it by 15bp each side.
cd /home/u035/project/resources
../software/bcbio/bin/bedtools slop -g \
../software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa.fai \
-i Twist_Exome_RefSeq_targets_hg38.bed -b 15 | \
../software/bcbio/bin/bedtools merge > \
Twist_Exome_RefSeq_targets_hg38.plus15bp.bed
Input
A set of Nx2 FASTQ files, one per sample. A 6-column tab-delimited PED/FAM format file (https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in Figure 1.
<EdGE_project_id>/
+---md5_check.txt
+---raw_data/
| +---<dated_batch>/
| | +---<EdGE_sample_id>/
| | | +---<fastq_id>_R1.fastq.count
| | | +---<fastq_id>_R1.fastq.gz
| | | +---<fastq_id>_R2.fastq.count
| | | +---<fastq_id>_R2.fastq.gz
| | +---file_list.tsv
| | +---md5sums.txt
| +---<dated_batch>_tree.txt
| +---Information.txt
Figure 1. File name and directory structure for a batch of sequencing. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself.
Sample id format
The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv inside the dated batch folder. The identifiers are in the format:
<pcr_plate_id>_<indiv_id>_<family_id><suffix>
The suffix identifies the exome kit, e.g. "_IDT-A". These identifiers are referenced below in the output file structure.
Working directories
The project working directories will be in the folder /scratch/u035/project/trio_whole_exome/analysis and follow the structure in Figure 2.