- Standard operating procedure - Alignment, variant calling, annotation, compression, and storage of trio whole exome samples at the Edinburgh Parallel Computing Centre
- Definitions
- Software and data requirements
- Input
- PED file
- Sample id format
- Reads - Edinburgh Genomics
- Reads - Edinburgh Clinical Research Facility
- Working directories
- Project configuration
- Template for bcbio configuration
- Output
- Procedure
Standard operating procedure - Alignment, variant calling, annotation, compression, and storage of trio whole exome samples at the Edinburgh Parallel Computing Centre
This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE) or the Edinburgh Clinical Research Facility (ECRF). It assumes that data has been successfully transferred to the Edinburgh Parallel Computing Centre (EPCC) (see SOP: Transfer of whole exome sequencing samples from Edinburgh Genomics to Edinburgh Parallel Computing Centre). Scripts are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome
. Request access by e-mail: alison.meynert@igmm.ed.ac.uk.
Definitions
In this document, N is the total number of samples in the project, and X is the number of families.
Text in angle brackets, e.g. <project>
indicates variable parameters. A variable parameter such as <family1-X>
indicates that there are X instances of the parameter, each with their own unique value.
Software and data requirements
The analysis is run with the bcbio pipeline (version 1.2.8) located at /home/u035/u035/shared/software/bcbio
. All genome reference and annotation data resources are contained within the genomes/Hsapiens/hg38
subfolder.
The TWIST target BED file is at: /home/u035/u035/shared/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed
. See resources.
The tracking file is maintained on the IGC datastore at /exports/igmm/datastore/IGMM-VariantAnalysis/trio_whole_exome/Batch_status.xlsx.
Input
PED file
A 6-column tab-delimited PED/FAM format file is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status.
Sample id format
The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv
inside the dated batch directory. The identifiers are in the format:
<pcr_plate_id>_<indiv_id>_<family_id><suffix>
The suffix identifies the exome kit, e.g. _WESTwist_IDT-A
. These identifiers are referenced below in the output file structure.
Reads - Edinburgh Genomics
A set of paired end FASTQ files (designated by R1 or R2 suffixes), possibly more than one pair per sample. Each sample's files are in its own folder. The input files will be in the folder /home/u035/u035/shared/data
and follow the structure in Figure 1. Older deliveries contained the <dated_batch>
folder within a raw_data
folder.
<EdGE_project_id>/
+---<dated_batch>/
| +---<sample_id>/
| | +---*.fastq.count
| | +---*.fastq.gz
| +---file_list.tsv
| +---md5sums.txt
+---<dated_batch>_tree.txt
+---Information.txt
+---md5_check.txt
Figure 1. File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format XXXXX_Lastname_Firstname
, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd
– in general we expect there to be only one of these per EdGE project id. The FASTQ file names relate to the sequencing run information and do not contain any information about the sample itself.
Reads - Edinburgh Clinical Research Facility
A set of paired end FASTQ files (designated by R1 or R2 suffixes), generally one pair per sample. The input files will be in the folder /home/u035/u035/shared/data
and follow the structure in Figure 2.
<ECRF_project_id>/
+---<internal_id_-md5.txt
+---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R1_001.fastq.gz
+---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R2_001.fastq.gz
+...
Figure 2. File name and directory structure for a batch of sequencing from the ECRF.
Working directories
The project working directories will be in the folder /home/u035/u035/shared/analysis
and follow the structure in Figure 3.
config – bcbio configuration files in YAML format
logs – PBS job submission log files
params – parameters for PBS job submission
reads – symlinks/merged versions of input FASTQ files
work – bcbio working folder
Figure 3. Project working directories.
Project configuration
A configuration script sets environment variables common to scripts used in this SOP.
Template for bcbio configuration
Bcbio requires a template file in YAML format to define the procedures run in the pipeline.
Output
Per sample: BAM file of aligned reads against the hg38 genome assembly Per family: Annotated VCF file and QC report
Output will be in the folder /home/u035/u035/shared/results/<short_project_id>_<version>
where <short_project_id>
is the numeric prefix of <project_id>
and follow the structure in Figure 4 and Figure 5 (with multiple instances of the indiv_id sub directories, one per sequenced family member.). The qc sub-directories are not enumerated, and automatically generated index files are not listed for brevity.
<short_project_id>_<version>/
+---config
| +---<short_project_id>_<version>_<pcr_plate_id>_<family_id>.yaml
+---families
| +---<analysis_date>_<short_project_id>_<pcr_plate_id>_<family_id>
+---params
| +---<ped_file>
| +---<project_id>_<pcr_plate_id>_<family_id>.ped
| +---<project_id>.ped
| +---<project_id>_<version>_<date>.log
| +---<short_project_id>_<pcr_plate_id>_<family_id>.csv
+---prioritization
| +---<priority_dirs>
+---qc
| +---<short_project_id>_<version>.ped_check.txt
| +---<short_project_id>_<version>_qc_report_data
| +---<short_project_id>_<version>_qc_report.html
Figure 4. File name and output directory structure for a batch of sequencing.
<analysis_date>_<short_project_id>_<pcr_plate_id>_<family_id>/
+---<indiv_id>_<family_id>/
| +---<indiv_id>_<family_id>-callable.bed
| +---<indiv_id>_<family_id>-ready.bam
| +---qc/
+---<pcr_plate>_<family_id>-gatk-haplotype-annotated.vcf.gz
+---bcbio-nextgen-commands.log
+---bcbio-nextgen.log
+---data_versions.csv
+---metadata.csv
+---multiqc/
| +---list_files_final.txt
| +---multiqc_config.yaml
| +---multiqc_data/
| +---multiqc_report.html
| +---report/
+---programs.txt
+---project-summary.yaml
Figure 5. File name and output directory structure for each family in a batch of sequencing.
Procedure
- Set environment variable project_id and general configuration variables. All steps below can assume that these have been set. Version should be "v1" by default for
the first analysis run of a batch, "v2" etc for subsequent runs.
project_id=<project_id>
short_project_id=`echo $project_id | cut -f 1 -d '_'`
version=<version>.
source /home/u035/u035/shared/scripts/trio_whole_exome_config.sh
- Copy the PED file for the batch to the params folder in the working area. It should be named
<project_id>.ped
, relating it to the input directory for the FASTQ files. If the PED file given was not named in this way, don’t rename it, copy it instead.
cd $PARAMS_DIR
ped_file=<input_ped_file>
cp $ped_file $project_id.ped
- In the params folder, create the symlinks to the reads and the bcbio configuration files. If specifying a common sample suffix, ensure it includes any joining characters, e.g. “-“ or “_”, so that the family identifier can be cleanly separated from the suffix. Get the number of families from the batch.
Edinburgh Genomics data
cd $PARAMS_DIR
sample_suffix=<sample_suffix>
$SCRIPTS/trio_wes_prepare_bcbio_config.sh \
$SCRIPTS/trio_whole_exome_config.sh \
$project_id $version $sample_suffix &> ${project_id}_${version}_`date +%Y%m%d%H%M`.log
X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
Edinburgh Clinical Research Facility data
cd $PARAMS_DIR
sample_suffix=<sample_suffix>
$SCRIPTS/scripts/trio_wes_prepare_bcbio_config_crf.sh \
$SCRIPTS/trio_whole_exome_config.sh \
$project_id $version $sample_suffix &> ${project_id}_${version}_`date +%Y%m%d%H%M`.log
X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
- Submit the bcbio jobs from the logs folder. See above for version.
cd $LOGS_DIR
sbatch --export=PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=$SCRIPTS/trio_whole_exome_config.sh \
--array=1-$X $SCRIPTS/submit_trio_wes_bcbio.sh
If all log files end in ‘Finished’ or ‘Storing in local filesystem’ for a metadata file (occasionally the job completes without quite outputting all of the ‘Storing’ messages), the batch is complete. If this is not the case, resubmit the incomplete jobs – they will resume where they left off.
- Check the output directory to make sure all family output folders were moved into the
families
subdirectory. This should happen automatically at the end of thesubmit_bcbio_trio_wes.sh
script but occasionally fails.
cd $OUTPUT_DIR/${short_project_id}_${version}
mv *${short_project_id}* families/
- Generate a MultiQC report for all files in the batch.
cd $OUTPUT_DIR/${short_project_id}_${version}/families
mkdir -p ../qc
multiqc --title "Trio whole exome QC report: $short_project_id $version" \
--outdir ../qc \
--filename ${short_project_id}_${version}_qc_report.html .
- Check the parent-child relationships predicted by peddy match the pedigree information. There should be no entries in the
<EdGE_project_id>.ped_check.txt
file that do not end in ‘True’. If there are, report these back to the NHS Clinical Scientist who generated the PED file for this batch. The<batch_id>
is the 5 digit number that prefixes all the family ids in the output. Move to SOP prioritization.
cd $OUTPUT_DIR/${short_project_id}_${version}/families
batch_id=<batch_id>
perl $SCRIPTS/trio_whole_exome_parse_peddy_ped_csv.pl \
--output ../qc/${short_project_id}_${version}.ped_check.txt \
--project $short_project_id \
--batch $batch_id \
--version $version \
--ped ../params/$project_id.ped \
--families .
grep -v False$ ../qc/${short_project_id}_${version}.ped_check.txt
- Compress BAM files to CRAM and compare the two files. The output log files should be empty and the files .cram, .cram.crai, and .cram.flagstat.txt should be present for each sample.
cd $LOGS_DIR
sbatch --export=PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=$SCRIPTS/trio_whole_exome_config.sh \
--array=1-$X $SCRIPTS/submit_trio_wes_cram_compression.sh
- Calculate md5 checksums on the per-family files, excluding the BAM files. Creates the file
md5sum.txt
at the root of each family’s output directory. Check the files with the calculated md5sums. They should total 30 lines per sample plus 26 lines per family. The log files should be empty. When complete, move the family ids text file into the results folder for the project.
cd $LOGS_DIR
sbatch --export=PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=$SCRIPTS/trio_whole_exome_config.sh \
--array=1-$X $SCRIPTS/submit_trio_wes_family_checksums.sh
cd $OUTPUT_DIR/${short_project_id}_${version}/families
wc -l */md5sum.txt
cd $PARAMS_DIR
mv $project_id.family_ids.txt $OUTPUT_DIR/${short_project_id}_${version}/params/
- Wait for prioritization to be completed. Calculate md5 checksums on the remaining project files, excluding the
families
sub-directory. Creates the filemd5sum.txt
at the root of the project output directory.
sbatch --export=PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=$SCRIPTS/trio_whole_exome_config.sh \
$SCRIPTS/submit_trio_wes_project_checksums.sh
- Remove the BAM files from the results.
cd $OUTPUT_DIR/${short_project_id}_${version}
rm families/*/*/*.bam*
- Clean up. Clear the work and logs directories. Move the bcbio YAML configuration files into the results folder for the project. Retain reads for samples in families where one sample has failed QC, using a list
retain\_for\_rerun.txt
. These will likely be required for later runs, and it is simpler to regenerate config YAML files if it is not necessary to re-do symlinks/read merging.
cd $WORK_DIR
rm -r *
cd $LOGS_DIR
rm -r *
cd $PARAMS_DIR
rm -r *
mkdir -p $OUTPUT_DIR/${short_project_id}_${version}/config/
mv $CONFIG_DIR/${short_project_id}_${version}*.yaml $OUTPUT_DIR/${short_project_id}_${version}/config/
cd /home/u035/u035/shared/analysis/reads/${project_id}
rm `ls | grep -v -f retain_for_rerun.txt`
- Update the batch status spreadsheet.