diff --git a/docs/SOP_alignment_variant_annotation.md b/docs/SOP_alignment_variant_annotation.md index 350993af93e364285d3c9d3f8978d170ca1b8da5..e625d101c84041786197a1b40503061015f8019c 100644 --- a/docs/SOP_alignment_variant_annotation.md +++ b/docs/SOP_alignment_variant_annotation.md @@ -25,3 +25,216 @@ cd /home/u035/project/resources ../software/bcbio/tools/bin/bedtools merge > \ Twist_Exome_RefSeq_targets_hg38.plus15bp.bed ``` + +## Input + +A set of Nx2 FASTQ files, one per sample. A 6-column tab-delimited PED/FAM format file (https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 1*. + +``` +<EdGE_project_id>/ + +---md5_check.txt + +---raw_data/ + | +---<dated_batch>/ + | | +---<EdGE_sample_id>/ + | | | +---<fastq_id>_R1.fastq.count + | | | +---<fastq_id>_R1.fastq.gz + | | | +---<fastq_id>_R2.fastq.count + | | | +---<fastq_id>_R2.fastq.gz + | | +---file_list.tsv + | | +---md5sums.txt + | +---<dated_batch>_tree.txt + | +---Information.txt +``` +*Figure 1.* File name and directory structure for a batch of sequencing. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself. + +## Sample id format + +The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv inside the dated batch folder. The identifiers are in the format: + +``` +<pcr_plate_id>_<indiv_id>_<family_id><suffix> +``` + +The suffix identifies the exome kit, e.g. "_IDT-A". These identifiers are referenced below in the output file structure. + +## Working directories + +The project working directories will be in the folder /scratch/u035/project/trio_whole_exome/analysis and follow the structure in *Figure 2*. + +``` + config – bcbio configuration files in YAML format + logs – PBS job submission log files + output – output to be passed to variant prioritization and archiving + params – parameters for PBS job submission + reads – symlinks to input FASTQ files + work – bcbio working folder +``` +*Figure 2.* Project working directories. + +## Project configuration + +A configuration script sets environment variables common to scripts used in this SOP. This is stored at /home/u035/project/scripts/trio_whole_exome_config.sh. + +``` +#!/usr/bin/bash +# +# Basic configuration options for trio WES pipeline +# + +SCRIPTS=/home/u035/project/scripts +BCBIO_TEMPLATE=$SCRIPTS/trio_whole_exome_bcbio_template.yaml +TARGET=/home/u035/project/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed +DOWNLOAD_DIR=/scratch/u035/project/trio_whole_exome/data +REFERENCE_GENOME=/home/u035/project/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa + +BASE=/scratch/u035/project/trio_whole_exome/analysis +PARAMS_DIR=$BASE/params +READS_DIR=$BASE/reads +CONFIG_DIR=$BASE/config +WORK_DIR=$BASE/work +OUTPUT_DIR=$BASE/output + +ARCHIVE_DIR=/archive/u035/trio_whole_exome + +export PATH=/home/u035/project/software/bcbio/tools/bin:$PATH +```` + +## Template for bcbio configuration + +Bcbio requires a template file in YAML format to define the procedures run in the pipeline. The template for this project is stored at /home/u035/project/scripts/trio_whole_exome_bcbio_template.yaml. + +``` +details: +- algorithm: + platform: illumina + quality_format: standard + aligner: bwa + mark_duplicates: true + realign: false + recalibrate: true + effects: vep + effects_transcripts: all + variantcaller: gatk-haplotype + indelcaller: false + remove_lcr: true + tools_on: + - vep_splicesite_annotations + analysis: variant2 + genome_build: hg38 +upload: + dir: /scratch/u035/project/trio_whole_exome/analysis/output +``` + +## Output + +Per sample: BAM file of aligned reads against the hg38 genome assembly +Per family: Annotated VCF file and QC report + +Output will be in the folder /scratch/u035/project/trio_whole_exome/analysis/output and follow the structure in *Figure 3* (with multiple instances of the indiv_id sub directories, one per sequenced family member.). The qc sub-directories are not enumerated, and automatically generated index files are not listed for brevity. An additional directory at the root of the output folder called “qc†will contain the MultiQC reports generated for an entire batch. + +``` +<analysis_date>_<EdGE_project_id>_<pcr_plate_id>_<family_id>/ + +---<indiv_id>_<family_id>/ + | +---<indiv_id>_<family_id>-callable.bed + | +---<indiv_id>_<family_id>-ready.bam + | +---qc/ + +---<pcr_plate>_<family_id>-gatk-haplotype-annotated.vcf.gz + +---bcbio-nextgen-commands.log + +---bcbio-nextgen.log + +---data_versions.csv + +---metadata.csv + +---multiqc/ + | +---list_files_final.txt + | +---multiqc_config.yaml + | +---multiqc_data/ + | +---multiqc_report.html + | +---report/ + +---programs.txt + +---project-summary.yaml +``` +*Figure 3.* File name and output directory structure for each family in a batch of sequencing. + +## Procedure + +1. Set environment variable project_id and general configuration variables. + +``` +project_id=<EdGE_project_id> +source /home/u035/project/scripts/trio_whole_exome_config.sh +``` + +2. Copy the PED file for the batch to the params folder in the working area. It should be named <EdGE_project_id>.ped, relating it to the input directory for the FASTQ files. If the PED file given was not named in this way, don’t rename it, create a symlink with the correct name. + +``` +cd $PARAMS_DIR +ped_file=<input_ped_file> +ln -s $ped_file $project_id.ped +``` + +3. In the params folder, create the symlinks to the reads and the bcbio configuration files. If specifying a common sample suffix, ensure it includes any joining characters, e.g. “-“ or “_â€, so that the family identifier can be cleanly separated from the suffix. Get the number of families from the batch. + +``` +cd $PARAMS_DIR +sample_suffix=<sample_suffix> +/home/u035/project/scripts/prepare_bcbio_config.sh \ + /home/u035/project/scripts/trio_whole_exome_config.sh \ + $project_id $sample_suffix &> $project_id.log +X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'` +``` + +4. Submit the bcbio jobs from the logs folder. + +``` +cd /home/u035/project/trio_whole_exome/logs +qsub -v PROJECT_ID=$project_id,\ + CONFIG_SH=/home/u035/project/scripts/trio_whole_exome_config.sh \ + -J 1-$X -N trio_whole_exome_bcbio.$project_id \ + /home/u035/project/scripts/submit_bcbio_trio_wes.sh +``` + +If all log files end in ‘Finished’ or ‘Storing in local filesystem’ for a metadata file (occasionally the job completes without quite outputting all of the ‘Storing’ messages), the batch is complete. If this is not the case, resubmit the incomplete jobs – they will resume where they left off. + +5. Generate a MultiQC report for all files in the batch. + +``` +source /home/u035/project/scripts/trio_whole_exome_config.sh +cd /scratch/u035/project/trio_whole_exome/output +multiqc --title "Trio whole exome QC report: $project_id" \ + --outdir qc \ + --filename ${project_id}_qc_report.html \ + *$project_id* +``` + +6. Check the parent-child relationships predicted by peddy match the pedigree information. There should be no entries in the <EdGE_project_id>.ped_check.txt file that do not end in ‘True’. If there are, report these back to the NHS Clinical Scientist who generated the PED file for this batch. The batch id is the 5 digit number that prefixes all the family ids in the output. + +``` +cd /scratch/u035/project/trio_whole_exome/analysis/output +perl /home/u035/project/scripts/trio_whole_exome_parse_peddy_ped_csv.pl \ + --output /scratch/u035/project/trio_whole_exome/analysis/output \ + --project $project_id \ + --batch $batch_id \ + --ped /scratch/u035/project/trio_whole_exome/analysis/params/$project_id.ped +grep -v False$ output/qc/$project_id.ped_check.txt +``` + +7. Clear the work directory and move the log files to the complete sub-directory. + +``` +cd /scratch/u035/project/trio_whole_exome/work +rm -r * +cd /home/u035/project/trio_whole_exome/logs +mv trio_whole_exome_bcbio.$project_id* complete/ +``` + +8. Copy the MultiQC report to the IGMM-VariantAnalysis area on the IGMM datastore. + +``` +ssh eddie3.ecdf.ed.ac.uk +qlogin -q staging +cd /exports/igmm/datastore/IGMM-VariantAnalysis/documentation/trio_whole_exome/qc + +user=<ultra_user_id> +project_id=<EdGE_project_id> + +scp $user@ultra.epcc.ed.ac.uk:/scratch/u035/project/trio_whole_exome/analysis/output/qc/${project_id}_qc_report.html ./ +```