Trio-Whole-Exome Pipeline
This is an automated version of the scripts currently run manually according to SOP as part of the whole exome trios project with David Fitzpatrick's group. This pipeline is controlled by NextFlow.
Setup
This pipeline requires:
- NextFlow
- An install of BCBio v1.2.8
A Conda environment containing NextFlow is available in environment.yml
. This can be created
with the command:
$ conda env create -n <environment_name> -f environment.yml
Running the pipeline
The pipeline requires two main input files:
Configuration
This pipeline uses a config at trio-whole-exome/nextflow.config, containing profiles for different sizes of process. NextFlow picks this up automatically.
A second config is necessary for providing executor and param information. This can be supplied via the -c
argument.
Parameters:
-
bcbio
- path to a BCBio install, containing 'anaconda', 'galaxy', 'genomes', etc -
bcbio_template
- path to a template config for BCBio variant calling. Should setupload.dir: ./results
so that BCBio will output results to the working dir. -
output_dir
- where the results get written to on the system. The variant calling creates initial results here, and variant prioritisation adds to them -
target_bed
- bed file of Twist exome targets -
reference_genome
- hg38 reference genome in fasta format -
parse_peddy_output
- path to the parse_peddy_output Perl script. Todo: remove once scripts are in bin/
Samplesheet
This is a tab-separated file mapping individuals to fastq pairs. The columns are individual_id, read_1 and read_2. If a sample has been sequenced over multiple lanes, then include a line for each fastq pair:
individual_id read_1 read_2
000001 path/to/lane_1_r1.fastq.gz path/to/lane_1_r2.fastq.gz
000001 path/to/lane_2_r1.fastq.gz path/to/lane_2_r2.fastq.gz
### Ped file
Tab-separated Ped file mapping individuals to each other family IDs and and affected status. Per the specification, the columns are family ID, individual ID, father ID, mother ID, sex (1=male, 2=female, other=unknown), affected status (-9 or 0=missing, 1=unaffected, 2=affected):
000001 000001 000002 000003 2 2
000001 000002 0 0 1 1
000001 000003 0 0 2 1
The pipeline does support non-trios, e.g. singletons, duos, quads.
Usage
The pipeline can now be run. First, run the initial variant calling:
$ nextflow path/to/trio-whole-exome/main.nf \
-c path/to/nextflow.config
--workflow 'variant-calling' \
--pipeline_project_id projname --pipeline_project_version v1 \
--ped_file path/to/batch.ped \
--sample_sheet path/to/samplesheet.tsv
Todo: variant prioritisation workflow
Tests
This pipeline has automated tests contained in the folder tests/
. To the run the tests locally, cd
to this folder
with your Conda environment active and run the test scripts:
- run_tests.sh
- run_giab_tests.sh
These tests use the environment variable NEXTFLOW_CONFIG
, pointing to a platform-specific config file.
Trio whole exome service scripts and documentation
Resources and set up
SOPs
Current script list by category
Resource generation & acquisition
bcbio_gnomad_install.sh
Sample acquisition
submit_trio_wes_aspera_download.sh
submit_trio_wes_lftp_download.sh
Alignment & variant calling
Preparation & config file generation
trio_wes_prepare_bcbio_config_crf.sh
trio_wes_prepare_bcbio_config.sh
trio_whole_exome_create_parameter_files.pl
Alignment & variant calling
submit_trio_wes_bcbio.sh
Quality control
trio_whole_exome_parse_peddy_ped_csv.pl
Prioritization
compare_indi_vars_by_version.py
convert_DEC_to_v10.py
decipher_NHS_WES_trio.sh
downstream_setup.sh
extract_BED_CCDS_DDG2P.py
extract_trio_FAM_PRO_ID.py
filter_LQ_GT.py
full_process_NHS_WES_trio.sh
gather_NHS_WES_aff_probands_results.sh
gather_NHS_WES_quad_results.sh
gather_NHS_WES_trio_results.sh
generate_coverage_result_file.py
generate_DEC_IGV.py
generate_G2P_out_VCF.py
NHS_WES_check_PED_aff_probands.py
NHS_WES_check_PED_quad.py
NHS_WES_extract_shared_vars.py
NHS_WES_extract_trio_FAM_PRO_ID.py
NHS_WES_filter_LQ_GT.py
NHS_WES_generate_aff_sib_ped.py
NHS_WES_generate_coverage_result_file.py
NHS_WES_generate_DEC_IGV_aff_probands.py
NHS_WES_generate_DEC_IGV.py
NHS_WES_generate_DEC_IGV.py.v1
NHS_WES_generate_DEC_IGV.py_wrong_gene_trans
NHS_WES_generate_DEC_IGV_sib_from_quad.py
NHS_WES_generate_DEC_IGV_trio_from_quad.py
NHS_WES_generate_trio_ped.py
NHS_WES_generate_trio_VCF.py
NHS_WES_trio_cram_setup.sh
NHS_WES_trio_delete_BAM.sh
NHS_WES_trio_setup.sh
old_downstream_setup.sh
old_submit_downstream.sh
old_submit_trio_wes_aspera_download.sh
processing_setup.sh
process_NHS_WES_aff_probands.sh
process_NHS_WES_quad_full.sh
process_NHS_WES_quad.sh
process_NHS_WES_trio_before_BAMOUT.sh
process_NHS_WES_trio.sh
run_processing.sh
submit_depth_of_coverage_MQ20_BQ20.sh
submit_downstream.sh
test_process_NHS_WES_trio.sh
test_run_processing.sh
Archiving & cleanup
submit_trio_wes_cram_compression.sh
submit_trio_wes_family_checksums.sh
submit_trio_wes_project_checksums.sh
Configuration files
trio_whole_exome_bcbio_template.yaml
trio_whole_exome_config.sh
vcf_config.json.backup
Terminology
- 'Batch'
- Slightly ambiguous term - can be a pipeline batch, a sequencing batch or a BCBio batch. To this end, a single run of this pipeline is known as a project.
- 'Pipeline project'
- A single run of this pipeline, potentially mixing samples and families from multiple sequencing batches. There's always one Ped file and sample sheet per pipeline project.
- 'Sequencing batch'
- A group of samples that were prepared and sequenced together.
- 'BCBio batch'
- Used internally by BCBio to identify a family.
- 'Sample ID'
- Specific to a sequencing batch, family ID, individual ID and extraction kit type