Skip to content
Snippets Groups Projects

Trio-Whole-Exome Pipeline

This is an automated version of the scripts currently run manually according to SOP as part of the whole exome trios project with David Fitzpatrick's group. This pipeline is controlled by NextFlow.

Setup

This pipeline requires:

  • NextFlow
  • An install of BCBio v1.2.8

A Conda environment containing NextFlow is available in environment.yml. This can be created with the command:

$ conda env create -n <environment_name> -f environment.yml

Running the pipeline

The pipeline requires two main input files:

Configuration

This pipeline uses a config at trio-whole-exome/nextflow.config, containing profiles for different sizes of process. NextFlow picks this up automatically.

A second config is necessary for providing executor and param information. This can be supplied via the -c argument. Parameters:

  • bcbio - path to a BCBio install, containing 'anaconda', 'galaxy', 'genomes', etc
  • bcbio_template - path to a template config for BCBio variant calling. Should set upload.dir: ./results so that BCBio will output results to the working dir.
  • output_dir - where the results get written to on the system. The variant calling creates initial results here, and variant prioritisation adds to them
  • target_bed - bed file of Twist exome targets
  • reference_genome - hg38 reference genome in fasta format
  • parse_peddy_output - path to the parse_peddy_output Perl script. Todo: remove once scripts are in bin/

Samplesheet

This is a tab-separated file mapping individuals to fastq pairs. The columns are individual_id, read_1 and read_2. If a sample has been sequenced over multiple lanes, then include a line for each fastq pair:

individual_id   read_1                      read_2
000001          path/to/lane_1_r1.fastq.gz  path/to/lane_1_r2.fastq.gz
000001          path/to/lane_2_r1.fastq.gz  path/to/lane_2_r2.fastq.gz

### Ped file

Tab-separated Ped file mapping individuals to each other family IDs and and affected status. Per the specification, the columns are family ID, individual ID, father ID, mother ID, sex (1=male, 2=female, other=unknown), affected status (-9 or 0=missing, 1=unaffected, 2=affected):

000001  000001  000002  000003  2  2
000001  000002  0       0       1  1
000001  000003  0       0       2  1

The pipeline does support non-trios, e.g. singletons, duos, quads.

Usage

The pipeline can now be run. First, run the initial variant calling:

$ nextflow path/to/trio-whole-exome/main.nf \
    -c path/to/nextflow.config
    --workflow 'variant-calling' \
    --pipeline_project_id projname --pipeline_project_version v1 \
    --ped_file path/to/batch.ped \
    --sample_sheet path/to/samplesheet.tsv

Todo: variant prioritisation workflow

Tests

This pipeline has automated tests contained in the folder tests/. To the run the tests locally, cd to this folder with your Conda environment active and run the test scripts:

  • run_tests.sh
  • run_giab_tests.sh

These tests use the environment variable NEXTFLOW_CONFIG, pointing to a platform-specific config file.

Trio whole exome service scripts and documentation

Resources and set up

SOPs

Current script list by category

Resource generation & acquisition

bcbio_gnomad_install.sh

Sample acquisition

submit_trio_wes_aspera_download.sh
submit_trio_wes_lftp_download.sh

Alignment & variant calling

Preparation & config file generation

trio_wes_prepare_bcbio_config_crf.sh
trio_wes_prepare_bcbio_config.sh
trio_whole_exome_create_parameter_files.pl

Alignment & variant calling

submit_trio_wes_bcbio.sh

Quality control

trio_whole_exome_parse_peddy_ped_csv.pl

Prioritization

compare_indi_vars_by_version.py
convert_DEC_to_v10.py
decipher_NHS_WES_trio.sh
downstream_setup.sh
extract_BED_CCDS_DDG2P.py
extract_trio_FAM_PRO_ID.py
filter_LQ_GT.py
full_process_NHS_WES_trio.sh
gather_NHS_WES_aff_probands_results.sh
gather_NHS_WES_quad_results.sh
gather_NHS_WES_trio_results.sh
generate_coverage_result_file.py
generate_DEC_IGV.py
generate_G2P_out_VCF.py
NHS_WES_check_PED_aff_probands.py
NHS_WES_check_PED_quad.py
NHS_WES_extract_shared_vars.py
NHS_WES_extract_trio_FAM_PRO_ID.py
NHS_WES_filter_LQ_GT.py
NHS_WES_generate_aff_sib_ped.py
NHS_WES_generate_coverage_result_file.py
NHS_WES_generate_DEC_IGV_aff_probands.py
NHS_WES_generate_DEC_IGV.py
NHS_WES_generate_DEC_IGV.py.v1
NHS_WES_generate_DEC_IGV.py_wrong_gene_trans
NHS_WES_generate_DEC_IGV_sib_from_quad.py
NHS_WES_generate_DEC_IGV_trio_from_quad.py
NHS_WES_generate_trio_ped.py
NHS_WES_generate_trio_VCF.py
NHS_WES_trio_cram_setup.sh
NHS_WES_trio_delete_BAM.sh
NHS_WES_trio_setup.sh
old_downstream_setup.sh
old_submit_downstream.sh
old_submit_trio_wes_aspera_download.sh
processing_setup.sh
process_NHS_WES_aff_probands.sh
process_NHS_WES_quad_full.sh
process_NHS_WES_quad.sh
process_NHS_WES_trio_before_BAMOUT.sh
process_NHS_WES_trio.sh
run_processing.sh
submit_depth_of_coverage_MQ20_BQ20.sh
submit_downstream.sh
test_process_NHS_WES_trio.sh
test_run_processing.sh

Archiving & cleanup

submit_trio_wes_cram_compression.sh
submit_trio_wes_family_checksums.sh
submit_trio_wes_project_checksums.sh

Configuration files

trio_whole_exome_bcbio_template.yaml
trio_whole_exome_config.sh
vcf_config.json.backup

Terminology

  • 'Batch'
    • Slightly ambiguous term - can be a pipeline batch, a sequencing batch or a BCBio batch. To this end, a single run of this pipeline is known as a project.
  • 'Pipeline project'
    • A single run of this pipeline, potentially mixing samples and families from multiple sequencing batches. There's always one Ped file and sample sheet per pipeline project.
  • 'Sequencing batch'
    • A group of samples that were prepared and sequenced together.
  • 'BCBio batch'
    • Used internally by BCBio to identify a family.
  • 'Sample ID'
    • Specific to a sequencing batch, family ID, individual ID and extraction kit type