Skip to content
Snippets Groups Projects
Name Last commit Last update
..
docs
pipeline
tests
.gitignore
.gitlab-ci.yml
NHS_WES_check_PED_aff_probands.py
NHS_WES_check_PED_quad.py
NHS_WES_extract_shared_vars.py
NHS_WES_extract_trio_FAM_PRO_ID.py
NHS_WES_filter_LQ_GT.py
NHS_WES_generate_DEC_IGV.py
NHS_WES_generate_DEC_IGV.py.v1
NHS_WES_generate_DEC_IGV.py_wrong_gene_trans
NHS_WES_generate_DEC_IGV_aff_probands.py
NHS_WES_generate_DEC_IGV_sib_from_quad.py
NHS_WES_generate_DEC_IGV_trio_from_quad.py
NHS_WES_generate_aff_sib_ped.py
NHS_WES_generate_coverage_result_file.py
NHS_WES_generate_trio_VCF.py
NHS_WES_generate_trio_ped.py
NHS_WES_trio_cram_setup.sh
NHS_WES_trio_delete_BAM.sh
NHS_WES_trio_setup.sh
README.md
add_plate_and_family_id_to_ped.pl
bcbio_gnomad_install.sh
compare_indi_vars_by_version.py
convert_DEC_to_v10.py
decipher_NHS_WES_trio.sh
downstream_setup.sh
environment.yml
extract_BED_CCDS_DDG2P.py
extract_trio_FAM_PRO_ID.py
filter_LQ_GT.py
full_process_NHS_WES_trio.sh
gather_NHS_WES_aff_probands_results.sh
gather_NHS_WES_quad_results.sh
gather_NHS_WES_trio_results.sh
generate_DEC_IGV.py
generate_G2P_out_VCF.py
generate_coverage_result_file.py
old_downstream_setup.sh
old_submit_downstream.sh
old_submit_trio_wes_aspera_download.sh
prepare_bcbio_config.sh
prepare_bcbio_config_crf.sh
prepare_bcbio_config_old_edge.sh
prepare_bcbio_config_santosh.sh
process_NHS_WES_aff_probands.sh
process_NHS_WES_quad.sh
process_NHS_WES_quad_full.sh
process_NHS_WES_trio.sh
process_NHS_WES_trio_before_BAMOUT.sh
processing_setup.sh
run_processing.sh
submit_bcbio_trio_wes.sh
submit_depth_of_coverage_MQ20_BQ20.sh
submit_downstream.sh
submit_trio_wes_archive_project.sh
submit_trio_wes_aspera_download.sh
submit_trio_wes_checksums.sh
submit_trio_wes_cram_compression.sh
submit_trio_wes_lftp_download.sh
submit_trio_wes_priority_and_qc_checksums.sh
test_process_NHS_WES_trio.sh
test_run_processing.sh
trio_whole_exome_bcbio_crf_template.yaml
trio_whole_exome_bcbio_template.yaml
trio_whole_exome_config.sh
trio_whole_exome_create_parameter_files.pl
trio_whole_exome_crf_config.sh
trio_whole_exome_parse_peddy_ped_csv.pl
trio_whole_exome_parse_peddy_ped_csv_no_batch.pl
vcf_config.json.backup

Trio-Whole-Exome pipeline

This is an automated version of the scripts currently run manually according to SOP as part of the whole exome trios project with David Fitzpatrick's group. This pipeline is controlled by NextFlow

Setup

A Conda environment containing NextFlow is available in environment.yaml. Once you have Conda installed, you can create an environment by cd-ing into this project and running the command:

$ conda env create -n <environment_name>

Running

The pipeline requires two main input files:

Samplesheet

This is a tab-separated file mapping fastq pairs to metadata. The columns are individual ID, family ID, fastq sample ID, r1 fastq and r2 fastq. If a sample has been sequenced over multiple lanes, then include a line for each fastq pair:

individual_id	 family_id  sample_id                           read_1                      read_2
000001         000001     12345_000001_000001_WESTwist_IDT-B  path/to/lane_1_r1.fastq.gz  path/to/lane_1_r2.fastq.gz
000001         000001     12345_000001_000001_WESTwist_IDT-B  path/to/lane_2_r1.fastq.gz  path/to/lane_2_r2.fastq.gz

### Ped file

Tab-separated Ped file mapping individuals to each other and affected status. Per the specification, the columns are family ID, individual ID, father ID, mother ID, sex (1=male, 2=female, other=unknown), affected status (-9 or 0=missing, 1=unaffected, 2=affected):

000001  000001  000002  000003  2  2
000001  000002  0       0       1  1
000001  000003  0       0       2  1

The pipeline can now be run. First, check for errors:

$ nextflow run pipeline/validation.nf --ped_file path/to/batch.ped  --sample_sheet path/to/batch_samplesheet.tsv

Todo: run the main processing

Tests

This pipeline has automated tests contained in the folder tests/. To the run the tests locally, cd to this folder with your Conda environment active and run ./run_tests.sh.

Terminology

  • Batch: slightly ambiguous term - could be a pipeline batch, a sequencing batch or a BCBio batch
    • Pipeline batch: a single run of this pipeline, potentially mixing samples and families from multiple sequencing batches
    • Sequencing batch: a group of samples that were prepared and sequenced together
    • BCBio batch: used internally by BCBio to identify a family
  • Sample ID: specific to a sequencing batch, family ID, individual ID and extraction kit type
  • file_list.tsv: there's one of these files per sequencing batch, summarising all fastqs in the batch. A pipeline batch may need to refer to multiple different individuals across different file lists.
  • Ped file: defines family relationships between individuals. There's always one Ped file per pipeline batch.
  • Sample sheet: links the Ped file and file list(s) by defining what raw fastqs belong to each individual.