Trio-Whole-Exome pipeline
This is an automated version of the scripts currently run manually according to SOP as part of the whole exome trios project with David Fitzpatrick's group. This pipeline is controlled by NextFlow
Setup
A Conda environment containing NextFlow is available in environment.yaml
. Once you have Conda
installed, you can create an environment by cd
-ing into this project and running the command:
$ conda env create -n <environment_name>
Running
The pipeline requires two main input files:
Samplesheet
This is a tab-separated file mapping fastq pairs to metadata. The columns are individual ID, family ID, fastq sample ID, r1 fastq and r2 fastq. If a sample has been sequenced over multiple lanes, then include a line for each fastq pair:
individual_id family_id sample_id read_1 read_2
000001 000001 12345_000001_000001_WESTwist_IDT-B path/to/lane_1_r1.fastq.gz path/to/lane_1_r2.fastq.gz
000001 000001 12345_000001_000001_WESTwist_IDT-B path/to/lane_2_r1.fastq.gz path/to/lane_2_r2.fastq.gz
### Ped file
Tab-separated Ped file mapping individuals to each other and affected status. Per the specification, the columns are family ID, individual ID, father ID, mother ID, sex (1=male, 2=female, other=unknown), affected status (-9 or 0=missing, 1=unaffected, 2=affected):
000001 000001 000002 000003 2 2
000001 000002 0 0 1 1
000001 000003 0 0 2 1
The pipeline can now be run. First, check for errors:
$ nextflow run pipeline/validation.nf --ped_file path/to/batch.ped --sample_sheet path/to/batch_samplesheet.tsv
Todo: run the main processing
Tests
This pipeline has automated tests contained in the folder tests/
. To the run the tests locally, cd
to this folder
with your Conda environment active and run ./run_tests.sh
.
Terminology
- Batch: slightly ambiguous term - could be a pipeline batch, a sequencing batch or a BCBio batch
- Pipeline batch: a single run of this pipeline, potentially mixing samples and families from multiple sequencing batches
- Sequencing batch: a group of samples that were prepared and sequenced together
- BCBio batch: used internally by BCBio to identify a family
- Sample ID: specific to a sequencing batch, family ID, individual ID and extraction kit type
- file_list.tsv: there's one of these files per sequencing batch, summarising all fastqs in the batch. A pipeline batch may need to refer to multiple different individuals across different file lists.
- Ped file: defines family relationships between individuals. There's always one Ped file per pipeline batch.
- Sample sheet: links the Ped file and file list(s) by defining what raw fastqs belong to each individual.