Doc updates, updating format of sample sheets. Adding nextflow.config. Adding...

Doc updates, updating format of sample sheets. Adding nextflow.config. Adding main.nf entrypoint, moving param declarations there. Making most input parameters their own Channels.

Doc updates, updating format of sample sheets. Adding nextflow.config. Adding...
Doc updates, updating format of sample sheets. Adding nextflow.config. Adding main.nf entrypoint, moving param declarations there. Making most input parameters their own Channels.
61c7c113 · mwham · 95f1c2d0 · 61c7c113 · 61c7c113 · 61c7c113
Commit 61c7c113 authored 2 years ago by mwham
--- a/README.md
+++ b/README.md
-# Trio-Whole-Exome pipeline
+# Trio-Whole-Exome Pipeline
 This is an automated version of the scripts currently run manually according to SOP as part of the whole exome trios
-project with David Fitzpatrick's group. This pipeline is controlled by [NextFlow](https://www.nextflow.io/)
+project with David Fitzpatrick's group. This pipeline is controlled by [NextFlow](https://www.nextflow.io).
 ## Setup
-A [Conda](https://docs.conda.io) environment containing NextFlow is available in `environment.yaml`. Once you have Conda
+This pipeline requires:
-installed, you can create an environment by `cd`-ing into this project and running the command:
-    $ conda env create -n <environment_name>
+- NextFlow
+- An install of BCBio v1.2.8
+A [Conda](https://docs.conda.io) environment containing NextFlow is available in `environment.yml`. This can be created
+with the command:
-## Running
+    $ conda env create -n <environment_name> -f environment.yml
+## Running the pipeline
 The pipeline requires two main input files:
+### Configuration
+This pipeline uses a config at trio-whole-exome/nextflow.config, containing profiles for different sizes of process.
+NextFlow picks this up automatically.
+A second config is necessary for providing executor and param information. This can be supplied via the `-c` argument.
+Parameters:
+- `bcbio` - path to a BCBio install, containing 'anaconda', 'galaxy', 'genomes', etc
+- `bcbio_template` - path to a template config for BCBio variant calling. Should set `upload.dir: ./results` so that
+  BCBio will output results to the working dir.
+- `output_dir` - where the results get written to on the system. The variant calling creates initial results here,
+  and variant prioritisation adds to them
+- `target_bed` - bed file of Twist exome targets
+- `reference_genome` - hg38 reference genome in fasta format
+- `parse_peddy_output` - path to the parse_peddy_output Perl script. Todo: remove once scripts are in bin/
 ### Samplesheet
-This is a tab-separated file mapping fastq pairs to metadata. The columns are individual ID, family ID, fastq sample ID,
+This is a tab-separated file mapping individuals to fastq pairs. The columns are individual_id, read_1 and read_2. If a
-r1 fastq and r2 fastq. If a sample has been sequenced over multiple lanes, then include a line for each fastq pair:
+sample has been sequenced over multiple lanes, then include a line for each fastq pair:
-    individual_id	 family_id  sample_id                           read_1                      read_2
+    individual_id   read_1                      read_2
-    000001         000001     12345_000001_000001_WESTwist_IDT-B  path/to/lane_1_r1.fastq.gz  path/to/lane_1_r2.fastq.gz
+    000001          path/to/lane_1_r1.fastq.gz  path/to/lane_1_r2.fastq.gz
-    000001         000001     12345_000001_000001_WESTwist_IDT-B  path/to/lane_2_r1.fastq.gz  path/to/lane_2_r2.fastq.gz
+    000001          path/to/lane_2_r1.fastq.gz  path/to/lane_2_r2.fastq.gz
 ### Ped file
-Tab-separated Ped file mapping individuals to each other and affected status. Per the
+Tab-separated Ped file mapping individuals to each other family IDs and and affected status. Per the
 [specification](https://gatk.broadinstitute.org/hc/en-us/articles/360035531972-PED-Pedigree-format), the columns are
 family ID, individual ID, father ID, mother ID, sex (1=male, 2=female, other=unknown), affected status (-9 or 0=missing,
 1=unaffected, 2=affected):
@@ -36,27 +59,45 @@ family ID, individual ID, father ID, mother ID, sex (1=male, 2=female, other=unk
    000001  000002  0       0       1  1
    000001  000003  0       0       2  1
-The pipeline can now be run. First, check for errors:
+The pipeline does support non-trios, e.g. singletons, duos, quads.
-    $ nextflow run pipeline/validation.nf --ped_file path/to/batch.ped  --sample_sheet path/to/batch_samplesheet.tsv
+### Usage
+The pipeline can now be run. First, run the initial variant calling:
+    $ nextflow path/to/trio-whole-exome/main.nf \
+        -c path/to/nextflow.config
+        --workflow 'variant-calling' \
+        --pipeline_project_id projname --pipeline_project_version v1 \
+        --ped_file path/to/batch.ped \
+        --sample_sheet path/to/samplesheet.tsv
+Todo: variant prioritisation workflow
-Todo: run the main processing
 ## Tests
 This pipeline has automated tests contained in the folder `tests/`. To the run the tests locally, `cd` to this folder
-with your Conda environment active and run `./run_tests.sh`.
+with your Conda environment active and run the test scripts:
+- run_tests.sh
+- run_giab_tests.sh
+These tests use the environment variable `NEXTFLOW_CONFIG`, pointing to a platform-specific config file.
+## Terminology FAQ
-## Terminology
+- 'Batch'
+  - Slightly ambiguous term - can be a pipeline batch, a sequencing batch or a BCBio batch. To this end, a single
+    run of this pipeline is known as a project.
+- 'Pipeline project'
+  - A single run of this pipeline, potentially mixing samples and families from multiple sequencing batches. There's
+    always one Ped file and sample sheet per pipeline project.
+- 'Sequencing batch'
+  - A group of samples that were prepared and sequenced together.
+- 'BCBio batch'
+  - Used internally by BCBio to identify a family.
+- 'Sample ID'
+  - Specific to a sequencing batch, family ID, individual ID and extraction kit type
- Batch: slightly ambiguous term - could be a pipeline batch, a sequencing batch or a BCBio batch
-  - Pipeline batch: a single run of this pipeline, potentially mixing samples and families from multiple
-    sequencing batches
-  - Sequencing batch: a group of samples that were prepared and sequenced together
-  - BCBio batch: used internally by BCBio to identify a family
- Sample ID: specific to a sequencing batch, family ID, individual ID and extraction kit type
- file_list.tsv: there's one of these files per sequencing batch, summarising all fastqs in the batch. A
-  pipeline batch may need to refer to multiple different individuals across different file lists.
- Ped file: defines family relationships between individuals. There's always one Ped file per pipeline batch.
- Sample sheet: links the Ped file and file list(s) by defining what raw fastqs belong to each individual.
--- a/main.nf
+++ b/main.nf
+nextflow.enable.dsl = 2
+include {var_calling} from './pipeline/var_calling.nf'
+// which part of the pipeline to run - either 'variant-calling' or 'variant-prioritisation'
+params.workflow = null
+// path to a bcbio install, containing 'anaconda', 'galaxy', 'genomes', etc
+params.bcbio = null
+// path to a template config for bcbio variant calling
+params.bcbio_template = null
+// where the results get written to on the system. The variant calling creates initial
+// results here, and variant prioritisation adds to them
+params.output_dir = null
+// name of the pipeline batch, e.g. '21900', '20220427'
+params.pipeline_project_id = null
+// version of the pipeline batch, e.g. 'v1'
+params.pipeline_project_version = null
+// bed file of Twist exome targets
+params.target_bed = null
+// hg38 reference genome in fasta format
+params.reference_genome = null
+// path to the parse_peddy_output Perl script. Todo: remove once scripts are in bin/
+params.parse_peddy_output = null
+// path to a Ped file describing all the families in the pipeline batch
+params.ped_file = null
+// path to a samplesheet mapping individual IDs to fastq pairs
+params.sample_sheet = null
+workflow {
+    if (params.workflow == 'variant-calling') {
+        var_calling()
+    } else if (params.workflow == 'variant-prioritisation') {
+        println "Variant prioritisation coming soon"
+    } else {
+        exit 1, 'params.workflow required - variant-calling or variant-prioritisation'
+    }
+}
--- a/nextflow.config
+++ b/nextflow.config
+process {
+    executor = 'slurm'
+    cpus = 4
+    memory = 8.GB
+    time = '6h'
+    withLabel: small {
+        executor = 'local'
+        cpus = 2
+        memory = 2.GB
+    }
+    withLabel: medium {
+        cpus = 4
+        memory = 8.GB
+    }
+    withLabel: large {
+        cpus = 16
+        memory = 32.GB
+    }
+}
+profiles {
+    debug {
+        process.echo = true
+    }
+}
--- a/pipeline/inputs.nf
+++ b/pipeline/inputs.nf
 nextflow.enable.dsl = 2
-params.ped_file = null
-params.sample_sheet = null
 workflow read_inputs {
    /*
@@ -24,9 +22,8 @@ workflow read_inputs {
            ...
        ]
        */
-        ped_file = file(params.ped_file, checkIfExists: true)
+        ch_ped_file = Channel.fromPath(params.ped_file, checkIfExists: true)
-        ch_ped_file_info = Channel.fromPath(ped_file)
+        ch_ped_file_info = ch_ped_file.splitCsv(sep: '\t')
-            .splitCsv(sep: '\t')
            .map(
                { line ->
                    [
@@ -55,9 +52,8 @@ workflow read_inputs {
            ]
        ]
        */
-        samplesheet = file(params.sample_sheet, checkIfExists: true)
+        ch_samplesheet = Channel.fromPath(params.sample_sheet, checkIfExists: true)
-        ch_samplesheet_info = Channel.fromPath(samplesheet)
+        ch_samplesheet_info = ch_samplesheet.splitCsv(sep:'\t', header: true)
-            .splitCsv(sep:'\t', header: true)
            .map(
                { line -> [line.individual_id, file(line.read_1), file(line.read_2)] }
            )
@@ -103,6 +99,10 @@ workflow read_inputs {
        ch_individuals_by_family = ch_individuals.map({[it[1], it]})
    emit:
-        ch_individuals
+        ch_ped_file = ch_ped_file
-        ch_individuals_by_family
+        ch_ped_file_info = ch_ped_file_info
+        ch_samplesheet = ch_samplesheet
+        ch_samplesheet_info = ch_samplesheet_info
+        ch_individuals = ch_individuals
+        ch_individuals_by_family = ch_individuals_by_family
 }
--- a/pipeline/main.nf
+++ b/pipeline/main.nf
@@ -3,14 +3,6 @@ nextflow.enable.dsl = 2
 include {read_inputs} from './inputs.nf'
 include {validation} from './validation.nf'
-params.bcbio = null
-params.bcbio_template = null
-params.output_dir = null
-params.pipeline_project_id = null
-params.pipeline_project_version = null
-params.target_bed = null
-params.parse_peddy_output = null
 process merge_fastqs {
    label 'medium'
@@ -38,6 +30,7 @@ process write_bcbio_csv {
    input:
    tuple(val(family_id), val(individual_info))
+    path(target_bed)
    output:
    tuple(val(family_id), path("${family_id}.csv"))
@@ -45,14 +38,16 @@ process write_bcbio_csv {
    script:
    """
    #!/usr/bin/env python
+    import os
+    target_bed = os.path.realpath('${target_bed}')
    individual_info = '$individual_info'
    lines = individual_info.lstrip('[').rstrip(']').split('], [')
    with open('${family_id}.csv', 'w') as f:
        f.write('samplename,description,batch,sex,phenotype,variant_regions\\n')
        for l in lines:
-            f.write(l.replace(', ', ',') + '\\n')
+            f.write(l.replace(', ', ',') + ',' + target_bed + '\\n')
    """
 }
@@ -62,17 +57,19 @@ process bcbio_family_processing {
    input:
    tuple(val(family_id), val(individuals), path(family_csv))
+    path(bcbio)
+    path(bcbio_template)
    output:
    tuple(val(family_id), val(individuals), path("${family_id}-merged"))
    script:
    """
-    ${params.bcbio}/anaconda/bin/bcbio_prepare_samples.py --out . --csv $family_csv &&
+    ${bcbio}/anaconda/bin/bcbio_prepare_samples.py --out . --csv $family_csv &&
-    ${params.bcbio}/anaconda/bin/bcbio_nextgen.py -w template ${params.bcbio_template} ${family_csv.getBaseName()}-merged.csv ${individuals.collect({"${it}.fastq.gz"}).join(' ')} &&
+    ${bcbio}/anaconda/bin/bcbio_nextgen.py -w template ${bcbio_template} ${family_csv.getBaseName()}-merged.csv ${individuals.collect({"${it}.fastq.gz"}).join(' ')} &&
    cd ${family_id}-merged &&
-    ${params.bcbio}/anaconda/bin/bcbio_nextgen.py config/${family_id}-merged.yaml -n 16 -t local
+    ../${bcbio}/anaconda/bin/bcbio_nextgen.py config/${family_id}-merged.yaml -n 16 -t local
    """
 }
@@ -80,13 +77,15 @@ process bcbio_family_processing {
 process format_bcbio_individual_outputs {
    input:
    tuple(val(family_id), val(individuals), path(bcbio_output_dir))
+    path(bcbio)
+    path(reference_genome)
    output:
    tuple(val(family_id), path('individual_outputs'))
    script:
    """
-    samtools=${params.bcbio}/anaconda/bin/samtools &&
+    samtools=${bcbio}/anaconda/bin/samtools &&
    mkdir individual_outputs
    for i in ${individuals.join(' ')}
    do
@@ -99,7 +98,7 @@ process format_bcbio_individual_outputs {
        bam=\$indv_input/\$i-ready.bam
        cram="\$indv_output/\$i-ready.cram" &&
-        \$samtools view -@ ${task.cpus} -T ${params.reference_genome} -C -o \$cram \$bam &&
+        \$samtools view -@ ${task.cpus} -T ${reference_genome} -C -o \$cram \$bam &&
        \$samtools index \$cram &&
        bam_flagstat=./\$i-ready.bam.flagstat.txt &&
        cram_flagstat=\$cram.flagstat.txt &&
@@ -165,15 +164,19 @@ process format_bcbio_family_outputs {
 }
 process collate_pipeline_outputs {
    label 'small'
    publishDir "${params.output_dir}", mode: 'move', pattern: "${params.pipeline_project_id}_${params.pipeline_project_version}"
    input:
+    val(family_ids)
    val(bcbio_family_output_dirs)
    val(raw_bcbio_output_dirs)
+    path(ped_file)
+    path(samplesheet)
+    path(bcbio)
+    path(parse_peddy_output)
    output:
    path("${params.pipeline_project_id}_${params.pipeline_project_version}")
@@ -189,20 +192,25 @@ process collate_pipeline_outputs {
        cp -rL \$d \$outputs/families/\$(basename \$d)
    done &&
+    for f in ${family_ids.join(' ')}
+    do
+        grep \$f ${ped_file} > \$outputs/params/\$f.ped
+    done &&
    cd \$outputs/families &&
-    ${params.bcbio}/anaconda/bin/multiqc \
+    ../../${bcbio}/anaconda/bin/multiqc \
        --title "Trio whole exome QC report: ${params.pipeline_project_id}_${params.pipeline_project_version}" \
        --outdir ../qc \
        --filename ${params.pipeline_project_id}_${params.pipeline_project_version}_qc_report.html \
        . &&
    peddy_output=../qc/${params.pipeline_project_id}_${params.pipeline_project_version}.ped_check.txt &&
-    perl ${params.parse_peddy_output} \
+    perl ../../${parse_peddy_output} \
        --output \$peddy_output \
        --project ${params.pipeline_project_id} \
        --batch ${bcbio_family_output_dirs[0].getName().split('_')[1]} \
        --version ${params.pipeline_project_version} \
-        --ped ${params.ped_file} \
+        --ped ../../${ped_file} \
        --families . &&
    # no && here - exit status checked below
@@ -222,8 +230,8 @@ process collate_pipeline_outputs {
        dest_basename=${params.pipeline_project_id}_${params.pipeline_project_version}_\$family_id &&
        cp -L \$d/config/\$family_id_merged.csv \$outputs/params/\$dest_basename.csv &&
        cp -L \$d/config/\$family_id_merged.yaml \$outputs/config/\$dest_basename.yaml &&
-        cp -L ${params.ped_file} \$outputs/params/ &&
+        cp -L ${ped_file} \$outputs/params/ &&
-        cp -L ${params.sample_sheet} \$outputs/params/
+        cp -L ${samplesheet} \$outputs/params/
    done
    """
 }
@@ -242,8 +250,16 @@ workflow process_families {
    take:
        ch_individuals
+        ch_ped_file
+        ch_samplesheet
    main:
+        ch_bcbio = file(params.bcbio, checkIfExists: true)
+        ch_bcbio_template = file(params.bcbio_template, checkIfExists: true)
+        ch_target_bed = file(params.target_bed, checkIfExists: true)
+        ch_parse_peddy_output = file(params.parse_peddy_output, checkIfExists: true)
+        ch_reference_genome = file(params.reference_genome, checkIfExists: true)
        ch_merged_fastqs = merge_fastqs(
            ch_individuals.map(
                { indv, family, father, mother, sex, affected, r1, r2 ->
@@ -267,9 +283,10 @@ workflow process_families {
        ch_bcbio_csvs = write_bcbio_csv(
            ch_read1_meta.mix(ch_read2_meta).map(
                { family_id, sample_id, father, mother, sex, phenotype, merged_fastq ->
-                    [family_id, [merged_fastq, sample_id, family_id, sex, phenotype, params.target_bed]]
+                    [family_id, [merged_fastq, sample_id, family_id, sex, phenotype]]
                }
-            ).groupTuple()
+            ).groupTuple(),
+            ch_target_bed
        )
        ch_bcbio_inputs = ch_joined_indv_info.map(
@@ -277,25 +294,42 @@ workflow process_families {
                [family_id, sample_id]
        }).groupTuple().join(ch_bcbio_csvs)
-        ch_bcbio_family_outputs = bcbio_family_processing(ch_bcbio_inputs)
+        ch_bcbio_family_outputs = bcbio_family_processing(
-        ch_individual_folders = format_bcbio_individual_outputs(ch_bcbio_family_outputs)
+            ch_bcbio_inputs,
+            ch_bcbio,
+            ch_bcbio_template
+        )
+        ch_individual_folders = format_bcbio_individual_outputs(
+            ch_bcbio_family_outputs,
+            ch_bcbio,
+            ch_reference_genome
+        )
        ch_formatted_bcbio_outputs = format_bcbio_family_outputs(
            ch_bcbio_family_outputs.join(ch_individual_folders)
        )
        collate_pipeline_outputs(
+            ch_formatted_bcbio_outputs.map({it[0]}).collect(),
            ch_formatted_bcbio_outputs.map({it[1]}).collect(),
-            ch_formatted_bcbio_outputs.map({it[2]}).collect()
+            ch_formatted_bcbio_outputs.map({it[2]}).collect(),
+            ch_ped_file,
+            ch_samplesheet,
+            ch_bcbio,
+            ch_parse_peddy_output
        )
 }
-workflow {
+workflow var_calling {
    read_inputs()
-    ch_individuals = read_inputs.out[0]
-    ch_individuals_by_family = read_inputs.out[1]
-    validation(ch_individuals)
+    validation(read_inputs.out.ch_individuals)
-    process_families(ch_individuals)
+    process_families(
+        read_inputs.out.ch_individuals,
+        read_inputs.out.ch_ped_file,
+        read_inputs.out.ch_samplesheet
+    )
 }
--- a/tests/assets/input_data/ped_files/giab_test_non_trios.ped
+++ b/tests/assets/input_data/ped_files/giab_test_non_trios.ped
+00001_000001	000001_000001	0	000003_000002	1	2
+00001_000001	000003_000001	0	0	2	1
+00001_000002	000004_000002	0	0	1	2
+00001_000003	000005_000003	000007_000003	000008_000003	1	2
+00001_000003	000006_000003	000007_000003	000008_000003	2	2
+00001_000003	000007_000003	0	0	1	1
+00001_000003	000008_000003	0	0	2	1
--- a/tests/assets/input_data/ped_files/giab_test_trios.ped
+++ b/tests/assets/input_data/ped_files/giab_test_trios.ped
+00001_000001	000001_000001	000002_000002	000003_000002	1	2
+00001_000001	000002_000001	0	0	1	1
+00001_000001	000003_000001	0	0	2	1
+00001_000002	000004_000002	000005_000002	000006_000002	1	2
+00001_000002	000005_000002	0	0	1	1
+00001_000002	000006_000002	0	0	2	1
--- a/tests/assets/input_data/sample_sheets/batch_1.tsv
+++ b/tests/assets/input_data/sample_sheets/batch_1.tsv
-individual_id	family_id	sample_id	read_1	read_2
+individual_id	read_1	read_2
-000001	000001	12345_000001_000001_WESTwist_IDT-B	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000001_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_1_00001AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000001_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_1_00001AM0001L01_2.fastq.gz
+000001	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000001_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_1_00001AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000001_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_1_00001AM0001L01_2.fastq.gz
-000001	000001	12345_000001_000001_WESTwist_IDT-B	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000001_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_2_00001AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000001_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_2_00001AM0001L01_2.fastq.gz
+000001	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000001_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_2_00001AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000001_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_2_00001AM0001L01_2.fastq.gz
-000002	000001	12345_000002_000001_WESTwist_IDT-B	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000002_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_3_00002AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000002_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_3_00002AM0001L01_2.fastq.gz
+000002	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000002_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_3_00002AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000002_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_3_00002AM0001L01_2.fastq.gz
-000002	000001	12345_000002_000001_WESTwist_IDT-B	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000002_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_4_00002AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000002_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_4_00002AM0001L01_2.fastq.gz
+000002	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000002_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_4_00002AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000002_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_4_00002AM0001L01_2.fastq.gz
-000003	000001	12345_000003_000001_WESTwist_IDT-B	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000003_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_5_00003AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000003_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_5_00003AM0001L01_2.fastq.gz
+000003	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000003_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_5_00003AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000003_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_5_00003AM0001L01_2.fastq.gz
-000003	000001	12345_000003_000001_WESTwist_IDT-B	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000003_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_6_00003AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000003_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_6_00003AM0001L01_2.fastq.gz
+000003	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000003_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_6_00003AM0001L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12345_A_Researcher/20210922/12345_000003_000001_WESTwist_IDT-B/200922_A00001_0001_BHNTGMDMXX_6_00003AM0001L01_2.fastq.gz
--- a/tests/assets/input_data/sample_sheets/batch_2_md5_errors.tsv
+++ b/tests/assets/input_data/sample_sheets/batch_2_md5_errors.tsv
-individual_id	family_id	sample_id	read_1	read_2
+individual_id	read_1	read_2
-000006	000003	12346_000006_000003_WESTwist_IDT-B	assets/input_data/edinburgh_genomics/X12346_MD5_Errors/20211005/12346_000006_000003_WESTwist_IDT-B/211005_A00002_0002_AJTHSNRLXX_1_00002AM0002L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12346_MD5_Errors/20211005/12346_000006_000003_WESTwist_IDT-B/211005_A00002_0002_AJTHSNRLXX_1_00002AM0002L01_2.fastq.gz
+000006	assets/input_data/edinburgh_genomics/X12346_MD5_Errors/20211005/12346_000006_000003_WESTwist_IDT-B/211005_A00002_0002_AJTHSNRLXX_1_00002AM0002L01_1.fastq.gz	assets/input_data/edinburgh_genomics/X12346_MD5_Errors/20211005/12346_000006_000003_WESTwist_IDT-B/211005_A00002_0002_AJTHSNRLXX_1_00002AM0002L01_2.fastq.gz
--- a/tests/assets/input_data/sample_sheets/giab_test_non_trios.tsv
+++ b/tests/assets/input_data/sample_sheets/giab_test_non_trios.tsv
+individual_id	read_1	read_2
+000001_000001	assets/input_data/giab/AshkenazimTrio/HG002_R1.fastq.gz	assets/input_data/giab/AshkenazimTrio/HG002_R2.fastq.gz
+000003_000001	assets/input_data/giab/AshkenazimTrio/HG004_R1.fastq.gz	assets/input_data/giab/AshkenazimTrio/HG004_R2.fastq.gz
+000004_000002	assets/input_data/giab/ChineseTrio/HG005_R1.fastq.gz	assets/input_data/giab/ChineseTrio/HG005_R2.fastq.gz
+000005_000003	assets/input_data/giab/AshkenazimTrio/HG002_R1.fastq.gz	assets/input_data/giab/AshkenazimTrio/HG002_R2.fastq.gz
+000006_000003	assets/input_data/giab/ChineseTrio/HG005_R1.fastq.gz	assets/input_data/giab/ChineseTrio/HG005_R2.fastq.gz
+000007_000003	assets/input_data/giab/AshkenazimTrio/HG003_R1.fastq.gz	assets/input_data/giab/AshkenazimTrio/HG003_R2.fastq.gz
+000008_000003	assets/input_data/giab/AshkenazimTrio/HG004_R1.fastq.gz	assets/input_data/giab/AshkenazimTrio/HG004_R2.fastq.gz
--- a/tests/assets/input_data/sample_sheets/giab_test_trios.tsv
+++ b/tests/assets/input_data/sample_sheets/giab_test_trios.tsv
+individual_id	read_1	read_2
+000001_000001	assets/input_data/giab/AshkenazimTrio/HG002_R1.fastq.gz	assets/input_data/giab/AshkenazimTrio/HG002_R2.fastq.gz
+000002_000001	assets/input_data/giab/AshkenazimTrio/HG003_R1.fastq.gz	assets/input_data/giab/AshkenazimTrio/HG003_R2.fastq.gz
+000003_000001	assets/input_data/giab/AshkenazimTrio/HG004_R1.fastq.gz	assets/input_data/giab/AshkenazimTrio/HG004_R2.fastq.gz
+000004_000002	assets/input_data/giab/ChineseTrio/HG005_R1.fastq.gz	assets/input_data/giab/ChineseTrio/HG005_R2.fastq.gz
+000005_000002	assets/input_data/giab/ChineseTrio/HG006_R1.fastq.gz	assets/input_data/giab/ChineseTrio/HG006_R2.fastq.gz
+000006_000002	assets/input_data/giab/ChineseTrio/HG007_R1.fastq.gz	assets/input_data/giab/ChineseTrio/HG007_R2.fastq.gz
--- a/tests/run_giab_tests.sh
+++ b/tests/run_giab_tests.sh
+#!/bin/bash
+source scripts/nextflow_detached.sh
+test_exit_status=0
+nextflow -c "$NEXTFLOW_CONFIG" clean -f
+echo "Reduced GiaB data - trios"
+run_nextflow ../main.nf \
+    -c "$NEXTFLOW_CONFIG" \
+    --workflow 'variant-calling' \
+    --pipeline_project_id giab_test_trios \
+    --pipeline_project_version v1 \
+    --ped_file $PWD/assets/input_data/ped_files/giab_test_trios.ped \
+    --sample_sheet $PWD/assets/input_data/sample_sheets/giab_test_trios.tsv
+test_exit_status=$(( $test_exit_status + $? ))
+echo "Reduced GiaB data - non-trios"
+run_nextflow ../main.nf \
+    -c "$NEXTFLOW_CONFIG" \
+    --workflow 'variant-calling' \
+    --pipeline_project_id giab_test_non_trios \
+    --pipeline_project_version v1 \
+    --ped_file $PWD/assets/input_data/ped_files/giab_test_non_trios.ped \
+    --sample_sheet $PWD/assets/input_data/sample_sheets/giab_test_non_trios.tsv
+test_exit_status=$(( $test_exit_status + $? ))
+echo "Tests finished with exit status $test_exit_status"
--- a/tests/run_tests.sh
+++ b/tests/run_tests.sh
 #!/bin/bash
-source scripts/test_config.sh
+source scripts/nextflow_detached.sh
 bcbio=$PWD/scripts/bcbio_nextgen.py
 bcbio_prepare_samples=$PWD/scripts/bcbio_prepare_samples.py
@@ -9,7 +9,6 @@ common_args="--bcbio $bcbio --bcbio_prepare_samples $bcbio_prepare_samples --bcb
 test_exit_status=0
 nextflow clean -f
-rm -r ./outputs/* ./work/*
 echo "Test case 1: simple trio"
 run_nextflow ../pipeline/main.nf --ped_file assets/input_data/ped_files/batch_1.ped  --sample_sheet assets/input_data/sample_sheets/batch_1.tsv $common_args

--- a/tests/scripts/test_config.sh
+++ b/tests/scripts/test_config.sh