Started documenting ECRF process.

01106da9 · ameyner2 · ad7bdf73 · 01106da9
Commit 01106da9 authored 4 years ago by ameyner2
--- a/docs/SOP_alignment_variant_annotation.md
+++ b/docs/SOP_alignment_variant_annotation.md
 # Standard operating procedure - Alignment, variant calling, and annotation of trio whole exome samples at the Edinburgh Parallel Computing Centre

-This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE). It assumes that data has been successfully transferred from EdGE to the Edinburgh Parallel Computing Centre (EPCC) (see SOP: Transfer of whole exome sequencing samples from Edinburgh Genomics to Edinburgh Parallel Computing Centre). Scripts are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome. Request access by e-mail: alison.meynert@igmm.ed.ac.uk.
+This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE) or the Edinburgh Clinical Research Facility (ECRF). It assumes that data has been successfully transferred to the Edinburgh Parallel Computing Centre (EPCC) (see SOP: Transfer of whole exome sequencing samples from Edinburgh Genomics to Edinburgh Parallel Computing Centre). Scripts are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome. Request access by e-mail: alison.meynert@igmm.ed.ac.uk.

 ## Definitions

@@ -10,7 +10,7 @@ Text in angle brackets, e.g. <project> indicates variable parameters. A variable

 ## Software and data requirements

-The analysis is run with the bcbio pipeline (version 1.1.15) located at /home/u035/project/software/bcbio. All genome reference and annotation data resources are contained within the genomes/Hsapiens/hg38 subfolder.
+The analysis is run with the bcbio pipeline (version 1.2.3) located at /home/u035/project/software/bcbio. All genome reference and annotation data resources are contained within the genomes/Hsapiens/hg38 subfolder.

 The TWIST target BED file is at: /home/u035/project/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed

@@ -26,7 +26,11 @@ bedtools slop -g $REFERENCE_GENOME.fai -i Twist_Exome_RefSeq_targets_hg38.bed -b

 ## Input

-A set of Nx2 FASTQ files, one per sample. A 6-column tab-delimited PED/FAM format file (https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 1*.
+A 6-column tab-delimited PED/FAM format file (https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status.
+
+### Edinburgh Genomics
+
+A set of Nx2 FASTQ files, one per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 1*.

 ```
 <EdGE_project_id>/
@@ -43,9 +47,9 @@ A set of Nx2 FASTQ files, one per sample. A 6-column tab-delimited PED/FAM forma
  |   +---<dated_batch>_tree.txt
  |   +---Information.txt
 ```
-*Figure 1.* File name and directory structure for a batch of sequencing. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself.
+*Figure 1.* File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself.

-## Sample id format
+#### Sample id format

 The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv inside the dated batch folder. The identifiers are in the format:

@@ -55,9 +59,20 @@ The sequencing reads for the samples delivered from EdGE are identified by folde

 The suffix identifies the exome kit, e.g. "_IDT-A". These identifiers are referenced below in the output file structure.

+### Edinburgh Clinical Research Facility
+
+A set of Nx2 FASTQ files, one per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 2*.
+
+TODO
+*Figure 2.* File name and directory structure for a batch of sequencing from the ECRF.
+
+#### Sample id format
+
+TODO
+
 ## Working directories

-The project working directories will be in the folder /scratch/u035/project/trio_whole_exome/analysis and follow the structure in *Figure 2*.
+The project working directories will be in the folder /scratch/u035/project/trio_whole_exome/analysis and follow the structure in *Figure 3*.

 ```
    config – bcbio configuration files in YAML format
@@ -67,7 +82,7 @@ The project working directories will be in the folder /scratch/u035/project/trio
    reads – symlinks to input FASTQ files
    work – bcbio working folder
 ```
-*Figure 2.* Project working directories.
+*Figure 3.* Project working directories.

 ## Project configuration

@@ -128,7 +143,7 @@ upload:
 Per sample: BAM file of aligned reads against the hg38 genome assembly
 Per family: Annotated VCF file and QC report

-Output will be in the folder /scratch/u035/project/trio_whole_exome/analysis/output and follow the structure in *Figure 3* (with multiple instances of the indiv_id sub directories, one per sequenced family member.). The qc sub-directories are not enumerated, and automatically generated index files are not listed for brevity. An additional directory at the root of the output folder called “qc” will contain the MultiQC reports generated for an entire batch.
+Output will be in the folder /scratch/u035/project/trio_whole_exome/analysis/output and follow the structure in *Figure 4* (with multiple instances of the indiv_id sub directories, one per sequenced family member.). The qc sub-directories are not enumerated, and automatically generated index files are not listed for brevity. An additional directory at the root of the output folder called “qc” will contain the MultiQC reports generated for an entire batch.

 ```
 <analysis_date>_<EdGE_project_id>_<pcr_plate_id>_<family_id>/
@@ -150,7 +165,7 @@ Output will be in the folder /scratch/u035/project/trio_whole_exome/analysis/out
  +---programs.txt
  +---project-summary.yaml
 ```
-*Figure 3.* File name and output directory structure for each family in a batch of sequencing.
+*Figure 4.* File name and output directory structure for each family in a batch of sequencing.

 ## Procedure