Added info about input reads from ECRF

2b045e62 · ameyner2 · a37dad4d · 2b045e62
Commit 2b045e62 authored 4 years ago by ameyner2
--- a/docs/SOP_alignment_variant_annotation.md
+++ b/docs/SOP_alignment_variant_annotation.md
@@ -26,11 +26,24 @@ bedtools slop -g $REFERENCE_GENOME.fai -i Twist_Exome_RefSeq_targets_hg38.bed -b

 ## Input

+### PED file
+
 A 6-column tab-delimited PED/FAM format file (https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status.

-### Edinburgh Genomics

-A set of Nx2 FASTQ files, one per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 1*.
+### Sample id format
+
+The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv inside the dated batch folder. The identifiers are in the format:
+
+```
+<pcr_plate_id>_<indiv_id>_<family_id><suffix>
+```
+
+The suffix identifies the exome kit, e.g. "_IDT-A". These identifiers are referenced below in the output file structure.
+
+### Reads - Edinburgh Genomics
+
+A set of paired end FASTQ files (designated by R1 or R2 suffixes), possibly more than one pair per sample. Each sample's files are in its own folder. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 1*.

 ```
 <EdGE_project_id>/
@@ -47,29 +60,22 @@ A set of Nx2 FASTQ files, one per sample. The input files will be in the folder
  |   +---<dated_batch>_tree.txt
  |   +---Information.txt
 ```
-*Figure 1.* File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself.
+*Figure 1.* File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – in general we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself.

-#### Sample id format
+### Reads - Edinburgh Clinical Research Facility

-The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv inside the dated batch folder. The identifiers are in the format:
+A set of paired end FASTQ files (designated by R1 or R2 suffixes), generally one pair per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 2*.

 ```
-<pcr_plate_id>_<indiv_id>_<family_id><suffix>
+<EdGE_project_id>/
+  +---<internal_id_-md5.txt
+  +---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R1_001.fastq.gz
+  +---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R2_001.fastq.gz
+  +...
 ```

-The suffix identifies the exome kit, e.g. "_IDT-A". These identifiers are referenced below in the output file structure.
-
-### Edinburgh Clinical Research Facility
-
-A set of Nx2 FASTQ files, one per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 2*.
-
-TODO
 *Figure 2.* File name and directory structure for a batch of sequencing from the ECRF.

-#### Sample id format
-
-TODO
-
 ## Working directories

 The project working directories will be in the folder /scratch/u035/project/trio_whole_exome/analysis and follow the structure in *Figure 3*.
@@ -227,7 +233,7 @@ perl /home/u035/project/scripts/trio_whole_exome_parse_peddy_ped_csv.pl \
  --project $project_id \
  --batch $batch_id \
  --ped /scratch/u035/project/trio_whole_exome/analysis/params/$project_id.ped
-grep -v False$ output/qc/$project_id.ped_check.txt
+grep -v False$ qc/$project_id.ped_check.txt
 ```

 7. Clear the work directory and move the log files to the complete sub-directory.