Skip to content
Snippets Groups Projects
Commit 2b045e62 authored by ameyner2's avatar ameyner2
Browse files

Added info about input reads from ECRF

parent a37dad4d
No related branches found
No related tags found
No related merge requests found
......@@ -26,11 +26,24 @@ bedtools slop -g $REFERENCE_GENOME.fai -i Twist_Exome_RefSeq_targets_hg38.bed -b
## Input
### PED file
A 6-column tab-delimited PED/FAM format file (https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status.
### Edinburgh Genomics
A set of Nx2 FASTQ files, one per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 1*.
### Sample id format
The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv inside the dated batch folder. The identifiers are in the format:
```
<pcr_plate_id>_<indiv_id>_<family_id><suffix>
```
The suffix identifies the exome kit, e.g. "_IDT-A". These identifiers are referenced below in the output file structure.
### Reads - Edinburgh Genomics
A set of paired end FASTQ files (designated by R1 or R2 suffixes), possibly more than one pair per sample. Each sample's files are in its own folder. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 1*.
```
<EdGE_project_id>/
......@@ -47,29 +60,22 @@ A set of Nx2 FASTQ files, one per sample. The input files will be in the folder
| +---<dated_batch>_tree.txt
| +---Information.txt
```
*Figure 1.* File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself.
*Figure 1.* File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – in general we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself.
#### Sample id format
### Reads - Edinburgh Clinical Research Facility
The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv inside the dated batch folder. The identifiers are in the format:
A set of paired end FASTQ files (designated by R1 or R2 suffixes), generally one pair per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 2*.
```
<pcr_plate_id>_<indiv_id>_<family_id><suffix>
<EdGE_project_id>/
+---<internal_id_-md5.txt
+---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R1_001.fastq.gz
+---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R2_001.fastq.gz
+...
```
The suffix identifies the exome kit, e.g. "_IDT-A". These identifiers are referenced below in the output file structure.
### Edinburgh Clinical Research Facility
A set of Nx2 FASTQ files, one per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 2*.
TODO
*Figure 2.* File name and directory structure for a batch of sequencing from the ECRF.
#### Sample id format
TODO
## Working directories
The project working directories will be in the folder /scratch/u035/project/trio_whole_exome/analysis and follow the structure in *Figure 3*.
......@@ -227,7 +233,7 @@ perl /home/u035/project/scripts/trio_whole_exome_parse_peddy_ped_csv.pl \
--project $project_id \
--batch $batch_id \
--ped /scratch/u035/project/trio_whole_exome/analysis/params/$project_id.ped
grep -v False$ output/qc/$project_id.ped_check.txt
grep -v False$ qc/$project_id.ped_check.txt
```
7. Clear the work directory and move the log files to the complete sub-directory.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment