Documentation of running script for generating singletons from duos and...

Documentation of running script for generating singletons from duos and prioritization order of families

Documentation of running script for generating singletons from duos and...
Documentation of running script for generating singletons from duos and prioritization order of families
c1cfc357 · ameyner2 · 55ef7859 · c1cfc357
Commit c1cfc357 authored 1 year ago by ameyner2
--- a/docs/SOP_alignment_variant_annotation.md
+++ b/docs/SOP_alignment_variant_annotation.md
@@ -198,7 +198,6 @@ sample_suffix=<sample_suffix>
 $SCRIPTS/trio_wes_prepare_bcbio_config.sh \
  $SCRIPTS/trio_whole_exome_config.sh \
  $project_id $version $sample_suffix &> ${project_id}_${version}_`date +%Y%m%d%H%M`.log
-X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
 ```

 *Edinburgh Clinical Research Facility data*
@@ -209,10 +208,25 @@ sample_suffix=<sample_suffix>
 $SCRIPTS/trio_wes_prepare_bcbio_config_crf.sh \
  $SCRIPTS/trio_whole_exome_config.sh \
  $project_id $version $sample_suffix &> ${project_id}_${version}_`date +%Y%m%d%H%M`.log
+```
+
+5. Manually sort the family ids text file using this priority order (duos with unaffected parent are added in the next step):
+
+  a. Urgent families
+  b. Singletons
+  c. Quads
+  d. Trios
+  e. Shared affected
+
+6. For any duos where the proband is to be analyzed as a singleton, move the current PED and config YAML files to have a 'duo' suffix, and extract the relevant info for singleton PED and config YAML files using the original file names. Singleton PED file should have '0' for both parent ids. Put the family ids (including batch id, format `<batch_id>_<family_id>`) for the duos into a file `$project_id.singleton_from_duo.txt`. Add the duo suffix family ids to the end of the family ids text file and set `X` to the number of families in that file.
+
+```
+$SCRIPTS/trio_wes_prepare_bcbio_config_singleton_from_duo.sh $SCRIPTS/trio_whole_exome_config.sh $project_id $version
+awk '{ print $0 "duo" }' $project_id.singleton_from_duo.txt >> $project_id.family_ids.txt
 X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
 ```

-5. Submit the bcbio jobs from the logs folder. See above for version.
+7. Submit the bcbio jobs from the logs folder. See above for version.

 ```
 cd $LOGS_DIR
@@ -222,14 +236,14 @@ sbatch --export=PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=$SCRIPTS/trio_

 If all log files end in ‘Finished’ or ‘Storing in local filesystem’ for a metadata file (occasionally the job completes without quite outputting all of the ‘Storing’ messages), the batch is complete. If this is not the case, resubmit the incomplete jobs – they will resume where they left off.

-6. Check the output directory to make sure all family output folders were moved into the `families` subdirectory. This should happen automatically at the end of the `submit_bcbio_trio_wes.sh` script but occasionally fails.
+8. Check the output directory to make sure all family output folders were moved into the `families` subdirectory. This should happen automatically at the end of the `submit_bcbio_trio_wes.sh` script but occasionally fails.

 ```
 cd $OUTPUT_DIR/${short_project_id}_${version}
 mv *${short_project_id}* families/
 ```

-7. Generate a MultiQC report for all files in the batch.
+9. Generate a MultiQC report for all files in the batch.

 ```
 cd $OUTPUT_DIR/${short_project_id}_${version}/families
@@ -239,7 +253,7 @@ multiqc --title "Trio whole exome QC report: $short_project_id $version" \
  --filename ${short_project_id}_${version}_qc_report.html .
 ```

-8. Check the parent-child relationships predicted by peddy match the pedigree information. There should be no entries in the `<EdGE_project_id>.ped_check.txt` file that do not end in ‘True’. If there are, report these back to the NHS Clinical Scientist who generated the PED file for this batch. The `<batch_id>` is the 5 digit number that prefixes all the family ids in the output. Move to [SOP prioritization](SOP_prioritization.md).
+10. Check the parent-child relationships predicted by peddy match the pedigree information. There should be no entries in the `<EdGE_project_id>.ped_check.txt` file that do not end in ‘True’. If there are, report these back to the NHS Clinical Scientist who generated the PED file for this batch. The `<batch_id>` is the 5 digit number that prefixes all the family ids in the output. Move to [SOP prioritization](SOP_prioritization.md).

 ```
 cd $OUTPUT_DIR/${short_project_id}_${version}/families
@@ -252,7 +266,14 @@ perl $SCRIPTS/peddy_validation.pl \
 grep -v False$ ../qc/${short_project_id}_${version}.ped_check.txt
 ```

-9. Compress BAM files to CRAM and compare the two files. The output log files should be empty and the files <sample>.cram, <sample>.cram.crai, and <sample>.cram.flagstat.txt should be present for each sample.
+11. Get the QC summary metrics from the MultiQC report and add these to the QC Word document.
+
+```
+cd $OUTPUT_DIR/${short_project_id}_${version}/qc/${short_project_id}_${version}data
+Rscript $SCRIPTS/qc_metrics.R | less
+```
+
+12. Compress BAM files to CRAM and compare the two files. The output log files should be empty and the files <sample>.cram, <sample>.cram.crai, and <sample>.cram.flagstat.txt should be present for each sample.

 ```
 cd $LOGS_DIR
@@ -260,7 +281,7 @@ sbatch --export=PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=$SCRIPTS/trio_
  --array=1-$X $SCRIPTS/submit_trio_wes_cram_compression.sh
 ```

-10. Calculate md5 checksums on the per-family files, excluding the BAM files. Creates the file `md5sum.txt` at the root of each family’s output directory. Check the files with the calculated md5sums. They should total 30 lines per sample plus 26 lines per family. The log files should be empty. When complete, move the family ids text file into the results folder for the project.
+13. Calculate md5 checksums on the per-family files, excluding the BAM files. Creates the file `md5sum.txt` at the root of each family’s output directory. Check the files with the calculated md5sums. They should total 30 lines per sample plus 26 lines per family. The log files should be empty. When complete, move the family ids text file into the results folder for the project.

 ```
 cd $LOGS_DIR
@@ -273,21 +294,21 @@ cd $PARAMS_DIR
 mv $project_id.family_ids.txt $OUTPUT_DIR/${short_project_id}_${version}/params/
 ```

-11. Wait for prioritization to be completed. Calculate md5 checksums on the remaining project files, excluding the `families` sub-directory. Creates the file `md5sum.txt` at the root of the project output directory.
+14. Wait for prioritization to be completed. Calculate md5 checksums on the remaining project files, excluding the `families` sub-directory. Creates the file `md5sum.txt` at the root of the project output directory.

 ```
 sbatch --export=PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=$SCRIPTS/trio_whole_exome_config.sh \
  $SCRIPTS/submit_trio_wes_project_checksums.sh
 ```

-12. Remove the BAM files from the results.
+15. Remove the BAM files from the results.

 ```
 cd $OUTPUT_DIR/${short_project_id}_${version}
 rm families/*/*/*.bam*
 ```

-13. Clean up. Clear the work and logs directories. Move the bcbio YAML configuration files into the results folder for the project. Retain reads for samples in families where one sample has failed QC, using a list `retain\_for\_rerun.txt`. These will likely be required for later runs, and it is simpler to regenerate config YAML files if it is not necessary to re-do symlinks/read merging.
+16. Clean up. Clear the work and logs directories. Move the bcbio YAML configuration files into the results folder for the project. Retain reads for samples in families where one sample has failed QC, using a list `retain\_for\_rerun.txt`. These will likely be required for later runs, and it is simpler to regenerate config YAML files if it is not necessary to re-do symlinks/read merging.

 ```
 cd $WORK_DIR
@@ -303,4 +324,4 @@ mkdir -p $OUTPUT_DIR/${short_project_id}_${version}/config/
 mv $CONFIG_DIR/${short_project_id}_${version}*.yaml $OUTPUT_DIR/${short_project_id}_${version}/config/
 ```

-14. Update the batch status spreadsheet. 
+17. Update the batch status spreadsheet.