diff --git a/docs/SOP_reanalysis_preparation.md b/docs/SOP_reanalysis_preparation.md new file mode 100644 index 0000000000000000000000000000000000000000..b0f0b78b63867067657afd82c73a5e40eead0c31 --- /dev/null +++ b/docs/SOP_reanalysis_preparation.md @@ -0,0 +1,136 @@ +# Standard operating procedure - Preparation for re-analysis of previously analyzed trio whole exome samples at the Edinburgh Parallel Computing Centre + +This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE) or the Edinburgh Clinical Research Facility (ECRF). It assumes that these families have previously been successfully analyzed. + +Definitions, Software and data requirements, PED file, Working directories are the same as in [SOP alignment and variant calling](SOP_alignment_variant_annotation.md). + +## Project configuration + +A [configuration script](../trio_whole_exome_variant_calling_config.sh) sets environment variables common to scripts used in this SOP. + +## Template for bcbio configuration + +Bcbio requires a [template file in YAML format](../trio_whole_exome_bcbio_variant_calling_template.yaml) to define the procedures run in the pipeline. + +## Output + +Per family: VCF file, CRAM files symlinked from previous analysis. Structure is otherwise the same as in [SOP alignment and variant calling](SOP_alignment_variant_annotation.md). + +## Procedure - re-running variant calling for projects where variant calling configuration has changed since the previous analysis. + +1. Set environment variable project_id and general configuration variables. All steps below can assume that these have been set. Old version is the previous analysis version, new version increments by 1. + +``` +project_id=<project_id> +short_project_id=`echo $project_id | cut -f 1 -d '_'` +old_version=<old_version> +new_version=<new_version> +source /home/u035/u035/shared/scripts/bin/trio_whole_exome_variant_calling_config.sh +``` + +2. Configure to run variant calling from existing CRAM files. + +``` +cd $PARAMS_DIR +$SCRIPTS/trio_wes_prepare_variant_calling_bcbio_config.sh \ + $SCRIPTS/trio_whole_exome_variant_calling_config.sh $project_id $old_version $new_version \ + &> ${project_id}_${new_version}_`date +%Y%m%d%H%M`.log +X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'` +``` + +3. Run variant calling, submitting the bcbio jobs from the logs folder. + +``` +cd $LOGS_DIR +sbatch --export=PROJECT_ID=$project_id,VERSION=$new_version,CONFIG_SH=$SCRIPTS/trio_whole_exome_config.sh \ + --array=1-$X $SCRIPTS/submit_trio_wes_bcbio.sh +``` + +4. On completion, symlink the CRAM files from the old version folder into the new version folder. + +``` +TODO +``` + +5. Clean up. + +``` +rm -r $WORK_DIR/* $LOGS_DIR/* $PARAMS_DIR/* $CONFIG_DIR/* +``` + +## Procedure - preparation of the reanalysis file structure + +1. Create the basic file structure + +``` +cd $OUTPUT_DIR +mkdir `date +%Y%m%d`_reanalysis +cd `date +%Y%m%d`_reanalysis +mkdir families params prioritization +chmod -R g+w * +``` + +2. Put the project results folders to be used in `params/projects.txt`, e.g. + +``` +26960_v2 +27430_v1 +``` + +3. Link family folders from previous runs + +``` +cd families +for project in `cat ../params/projects.txt` +do + for family in `ls ../../$project/families` + do + ln -s ../../$project/families/$family $family + done +done +``` + +4. Create a tab-delimited text file of the family ids and identify families with multiple entries in the list. Remove all but the most recent run for these. + +``` +ls | sed -e 's/\_/\t/g' > ../params/families.txt + +cd ../params +cut -f 5 families.txt | sort | uniq -d > families_multiple_entries.txt + +for family in `cat families_multiple_entries.txt` +do + grep $family families.txt | sort | sed -e 's/\t/\_/g' > $family.txt + count=`grep -c $family families.txt` + count=$((count-1)) + for ((i = 1; i <= $count; i = i + 1)) + do + family_entry=`head -n $i $family.txt | tail -n 1` + echo $family_entry >> family_entries_removed_from_reanalysis.txt + done + rm $family.txt +done + +cd ../families +rm `cat ../params/family_entries_removed_from_reanalysis.txt` +ls | grep -v trio | grep -v duo | sed -e 's/\_/\t/g' | cut -f 2-5 | sed -e 's/\tv/_v/' > ../params/all.txt +``` + +5. Pull in all the relevant PED files into the params folder + +cd ../params +count=`wc -l all.txt | awk '{ print $1 }'` +for ((i = 1; i <= $count; i = i + 1)) +do + project=`head -n $i all.txt | tail -n 1 | cut -f 1` + family=`head -n $i all.txt | tail -n 1 | cut -f 3` + + cp ../../$project/params/*$family*.ped ./ +done + +6. Make folders for the relevant groups and sort the families based on their PED files + +``` +mkdir trio singleton quad shared_affected trio_with_affected_parent +TODO +```