_ultra2 docs moved to main copy

7fe258a1 · not populated not populated · 7abf48a4 · 7fe258a1 · 7abf48a4 · 7fe258a1
Commit 7fe258a1 authored 3 years ago by not populated not populated
--- a/docs/SOP_alignment_variant_annotation.md
+++ b/docs/SOP_alignment_variant_annotation.md
 # Standard operating procedure - Alignment, variant calling, and annotation of trio whole exome samples at the Edinburgh Parallel Computing Centre

-This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE) or the Edinburgh Clinical Research Facility (ECRF). It assumes that data has been successfully transferred to the Edinburgh Parallel Computing Centre (EPCC) (see SOP: Transfer of whole exome sequencing samples from Edinburgh Genomics to Edinburgh Parallel Computing Centre). Scripts are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome. Request access by e-mail: alison.meynert@igmm.ed.ac.uk.
+This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE) or the Edinburgh Clinical Research Facility (ECRF). It assumes that data has been successfully transferred to the Edinburgh Parallel Computing Centre (EPCC) (see SOP: Transfer of whole exome sequencing samples from Edinburgh Genomics to Edinburgh Parallel Computing Centre). Scripts are version controlled on the University of Edinburgh gitlab server `gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome`. Request access by e-mail: alison.meynert@igmm.ed.ac.uk.

 ## Definitions

 In this document, N is the total number of samples in the project, and X is the number of families.

-Text in angle brackets, e.g. <project> indicates variable parameters. A variable parameter such as <family1-X> indicates that there are X instances of the parameter, each with their own unique value.
+Text in angle brackets, e.g. `<project>` indicates variable parameters. A variable parameter such as `<family1-X>` indicates that there are X instances of the parameter, each with their own unique value.

 ## Software and data requirements

-The analysis is run with the bcbio pipeline (version 1.2.3) located at /home/u035/project/software/bcbio. All genome reference and annotation data resources are contained within the genomes/Hsapiens/hg38 subfolder.
+The analysis is run with the bcbio pipeline (version 1.2.8) located at `/home/u035/u035/shared/software/bcbio`. All genome reference and annotation data resources are contained within the `genomes/Hsapiens/hg38` subfolder.

-The TWIST target BED file is at: /home/u035/project/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed
-
-To generate the target BED file, first copy the file Twist_Exome_RefSeq_targets_hg38.bed from NHS Clinical Genetics Services to /home/u035/project/resources on ultra, then pad it by 15bp each side.
-
-```
-cd /home/u035/project/resources
-source ../scripts/trio_whole_exome_config.sh
-
-bedtools slop -g $REFERENCE_GENOME.fai -i Twist_Exome_RefSeq_targets_hg38.bed -b 15 | \
-  bedtools merge > Twist_Exome_RefSeq_targets_hg38.plus15bp.bed
-```
+The TWIST target BED file is at: `/home/u035/u035/shared/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed`. See [resources](https://git.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome/blob/master/docs/Resources_ultra2.md).

 ## Input

 ### PED file

-A 6-column tab-delimited PED/FAM format file (https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status.
+A 6-column tab-delimited [PED/FAM format file](https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status.


 ### Sample id format
@@ -39,35 +29,32 @@ The sequencing reads for the samples delivered from EdGE are identified by folde
 <pcr_plate_id>_<indiv_id>_<family_id><suffix>
 ```

-The suffix identifies the exome kit, e.g. "_IDT-A". These identifiers are referenced below in the output file structure.
+The suffix identifies the exome kit, e.g. `_WESTwist_IDT-A`. These identifiers are referenced below in the output file structure.

 ### Reads - Edinburgh Genomics

-A set of paired end FASTQ files (designated by R1 or R2 suffixes), possibly more than one pair per sample. Each sample's files are in its own folder. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 1*.
+A set of paired end FASTQ files (designated by R1 or R2 suffixes), possibly more than one pair per sample. Each sample's files are in its own folder. The input files will be in the folder `/home/u035/u035/shared/data` and follow the structure in *Figure 1*. Older deliveries contained the `<dated_batch>` folder within a `raw_data` folder.

 ```
 <EdGE_project_id>/
+  +---<dated_batch>/
+  |   +---<sample_id>/
+  |   |   +---*.fastq.count
+  |   |   +---*.fastq.gz
+  |   +---file_list.tsv
+  |   +---md5sums.txt
+  +---<dated_batch>_tree.txt
+  +---Information.txt
  +---md5_check.txt
-  +---raw_data/
-  |   +---<dated_batch>/
-  |   |   +---<EdGE_sample_id>/
-  |   |   |   +---<fastq_id>_R1.fastq.count
-  |   |   |   +---<fastq_id>_R1.fastq.gz
-  |   |   |   +---<fastq_id>_R2.fastq.count
-  |   |   |   +---<fastq_id>_R2.fastq.gz
-  |   |   +---file_list.tsv
-  |   |   +---md5sums.txt
-  |   +---<dated_batch>_tree.txt
-  |   +---Information.txt
-```
-*Figure 1.* File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format XXXXX_Lastname_Firstname, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format yyyymmdd – in general we expect there to be only one of these per EdGE project id. The FASTQ file id relates to the sequencing run information and does not contain any information about the sample itself.
+```
+*Figure 1.* File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format `XXXXX_Lastname_Firstname`, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format `yyyymmdd` – in general we expect there to be only one of these per EdGE project id. The FASTQ file names relate to the sequencing run information and do not contain any information about the sample itself.

 ### Reads - Edinburgh Clinical Research Facility

-A set of paired end FASTQ files (designated by R1 or R2 suffixes), generally one pair per sample. The input files will be in the folder /scratch/u035/project/trio_whole_exome/data and follow the structure in *Figure 2*.
+A set of paired end FASTQ files (designated by R1 or R2 suffixes), generally one pair per sample. The input files will be in the folder `/home/u035/u035/shared/data` and follow the structure in *Figure 2*.

 ```
-<EdGE_project_id>/
+<ECRF_project_id>/
  +---<internal_id_-md5.txt
  +---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R1_001.fastq.gz
  +---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R2_001.fastq.gz
@@ -78,81 +65,34 @@ A set of paired end FASTQ files (designated by R1 or R2 suffixes), generally one

 ## Working directories

-The project working directories will be in the folder /scratch/u035/project/trio_whole_exome/analysis and follow the structure in *Figure 3*.
+The project working directories will be in the folder `/home/u035/u035/shared/analysis` and follow the structure in *Figure 3*.

 ```
    config – bcbio configuration files in YAML format
    logs – PBS job submission log files
-    output – output to be passed to variant prioritization and archiving
    params – parameters for PBS job submission
-    reads – symlinks to input FASTQ files
+    reads – symlinks/merged versions of input FASTQ files
    work – bcbio working folder
 ```
 *Figure 3.* Project working directories.

 ## Project configuration

-A configuration script sets environment variables common to scripts used in this SOP. This is stored at /home/u035/project/scripts/trio_whole_exome_config.sh.
-
-```
-#!/usr/bin/bash
-#
-# Basic configuration options for trio WES pipeline
-#
-
-SCRIPTS=/home/u035/project/scripts
-BCBIO_TEMPLATE=$SCRIPTS/trio_whole_exome_bcbio_template.yaml
-TARGET=/home/u035/project/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed
-DOWNLOAD_DIR=/scratch/u035/project/trio_whole_exome/data
-REFERENCE_GENOME=/home/u035/project/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
-
-BASE=/scratch/u035/project/trio_whole_exome/analysis
-PARAMS_DIR=$BASE/params
-READS_DIR=$BASE/reads
-CONFIG_DIR=$BASE/config
-WORK_DIR=$BASE/work
-OUTPUT_DIR=$BASE/output
-
-ARCHIVE_DIR=/archive/u035/trio_whole_exome
-
-export PATH=/home/u035/project/software/bcbio/tools/bin:$PATH
-````
+A [configuration script](../trio_whole_exome_config.sh) sets environment variables common to scripts used in this SOP.

 ## Template for bcbio configuration

-Bcbio requires a template file in YAML format to define the procedures run in the pipeline. The template for this project is stored at /home/u035/project/scripts/trio_whole_exome_bcbio_template.yaml.
-
-```
-details:
- algorithm:
-    platform: illumina
-    quality_format: standard
-    aligner: bwa
-    mark_duplicates: true
-    realign: false
-    recalibrate: true
-    effects: vep
-    effects_transcripts: all
-    variantcaller: gatk-haplotype
-    indelcaller: false
-    remove_lcr: true
-    tools_on:
-    - vep_splicesite_annotations
-  analysis: variant2
-  genome_build: hg38
-upload:
-  dir: /scratch/u035/project/trio_whole_exome/analysis/output
-```
+Bcbio requires a [template file in YAML format](../trio_whole_exome_bcbio_template.yaml) to define the procedures run in the pipeline.

 ## Output

 Per sample: BAM file of aligned reads against the hg38 genome assembly
 Per family: Annotated VCF file and QC report

-Output will be in the folder /scratch/u035/project/trio_whole_exome/analysis/output and follow the structure in *Figure 4* (with multiple instances of the indiv_id sub directories, one per sequenced family member.). The qc sub-directories are not enumerated, and automatically generated index files are not listed for brevity. An additional directory at the root of the output folder called “qc” will contain the MultiQC reports generated for an entire batch.
+Output will be in the folder `/home/u035/u035/shared/results/<short_project_id>_<version>` where `<short_project_id>` is the numeric prefix of `<project_id>` and follow the structure in *Figure 4* (with multiple instances of the indiv_id sub directories, one per sequenced family member.). The qc sub-directories are not enumerated, and automatically generated index files are not listed for brevity. An additional directory at the root of each project/version output folder called “qc” will contain the MultiQC report generated for an entire batch.

 ```
-<analysis_date>_<EdGE_project_id>_<pcr_plate_id>_<family_id>/
+<analysis_date>_<project_id>_<pcr_plate_id>_<family_id>/
  +---<indiv_id>_<family_id>/
  |   +---<indiv_id>_<family_id>-callable.bed
  |   +---<indiv_id>_<family_id>-ready.bam
@@ -178,16 +118,16 @@ Output will be in the folder /scratch/u035/project/trio_whole_exome/analysis/out
 1. Set environment variable project_id and general configuration variables.

 ```
-project_id=<EdGE_project_id>
-source /home/u035/project/scripts/trio_whole_exome_config.sh
+project_id=<project_id>
+source /home/u035/u035/shared/scripts/trio_whole_exome_config.sh
 ```

-2. Copy the PED file for the batch to the params folder in the working area. It should be named <EdGE_project_id>.ped, relating it to the input directory for the FASTQ files. If the PED file given was not named in this way, don’t rename it, create a symlink with the correct name.
+2. Copy the PED file for the batch to the params folder in the working area. It should be named <project_id>.ped, relating it to the input directory for the FASTQ files. If the PED file given was not named in this way, don’t rename it, copy it instead.

 ```
 cd $PARAMS_DIR
 ped_file=<input_ped_file>
-ln -s $ped_file $project_id.ped
+cp $ped_file $project_id.ped
 ```

 3. In the params folder, create the symlinks to the reads and the bcbio configuration files. If specifying a common sample suffix, ensure it includes any joining characters, e.g. “-“ or “_”, so that the family identifier can be cleanly separated from the suffix. Get the number of families from the batch. Version should be "v1" by default for the first analysis run of a batch, "v2" etc for subsequent runs.
@@ -199,9 +139,9 @@ ln -s $ped_file $project_id.ped
 cd $PARAMS_DIR
 version=<version>
 sample_suffix=<sample_suffix>
-/home/u035/project/scripts/prepare_bcbio_config.sh \
-  /home/u035/project/scripts/trio_whole_exome_config.sh \
-  $project_id $version $sample_suffix &> ${version}_${project_id}.log
+/home/u035/u035/shared/scripts/prepare_bcbio_config.sh \
+  /home/u035/u035/shared/scripts/trio_whole_exome_config.sh \
+  $project_id $version $sample_suffix &> ${project_id}_${version}_`date +%Y%m%d%H%M`.log
 X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
 ```

@@ -211,65 +151,77 @@ X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
 cd $PARAMS_DIR
 version=<version>
 sample_suffix=<sample_suffix>
-/home/u035/project/scripts/prepare_bcbio_config_crf.sh \
-  /home/u035/project/scripts/trio_whole_exome_crf_config.sh \
-  $project_id $version $sample_suffix &> ${version}_${project_id}.log
+/home/u035/u035/shared/scripts/prepare_bcbio_config_crf.sh \
+  /home/u035/u035/shared/scripts/trio_whole_exome_crf_config.sh \
+  $project_id $version $sample_suffix &> ${project_id}_${version}_`date +%Y%m%d%H%M`.log
 X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
 ```

 4. Submit the bcbio jobs from the logs folder. See above for version.

 ```
-cd /home/u035/project/trio_whole_exome/analysis/logs
-qsub -v PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=/home/u035/project/scripts/trio_whole_exome_config.sh \
-  -J 1-$X -N trio_whole_exome_bcbio.$project_id \
-  /home/u035/project/scripts/submit_bcbio_trio_wes.sh
+cd /home/u035/u035/shared/trio_whole_exome/analysis/logs
+sbatch --export=PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=/home/u035/u035/shared/scripts/trio_whole_exome_config.sh \
+  --array=1-$X --job-name=trio_whole_exome_bcbio.$project_id \
+  /home/u035/u035/shared/scripts/submit_bcbio_trio_wes.sh
 ```

 If all log files end in ‘Finished’ or ‘Storing in local filesystem’ for a metadata file (occasionally the job completes without quite outputting all of the ‘Storing’ messages), the batch is complete. If this is not the case, resubmit the incomplete jobs – they will resume where they left off.

-5. Generate a MultiQC report for all files in the batch.
+5. Clean up the output directory.

 ```
-source /home/u035/project/scripts/trio_whole_exome_config.sh
-cd /scratch/u035/project/trio_whole_exome/analysis/output
-/home/u035/project/software/bcbio/anaconda/bin/multiqc --title "Trio whole exome QC report: $project_id" \
-  --outdir qc \
+cd /home/u035/u035/shared/results
+short_project_id=`echo $project_id | cut -f 1 -d '_'`
+mkdir ${version}_${short_project_id}
+mv *${version}_${project_id}* ${version}_${short_project_id}/
+```
+
+6. Generate a MultiQC report for all files in the batch.
+
+```
+source /home/u035/u035/shared/scripts/trio_whole_exome_config.sh
+short_project_id=`echo $project_id | cut -f 1 -d '_'`
+
+cd /home/u035/u035/shared/results
+/home/u035/u035/shared/software/bcbio/anaconda/bin/multiqc --title "Trio whole exome QC report: $project_id" \
+  --outdir ${short_version}_${project_id}/qc \
  --filename ${version}_${project_id}_qc_report.html \
-  *$version*$project_id*
+  ${version}_${short_project_id}
 ```

-6. Check the parent-child relationships predicted by peddy match the pedigree information. There should be no entries in the <EdGE_project_id>.ped_check.txt file that do not end in ‘True’. If there are, report these back to the NHS Clinical Scientist who generated the PED file for this batch. The batch id is the 5 digit number that prefixes all the family ids in the output.
+7. Check the parent-child relationships predicted by peddy match the pedigree information. There should be no entries in the <EdGE_project_id>.ped_check.txt file that do not end in ‘True’. If there are, report these back to the NHS Clinical Scientist who generated the PED file for this batch. The batch id is the 5 digit number that prefixes all the family ids in the output.

 ```
-cd /scratch/u035/project/trio_whole_exome/analysis/output
-perl /home/u035/project/scripts/trio_whole_exome_parse_peddy_ped_csv.pl \
-  --output /scratch/u035/project/trio_whole_exome/analysis/output \
+cd /home/u035/u035/shared/results
+short_project_id=`echo $project_id | cut -f 1 -d '_'`
+
+perl /home/u035/u035/shared/scripts/trio_whole_exome_parse_peddy_ped_csv.pl \
+  --output /home/u035/u035/shared/results/${version}_${short_project_id}/qc \
  --project $project_id \
  --batch $batch_id \
  --version $version \
-  --ped /scratch/u035/project/trio_whole_exome/analysis/params/$project_id.ped
-grep -v False$ qc/${version}_$project_id.ped_check.txt
+  --ped /home/u035/u035/shared/analysis/params/$project_id.ped
+grep -v False$ ${version}_${short_project_id}/qc/${version}_${project_id}.ped_check.txt
 ```

-7. Clean up the output directory.
+8. Clear the work directory and move the log files to the complete sub-directory.

 ```
-cd /home/u035/project/trio_whole_exome/
-mkdir ${version}_${project_id}
-mv *${version}_${project_id}* ${version}_${project_id}/
+cd /home/u035/u035/shared/analysis/work
+rm -r *
+cd /home/u035/u035/shared/analysis/logs
+mv trio_whole_exome_bcbio.$project_id* complete/
 ```

-8. Clear the work directory and move the log files to the complete sub-directory.
+9. Clean up the reads directory. Retain reads for samples in families where one sample has failed QC, using a list `retain\_for\_rerun.txt`. These will likely be required for later runs, and it is simpler to regenerate config YAML files if it is not necessary to re-do symlinks/read merging.

 ```
-cd /scratch/u035/project/trio_whole_exome/work
-rm -r *
-cd /home/u035/project/trio_whole_exome/logs
-mv trio_whole_exome_bcbio.$project_id* complete/
+cd /home/u035/u035/shared/analysis/reads/${project_id}
+rm `ls | grep -v -f retain_for_rerun.txt`
 ```

-9. Copy the MultiQC report to the IGMM-VariantAnalysis area on the IGMM datastore.
+10. Copy the MultiQC report to the IGMM-VariantAnalysis area on the IGMM datastore.

 ```
 ssh eddie3.ecdf.ed.ac.uk
@@ -279,5 +231,5 @@ cd /exports/igmm/datastore/IGMM-VariantAnalysis/documentation/trio_whole_exome/q
 user=<ultra_user_id>
 project_id=<EdGE_project_id>

-scp $user@ultra.epcc.ed.ac.uk:/scratch/u035/project/trio_whole_exome/analysis/output/qc/${version}_${project_id}_qc_report.html ./
+scp $user@sdf-cs1.epcc.ed.ac.uk:/home/u035/u035/shared/results/${version}_${project_id}/qc/${version}_${project_id}_qc_report.html ./
 ```
--- a/docs/SOP_alignment_variant_annotation_ultra2.md
+++ b/docs/SOP_alignment_variant_annotation_ultra2.md
-# Standard operating procedure - Alignment, variant calling, and annotation of trio whole exome samples at the Edinburgh Parallel Computing Centre
-
-This SOP applies to batches of family/trio samples where trio whole exome sequencing has been performed by Edinburgh Genomics (EdGE) or the Edinburgh Clinical Research Facility (ECRF). It assumes that data has been successfully transferred to the Edinburgh Parallel Computing Centre (EPCC) (see SOP: Transfer of whole exome sequencing samples from Edinburgh Genomics to Edinburgh Parallel Computing Centre). Scripts are version controlled on the University of Edinburgh gitlab server `gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome`. Request access by e-mail: alison.meynert@igmm.ed.ac.uk.
-
-## Definitions
-
-In this document, N is the total number of samples in the project, and X is the number of families.
-
-Text in angle brackets, e.g. `<project>` indicates variable parameters. A variable parameter such as `<family1-X>` indicates that there are X instances of the parameter, each with their own unique value.
-
-## Software and data requirements
-
-The analysis is run with the bcbio pipeline (version 1.2.8) located at `/home/u035/u035/shared/software/bcbio`. All genome reference and annotation data resources are contained within the `genomes/Hsapiens/hg38` subfolder.
-
-The TWIST target BED file is at: `/home/u035/u035/shared/resources/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed`. See [resources](https://git.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome/blob/master/docs/Resources_ultra2.md).
-
-## Input
-
-### PED file
-
-A 6-column tab-delimited [PED/FAM format file](https://www.cog-genomics.org/plink2/formats#fam) is required for each batch, describing the relationships between the sampled individuals, their sex, and their affected/unaffected status.
-
-
-### Sample id format
-
-The sequencing reads for the samples delivered from EdGE are identified by folder name and as the 8th column in the tab-delimited text file file_list.tsv inside the dated batch folder. The identifiers are in the format:
-
-```
-<pcr_plate_id>_<indiv_id>_<family_id><suffix>
-```
-
-The suffix identifies the exome kit, e.g. `_WESTwist_IDT-A`. These identifiers are referenced below in the output file structure.
-
-### Reads - Edinburgh Genomics
-
-A set of paired end FASTQ files (designated by R1 or R2 suffixes), possibly more than one pair per sample. Each sample's files are in its own folder. The input files will be in the folder `/home/u035/u035/shared/data` and follow the structure in *Figure 1*. Older deliveries contained the `<dated_batch>` folder within a `raw_data` folder.
-
-```
-<EdGE_project_id>/
-  +---<dated_batch>/
-  |   +---<sample_id>/
-  |   |   +---*.fastq.count
-  |   |   +---*.fastq.gz
-  |   +---file_list.tsv
-  |   +---md5sums.txt
-  +---<dated_batch>_tree.txt
-  +---Information.txt
-  +---md5_check.txt
-```
-*Figure 1.* File name and directory structure for a batch of sequencing from Edinburgh Genomics. The EdGE project id takes the format `XXXXX_Lastname_Firstname`, identifying the NHS staff member who submitted the samples for sequencing. The dated batch is in the format `yyyymmdd` – in general we expect there to be only one of these per EdGE project id. The FASTQ file names relate to the sequencing run information and do not contain any information about the sample itself.
-
-### Reads - Edinburgh Clinical Research Facility
-
-A set of paired end FASTQ files (designated by R1 or R2 suffixes), generally one pair per sample. The input files will be in the folder `/home/u035/u035/shared/data` and follow the structure in *Figure 2*.
-
-```
-<ECRF_project_id>/
-  +---<internal_id_-md5.txt
-  +---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R1_001.fastq.gz
-  +---<pcr_plate_id>_<indiv_id>_<family_id><suffix>_S<i>_L001_R2_001.fastq.gz
-  +...
-```
-
-*Figure 2.* File name and directory structure for a batch of sequencing from the ECRF.
-
-## Working directories
-
-The project working directories will be in the folder `/home/u035/u035/shared/analysis` and follow the structure in *Figure 3*.
-
-```
-    config – bcbio configuration files in YAML format
-    logs – PBS job submission log files
-    params – parameters for PBS job submission
-    reads – symlinks/merged versions of input FASTQ files
-    work – bcbio working folder
-```
-*Figure 3.* Project working directories.
-
-## Project configuration
-
-A [configuration script](../trio_whole_exome_config.sh) sets environment variables common to scripts used in this SOP.
-
-## Template for bcbio configuration
-
-Bcbio requires a [template file in YAML format](../trio_whole_exome_bcbio_template.yaml) to define the procedures run in the pipeline.
-
-## Output
-
-Per sample: BAM file of aligned reads against the hg38 genome assembly
-Per family: Annotated VCF file and QC report
-
-Output will be in the folder `/home/u035/u035/shared/results/<short_project_id>_<version>` where `<short_project_id>` is the numeric prefix of `<project_id>` and follow the structure in *Figure 4* (with multiple instances of the indiv_id sub directories, one per sequenced family member.). The qc sub-directories are not enumerated, and automatically generated index files are not listed for brevity. An additional directory at the root of each project/version output folder called “qc” will contain the MultiQC report generated for an entire batch.
-
-```
-<analysis_date>_<project_id>_<pcr_plate_id>_<family_id>/
-  +---<indiv_id>_<family_id>/
-  |   +---<indiv_id>_<family_id>-callable.bed
-  |   +---<indiv_id>_<family_id>-ready.bam
-  |   +---qc/
-  +---<pcr_plate>_<family_id>-gatk-haplotype-annotated.vcf.gz
-  +---bcbio-nextgen-commands.log
-  +---bcbio-nextgen.log
-  +---data_versions.csv
-  +---metadata.csv
-  +---multiqc/
-  |   +---list_files_final.txt
-  |   +---multiqc_config.yaml
-  |   +---multiqc_data/
-  |   +---multiqc_report.html
-  |   +---report/
-  +---programs.txt
-  +---project-summary.yaml
-```
-*Figure 4.* File name and output directory structure for each family in a batch of sequencing.
-
-## Procedure
-
-1. Set environment variable project_id and general configuration variables.
-
-```
-project_id=<project_id>
-source /home/u035/u035/shared/scripts/trio_whole_exome_config.sh
-```
-
-2. Copy the PED file for the batch to the params folder in the working area. It should be named <project_id>.ped, relating it to the input directory for the FASTQ files. If the PED file given was not named in this way, don’t rename it, copy it instead.
-
-```
-cd $PARAMS_DIR
-ped_file=<input_ped_file>
-cp $ped_file $project_id.ped
-```
-
-3. In the params folder, create the symlinks to the reads and the bcbio configuration files. If specifying a common sample suffix, ensure it includes any joining characters, e.g. “-“ or “_”, so that the family identifier can be cleanly separated from the suffix. Get the number of families from the batch. Version should be "v1" by default for the first analysis run of a batch, "v2" etc for subsequent runs.
-
-
-### Edinburgh Genomics data
-
-```
-cd $PARAMS_DIR
-version=<version>
-sample_suffix=<sample_suffix>
-/home/u035/u035/shared/scripts/prepare_bcbio_config.sh \
-  /home/u035/u035/shared/scripts/trio_whole_exome_config.sh \
-  $project_id $version $sample_suffix &> ${project_id}_${version}_`date +%Y%m%d%H%M`.log
-X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
-```
-
-### Edinburgh Clinical Research Facility data
-
-```
-cd $PARAMS_DIR
-version=<version>
-sample_suffix=<sample_suffix>
-/home/u035/u035/shared/scripts/prepare_bcbio_config_crf.sh \
-  /home/u035/u035/shared/scripts/trio_whole_exome_crf_config.sh \
-  $project_id $version $sample_suffix &> ${project_id}_${version}_`date +%Y%m%d%H%M`.log
-X=`wc -l $PARAMS_DIR/$project_id.family_ids.txt | awk '{print $1}'`
-```
-
-4. Submit the bcbio jobs from the logs folder. See above for version.
-
-```
-cd /home/u035/u035/shared/trio_whole_exome/analysis/logs
-sbatch --export=PROJECT_ID=$project_id,VERSION=$version,CONFIG_SH=/home/u035/u035/shared/scripts/trio_whole_exome_config.sh \
-  --array=1-$X --job-name=trio_whole_exome_bcbio.$project_id \
-  /home/u035/u035/shared/scripts/submit_bcbio_trio_wes.sh
-```
-
-If all log files end in ‘Finished’ or ‘Storing in local filesystem’ for a metadata file (occasionally the job completes without quite outputting all of the ‘Storing’ messages), the batch is complete. If this is not the case, resubmit the incomplete jobs – they will resume where they left off.
-
-5. Clean up the output directory.
-
-```
-cd /home/u035/u035/shared/results
-short_project_id=`echo $project_id | cut -f 1 -d '_'`
-mkdir ${version}_${short_project_id}
-mv *${version}_${project_id}* ${version}_${short_project_id}/
-```
-
-6. Generate a MultiQC report for all files in the batch.
-
-```
-source /home/u035/u035/shared/scripts/trio_whole_exome_config.sh
-short_project_id=`echo $project_id | cut -f 1 -d '_'`
-
-cd /home/u035/u035/shared/results
-/home/u035/u035/shared/software/bcbio/anaconda/bin/multiqc --title "Trio whole exome QC report: $project_id" \
-  --outdir ${short_version}_${project_id}/qc \
-  --filename ${version}_${project_id}_qc_report.html \
-  ${version}_${short_project_id}
-```
-
-7. Check the parent-child relationships predicted by peddy match the pedigree information. There should be no entries in the <EdGE_project_id>.ped_check.txt file that do not end in ‘True’. If there are, report these back to the NHS Clinical Scientist who generated the PED file for this batch. The batch id is the 5 digit number that prefixes all the family ids in the output.
-
-```
-cd /home/u035/u035/shared/results
-short_project_id=`echo $project_id | cut -f 1 -d '_'`
-
-perl /home/u035/u035/shared/scripts/trio_whole_exome_parse_peddy_ped_csv.pl \
-  --output /home/u035/u035/shared/results/${version}_${short_project_id}/qc \
-  --project $project_id \
-  --batch $batch_id \
-  --version $version \
-  --ped /home/u035/u035/shared/analysis/params/$project_id.ped
-grep -v False$ ${version}_${short_project_id}/qc/${version}_${project_id}.ped_check.txt
-```
-
-8. Clear the work directory and move the log files to the complete sub-directory.
-
-```
-cd /home/u035/u035/shared/analysis/work
-rm -r *
-cd /home/u035/u035/shared/analysis/logs
-mv trio_whole_exome_bcbio.$project_id* complete/
-```
-
-9. Clean up the reads directory. Retain reads for samples in families where one sample has failed QC, using a list `retain\_for\_rerun.txt`. These will likely be required for later runs, and it is simpler to regenerate config YAML files if it is not necessary to re-do symlinks/read merging.
-
-```
-cd /home/u035/u035/shared/analysis/reads/${project_id}
-rm `ls | grep -v -f retain_for_rerun.txt`
-```
-
-10. Copy the MultiQC report to the IGMM-VariantAnalysis area on the IGMM datastore.
-
-```
-ssh eddie3.ecdf.ed.ac.uk
-qlogin -q staging
-cd /exports/igmm/datastore/IGMM-VariantAnalysis/documentation/trio_whole_exome/qc
-
-user=<ultra_user_id>
-project_id=<EdGE_project_id>
-
-scp $user@sdf-cs1.epcc.ed.ac.uk:/home/u035/u035/shared/results/${version}_${project_id}/qc/${version}_${project_id}_qc_report.html ./
-```
--- a/docs/Software_installation.md
+++ b/docs/Software_installation.md
-# Installation of software for trio whole exome project
+# Ultra2 - Installation of software for trio whole exome project

 ## Aspera

-Downloaded Aspera Connect version 3.7.4.147727 from https://downloads.asperasoft.com and extracted to /home/u035/project/software.
+Downloaded Aspera Connect version 3.9.6.1467 installer script from https://downloads.asperasoft.com to /home/u035/project/software/install and run it. This installs the software in ~/.aspera, so it needs to be moved to the shared folder.
+
+```
+bash ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh
+mv ~/.aspera ../aspera
+```

 ## bcbio

-Version 1.2.3 with some bugfixes from the dev branch as of 26 August 2020.
+Version 1.2.8 (14 April 2021).

 Start with installing the base software, and add datatargets.

-This will take a long time, and may require multiple runs if it fails on a step. It will resume if needed. Run on a screen session and log each attempt. It's important to set the limit on the number of concurrently open files to as high as possible (4096 on ultra).
+This will take a long time, and may require multiple runs if it fails on a step. It will resume if needed. Run on a screen session and log each attempt. It's important to set the limit on the number of concurrently open files to as high as possible (4096).

 ```
-cd /home/u035/project/software/install
+cd /home/u035/u035/shared/software/install
+mkdir bcbio_install_logs
+
 wget https://raw.github.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py

 ulimit -n 4096

 DATE=`date +%Y%m%d%H%M`
-python bcbio_nextgen_install.py /home/u035/project/software/bcbio \
-  --tooldir /home/u035/project/software/bcbio/tools \
+python3 bcbio_nextgen_install.py /home/u035/u035/shared/software/bcbio \
+  --tooldir /home/u035/u035/shared/software/bcbio/tools \
  --genomes hg38 --aligners bwa \
-  --cores 32 &> bcbio_install_base_${DATE}.log
-```
-
-Fix an issue with bcbio & vt/samtools/htslib. See https://github.com/bcbio/bcbio-nextgen/issues/3327 and https://github.com/bcbio/bcbio-nextgen/issues/3328.
-
-```
-DATE=`date +%Y%m%d%H%M`
-/home/u035/project/software/bcbio/tools/bin/bcbio_nextgen.py upgrade -u development --tools &> bcbio_install_upgrade_tools_${DATE}.log
+  --cores 128 &> bcbio_install_logs/bcbio_install_base_${DATE}.log
 ```

 Install datatarget vep

 ```
 DATE=`date +%Y%m%d%H%M`
-/home/u035/project/software/bcbio/tools/bin/bcbio_nextgen.py upgrade -u skip --datatarget vep &> bcbio_install_datatarget_vep_${DATE}.log
-```
-
-We already had gnomAD 3.0 compiled and downloaded from another bcbio installation, so this gets copied to /home/u035/project/software/bcbio/genomes/Hsapiens/hg38/variation. However, if needed, re-generate it like this. It will take about 6 days.
-
-```
-DATE=`date +%Y%m%d%H%M`
-/home/u035/project/software/bcbio/tools/bin/bcbio_nextgen.py upgrade -u skip --datatarget gnomad &> bcbio_install_datatarget_gnomad_${DATE}.log
+/home/u035/u035/shared/software/bcbio/tools/bin/bcbio_nextgen.py upgrade -u skip --datatarget vep &> bcbio_install_logs/bcbio_install_datatarget_vep_${DATE}.log
 ```

 Increase JVM memory for GATK in galaxy/bcbio_system.yaml

 ```
-  gatk:
-    jvm_opts: ["-Xms500m", "-Xmx5g"]
+  gatk:
+    jvm_opts: ["-Xms500m", "-Xmx5g"]
 ```

 ### Patch Ensembl VEP 100.4

 See https://github.com/Ensembl/ensembl-variation/pull/621/files

-Edit /home/u035/project/software/bcbio/anaconda/share/ensembl-vep-100.4-0/Bio/EnsEMBL/Variation/BaseAnnotation.pm accordingly.
+Edit /home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0/Bio/EnsEMBL/Variation/BaseAnnotation.pm accordingly.

 ### Verifybamid custom panel for exomes

 ```
-source /home/u035/project/scripts/trio_whole_exome_config.sh
-
-mkdir /home/u035/project/software/install/1000G_phase3_hg38
-cd /home/u035/project/software/install/1000G_phase3_hg38
+mkdir /home/u035/u035/shared/software/install/1000G_phase3_hg38
+cd /home/u035/u035/shared/software/install/1000G_phase3_hg38

 # download the 1000 Genomes autosomes + X site VCFs
 for ((i = 1; i <= 22; i = i + 1))
@@ -75,83 +66,90 @@ do
 done
 wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz
 wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz.tbi
-cd ..

 # create bare to prefixed chromosome map
 for ((i = 1; i <= 22; i = i + 1))
 do
  echo $i "chr"$i >> chr_prefix_map.txt
 done
-echo chrX >> chr_prefix_map.txt
+echo X chrX >> chr_prefix_map.txt
+
+# add bcbio tools to path
+PATH=/home/u035/u035/shared/software/bcbio/tools/bin:/home/u035/u035/shared/software/bcbio/anaconda/share/verifybamid2-1.0.6-0:$PATH

 # use the TWIST kit to subset the variants and add the chr prefix at the same time
-for file in 1000G_phase3_hg38/*vcf.gz
+sed -e 's/chr//' ../../../resources/Twist_Exome_Target_hg38.bed > targets.bed
+for file in *phased.vcf.gz
 do
  bname=`basename $file`
-  bcftools view -R /home/u035/project/resources/Twist_Exome_Target_hg38.bed -m2 -M2 -v snps -i 'AF >= 0.01' $file | bcftools annotate --rename-chrs chr_prefix_map.txt | bgzip -c > ${bname%.vcf.gz}.biallelic.snps.m\
-inAF0.01.vcf.gz
+  bcftools view -R targets.bed -m2 -M2 -v snps -i 'AF >= 0.01' $file | bcftools annotate --rename-chrs chr_prefix_map.txt | bgzip -c > ${bname%.vcf.gz}.biallelic.snps.minAF0.01.vcf.gz
  tabix ${bname%.vcf.gz}.biallelic.snps.minAF0.01.vcf.gz
 done

 # concatenate all the files in the correct order
-bcftools concat -o ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz -O z \
-  ALL.chr[1-9].shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz \
-  ALL.chr[12][0-9].shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz \
-  ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz
-tabix ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz
+bcftools concat -o ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz -O z \
+  ALL.chr[1-9].shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz \
+  ALL.chr[12][0-9].shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz \
+  ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz
+tabix ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz

 # use VerifyBamID to create the new panel
-/home/u035/project/software/bcbio/anaconda/share/verifybamid2-1.0.6-0/VerifyBamID \
-  --RefVCF ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz
-  --Reference bcbio-1.1.5/genomes/Hsapiens/hg38/seq/hg38.fa
+VerifyBamID \
+  --RefVCF ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz \
+  --Reference ../../bcbio/genomes/Hsapiens/hg38/seq/hg38.fa

 # rename the files to the correct format
-mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz.bed 1000g.phase3.100k.b38.vcf.gz.dat.bed
-mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz.mu 1000g.phase3.100k.b38.vcf.gz.dat.mu
-mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz.PC 1000g.phase3.100k.b38.vcf.gz.dat.V
-mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.chr.biallelic.snps.minAF0.01.vcf.gz.UD 1000g.phase3.100k.b38.vcf.gz.dat.UD
+mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz.bed 1000g.phase3.100k.b38.vcf.gz.dat.bed
+mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz.mu 1000g.phase3.100k.b38.vcf.gz.dat.mu
+mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz.PC 1000g.phase3.100k.b38.vcf.gz.dat.V
+mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz.UD 1000g.phase3.100k.b38.vcf.gz.dat.UD

 # move them into the correct location, backing up the original resource folder
-cd /home/u035/project/software/bcbio/anaconda/share/verifybamid2-1.0.6-0
+cd /home/u035/u035/shared/software/bcbio/anaconda/share/verifybamid2-1.0.6-0
 mv resource resource.bak
 mkdir resource
-mv /home/u035/project/software/install/1000G_phase3_hg38/1000g.phase3.100k.b38* resource/
+mv /home/u035/u035/shared/software/install/1000G_phase3_hg38/1000g.phase3.100k.b38* resource/
+
+# clean up intermediate files
+cd /home/u035/u035/shared/software/install
+rm -r 1000G_phase3_hg38
 ```

 ## Python modules

 ### VASE

-VASE v0.4 was installed 28 August 2020.
+VASE v0.4.2 was installed 18 August 2021.

 ```
-cd /home/u035/project/software
-./bcbio/anaconda/bin/pip3 install git+git://github.com/david-a-parry/vase.git#egg=project[BGZIP,REPORTER,MYGENE]
+cd /home/u035/u035/shared/software
+./bcbio/anaconda/bin/pip3 install git+git://github.com/david-a-parry/vase.git#egg=vase[BGZIP,REPORTER,MYGENE]
 ```

 ### XlsxWriter

-XlsxWriter 1.3.3 was installed 28 August 2020.
+XlsxWriter 3.0.1 was installed 18 August 2021.

 ```
-cd /home/u035/project/software
+cd /home/u035/u035/shared/software
 ./bcbio/anaconda/bin/pip3 install XlsxWriter
 ```

 ## GATK 3.8

 ```
-cd /home/u035/project/software/install
+cd /home/u035/u035/shared/software/install
 wget https://storage.googleapis.com/gatk-software/package-archive/gatk/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2
 bzip2 -d GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2 
 tar -xf GenomeAnalysisTK-3.8-0-ge9d806836.tar
 mv GenomeAnalysisTK-3.8-0-ge9d806836 ../GenomeAnalysisTK-3.8
+rm GenomeAnalysisTK-3.8-0-ge9d806836.tar
 ```

 ## RTG tools

 ```
-cd /home/u035/project/software
+cd /home/u035/u035/shared/software
 wget https://github.com/RealTimeGenomics/rtg-tools/releases/download/3.11/rtg-tools-3.11-linux-x64.zip
 unzip rtg-tools-3.11-linux-x64.zip
 rm rtg-tools-3.11-linux-x64.zip
@@ -160,9 +158,16 @@ rm rtg-tools-3.11-linux-x64.zip
 ## IGV

 ```
-cd /home/u035/project/software
+cd /home/u035/u035/shared/software
 wget https://data.broadinstitute.org/igv/projects/downloads/2.8/IGV_Linux_2.8.9.zip
 unzip IGV_Linux_2.8.9.zip
 rm IGV_Linux_2.8.9.zip
 ```

+## Emacs
+
+```
+cd /home/u035/u035/shared/software
+./bcbio/anaconda/bin/conda install emacs
+```
+
--- a/docs/Software_installation_ultra2.md
+++ b/docs/Software_installation_ultra2.md
-# Ultra2 - Installation of software for trio whole exome project
-
-## Aspera
-
-Downloaded Aspera Connect version 3.9.6.1467 installer script from https://downloads.asperasoft.com to /home/u035/project/software/install and run it. This installs the software in ~/.aspera, so it needs to be moved to the shared folder.
-
-```
-bash ibm-aspera-cli-3.9.6.1467.159c5b1-linux-64-release.sh
-mv ~/.aspera ../aspera
-```
-
-## bcbio
-
-Version 1.2.8 (14 April 2021).
-
-Start with installing the base software, and add datatargets.
-
-This will take a long time, and may require multiple runs if it fails on a step. It will resume if needed. Run on a screen session and log each attempt. It's important to set the limit on the number of concurrently open files to as high as possible (4096).
-
-```
-cd /home/u035/u035/shared/software/install
-mkdir bcbio_install_logs
-
-wget https://raw.github.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
-
-ulimit -n 4096
-
-DATE=`date +%Y%m%d%H%M`
-python3 bcbio_nextgen_install.py /home/u035/u035/shared/software/bcbio \
-  --tooldir /home/u035/u035/shared/software/bcbio/tools \
-  --genomes hg38 --aligners bwa \
-  --cores 128 &> bcbio_install_logs/bcbio_install_base_${DATE}.log
-```
-
-Install datatarget vep
-
-```
-DATE=`date +%Y%m%d%H%M`
-/home/u035/u035/shared/software/bcbio/tools/bin/bcbio_nextgen.py upgrade -u skip --datatarget vep &> bcbio_install_logs/bcbio_install_datatarget_vep_${DATE}.log
-```
-
-Increase JVM memory for GATK in galaxy/bcbio_system.yaml
-
-```
-  gatk:
-    jvm_opts: ["-Xms500m", "-Xmx5g"]
-```
-
-### Patch Ensembl VEP 100.4
-
-See https://github.com/Ensembl/ensembl-variation/pull/621/files
-
-Edit /home/u035/u035/shared/software/bcbio/anaconda/share/ensembl-vep-100.4-0/Bio/EnsEMBL/Variation/BaseAnnotation.pm accordingly.
-
-### Verifybamid custom panel for exomes
-
-```
-mkdir /home/u035/u035/shared/software/install/1000G_phase3_hg38
-cd /home/u035/u035/shared/software/install/1000G_phase3_hg38
-
-# download the 1000 Genomes autosomes + X site VCFs
-for ((i = 1; i <= 22; i = i + 1))
-do
-  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr${i}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz;
-  wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr${i}.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz.tbi
-done
-wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz
-wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz.tbi
-
-# create bare to prefixed chromosome map
-for ((i = 1; i <= 22; i = i + 1))
-do
-  echo $i "chr"$i >> chr_prefix_map.txt
-done
-echo X chrX >> chr_prefix_map.txt
-
-# add bcbio tools to path
-PATH=/home/u035/u035/shared/software/bcbio/tools/bin:/home/u035/u035/shared/software/bcbio/anaconda/share/verifybamid2-1.0.6-0:$PATH
-
-# use the TWIST kit to subset the variants and add the chr prefix at the same time
-sed -e 's/chr//' ../../../resources/Twist_Exome_Target_hg38.bed > targets.bed
-for file in *phased.vcf.gz
-do
-  bname=`basename $file`
-  bcftools view -R targets.bed -m2 -M2 -v snps -i 'AF >= 0.01' $file | bcftools annotate --rename-chrs chr_prefix_map.txt | bgzip -c > ${bname%.vcf.gz}.biallelic.snps.minAF0.01.vcf.gz
-  tabix ${bname%.vcf.gz}.biallelic.snps.minAF0.01.vcf.gz
-done
-
-# concatenate all the files in the correct order
-bcftools concat -o ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz -O z \
-  ALL.chr[1-9].shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz \
-  ALL.chr[12][0-9].shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz \
-  ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz
-tabix ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz
-
-# use VerifyBamID to create the new panel
-VerifyBamID \
-  --RefVCF ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz \
-  --Reference ../../bcbio/genomes/Hsapiens/hg38/seq/hg38.fa
-
-# rename the files to the correct format
-mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz.bed 1000g.phase3.100k.b38.vcf.gz.dat.bed
-mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz.mu 1000g.phase3.100k.b38.vcf.gz.dat.mu
-mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz.PC 1000g.phase3.100k.b38.vcf.gz.dat.V
-mv ALL.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.biallelic.snps.minAF0.01.vcf.gz.UD 1000g.phase3.100k.b38.vcf.gz.dat.UD
-
-# move them into the correct location, backing up the original resource folder
-cd /home/u035/u035/shared/software/bcbio/anaconda/share/verifybamid2-1.0.6-0
-mv resource resource.bak
-mkdir resource
-mv /home/u035/u035/shared/software/install/1000G_phase3_hg38/1000g.phase3.100k.b38* resource/
-
-# clean up intermediate files
-cd /home/u035/u035/shared/software/install
-rm -r 1000G_phase3_hg38
-```
-
-## Python modules
-
-### VASE
-
-VASE v0.4.2 was installed 18 August 2021.
-
-```
-cd /home/u035/u035/shared/software
-./bcbio/anaconda/bin/pip3 install git+git://github.com/david-a-parry/vase.git#egg=vase[BGZIP,REPORTER,MYGENE]
-```
-
-### XlsxWriter
-
-XlsxWriter 3.0.1 was installed 18 August 2021.
-
-```
-cd /home/u035/u035/shared/software
-./bcbio/anaconda/bin/pip3 install XlsxWriter
-```
-
-## GATK 3.8
-
-```
-cd /home/u035/u035/shared/software/install
-wget https://storage.googleapis.com/gatk-software/package-archive/gatk/GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2
-bzip2 -d GenomeAnalysisTK-3.8-0-ge9d806836.tar.bz2 
-tar -xf GenomeAnalysisTK-3.8-0-ge9d806836.tar
-mv GenomeAnalysisTK-3.8-0-ge9d806836 ../GenomeAnalysisTK-3.8
-rm GenomeAnalysisTK-3.8-0-ge9d806836.tar
-```
-
-## RTG tools
-
-```
-cd /home/u035/u035/shared/software
-wget https://github.com/RealTimeGenomics/rtg-tools/releases/download/3.11/rtg-tools-3.11-linux-x64.zip
-unzip rtg-tools-3.11-linux-x64.zip
-rm rtg-tools-3.11-linux-x64.zip
-```
-
-## IGV
-
-```
-cd /home/u035/u035/shared/software
-wget https://data.broadinstitute.org/igv/projects/downloads/2.8/IGV_Linux_2.8.9.zip
-unzip IGV_Linux_2.8.9.zip
-rm IGV_Linux_2.8.9.zip
-```
-
-## Emacs
-
-```
-cd /home/u035/u035/shared/software
-./bcbio/anaconda/bin/conda install emacs
-```
-