Test trio whole exome samples on eddie

assigned to @ameyner2

I am currently trying to run run_tests.sh on eddie, on a branch I have named emma-cnv which is branched from master commit 6d2af834. I have been documenting the steps I have taken and the problems I have run into so far, which may be useful when it comes to testing real data

Login and create environment

Steps to run on eddie
Login to eddie ssh [user]@eddie.ecdf.ed.ac.uk
Go to shared directory cd /exports/igmm/eddie/IGMM-VariantAnalysis/
Create folder to work in
Clone repository git clone https://git.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome.git
Create new branch git branch [new branch]
Move to new branch git checkout [new branch]
In this case the new branch is called emma-cnv
Login to interactive session qlogin -l h_vmem=32G to load modules need to be in an interactive session with at least 4G – this command gives 32.
update - different sections of the pipeline require different resources, up to 16 CPUs. Request this for interactive session qlogin -l h_vmem=8G -pe interactivemem 16

Create conda environment May need to configure or create .condarc depending on whether this is first time using conda or not. see https://www.wiki.ed.ac.uk/display/ResearchServices/Anaconda
Could temporarily do this in the scratch directory.
Currently for me set up to point at my scratch environment – may need to change
module load anaconda loads anaconda – which is needed to access conda
Enter conda environment with source activate trio-pipe-env

Create nextflow.config

Take the outline provided by Murray (see below) and hunt through eddie to find the equivalent files

params {
  // path to BCBio - should contain anaconda/bin/bcbio_nextgen.py
  bcbio = '/home/u035/u035/shared/software/bcbio'

  // this will just be `<pipeline_repo>/trio_whole_exome_parse_peddy_ped_csv.pl`. Won't need this once
  // Alison's merged some stuff
  parse_peddy_output = '/home/u035/u035/shared/testing/trio-whole-exome/trio_whole_exome_parse_peddy_ped_csv.pl'

  // base BCBio variant calling template
  bcbio_template = '/home/u035/u035/shared/testing/inputs/test_bcbio_template.yaml'

  // exome target BED file
  target_bed = '/home/u035/u035/shared/resources/exome_targets/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed'

  // HG38 reference genome
  reference_genome = '/home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa'

  // pipeline outputs
  output_dir = '/home/u035/u035/shared/testing/outputs'
}

executor {
  name = 'slurm'
  queue = 'standard'
}

The final eddie config looked as follows:


params {
  // path to BCBio - should contain anaconda/bin/bcbio_nextgen.py
  bcbio = '/exports/igmm/eddie/IGMM-VariantAnalysis/software/bcbio-1.0.9'

  // this will just be `<pipeline_repo>/trio_whole_exome_parse_peddy_ped_csv.pl`. Won't need this once
  // Alison's merged some stuff
  parse_peddy_output = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/trio_whole_exome_parse_peddy_ped_csv.pl'

  // base BCBio variant calling template
  bcbio_template = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/tests/scripts/trio_whole_exome_bcbio_template.yaml'

  // exome target BED file
  target_bed = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/tests/assets/input_data/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed'

  // HG38 reference genome
  reference_genome = '/exports/igmm/eddie/IGMM-VariantAnalysis/software/bcbio-1.0.9/genomes/Hsapiens/hg38/seq/hg38.fa'

  // pipeline outputs
  output_dir = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/tests/outputs'
}

executor {
  
  // trying with sge as this is recommended by the research services  
  name = 'sge'
  queue = 'standard'
}

Set $NEXTFLOW_CONFIG using export NEXTFLOW_CONFIG=[path to config file] and test with echo $NEXTFLOW_CONFIG

run test with ./run_tests.sh

Changes made

PROBLEM: Can’t find main.nf
SOLUTION: In run_tests.sh changed ../pipeline/main.nf to ../main.nf
PROBLEM: Starts running but params.workflow required - variant-calling or variant-prioritisation
SOLUTION: Added –workflow ‘variant-calling’ to run nextflow command in run_tests.sh
PROBLEM: Unable to find bcbio config
SOLUTION added assets to path in --bcbio_template in run_tests.sh
edit bcbio template path to files to reflect eddie environment
PROBLEM: The test file does not seem to be loading NEXTFLOW_CONFIG
SOLUTION: Adding -c [path to NEXTFLOW_CONFIG] to nextflow run command in run_test.sh
PROBLEM: new error – process requirement exceed available CPUs – req: 2; avail; 1
SOLUTION: restarting interactive session but requesting -pe interactivemem 4 – seems to have worked. Realised that later 'large' processes may need more. either specify more (16) to start or potentially use wild west node
PROBLEM: Now having error in var_calling:process_families:merge_fastqs. running bash .command.run reveals

gzip: 200922_A00001_0001_BHNTGMDMXX_3_00002AM0001L01_1.fastq.gz: unexpected end of file

gzip: 200922_A00001_0001_BHNTGMDMXX_3_00002AM0001L01_2.fastq.gz: unexpected end of file

SOLUTION: Asked Murray and he said these test files are empty, which explains why the pipeline fails, and I am now moving on to run_giab_tests.sh which uses a subsample of real data from Genome in a Bottle reference data.

The data for this can be downloaded with the script in tests/assests/input_data/scripts/ called giab.sh

Login process
Connect to eddie, set up interactive session, load anaconda module, activate conda environment

See the documentation on running Nextflow on eddie for fleshing out the configuration file: https://www.wiki.ed.ac.uk/pages/viewpage.action?spaceKey=ResearchServices&title=Bioinformatics

I have fleshed out the eddie.config file with the processes section of the example eddie config for running nextflow

I have also found that even after running giab.sh with my job submission script the pipeline would still fail when trying to merge the fastq.qz files due to an unexpected end of file. Upon investigation it turns out the raw data was being downloaded correctly but the resulting fastq.gz files were empty. There are two issues that contribute to this: the path to bazam.jar being incorrect, and the command seqtk not being found when run. To correct this I have fixed the paths so they work with my environment and now load the igmm/apps/seqtk module as part of my job submission script. This will hopefully now download the data correctly.

The job submission script currently looks like this:

#!/bin/sh
#$ -cwd
#$ -o /exports/igmm/eddie/IGMM-VariantAnalysis/emma/downloading_giab_data.o
#$ -e /exports/igmm/eddie/IGMM-VariantAnalysis/emma/downloading_giab_data.e

cd /exports/igmm/eddie/IGMM-VariantAnalysis/emma

. /etc/profile.d/modules.sh
module load anaconda
module load igmm/apps/seqtk/1.2
source activate trio-pipe-env

cd  trio-whole-exome/tests/assets/input_data/scripts

./giab.sh

I have created a document describing how I have been getting the tests to run on eddie on branch emma-cnv called steps_to_run_on_eddie.md describing errors and solutions so far.

It is still a work in progress, but I thought I'd add a link to it here

Tests have successfully run on eddie, am creating a document with detailed instructions on how to run on eddie

closed

Test trio whole exome samples on eddie

Designs

Child items ...

Activity