I am currently trying to run run_tests.sh on eddie, on a branch I have named emma-cnv which is branched from master commit 6d2af834. I have been documenting the steps I have taken and the problems I have run into so far, which may be useful when it comes to testing real data
Login and create environment
Steps to run on eddie
Login to eddie ssh [user]@eddie.ecdf.ed.ac.uk
Go to shared directory cd /exports/igmm/eddie/IGMM-VariantAnalysis/
Create folder to work in
Clone repository git clone https://git.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome.git
Create new branch git branch [new branch]
Move to new branch git checkout [new branch]
In this case the new branch is called emma-cnv
Login to interactive session qlogin -l h_vmem=32Gto load modules need to be in an interactive session with at least 4G – this command gives 32. update - different sections of the pipeline require different resources, up to 16 CPUs. Request this for interactive session qlogin -l h_vmem=8G -pe interactivemem 16
Create conda environment
May need to configure or create .condarc depending on whether this is first time using conda or not. see https://www.wiki.ed.ac.uk/display/ResearchServices/Anaconda
Could temporarily do this in the scratch directory.
Currently for me set up to point at my scratch environment – may need to change module load anaconda loads anaconda – which is needed to access conda
Enter conda environment with source activate trio-pipe-env
Create nextflow.config
Take the outline provided by Murray (see below) and hunt through eddie to find the equivalent files
params { // path to BCBio - should contain anaconda/bin/bcbio_nextgen.py bcbio = '/home/u035/u035/shared/software/bcbio' // this will just be `<pipeline_repo>/trio_whole_exome_parse_peddy_ped_csv.pl`. Won't need this once // Alison's merged some stuff parse_peddy_output = '/home/u035/u035/shared/testing/trio-whole-exome/trio_whole_exome_parse_peddy_ped_csv.pl' // base BCBio variant calling template bcbio_template = '/home/u035/u035/shared/testing/inputs/test_bcbio_template.yaml' // exome target BED file target_bed = '/home/u035/u035/shared/resources/exome_targets/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed' // HG38 reference genome reference_genome = '/home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa' // pipeline outputs output_dir = '/home/u035/u035/shared/testing/outputs'}executor { name = 'slurm' queue = 'standard'}
The final eddie config looked as follows:
params { // path to BCBio - should contain anaconda/bin/bcbio_nextgen.py bcbio = '/exports/igmm/eddie/IGMM-VariantAnalysis/software/bcbio-1.0.9' // this will just be `<pipeline_repo>/trio_whole_exome_parse_peddy_ped_csv.pl`. Won't need this once // Alison's merged some stuff parse_peddy_output = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/trio_whole_exome_parse_peddy_ped_csv.pl' // base BCBio variant calling template bcbio_template = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/tests/scripts/trio_whole_exome_bcbio_template.yaml' // exome target BED file target_bed = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/tests/assets/input_data/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed' // HG38 reference genome reference_genome = '/exports/igmm/eddie/IGMM-VariantAnalysis/software/bcbio-1.0.9/genomes/Hsapiens/hg38/seq/hg38.fa' // pipeline outputs output_dir = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/tests/outputs'}executor { // trying with sge as this is recommended by the research services name = 'sge' queue = 'standard'}
Set $NEXTFLOW_CONFIG using export NEXTFLOW_CONFIG=[path to config file] and test with echo $NEXTFLOW_CONFIG
run test with ./run_tests.sh
Changes made
PROBLEM: Can’t find main.nf
SOLUTION: In run_tests.sh changed ../pipeline/main.nf to ../main.nf
PROBLEM: Starts running but params.workflow required - variant-calling or variant-prioritisation
SOLUTION: Added –workflow ‘variant-calling’ to run nextflow command in run_tests.sh
PROBLEM: Unable to find bcbio config
SOLUTION added assets to path in --bcbio_template in run_tests.sh
edit bcbio template path to files to reflect eddie environment
PROBLEM: The test file does not seem to be loading NEXTFLOW_CONFIG
SOLUTION: Adding -c [path to NEXTFLOW_CONFIG] to nextflow run command in run_test.sh
PROBLEM: new error – process requirement exceed available CPUs – req: 2; avail; 1
SOLUTION: restarting interactive session but requesting -pe interactivemem 4 – seems to have worked. Realised that later 'large' processes may need more. either specify more (16) to start or potentially use wild west node
PROBLEM: Now having error in var_calling:process_families:merge_fastqs. running bash .command.run reveals
gzip: 200922_A00001_0001_BHNTGMDMXX_3_00002AM0001L01_1.fastq.gz: unexpected end of filegzip: 200922_A00001_0001_BHNTGMDMXX_3_00002AM0001L01_2.fastq.gz: unexpected end of file
SOLUTION: Asked Murray and he said these test files are empty, which explains why the pipeline fails, and I am now moving on to run_giab_tests.sh which uses a subsample of real data from Genome in a Bottle reference data.
The data for this can be downloaded with the script in tests/assests/input_data/scripts/ called giab.sh
Login process
Connect to eddie, set up interactive session, load anaconda module, activate conda environment
I have fleshed out the eddie.config file with the processes section of the example eddie config for running nextflow
I have also found that even after running giab.sh with my job submission script the pipeline would still fail when trying to merge the fastq.qz files due to an unexpected end of file. Upon investigation it turns out the raw data was being downloaded correctly but the resulting fastq.gz files were empty. There are two issues that contribute to this: the path to bazam.jar being incorrect, and the command seqtk not being found when run.
To correct this I have fixed the paths so they work with my environment and now load the igmm/apps/seqtk module as part of my job submission script. This will hopefully now download the data correctly.
The job submission script currently looks like this:
I have created a document describing how I have been getting the tests to run on eddie on branch emma-cnv called steps_to_run_on_eddie.md describing errors and solutions so far.
It is still a work in progress, but I thought I'd add a link to it here