Add documentation of the steps, errors and solutions when getting the giab test to run on eddie.

b422289a · s1734289 · 607d6f79 · b422289a
Commit b422289a authored 2 years ago by s1734289
--- a/docs/steps_to_run_tests_on_eddie.md
+++ b/docs/steps_to_run_tests_on_eddie.md
+Login and create environment 
+
+__Steps to run on eddie__  
+Login to eddie `ssh [user]@eddie.ecdf.ed.ac.uk  
+Go to shared directory `cd /exports/igmm/eddie/IGMM-VariantAnalysis/`  
+Create folder  
+Clone repository `git clone https://git.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome.git`  
+Create new branch `git branch [new branch]`  
+Move to new branch `git checkout [new branch]`  
+In this case the new branch is called emma-cnv  
+Login to interactive session `qlogin -l h_vmem=32G` _to load modules need to be in an interactice session with at least 4G – this command gives 32._
+
+__Create conda environment__  
+May need to configure or create `.condarc` depending on whether this is first time or not. see https://www.wiki.ed.ac.uk/display/ResearchServices/Anaconda  
+Could temporarily do this in the scratch directory. Currently for me set up to point at my scratch environment – will need to change   
+`Module load anaconda` loads anaconda – which is needed to access conda   
+Enter conda environment? `source activate trio-pipe-env`  
+
+__Create nextflow.config__  
+Take the outline provided by Murray (insert below) and hunt through eddie to find the right equivalent folders
+
+```
+params {
+  // path to BCBio - should contain anaconda/bin/bcbio_nextgen.py
+  bcbio = '/home/u035/u035/shared/software/bcbio'
+
+  // this will just be `<pipeline_repo>/trio_whole_exome_parse_peddy_ped_csv.pl`. Won't need this once
+  // Alison's merged some stuff
+  parse_peddy_output = '/home/u035/u035/shared/testing/trio-whole-exome/trio_whole_exome_parse_peddy_ped_csv.pl'
+
+  // base BCBio variant calling template
+  bcbio_template = '/home/u035/u035/shared/testing/inputs/test_bcbio_template.yaml'
+
+  // exome target BED file
+  target_bed = '/home/u035/u035/shared/resources/exome_targets/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed'
+
+  // HG38 reference genome
+  reference_genome = '/home/u035/u035/shared/software/bcbio/genomes/Hsapiens/hg38/seq/hg38.fa'
+
+  // pipeline outputs
+  output_dir = '/home/u035/u035/shared/testing/outputs'
+}
+
+executor {
+  name = 'slurm'
+  queue = 'standard'
+}
+```
+The final config looked as such
+
+```
+params {
+  // path to BCBio - should contain anaconda/bin/bcbio_nextgen.py
+  bcbio = '/exports/igmm/eddie/IGMM-VariantAnalysis/software/bcbio-1.0.9'
+
+  // this will just be `<pipeline_repo>/trio_whole_exome_parse_peddy_ped_csv.pl`. Won't need this once
+  // Alison's merged some stuff
+  parse_peddy_output = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/trio_whole_exome_parse_peddy_ped_csv.pl'
+
+  // base BCBio variant calling template
+  bcbio_template = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/tests/scripts/trio_whole_exome_bcbio_template.yaml'
+
+  // exome target BED file
+  target_bed = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/tests/assets/input_data/Twist_Exome_RefSeq_targets_hg38.plus15bp.bed'
+
+  // HG38 reference genome
+  reference_genome = '/exports/igmm/eddie/IGMM-VariantAnalysis/software/bcbio-1.0.9/genomes/Hsapiens/hg38/seq/hg38.fa'
+
+  // pipeline outputs
+  output_dir = '/exports/igmm/eddie/IGMM-VariantAnalysis/emma/trio-whole-exome/tests/outputs'
+}
+
+executor {
+  
+  // trying with sge as this is recommended by the research services  
+  name = 'sge'
+  queue = 'standard'
+}
+
+```
+
+Set NEXTFLOW_CONFIG using ‘export NEXTFLOW_CONFIG=”[path to config file]” 
+Test with ‘echo $NEXTFLOW_CONFIG’
+./run_tests.sh
+
+__Changes made__
+- **PROBLEM**: Can’t find `main.nf`
+- SOLUTION: In run_tests.sh changed `../pipeline/main.nf` to `../main.nf`
+- **PROBLEM**: Starts running but `params.workflow required - variant-calling or variant-prioritisation`
+- SOLUTION: Added `–workflow ‘variant-calling’` to run nextflow command in run_tests.sh 
+- **PROBLEM**: Unable to find bcbio config 
+- SOLUTION added ‘assets’ to path in --bcbio_template in run_tests.sh
+- edit bcbio template path
+- **PROBLEM**: What should I set as NEXTFLOW_config – what is the platform specific config – is there an example – is it the trio_whole_exom_config.sh
+- SOLUTION: Ask Murray, and edit paths in example config (now eddie.config stored in emma directory, or see above) to reflect paths in eddie. this may need some editing 
+- **PROBLEM**: The test file does not seem to be loading NEXTFLOW_CONFIG
+- SOLUTION: Adding ` -c [path to NEXTFLOW_CONFIG]` to nextflow run command in run_test.sh
+- **PROBLEM**: new error – `process requirement exceed available CPUs – req: 2; avail; 1`
+-  SOLUTION: restarting interactive session but requesting `-pe interactivemem 4` – seems to have worked. Realised that later 'large' processes may need more. either specify more to start or potentially use wild west node
+- **PROBLEM**: Now having error of unable to find a file in var_calling:process_families:merge_fastqs, seems a path is set in a weird way somewhere. during step `var_calling:process_families:merge_fastqs`, looking for the fastq files in the work folder of that nextflow step, ie `work/3a/c52e9a602870d4672087f6c38f710e/`.    
+
+```
+gzip: HG002_R1.fastq.gz: No such file or directory  
+gzip: HG002_R2.fastq.gz: No such file or directory  
+```
+These files do not reside there. potentially a problem with path to fastqs 
+
+**Running giab tests**
+
+I have downloaded the giab test data using a custom script and am attempting to run run_giab_tests.sh. 
+
+*Edits were needed to to the download script for the data to be successfully downloaded, see below*
+
+It still fails during `var_calling:process_families:merge_fastqs`. In the logs it seems to be putting a link to fastq.gz in a locations where the files do not reside (input_data/giab instead of input_data/scripts/giab) as a quick fix because the link is generated as part of the script I have moved the files to this location.
+This has not stopped the test from failing at this stage, however some merged files have appeared in the tests directory. I think this is because I specify in the trio_whole_exome_bcbio_template.yaml that something should be uploaded to the test directory. The appearance of some merged files indicates that at least one of the merge files processes is completing. Which is confusing. It may be something to do with the eddie config and the data usage specification in that. I will look into the example config on the research services page and see if I can find a solution
+
+have updated config file to include the processes section from the example of the eddie_conda_config file
+now getting unexpected end of file error with gzip, may use zcat?
+It appears the fastq files are empty. The raw data is there though 
+Something must have happened while downloading the data, will check what packages are needed
+
+**Solution**: checked giab.sh, the path to bazam.jar was wrong, therefor the data was not processed properly following the download, corrected the path and starting the download again. Path to reference bed - TWIST TARGETS was wrong. corrected in giab.sh
+
+Now have all of the correct modules loaded, and data is downloaded. The functional download script looks as follows:
+
+```
+#!/bin/sh
+#$ -cwd
+#$ -o /exports/igmm/eddie/IGMM-VariantAnalysis/emma/downloading_giab_data.o
+#$ -e /exports/igmm/eddie/IGMM-VariantAnalysis/emma/downloading_giab_data.e
+
+cd /exports/igmm/eddie/IGMM-VariantAnalysis/emma
+
+. /etc/profile.d/modules.sh
+
+module load anaconda
+module load igmm/apps/BEDTools
+module load igmm/apps/samtools/1.6
+module load igmm/apps/seqtk/1.2
+source activate trio-pipe-env
+
+cd  trio-whole-exome/tests/assets/input_data/scripts
+
+./giab.sh
+```
+
+**Problem**: It seems that the fastq.gz files have not infact been zipped properly, simply named gzip, which causes problems with the gzip in the test run.
+**Solution** going to rename the files to fastq then gzip them in the input data. This is a pretty crude approach, and meaning giab.sh should be investigated further but it works for now
+
+**Problem**: files are being proessed up to merge fastqs , and are being merged, then the pipeline fails. it is still saying that a file is missing. using `ls -a` it appears that .command.log is missing 
+
+the pipeline seems to fail at various stages of merge.fastqs, sometimes files are generated othertimes they are not. Upon inspection, it appears that the pipeline is still trying to use SBATCH, meaning it is still trying to use slurm. found that in nextflow.config the executor was defined in the processes section, meaning it was overwriting the executor defined in the eddie.config 
+
+**Solution** in nextflow.config, move executor definition outside of the processes section 
+
+**Problem** the md5.txt files were not found. Turns out giab.sh did not create them following downsampling 
+
+**Solution** create manually. saved the count number (10000) for each fastq.gz in a corresponding fastq.count file, then md5sum on each fastq* file, saving the outputs in a txt file 
+
+**Problem**  The pipeline fails at step `var_calling:process_families:bcbio_family_processing`
+
+````
+Error occurred during initialization of VM
+Could not reserve enough space for 46964736KB object heap
+
+````
+
+This section of the pipeline is trying to run a java script. 
+
+Does it need more space? The process that failed was a large task that requires 32G memory and 16 cpus - request this much in interactive session or use wild west node? This still isnt working. 
+
+Java required memory is a known problem when running on eddie. 'Java processes on eddie require a certain amount of memory overhead on top of this' https://www.wiki.ed.ac.uk/pages/viewpage.action?spaceKey=ResearchServices&title=Bioinformatics
+research services suggest adding line `clusterOptions = {"-l h_vmem=${(task.memory + 4.GB).bytes/task.cpus}"}` to eddie.config, so that more data is available. 
+it is looking for 47 GB, given 32 GB
+so add 16 to task memory. also going to clean work folder after this run
+
+error has now changed slightly to 
+
+```
+[2022-05-23T11:55Z] Error occurred during initialization of VM
+[2022-05-23T11:55Z] Unable to allocate 1467648KB bitmaps for parallel garbage collection for the requested 46964736KB heap.
+[2022-05-23T11:55Z] Error: Could not create the Java Virtual Machine.
+[2022-05-23T11:55Z] Error: A fatal exception has occurred. Program will exit.
+```
+
+A google suggests that the error `could not creare Java Virtual Machine` pops up when the java command is wrong
+
+The final lines in the .command.log file before the error are 
+
+```
+using: /gpfs/igmmfs01/eddie/bioinfsvice/ameynert/software/bcbio-1.0.9/anaconda/share/gatk4-4.0.3.0-0/gatk-package-4.0.3.0-local.jar
+
+java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_l$
+```
+
+Pasting this into the terminal results in permssion denied. Trying to find where the program is located, and why it is pointing to a directory I do not have access to. found program, is there as expected, check install notes for BCbio in docs/Software_installation.md, suggests a change to the memory fiven to GATK.
+
+changed galaxy/bcbio_system.yaml from: 
+
+```
+["-Xms500m", "-Xmx3500m"]
+```
+to: 
+
+```
+Increase JVM memory for GATK in galaxy/bcbio_system.yaml
+  gatk:
+    jvm_opts: ["-Xms500m", "-Xmx5g"]
+```
+
+Hopefully this will give it enough memory to run.
+Failed at same place again
+
+When run with bash .command.run, the individual process ran. failed once then got further. maybe problem due to two large processes trying to run at the same time but have only requested enough for 1? 
+
+when bash .command.run was used, the step completed. does this mean that the given memory caps must be doubled so both can run at the same time?
+
+__Login process__   
+Connect to eddie, set up interactive session, activate conda environment