Doc updates, putting conda in the default profile

12e6a972 · mwham · 186aa990 · 12e6a972 · 12e6a972 · 12e6a972
Commit 12e6a972 authored 2 months ago by mwham
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ The metadata creator is a Python script for writing XML metadata files based on

 ## Resources

-The encryption uses EGA-Cryptor, a JAR file from ega-archive.org. This is stored at `/modules/local/ega/encrypt/resources`:
+The encryption uses EGA-Cryptor, a JAR file from ega-archive.org. This is stored at `/modules/local/ega/encrypt/resources`. EgaCryptor can also be downloaded from EGA's website if needed:

 ```
 wget https://ega-archive.org/files/EgaCryptor.zip
@@ -20,45 +20,93 @@ unzip EgaCryptor.zip
 rm EgaCryptor.zip
 ```

-The metadata creator uses several Python packages from PyPI, described in metadata/requirements.txt. It is recommended to set up a virtual environment and install the dependencies into this:
+## Setup
+
+Prerequisites:
+- Conda
+
+Clone this Git repo and `cd` into it:

 ```
-python -m venv ./python_env
-source python_env/bin/activate
-pip install -r ega-submission-via-portal/metadata/requirements.txt
+git clone https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal.git
+cd ega-submission-via-portal
+```
+
+The encryption pipeline requires Nextflow and the metadata creator uses several Python packages from PyPI. This can be set up with Conda:
+
+```
+# installation via Conda
+conda env create -p path/to/conda-env -f environment.yml
+conda activate path/to/conda-env
+```
+
+If you already have Nextflow and Aspera installed somewhere else, then the dependencies for the metadata creator can also be set up with a Python virtual environment:
+
+```
+# installation via Python virtualenv (metadata creation only)
+python -m venv ./path/to/python_env
+source path/to/python_env/bin/activate
+pip install -r metadata/requirements.txt
 ```

 ## Running encryption

-This pipeline assumes that the FASTQ files for upload are named in the format `sample_R1.fastq.gz` and `sample_R2.fastq.gz`, where `sample` will become the sample ID for this fastq pair.
+To view help text on what the command line arguments do:
+
+```
+nextflow run main.nf --help
+```
+
+### Inputs
+
+This pipeline assumes that the FASTQ files for upload are named in the format `sample_R1.fastq.gz` and `sample_R2.fastq.gz`, where `sample` will become the sample ID for this fastq pair. These are passed in as `--reads` - note the single quotes to prevent the command line from interpolating the wildcard before it gets to Nextflow, and the combination of `*` and `{1,2}` to pick up the sample IDs and the R1/R2 respectively:
+
+```
+--reads 'path/to/fastqs/*_R{1,2}.fastq.gz'
+```
+
+To run the encryption:

-To run and upload automatically:
+```
+nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal \
+  --reads '*_R{1,2}.fastq.gz' \
+  --outdir output
+```
+
+To run the upload as well, add the arguments `--ega_user` and `--egapass`:

 ```
-nextflow https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal \
-  -profile conda \
+nextflow run main.nf \
  --reads '*_R{1,2}.fastq.gz' \
-  --samples /absolute/path/to/samples.csv \
  --outdir output \
  --ega_user ega-box-1234 \
  --egapass /absolute/path/to/egapass
 ```

-To encrypt and produce a `runs.csv` file without uploading, leave out the `--ega...` arguments:
+`--egapass` should be a file containing a single line, which is the password for the ega-box account being used:

 ```
-nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal \
-  -profile conda \
-  --reads '*_R{1,2}.fastq.gz' \
-  --samples /absolute/path/to/samples.csv \
-  --outdir output
+somePassword012!
+```
+
+If you don't need the metadata creation and already have Nextflow set up, then the encryption pipeline can also be run directly from Git without cloning the repo:
+
+```
+nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal ...
 ```

-The CSV file for connecting uploaded paired-end FASTQ files to their sample aliases and checksums in the EGA Submitter Portal will be in the specified output folder as runs.csv.
+### Outputs
+
+The pipeline will create, for each fastq:
+
+- an encrypted .gpg file
+- a .md5 file, containing the MD5 checksum of the unencrypted data
+- a .gpg.md5 file, containing the MD5 of the encrypted data
+- a 'runs.csv' file, summarising each fastq pair with sample ID, filenames and MD5s

 ## Running metadata creation

-The script will be run like:
+The script is run like:

 ```
 python ega_metadata.py [universal arguments] <stage> [step-specific arguments]

--- a/metadata/requirements.txt
+++ b/metadata/requirements.txt
 pyYAML>=6.0.2
-pandas>=2.2.2
-jinja2>=3.1.4
+pandas>=2.2.3
+jinja2>=3.1.5
 openpyxl>=3.1.5
--- a/nextflow.config
+++ b/nextflow.config
@@ -63,7 +63,7 @@ process {
 }

 profiles {
-    conda { conda.enabled = true }
+    standard { conda.enabled = true }
    stubs { conda.enabled = false }
 }