Skip to content
Snippets Groups Projects
Commit 12e6a972 authored by mwham's avatar mwham
Browse files

Doc updates, putting conda in the default profile

parent 186aa990
No related branches found
No related tags found
No related merge requests found
......@@ -12,7 +12,7 @@ The metadata creator is a Python script for writing XML metadata files based on
## Resources
The encryption uses EGA-Cryptor, a JAR file from ega-archive.org. This is stored at `/modules/local/ega/encrypt/resources`:
The encryption uses EGA-Cryptor, a JAR file from ega-archive.org. This is stored at `/modules/local/ega/encrypt/resources`. EgaCryptor can also be downloaded from EGA's website if needed:
```
wget https://ega-archive.org/files/EgaCryptor.zip
......@@ -20,45 +20,93 @@ unzip EgaCryptor.zip
rm EgaCryptor.zip
```
The metadata creator uses several Python packages from PyPI, described in metadata/requirements.txt. It is recommended to set up a virtual environment and install the dependencies into this:
## Setup
Prerequisites:
- Conda
Clone this Git repo and `cd` into it:
```
python -m venv ./python_env
source python_env/bin/activate
pip install -r ega-submission-via-portal/metadata/requirements.txt
git clone https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal.git
cd ega-submission-via-portal
```
The encryption pipeline requires Nextflow and the metadata creator uses several Python packages from PyPI. This can be set up with Conda:
```
# installation via Conda
conda env create -p path/to/conda-env -f environment.yml
conda activate path/to/conda-env
```
If you already have Nextflow and Aspera installed somewhere else, then the dependencies for the metadata creator can also be set up with a Python virtual environment:
```
# installation via Python virtualenv (metadata creation only)
python -m venv ./path/to/python_env
source path/to/python_env/bin/activate
pip install -r metadata/requirements.txt
```
## Running encryption
This pipeline assumes that the FASTQ files for upload are named in the format `sample_R1.fastq.gz` and `sample_R2.fastq.gz`, where `sample` will become the sample ID for this fastq pair.
To view help text on what the command line arguments do:
```
nextflow run main.nf --help
```
### Inputs
This pipeline assumes that the FASTQ files for upload are named in the format `sample_R1.fastq.gz` and `sample_R2.fastq.gz`, where `sample` will become the sample ID for this fastq pair. These are passed in as `--reads` - note the single quotes to prevent the command line from interpolating the wildcard before it gets to Nextflow, and the combination of `*` and `{1,2}` to pick up the sample IDs and the R1/R2 respectively:
```
--reads 'path/to/fastqs/*_R{1,2}.fastq.gz'
```
To run the encryption:
To run and upload automatically:
```
nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal \
--reads '*_R{1,2}.fastq.gz' \
--outdir output
```
To run the upload as well, add the arguments `--ega_user` and `--egapass`:
```
nextflow https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal \
-profile conda \
nextflow run main.nf \
--reads '*_R{1,2}.fastq.gz' \
--samples /absolute/path/to/samples.csv \
--outdir output \
--ega_user ega-box-1234 \
--egapass /absolute/path/to/egapass
```
To encrypt and produce a `runs.csv` file without uploading, leave out the `--ega...` arguments:
`--egapass` should be a file containing a single line, which is the password for the ega-box account being used:
```
nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal \
-profile conda \
--reads '*_R{1,2}.fastq.gz' \
--samples /absolute/path/to/samples.csv \
--outdir output
somePassword012!
```
If you don't need the metadata creation and already have Nextflow set up, then the encryption pipeline can also be run directly from Git without cloning the repo:
```
nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal ...
```
The CSV file for connecting uploaded paired-end FASTQ files to their sample aliases and checksums in the EGA Submitter Portal will be in the specified output folder as runs.csv.
### Outputs
The pipeline will create, for each fastq:
- an encrypted .gpg file
- a .md5 file, containing the MD5 checksum of the unencrypted data
- a .gpg.md5 file, containing the MD5 of the encrypted data
- a 'runs.csv' file, summarising each fastq pair with sample ID, filenames and MD5s
## Running metadata creation
The script will be run like:
The script is run like:
```
python ega_metadata.py [universal arguments] <stage> [step-specific arguments]
......
pyYAML>=6.0.2
pandas>=2.2.2
jinja2>=3.1.4
pandas>=2.2.3
jinja2>=3.1.5
openpyxl>=3.1.5
......@@ -63,7 +63,7 @@ process {
}
profiles {
conda { conda.enabled = true }
standard { conda.enabled = true }
stubs { conda.enabled = false }
}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment