@@ -12,7 +12,7 @@ The metadata creator is a Python script for writing XML metadata files based on
## Resources
The encryption uses EGA-Cryptor, a JAR file from ega-archive.org. This is stored at `/modules/local/ega/encrypt/resources`:
The encryption uses EGA-Cryptor, a JAR file from ega-archive.org. This is stored at `/modules/local/ega/encrypt/resources`. EgaCryptor can also be downloaded from EGA's website if needed:
```
wget https://ega-archive.org/files/EgaCryptor.zip
...
...
@@ -20,45 +20,93 @@ unzip EgaCryptor.zip
rm EgaCryptor.zip
```
The metadata creator uses several Python packages from PyPI, described in metadata/requirements.txt. It is recommended to set up a virtual environment and install the dependencies into this:
If you already have Nextflow and Aspera installed somewhere else, then the dependencies for the metadata creator can also be set up with a Python virtual environment:
```
# installation via Python virtualenv (metadata creation only)
python -m venv ./path/to/python_env
source path/to/python_env/bin/activate
pip install -r metadata/requirements.txt
```
## Running encryption
This pipeline assumes that the FASTQ files for upload are named in the format `sample_R1.fastq.gz` and `sample_R2.fastq.gz`, where `sample` will become the sample ID for this fastq pair.
To view help text on what the command line arguments do:
```
nextflow run main.nf --help
```
### Inputs
This pipeline assumes that the FASTQ files for upload are named in the format `sample_R1.fastq.gz` and `sample_R2.fastq.gz`, where `sample` will become the sample ID for this fastq pair. These are passed in as `--reads` - note the single quotes to prevent the command line from interpolating the wildcard before it gets to Nextflow, and the combination of `*` and `{1,2}` to pick up the sample IDs and the R1/R2 respectively:
```
--reads 'path/to/fastqs/*_R{1,2}.fastq.gz'
```
To run the encryption:
To run and upload automatically:
```
nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal \
--reads '*_R{1,2}.fastq.gz' \
--outdir output
```
To run the upload as well, add the arguments `--ega_user` and `--egapass`:
To encrypt and produce a `runs.csv` file without uploading, leave out the `--ega...` arguments:
`--egapass` should be a file containing a single line, which is the password for the ega-box account being used:
```
nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal \
-profile conda \
--reads '*_R{1,2}.fastq.gz' \
--samples /absolute/path/to/samples.csv \
--outdir output
somePassword012!
```
If you don't need the metadata creation and already have Nextflow set up, then the encryption pipeline can also be run directly from Git without cloning the repo:
```
nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal ...
```
The CSV file for connecting uploaded paired-end FASTQ files to their sample aliases and checksums in the EGA Submitter Portal will be in the specified output folder as runs.csv.
### Outputs
The pipeline will create, for each fastq:
- an encrypted .gpg file
- a .md5 file, containing the MD5 checksum of the unencrypted data
- a .gpg.md5 file, containing the MD5 of the encrypted data
- a 'runs.csv' file, summarising each fastq pair with sample ID, filenames and MD5s