EGA Submission via Portal
Supports encryption and upload of paired-end sequencing data to the European Genome/Phenome Archive (EGA) using the Submitter Portal CSV format for Run objects.
Introduction
This project is split into two components: encryption and metadata.
The encryption pipeline is built using Nextflow, a workflow tool to run parallel tasks across multiple compute infrastructures in a very portable manner. It is based on the nf-core template and currently only supports Conda. Source files for this can be found in main.nf
, modules
and nextflow.config
.
The metadata creator is a Python script for writing XML metadata files based on the outputs of the encryption pipeline. Source files can be found in metadata
.
Resources
The encryption uses EGA-Cryptor, a JAR file from ega-archive.org. This is stored at /modules/local/ega/encrypt/resources
. EgaCryptor can also be downloaded from EGA's website if needed:
wget https://ega-archive.org/files/EgaCryptor.zip
unzip EgaCryptor.zip
rm EgaCryptor.zip
Setup
Prerequisites:
- Conda
Clone this Git repo and cd
into it:
git clone https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal.git
cd ega-submission-via-portal
The encryption pipeline requires Nextflow and the metadata creator uses several Python packages from PyPI. This can be set up with Conda:
# installation via Conda
conda env create -p path/to/conda-env -f environment.yml
conda activate path/to/conda-env
If you already have Nextflow and Aspera installed somewhere else, then the dependencies for the metadata creator can also be set up with a Python virtual environment:
# installation via Python virtualenv (metadata creation only)
python -m venv ./path/to/python_env
source path/to/python_env/bin/activate
pip install -r metadata/requirements.txt
Running encryption
To view help text on what the command line arguments do:
nextflow run main.nf --help
Inputs
This pipeline assumes that the FASTQ files for upload are named in the format sample_R1.fastq.gz
and sample_R2.fastq.gz
, where sample
will become the sample ID for this fastq pair. These are passed in as --reads
- note the single quotes to prevent the command line from interpolating the wildcard before it gets to Nextflow, and the combination of *
and {1,2}
to pick up the sample IDs and the R1/R2 respectively:
--reads 'path/to/fastqs/*_R{1,2}.fastq.gz'
To run the encryption:
nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal \
--reads '*_R{1,2}.fastq.gz' \
--outdir output
To run the upload as well, add the arguments --ega_user
and --egapass
:
nextflow run main.nf \
--reads '*_R{1,2}.fastq.gz' \
--outdir output \
--ega_user ega-box-1234 \
--egapass /absolute/path/to/egapass
--egapass
should be a file containing a single line, which is the password for the ega-box account being used:
somePassword012!
If you don't need the metadata creation and already have Nextflow set up, then the encryption pipeline can also be run directly from Git without cloning the repo:
nextflow run https://git.ecdf.ed.ac.uk/igmmbioinformatics/ega-submission-via-portal ...
Outputs
The pipeline will create, for each fastq:
- an encrypted .gpg file
- a .md5 file, containing the MD5 checksum of the unencrypted data
- a .gpg.md5 file, containing the MD5 of the encrypted data
- a 'runs.csv' file, summarising each fastq pair with sample ID, filenames and MD5s
Running metadata creation
The script is run like:
python ega_metadata.py [universal arguments] <stage> [step-specific arguments]
Arguments can be specified on the command line or in a config file. The script will look for a config file specified in the environment variable 'EGA_UPLOAD_CONFIG', or 'ega_upload.yaml' in the current folder. Universal arguments includes the XML output folders. Other available arguments will depend on which stage you're running.
Each '--' argument on the command line has an identically-named option in the config file. For example, a study title can be specified as --study_title
on the command line, or study_title
in the config file. For more information on what arguments can be specified, check the test config file in metadata/tests/ega_upload.yaml or use the '--help' option on the command line:
python ega_metadata.py createstudy --help
The metadata creation is split into multiple stages:
- createdac: Creates a data access committee. If you're using a pre-existing data access policy, you do not need to run this step.
- createpolicy: Creates a policy object. Again, if you're using a pre-existing one, you don't need to run this step.
- createstudy: Creates a study object. This will hold your abstract and study title.
- createsamples: Creates sample objects. These will hold your sample IDs and sample-based metadata like sex, species.
- createrunsandexperiments: Creates interlinked run and experiment objects. Runs link your samples to raw sequencing files/checksums, and experiments hold metadata about sample preparation and sequencing. If you're splitting your upload into batches, each run-experiment pair represents a batch.
- createdataset: Final stage, creating the dataset object. This object will hold your dataset title, your data access policy and all of your runs together in one object, in turn linking everything together.
Each step produces an XML file that can be submitted to the EGA portal via your ega-box:
curl -u ega-box-001:egaboxpassword -F "SUBMISSION=@submission.xml" -F "STUDY=@study.xml" https://www.ebi.ac.uk/ena/submit/drop-box/submit
There is also a dev server at https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit.
Each submission produces an XML receipt - these should always be kept. The receipts will contain accession IDs that EGA puts on each object you submit, and it's not possible to recover these later. The final createdataset stage requires EGA accession IDs for your run and policy objects in order to complete.
Uploading in batches
The ega-box FTP servers are soft-limited to 8Tb and hard-limited to 10Tb. If your dataset is bigger than this, it will be necessary to upload in batches. To do this:
- Check how much data you have to submit in total, e.g.
du -Lhs results
- If this is larger than 10Tb, use
--nbatches
at createrunsandexperiments to split up your files into multiple runs and experiments. For example, 15 Tb can be split into two batches withcreaterunsandexperiments --nbatches 2
. - For each run object generated, upload the gpg and md5 files that it describes
- Submit the run and its associated experiment object. This will start EGA's archiving process and clear the ega-box when complete.
- Repeat for each batch
Credits
Alison Meynert (alison.meynert@ed.ac.uk) Murray Wham (murray.wham@ed.ac.uk)