Skip to content
Snippets Groups Projects
SOP_sample_transfer_from_EdGe_to_EPCC.md 4.09 KiB
Newer Older
# Standard operating procedure - Transfer of whole exome sequencing samples from Edinburgh Genomics to Edinburgh Parallel Computing Centre

This SOP applies to the transfer of files generated by whole exome sequencing at Edinburgh Genomics (EdGE) into storage at the Edinburgh Parallel Computing Centre (EPCC). Files are downloaded to /scratch/u035/project/trio_whole_exome/data. Scripts are version controlled on the University of Edinburgh gitlab server gitlab.ecdf.ed.ac.uk/igmmbioinformatics/trio-whole-exome. Request access by e-mail: alison.meynert@igmm.ed.ac.uk.

## User requirements

The user will need an account on the EPCC Ultra system. Contact Donald Scobbie (d.scobbie@epcc.ed.ac.uk) for any issues.

The user will also need to have an account on the EdGE transfer server and be authorized for the relevant project. Contact Javier Santoyo-Lopez (javier.santoyo@ed.ac.uk) for any issues.

Each user should create a BASH script file at /home/u035/project/scripts that exports the environment variables ASPERA_SCP_USER and ASPERA_SCP_PASS with their own password, e.g.

```
#!/usr/bin/bash

export ASPERA_SCP_USER=ameynert
export ASPERA_SCP_PASS=ExamplePassword
export ASPERA_SCP_SERVER=transfer.genomics.ed.ac.uk
```

This script will be referred to as <user_transfer_info_file>.

## Software requirements

The Aspera command line interface software (version 3.7.4.147727) is installed in the project area at /home/u035/project/software/aspera/connect/bin. Md5 checksums are calculated by the system function md5sum.

## Notes

1. In all steps below, when log files are completed and no longer used, move them into the sub-directory ‘trio_whole_exome/logs/complete’. This “clean desk” policy makes it easier to see what current jobs are running and which log files require examination.

2. The tracking file is maintained on the IGMM datastore at /exports/igmm/datastore/IGMM-VariantAnalysis/documents/trio_whole_exome/Batch_status.xlsx.

## Procedure

1. When a batch of sample files are ready for transfer, an e-mail from EdGE will be sent to all users authorized to transfer data from the EdGE server. Add the project name (e.g. 12345_Lastname_Firstname) and any dated subfolders indicating batch (format YYYYMMDD) to the Project and Batch fields of the tracking file. The e-mail will contain a link to the EdGE Aspera server of the form:

https://transfer.genomics.ed.ac.uk:/12345_Lastname_Firstname/raw_data

2. Log in to the EPCC Ultra system.

```
ssh <user>@ultra.epcc.ed.ac.uk
```

3. Download the files using the provided job submission script. The <user_transfer_info_file> contains the user’s EdGE login credentials as described above. <EdGE_project_id> is the project id e.g. 12345_Lastname_Firstname.

Notes:
a.	The script is set to request 24 hours of wall time. The transfer speed is roughly 5 hours per Tb. Adjust parameter -l walltime as necessary.
b.	If a download does not complete, this step may be run again and the download will resume from where it left off. If there are any files with extension .partial in the target directory, the download did not complete.

```
cd /scratch/u035/project/trio_whole_exome/logs
project_id=<EdGE_project_id>
transfer_sh=<full_path_to_user_transfer_info_file>
qsub -v TRANSFER_INFO_FILE=$transfer_sh,PROJECT=$project_id \
  -N aspera.$project_id \
  /home/u035/project/scripts/submit_trio_wes_aspera_download.sh
```

4. Confirm checksums via md5. The output file will have one line per file with an md5 checksum and should have the text ‘OK’ at the end of the line.  A line not ending in ‘OK’ indicates that the transfer of this file has failed. 

```
grep -v 'OK$' /scratch/u035/project/trio_whole_exome/data/$project_id/md5_check.txt
```

If any files in a download do not have the correct md5 checksum, manually delete the affected files and return to step 3, which will re-download the missing files and re-run the md5 checking. When all md5 checksums are correct, enter today’s date in the ‘Downloaded’ field for the project.

5. Clean up the logs directory.

```
cd /scratch/u035/project/trio_whole_exome/logs
mv aspera.$project_id.* complete/
mv project/$project_id.log complete/
```