S3
S3 is an object store that first became available in Amazon AWS, but is now available on-premise through services providing S3-like interfaces.
By storing the input and output data for your commands in S3, you can build a more “cloud-like” workflow on-premise, making any subsequent shift to the cloud easier, or potentially making better use of an on-premise OpenStack cluster.
wr add
has a --mounts
option that lets you mount S3 buckets prior to
executing your commands. --mounts
is a convenience option that lets you
specify your mounts in the common case that you wish the contents of 1 or more
remote directories to be accessible from the command working directory. For
anything more complicated you’ll need to use --mount_json
. You can’t use
both --mounts
and --mount_json
at once.
Note
For mounting to work, you must be able to carry out fuse mounts, which means
fuse-utils must be installed, and /etc/fuse.conf
should have
‘user_allow_other’ set. An easy way to enable it is to run:
sudo perl -i -pne 's/#user_allow_other/user_allow_other/;' /etc/fuse.conf
When wr creates OpenStack instances for you, it will attempt to do this for you.
You will also need a working ~/.s3cfg
or ~/.aws/credentials
file: if
a tool like s3cmd
works for you, so will wr’s mounting. wr cloud
deploy
will by default copy over your credential files to created
instances, so they will have S3 access.
The format of --mounts
is a comma-separated list of
[c|u][r|w]:[profile@]bucket[/path]
strings. The first character as ‘c’ means
to turn on caching, while ‘u’ means uncached. The second character as ‘r’ means
read-only, while ‘w’ means writeable (only one of them can have w). After the
colon you can optionally specify the profile name followed by the @ symbol,
followed by the required remote bucket name and ideally the path to the deepest
subdirectory that contains the data you wish to access.
The format of --mount_json
is a JSON string for an array of Config objects
describing all your mount parameters. A JSON array begins and ends with a square
bracket, and each item is separated with a comma. A JSON object can be written
by starting and ending it with curly braces. Parameter names and their values
are put in double quotes (except for numbers, which are left bare, booleans
where you write, unquoted, true or false, and arrays as described previously),
and the pair separated with a colon, and pairs separated from each other with
commas. For example (all on one line):
--mount_json '[{"Mount":"/tmp/wr_mnt","Targets":[{"Profile":"default","Path":"mybucket/subdir","Write":true}]}]'
Tip
You can use wr mount
to manually test your mount options before using
them with wr add
. After mounting, test the mount with eg. ls mnt
if
using the default mount location. It runs in the background, and it’s
important to cleanly terminate it by sending it SIGTERM with eg. kill
after your test.
wr mount -h
goes in to detail about all the options possible in the
JSON.
Mounts can be done a number of different ways, and make the commands you add cleaner and simpler.
For example, instead of doing something like (on an image where s3cmd has been installed and configured):
echo 's3cmd get s3://inbucket/path/input.file && myexe -i input.file > output.file && s3cmd put output.file s3://outbucket/path/output.file' | wr add
You could (not requiring s3cmd be installed):
echo 'myexe -i inputs/input.file > outputs/output.file' | wr add --mount_json '[{"Mount":"inputs","Targets":[{"Path":"inbucket/path"}]},{"Mount":"outputs","Targets":[{"Path":"outbucket/path","Write":true}]}]'
Or even nicer:
echo 'myexe -i input.file > output.file' | wr add --mounts 'ur:inbucket/path,cw:outbucket/path'
(Note that for direct use as a working directory like this, we ought to enable caching on the writable target. Without caching we can only do serial writes and for more complicated commands things may not work as expected.)
You could have a text file with many of these short and sweet command lines, and
specify the --mounts
just once as an option to wr add
. The performance
will also be higher than when using s3cmd
or s3fs
et al.
If an individual command will read the same data multiple times, enable per-command caching (which gets deleted once the cmd completes):
--mounts 'cr:inbucket/path'
If multiple different commands could run on the same machine and access the same data, put the cache in a fixed location (where it won’t ever get deleted by wr; be careful about doing this for writable mounts!; this is also slower than than the previous scenario if you don’t read whole files):
--mount_json '[{"Targets":[{"Path":"inbucket/path","CacheDir":"/tmp/mounts_cache"}]}]'
Note
Do not try and mount them at the same location: it won’t work! Give them unique mount points, but the same cache location.
Unlike s3cmd
, wr’s mount options support “profiles”, useful if you need to
mount multiple buckets that have different configurations. In your ~/.s3cfg
file, after the [default] section add more named sections with the necessary
settings, then select the section (or “profile”) to use by saying
profile@bucket
when specifying your bucket, where ‘profile’ is the name of
the desired section.
Note
If you turn on caching and find that commands that read lots of files fail
due to being unable to open or read a file, it could be due to you exceeding
your open file limit. Increase your limit to some very high value like
131072 by eg. doing echo 131072 > /proc/sys/fs/file-max
. If using
OpenStack, this may mean having to create a new image, or using a
--cloud_script
that sets the limit high.
Sanger Institute users
If you work at the Sanger Institute, here are some tips.
Your ~/.s3cfg
file should look something like:
[default]
access_key = MYACCESSKEY
secret_key = mysecret
encrypt = False
host_base = cog.sanger.ac.uk
host_bucket = %(bucket)s.cog.sanger.ac.uk
progress_meter = True
use_https = True
NPG have put a public bucket together containing lots of reference-related files that you might want to use. Eg. If you will run samtools to do something with cram files you might:
--mount_json '[{"Targets":[{"Path":"inbucket/path"},{"Path":"npg-repository","CacheDir":"/tmp/mounts_cache"}]}]'
And then in the JSON you supply to wr add -f
say something like:
{"cmd":"samtools view ...","env":["REF_PATH=cram_cache/%2s/%2s/%s"]}
Inside the npg-repository bucket you’ll also find reference indexes for use by bwa, samtools and other software. For tools like samtools that need the index file and the original fasta file in the same directory, you can take advantage of the multiplexing possible in –mounts:
--mounts 'ur:inbucket/path,cr:npg-repository/references/Homo_sapiens/GRCh38_15_noEBV/all/fasta,cr:npg-repository/references/Homo_sapiens/GRCh38_15_noEBV/all/samtools'
(Now your cmd will see Homo_sapiens.GRCh38_15_noEBV.fa and Homo_sapiens.GRCh38_15_noEBV.fa.fai in the current directory, along with your input files.)
iRODS @ Sanger
For Sanger Institute users that need to process data in OpenStack that is stored in iRODS, your best bet is probably to copy the data to S3 first, and then use S3 mounts as described above.
Because putting files in to S3 (ceph) happens at about 40MB/s from an OpenStack node but only about 20MB/s from a farm node (while reading from iRODS is a similar speed from both), you may prefer to do these copy jobs in OpenStack. That means bringing up instances with the iRODS clients installed and authentication sorted out.
The following guide assumes you have non-interactive (non-Kerberos) authentication already configured and working on the farm.
First, find a recent image that has the iRODS client installed, such as
bionic-WTSI-irodsclient_e49001
.
Now create an OpenStack-specific version of the environment file that excludes any local paths:
grep -Ev "plugins|certificate" ~/.irods/irods_environment.json > ~/.irods/irods_environment.json.openstack
One time only, we need to create an OpenStack-specific iRODS authentication file:
wr cloud deploy --os "bionic-WTSI-irodsclient" --config_files '~/.irods/irods_environment.json.openstack:~/.irods/irods_environment.json'
ssh -i ~/.wr_production/cloud_resources.openstack.key ubuntu@[ip address from step 1]
iinit
[enter your password and then as quickly as possible - time is important - carry out steps 5-7]
exit
sftp -i ~/.wr_production/cloud_resources.openstack.key ubuntu@[ip address from step 1]
get .irods/.irodsA
exit
mv .irodsA ~/.irods/.irodsA.openstack
wr cloud teardown
From now on, when we wish to do iRODS -> S3 copy jobs, we just have to be sure to copy over these irods files to the servers we create, and use the right image, eg.:
wr cloud deploy --config_files '~/.irods/irods_environment.json.openstack:~/.irods/irods_environment.json,~/.irods/.irodsA.openstack:~/.irods/.irodsA,~/.s3cfg'
echo "iget /seq/123/123.bam" | wr add --mounts 'cw:s3seq/123' --cloud_os "bionic-WTSI-irodsclient"`
(Note that this doesn’t work without caching turned on because random writes are not supported without caching.)