Unimelb command-line clients
These are a collection of command line interface clients for Mediaflux developed at the University of Melbourne. They are mainly focused on uploading and downloading data to and from a Mediaflux server as well as verifying that an upload or download completed successfully. They are written in Java and interact with Mediaflux using the HTTPS protocol which is secure and provides excellent data integrity guarantees, as well as allowing uploads and downloads to be efficient and restartable. It is possible to enable checksum checking, which verifies successful transfers by comparing a CRC32 checksum on the local and remote sides.
The main clients are called
unimelb-mf-download provides efficient, restartable download to the local host from Mediaflux
unimelb-mf-upload provides efficient, restartable upload from the local host to Mediaflux
unimelb-mf-check provides efficient directory comparison/verification between the local host file system and Mediaflux (i.e. check that the source and destination are the same).
Additional items:
mexplorer - a shell wrapper for Mediaflux Explorer to make it easy to launch from the command line
aterm - a shell wrapper for Mediaflux aterm to make it easy to script from the command line
aterm-gui - a shell wrapper for Mediaflux aterm to make it easy to launch from the command line
aterm-import - a shell wrapper for the aterm import command (run aterm help download to see all options)
aterm-download - a shell wrapper for the aterm download command (run aterm help import to see all options)
other wrapper scripts for each platform. See the ./bin/unix directory for macOS and Linux, and the ./bin/windows directory for Windows
University of Melbourne Spartan Users
Please note that these clients are pre-configured and available on the University's Spartan service.
Obtaining the unimelb clients
Download from the GitLab page by selecting the release for your platform (Windows/macOS/Linux) which includes an embedded Java distribution (recommended), or get the pure Java release which requires a local Java distribution.
Install
Unpack the downloaded zip file by double clicking on it via a GUI (or using a CLI unzip tool such as unzip on Unix systems)
Move the unpacked directory (named like unimelb-mf-clients-0.3.8) to wherever you'd like to keep it long term. E.g. the Desktop for Windows, or a ~/bin folder (on Unix systems like Linux and macOS)
Configure
Access to the Mediaflux server is managed via a configuration file which you must create and populate appropriately.
Add to your PATH
Optionally, you can add the commands to your PATH so that you can run them from any directory.
Using The Clients
Execute the client of interest on the command line and supply it the arguments that you need.
On Windows, the clients can be run from Windows PowerShell or the Command Prompt. You can start these by pressing the Start button and typing powershell or cmd, respectively.
On macOS, the clients can be run from the Terminal (Applications -> Utilities -> Terminal).
On Linux, you can execute the clients from any terminal or shell prompt. Linux commonly allows you to launch a terminal with ctrl-alt-t.
You can find some example of usage here:
unimelb-mf-download provides efficient, restartable download to the local host from Mediaflux
unimelb-mf-upload provides efficient, restartable upload from the local host to Mediaflux
unimelb-mf-check provides efficient directory comparison/verification between the local host file system and Mediaflux (i.e. check that the source and destination are the same).
and also see the Documentation in the source repository for examples.
Add to PATH
Optionally, you can add the commands to your PATH so that you can run them from any directory.
Windows
Add the location of the command line clients to your Path, using the System Properties control panel. For example, if you extracted the clients to your Desktop folder, you might add %USERPROFILE%\Desktop\unimelb-mf-clients-0.3.8\bin\windows to your Path. This will allow you to run the commands from any folder without specifying the path to the binary.
Click the start button, type env and run Edit the environment variables for your account.
Under User variables for <username>, click the Path entry and click Edit.... On Windows 10, you can add an additional line, on Windows 7, add the string using a semicolon (;) as a separator.
macOS
To add the location of the command line clients to your $PATH, edit the .bash_profile file in your home directory with TextEdit:
Run TextEdit
File -> Open
Press the keyboard shortcut Command-Shift-H to go to your Home directory
Press the keyboard shortcut Command-Shift-Period to show hidden files and directories
Open the .bash_profile file, or create one with the New Document button (see Configuration File for instructions on making it plain text, etc.)
Add a line (making sure it matches the location you extracted the zip file):
export PATH=~/bin/unimelb-mf-clients-0.3.8/bin/unix:$PATHSave the file
Close and re-open the terminal to re-load the changes to your PATH
Linux
Add the location of the command line clients to your $PATH. This will allow you to run the commands from any folder without specifying the path to the binary.
Edit your .bashrc file:
nano ~/.bashrcAdd a line to the bottom of the file:
export PATH=$PATH:~/bin/unimelb-mf-clients-0.3.8/bin/unixYou will need to log out and log back in to pick up the changes to your PATH
unimelb-mf-upload
This is a command-line Java application that you can use to efficiently upload your data into Mediaflux and make integrity checks. Installation instructions are available in the parent page
This client can
upload files in parallel (--nb-workers). There is no magic in this, it will only go faster if there is sufficient network capacity. Therefore, please don't use more than 4 upload threads. You may even find that if the network is heavily congested, 4 threads is no faster than 1. You may have to experiment a little to find the optimum.
compute checksums for additional validation (see below)
write a log file of the upload
generate and email a summary of the upload (including successful and failed uploads, and the number of zero-sized files it encountered)
run in daemon mode (in the background) so it keeps uploading new data to Mediaflux as it arrives in your local file system
Please see all command-line arguments with the --help switch
Here are all the details for the command-line arguments to this client.
Pre-existing files
The client checks whether files already exist in Mediaflux or not. If they do exist it will skip the upload. The checks it uses are:
File path/name exists and is the same
File size is the same
If checksums are enabled, the checksum is the same
If any of these fail, the file does not pre-exist and will be re-uploaded. In the case that the path/name is the same, but the source file has changed content, it will be uploaded to the pre-existing asset in Mediaflux as a new version.
Checksums
Checksums (a unique number computed from the contents of a file) are an important data integrity mechanism. The Mediaflux server computes a checksum for each file it receives. The upload client can compute checksums from the source data on the client side and compare with the checksum computed by the server when it receives the file. If the checksums match, we can be very confident that the file uploaded correctly. Many other clients for other protocols (e.g. sFTP and SMB) do not do this.
By default, checksums are not enabled (because computing checksums slows down the upload process). However, it is strongly recommended that you enable these during the upload or run the checker client unimelb-mf-check with checksums to check the upload afterwards.
Case 1 - Files DO NOT pre-exist on Mediaflux
When you enable checksums, and the data DO NOT already exist on the server, the client will compute the check sum as part of the upload process. When Mediaflux creates the asset, it will also compute the checksum. These checksums will be compared.
Case 2 - Files DO pre-exist on Mediaflux
When you enable checksums, and the data DO already exist on the server (by path/name and size), then client will compute the check sum on the local file first and compare the checksum with that already stored in Mediaflux.
If the checksums differ, it will then proceed to re-upload the local file (following the process in Case 1. above) because it has changed and make a new asset version. Thus, overall 2 checksums are computed by the client and one by the server.
Examples
You will need to know where (the path) to locate your data in Mediaflux (the --dest argument of the command) and where to upload from (the last positional argument)
Example 1 - parallel upload with checksum check
Upload data with four worker threads and turn on checksums for upload integrity checking (recommended). As the location of the config files is not specified, the client will look for it in the .Arcitecta directory of your home directory.
unimelb-mf-upload --csum-check --nb-workers 4 --dest /projects/proj-myproject-1128.1.59/12Jan2018 /data/projects/punim0058 |
Example 2 - using a configuration file
Upload data with one worker thread and specify explicitly where the configuration file is.
unimelb-mf-upload --mf.config /Users/nebk/.Arcitecta/mflux.cfg --dest /projects/proj-myproject-1128.1.59/12Jan2018 /data/projects/punim0058 |
The Configuration File might look like this:
host=mediaflux.researchsoftware.unimelb.edu.au
port=443
transport=https
token=phooP1Angohb2ooyahbiLiuwa6ahjuoKooViedaifooPhiqu1ookahXae7keichael4Shae2ael8ietit2phawucai0Aighifu6olah9OquahDei2aevae3keich8ain1OoLa4O |
Scheduled uploads
If you have a location that should be uploaded on a regular schedule such as an instrument PC that saves data to a given directory on the local computer, you can schedule uploads with unimelb-mf-upload. It is best to request an upload token if you want to do this as the credential will be stored on the computer that is doing the uploads. Contact Research Computing Services to request a token.
Windows
In this example:
we will put the unimelb-mf-client files in the %HOMEPATH%\Documents directory
we will save logs to the %HOMEPATH%\Documents\logs directory
will will put the configuration file in the %HOMEPATH%\Documents directory
Download from the GitLab page, selecting the Windows 64bit release. Extract the zip file to %HOMEPATH%\Documents.
Create a Configuration File. In this case we are going to use a secure token. In our example, it will be stored in %HOMEPATH%\Documents\mflux.cfg.
host=mediaflux.researchsoftware.unimelb.edu.au
port=443
transport=https
token=phooP1Angohb2ooyahbiLiuwa6ahjuoKooViedaifooPhiqu1ookahXae7keichael4Shae2ael8ietit2phawucai0Aighifu6olah9OquahDei2aevae3keich8ain1OoLa4O |
Create a batch file to perform the upload using Notepad. In our example, it will be stored in %HOMEPATH%\Documents\upload.bat:
%HOMEPATH%\Documents\unimelb-mf-clients-0.7.7\bin\windows\unimelb-mf-upload --mf.config %HOMEPATH%\Documents\mflux.cfg --log-dir %HOMEPATH%\Documents\logs --dest /projects/proj-demonstration-1128.4.15 %HOMEPATH%\Documents\data-to-upload |
Schedule the upload using Windows Task Scheduler.
Click the start button and start typing Task Scheduler and select it from the Start Menu when it appears.
Click on the Task Scheduler Library, then right click on the space and choose Create Basic Task... from the menu.
Give your task a name and description, then click Next >
choose a start date and time and click Next >
choose Start a program and click Next >
click the Browse button and find the script you created above.
Click Next > and then check the Open the Properties dialog for this task when I click Finish box, then click Finish.
Under Security options, choose which user you would like the task to run under. You may wish to make it so the scheduled job will run even if the user is not logged in.
Linux
In this example:
we will put the unimelb-mf-client files in the ~/bin directory
we will save logs to the ~/logs directory
will will put the configuration file in the ~/.Arcitecta directory
Download from the GitLab page, selecting the Linux 64bit release. Extract the zip file to ~/bin.
Create a Configuration File. In this case we are going to use a secure token. In our example, it will be stored in ~/.Arcitecta/mflux.cfg.
host=mediaflux.researchsoftware.unimelb.edu.au
port=443
transport=https
token=phooP1Angohb2ooyahbiLiuwa6ahjuoKooViedaifooPhiqu1ookahXae7keichael4Shae2ael8ietit2phawucai0Aighifu6olah9OquahDei2aevae3keich8ain1OoLa4O |
Create a shell script to perform the upload using the text editor of your choice. In our example, it will be stored in ~/bin/upload.sh:
#!/bin/bash
~/bin/unimelb-mf-clients-0.7.4/bin/unix/unimelb-mf-upload --mf.config ~/.Arcitecta/mflux.cfg --log-dir ~/logs --dest /projects/proj-demonstration-1128.4.15 ~/data-to-upload |
On Linux there's typically two options for scheduling tasks: cron and systemd timers. In this example, we will use a cron job.
Edit the crontab file with the following command:
crontab -e |
Create a new scheduled task at the end of the crontab file. To see documentation on the format, try the man 5 crontab command. In our example, we will run the command once per day at 1 am local time.
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h dom mon dow command
0 1 * * * $HOME/bin/upload.sh |
Save the file, and your job will be scheduled.
crontab: installing new crontab |
macOS
In this example:
we will put the unimelb-mf-clients in the ~/Applications folder
we will save logs to the ~/Documents/logs folder
we will put the configuration file in the ~/.Arcitecta folder
Download from the GitLab page, selecting the Mac 64bit release. Extract the tar.gz file by clicking on it. It will be extracted to a folder in your Downloads folder, so move it o the Applications folder.
Create a Configuration File. In this case we are going to use a secure token. In our example, it will be stored in ~/.Arcitecta/mflux.cfg.
host=mediaflux.researchsoftware.unimelb.edu.au
port=443
transport=https
token=phooP1Angohb2ooyahbiLiuwa6ahjuoKooViedaifooPhiqu1ookahXae7keichael4Shae2ael8ietit2phawucai0Aighifu6olah9OquahDei2aevae3keich8ain1OoLa4O |
Create a shell script to perform the upload using the text editor of your choice. In our example, it will be stored in ~/bin/upload.sh:
#!/bin/bash
~/Applications/unimelb-mf-clients-0.7.4/bin/unix/unimelb-mf-upload --mf.config ~/.Arcitecta/mflux.cfg --log-dir ~/logs --dest /projects/proj-demonstration-1128.4.15 ~/data-to-upload |
Edit the crontab file with the following command. By default the vim text editor will be used.
crontab -e # this will use the default text editor, usually vim
# if you would prefer to use the pico text editor, use the following command instead:
EDITOR=/usr/bin/pico crontab -e |
Create a new scheduled task at the end of the crontab file. To see documentation on the format, try the man 5 crontab command. In our example, we will run the command once per day at 1 am local time.
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h dom mon dow command
0 1 * * * $HOME/bin/upload.sh |
Save the file, and your job will be scheduled.
crontab: installing new crontab |
Troubleshooting upload issues caused by special files
Sparse files
Sparse Files
Sparse files are files that have large sections of unallocated data. They are commonly used in Linux/Unix systems. Sparse files use storage efficiently when the files have a lot of holes (contiguous ranges of bytes having the value of zero) by storing only metadata for the holes instead of using real disk blocks.
Sparse files should be either excluded, or compressed before uploading to Mediaflux. As Mediaflux backend does not support sparse files and treats them as regular files. Uploading uncompressed sparse files will be waste of storage space. We've seen issues caused by very large sparse files.
Find sparse files
To find the sparse files in your file system, you can use find command below:
find ./ -type f -printf "%S\t%p\n" | awk '$1 < 1.0 {print $2}' |
Compress sparse files
If you are aware of sparse files in your local file system, you can run the following command to compress them before uploading to Mediaflux:
find ./ -type f -printf "%S\t%p\n" | awk '$1 < 1.0 {print $2}' | xargs -I {} sh -c "tar -Sczvf {}.tar.gz {}; rm -f {}" |
Warning
The above command compresses the sparse files to *.tar.gz files and preserve their holes (-S option for tar), and the original sparse files will be replaced.
DO NOT try it if you don't know what you are doing.
FIFO (Named Pipe)
A FIFO (First In First Out) is similar to a pipe. The principal difference is that a FIFO has a name within the file system and is opened in the same way as a regular file. A FIFO has a write end and a read end, and data is read from the pipe in the same order as it is written. Fifo is also termed as Named pipes in Linux.
FIFO should not be uploaded to Mediaflux.
Mediaflux Explorer
Uploading FIFO causes Mediaflux Explorer (current version: v1.5.6) to crash.
unimelb-mf-upload (in unimelb-mf-clients)
Early versions (prior to v0.7.4) of unimelb-mf-upload also hangs when uploading FIFO.
From version v0.7.4 and above, unimelb-mf-upload excludes FIFO files.
Find FIFO (Named Pipes)
The following command can be used to list the FIFO files in your file system:
find ./ -type p |
unimelb-mf-download
This is a command-line Java application that you can use to efficiently download your data (folders recursively or individual files) from Mediaflux. Installation instructions are available in the Mediaflux Unimelb Command-Line Clients page.
You will need to know where your data are in Mediaflux (the last argument of the command) and where you want to locate (--out) the data on the destination computer.
This client can download files in parallel (--nb-workers). There is no magic in this, it will only go faster if there is sufficient network capacity. Therefore, please don't use more than 4 download threads. You may even find that if the network is heavily congested, 4 threads is no faster than 1. You may have to experiment a little to find the optimum.
The client can run in daemon (background) mode wherby it will keep on downloading data as it arrives in Mediaflux.
The client can synchronise from Mediaflux to the local file system and delete files that no longer exist in Mediaflux from the local file system.
Please see all command-line arguments with the --help switch
Examples
Example 1
Download data with one worker thread and skip pre-existing files, checking for files that pre-exist by their name and size only.
unimelb-mf-download --mf.config ~/.Arcitecta/mflux.cfg --out /data/projects/punim0058 /projects/proj-myproject-1128.1.59/12Jan2018 |
Example 2
Download data with four worker threads and we overwrite pre-existing files, and we check files pre-exist by their name and size and checksum (slower but safer). We don't need to specify the path to the config file as the client will look for it in the standard places.
unimelb-mf-download --overwite --csum-check --nb-workers 4 --out /data/projects/punim0058 /projects/proj-myproject-1128.1.59/12Jan2018 |
unimelb-mf-check
This is a command-line Java application that you can use to check and compare assets in Mediaflux against files on the local file system. Installation instructions are available in the Mediaflux Unimelb Command-Line Clients page.
See All the details for the arguments to this client.
The client checks the equality of files (you can say which direction you are checking) by existence, name, size and optionally checksum. The client can produce a report for you in CSV format.
Examples
Example 1
Compare in the downward direction (i.e. Mediaflux is the master)
unimelb-mf-check --mf.config ~/.Arcitecta/mflux.cfg --direction down --output ~/Documents/foo-download-check.csv ~/Documents/foo /projects/proj-myproj-1.2.3/foo |
aterm-download
aterm-download is a wrapper scripting based on Arcitecta aterm.jar to download data from mediaflux. It is included in unimelb-mf-clients software package.
Synopsis
aterm-download:
synopsis:
Exports one or more assets using a specified profile.
usage:
aterm-download [<args>] <file> [<create-args>]
arguments:
-lp <local profile>
[optional] A local profile (ecp) containing a specification for the export.
-mode [test|live]
[optional] Is this a test or a live export? Test export can be used to check whether a profile is correct. Defaults to 'live'.
-ncsr <nb>
[optional] The number of concurrent server requests. A number in the range [1,infinity].
Defaults to 1. Concurrent requests can increase performance as data is downloaded parallel to request processing.
-where <query>
[optional] Query that will return the assets for export/download. Any query conforming to AQL is valid. Must be specified if 'namespace' argument is omitted.
-namespace <namespace>
[optional] The asset namespace to export. Must be specified if 'where' argument is omitted
-onerror [abort|continue]
[optional] If there is an export error, what should happen? Defaults to 'abort'.
-onlocalerror [abort|continue]
[optional] If there is an error accessing or opening a local file (e.g. permissions, etc), what should happen? Defaults to 'abort'.
-task-name <task name>
[optional] Specifies the custom name for the task that monitors the progress of the export. User may track the progress of the task by using server.task.named.describe :name <task name>.
-task-remove-after <hours>
[optional] Used to specify how many hours after the export is complete do we want the monitoring task to be removed from the system. Defaults to '0' hours, i.e. now.
-task-batch-size <batch size>
[optional] When used task that monitors the progress of the export will update the progress after 'task-batch-size' of work units were completed. Defaults to '100' work units.
-task-count-assets <true|false>
[optional] Specifies if the assets should be counted before the export begins. This is used by task that tracks the progress of the export so that it can know total number of work units (file transfers). Defaults to 'false'.
-task-report-bytes <true|false>
[optional] Specifies if the task should include bytes transferred as well when updating progress, not just assets transferred. If set to true, bytes transferred will be reported once every second. Defaults to 'false'.
-verbose [true|false]
[optional] If set to true, will display those files being consumed. Defaults to false.
-export-empty-namespaces [true|false]
[optional] Specifies whether or not to export empty namespaces. If set to true, folders will be created for empty namespaces. This only works in conjunction with -namespace argument. It will be ignored if either of -lp or -where arguments are provided. Defaults to false.
-folder-layout [none|collection]
[optional] Specifies the folder layout for exported files. Ignored if '-lp' provided. Defaults to 'collection'.
-filename-collisions [skip|rename|overwrite]
[optional] Specifies how to handle filename collisions. Ignored if '-lp' provided. Defaults to 'rename'.
-ns-parent-number
[optional] When folder layout is set to 'collection' this argument specifies the number of collection parents to include. Defaults to infinity, i.e. all parents.
Configuration
To get aterm-download script, unimelb-mf-clients must be installed. For spartan HPC users, it is already installed and you just need to load the module:
module load unimelb-mf-clients |
You also need to create a configuration file in $HOME/.Arcitecta/mflux.cfg
mkdir -p ~/.Arcitecta/
touch ~/.Arcitecta/mflux.cfg |
The mflux.cfg file should contain the server details and user domain (unimelb for staff, student for students) and user name, see example below:
host=mediaflux.researchsoftware.unimelb.edu.au
port=443
transport=https
domain=unimelb
user=UNI_USERNAME |
Once the configuration file is created correctly, you can start using the aterm-download command in terminal.
Examples
Download a directory (asset namespace)
The command below downloads directory (asset namespace) /projects/proj-demonstration-1128.4.15/test-data from Mediaflux to current local directory:
aterm-download -verbose true -ns-parent-number 1 -namespace "/projects/proj-demonstration-1128.4.15/test-data" ./ |
Download individual files (assets)
To download an individual file (asset) /projects/proj-demonstration-1128.4.15/test-data/sample-file1.tar.gz from Mediaflux to current local directory:
aterm-download -ns-parent-number 0 -where "namespace='/projects/proj-demonstration-1128.4.15/test-data' and name='sample-file1.tar.gz'" ./ |
Utilities to check instrument uploads by Data Mover
Introduction
Two command line utilities have been developed to check the instrument uploads done by Mediaflux Data Mover. They are:
instrument-upload-list
A tool to list or search instrument data uploads in Mediaflux.
instrument-upload-missing-find
A tool to search local directories that have not been uploaded to Mediaflux, or the local directories do not match the total file count or size of the uploads in Mediaflux.
Installation on Windows 10
Download latest unimelb-mf-clients for Windows from UoM GitLab site