Unimelb command-line clients

Unimelb command-line clients

These are a collection of command line interface clients for Mediaflux developed at the University of Melbourne. They are mainly focused on uploading and downloading data to and from a Mediaflux server as well as verifying that an upload or download completed successfully.  They are written in Java and interact with Mediaflux using the HTTPS protocol which is secure and provides excellent data integrity guarantees, as well as allowing uploads and downloads to be efficient and restartable.  It is possible to enable checksum checking, which verifies successful transfers by comparing a CRC32 checksum on the local and remote sides.

The main clients are called

  • unimelb-mf-download provides efficient, restartable download to the local host from Mediaflux

  • unimelb-mf-upload provides efficient, restartable upload from the local host to Mediaflux

  • unimelb-mf-check provides efficient directory comparison/verification between the local host file system and Mediaflux (i.e. check that the source and destination are the same).

Additional items:

  • mexplorer - a shell wrapper for Mediaflux Explorer to make it easy to launch from the command line

  • aterm - a shell wrapper for Mediaflux aterm to make it easy to script from the command line

  • aterm-gui - a shell wrapper for Mediaflux aterm to make it easy to launch from the command line

  • aterm-import - a shell wrapper for the aterm import command (run aterm help download to see all options)

  • aterm-download - a shell wrapper for the aterm download command (run aterm help import to see all options)

  • other wrapper scripts for each platform.  See the ./bin/unix directory for macOS and Linux, and the ./bin/windows directory for Windows

University of Melbourne Spartan Users

Please note that these clients are pre-configured and available on the University's Spartan service.

Obtaining the unimelb clients

Download from the GitLab page by selecting the release for your platform (Windows/macOS/Linux) which includes an embedded Java distribution (recommended), or get the pure Java release which requires a local Java distribution.

Install

  • Unpack the downloaded zip file by double clicking on it via a GUI (or using a CLI unzip tool such as unzip on Unix systems)

  • Move the unpacked directory (named like unimelb-mf-clients-0.3.8) to wherever you'd like to keep it long term. E.g. the Desktop for Windows, or a ~/bin folder (on Unix systems like Linux and macOS)

Configure

Access to the Mediaflux server is managed via a configuration file which you must create and populate appropriately.

Add to your PATH

Optionally, you can add the commands to your PATH so that you can run them from any directory.

Using The Clients

Execute the client of interest on the command line and supply it the arguments that you need.

  • On Windows, the clients can be run from Windows PowerShell or the Command Prompt. You can start these by pressing the Start button and typing powershell or cmd, respectively.

  • On macOS, the clients can be run from the Terminal (Applications -> Utilities -> Terminal).

  • On Linux, you can execute the clients from any terminal or shell prompt. Linux commonly allows you to launch a terminal with ctrl-alt-t.

 

You can find some example of usage here:

  • unimelb-mf-download provides efficient, restartable download to the local host from Mediaflux

  • unimelb-mf-upload provides efficient, restartable upload from the local host to Mediaflux

  • unimelb-mf-check provides efficient directory comparison/verification between the local host file system and Mediaflux (i.e. check that the source and destination are the same).

and also see the Documentation in the source repository for examples.


Add to PATH

 

Optionally, you can add the commands to your PATH so that you can run them from any directory.

Windows

Add the location of the command line clients to your Path, using the System Properties control panel. For example, if you extracted the clients to your Desktop folder, you might add %USERPROFILE%\Desktop\unimelb-mf-clients-0.3.8\bin\windows to your Path. This will allow you to run the commands from any folder without specifying the path to the binary.

  • Click the start button, type env and run Edit the environment variables for your account.

 

  • Under User variables for <username>, click the Path entry and click Edit.... On Windows 10, you can add an additional line, on Windows 7, add the string using a semicolon (;) as a separator.

 

macOS

To add the location of the command line clients to your $PATH, edit the .bash_profile file in your home directory with TextEdit:

  • Run TextEdit

  • File -> Open

  • Press the keyboard shortcut Command-Shift-H to go to your Home directory

  • Press the keyboard shortcut Command-Shift-Period to show hidden files and directories

  • Open the .bash_profile file, or create one with the New Document button (see Configuration File for instructions on making it plain text, etc.)

 

  • Add a line (making sure it matches the location you extracted the zip file):

    export PATH=~/bin/unimelb-mf-clients-0.3.8/bin/unix:$PATH
  • Save the file

  • Close and re-open the terminal to re-load the changes to your PATH

Linux

Add the location of the command line clients to your $PATH. This will allow you to run the commands from any folder without specifying the path to the binary.

  • Edit your .bashrc file:

    nano ~/.bashrc
  • Add a line to the bottom of the file:

    export PATH=$PATH:~/bin/unimelb-mf-clients-0.3.8/bin/unix
  • You will need to log out and log back in to pick up the changes to your PATH


unimelb-mf-upload

 

This is a command-line Java application that you can use to efficiently upload your data into Mediaflux and make integrity checks. Installation instructions are available in the parent page

This client can

  • upload files in parallel (--nb-workers). There is no magic in this, it will only go faster if there is sufficient network capacity. Therefore, please don't use more than 4 upload threads. You may even find that if the network is heavily congested, 4 threads is no faster than 1. You may have to experiment a little to find the optimum.

  • compute checksums for additional validation (see below)

  • write a log file of the upload

  • generate and email a summary of the upload (including successful and failed uploads, and the number of zero-sized files it encountered)

  • run in daemon mode (in the background) so it keeps uploading new data to Mediaflux as it arrives in your local file system

  • Please see all command-line arguments with the --help switch

Here are all the details for the command-line arguments to this client.

Pre-existing files

The client checks whether files already exist in Mediaflux or not. If they do exist it will skip the upload. The checks it uses are:

  • File path/name exists and is the same

  • File size is the same

  • If checksums are enabled, the checksum is the same

If any of these fail, the file does not pre-exist and will be re-uploaded. In the case that the path/name is the same, but the source file has changed content, it will be uploaded to the pre-existing asset in Mediaflux as a new version.

Checksums

Checksums (a unique number computed from the contents of a file) are an important data integrity mechanism. The Mediaflux server computes a checksum for each file it receives. The upload client can compute checksums from the source data on the client side and compare with the checksum computed by the server when it receives the file. If the checksums match, we can be very confident that the file uploaded correctly. Many other clients for other protocols (e.g. sFTP and SMB) do not do this.

By default, checksums are not enabled (because computing checksums slows down the upload process). However, it is strongly recommended that you enable these during the upload or run the checker client unimelb-mf-check with checksums to check the upload afterwards.

Case 1 - Files DO NOT pre-exist on Mediaflux

When you enable checksums, and the data DO NOT already exist on the server, the client will compute the check sum as part of the upload process. When Mediaflux creates the asset, it will also compute the checksum. These checksums will be compared.

Case 2 - Files DO pre-exist on Mediaflux

When you enable checksums, and the data DO already exist on the server (by path/name and size), then client will compute the check sum on the local file first and compare the checksum with that already stored in Mediaflux.

If the checksums differ, it will then proceed to re-upload the local file (following the process in Case 1. above) because it has changed and make a new asset version. Thus, overall 2 checksums are computed by the client and one by the server.

Examples

You will need to know where (the path) to locate your data in Mediaflux (the --dest argument of the command) and where to upload from (the last positional argument)

Example 1 - parallel upload with checksum check

Upload data with four worker threads and turn on checksums for upload integrity checking (recommended). As the location of the config files is not specified, the client will look for it in the .Arcitecta directory of your home directory.

unimelb-mf-upload --csum-check --nb-workers 4 --dest /projects/proj-myproject-1128.1.59/12Jan2018 /data/projects/punim0058

Example 2 - using a configuration file

Upload data with one worker thread and specify explicitly where the configuration file is.

unimelb-mf-upload --mf.config /Users/nebk/.Arcitecta/mflux.cfg  --dest /projects/proj-myproject-1128.1.59/12Jan2018 /data/projects/punim0058

The Configuration File might look like this:

host=mediaflux.researchsoftware.unimelb.edu.au port=443 transport=https token=phooP1Angohb2ooyahbiLiuwa6ahjuoKooViedaifooPhiqu1ookahXae7keichael4Shae2ael8ietit2phawucai0Aighifu6olah9OquahDei2aevae3keich8ain1OoLa4O

 

Scheduled uploads

If you have a location that should be uploaded on a regular schedule such as an instrument PC that saves data to a given directory on the local computer, you can schedule uploads with unimelb-mf-upload.  It is best to request an upload token if you want to do this as the credential will be stored on the computer that is doing the uploads.  Contact Research Computing Services to request a token.

Windows

In this example:

  • we will put the unimelb-mf-client files in the %HOMEPATH%\Documents directory

  • we will save logs to the %HOMEPATH%\Documents\logs directory

  • will will put the configuration file in the %HOMEPATH%\Documents directory

Download from the GitLab page, selecting the Windows 64bit release.  Extract the zip file to %HOMEPATH%\Documents.

Create a Configuration File.  In this case we are going to use a secure token.  In our example, it will be stored in %HOMEPATH%\Documents\mflux.cfg.

host=mediaflux.researchsoftware.unimelb.edu.au port=443 transport=https token=phooP1Angohb2ooyahbiLiuwa6ahjuoKooViedaifooPhiqu1ookahXae7keichael4Shae2ael8ietit2phawucai0Aighifu6olah9OquahDei2aevae3keich8ain1OoLa4O

Create a batch file to perform the upload using Notepad.  In our example, it will be stored in %HOMEPATH%\Documents\upload.bat:

%HOMEPATH%\Documents\unimelb-mf-clients-0.7.7\bin\windows\unimelb-mf-upload --mf.config %HOMEPATH%\Documents\mflux.cfg --log-dir %HOMEPATH%\Documents\logs --dest /projects/proj-demonstration-1128.4.15 %HOMEPATH%\Documents\data-to-upload

Schedule the upload using Windows Task Scheduler.

  • Click the start button and start typing Task Scheduler and select it from the Start Menu when it appears.

  • Click on the Task Scheduler Library, then right click on the space and choose Create Basic Task... from the menu.

  • Give your task a name and description, then click Next >

  • choose a start date and time and click Next >

  • choose Start a program and click Next >

  • click the Browse button and find the script you created above.

  • Click Next > and then check the Open the Properties dialog for this task when I click Finish box, then click Finish.

  • Under Security options, choose which user you would like the task to run under.  You may wish to make it so the scheduled job will run even if the user is not logged in.

Linux

In this example:

  • we will put the unimelb-mf-client files in the ~/bin directory

  • we will save logs to the ~/logs directory

  • will will put the configuration file in the ~/.Arcitecta directory

Download from the GitLab page, selecting the Linux 64bit release.  Extract the zip file to ~/bin.

Create a Configuration File.  In this case we are going to use a secure token.  In our example, it will be stored in ~/.Arcitecta/mflux.cfg.

host=mediaflux.researchsoftware.unimelb.edu.au port=443 transport=https token=phooP1Angohb2ooyahbiLiuwa6ahjuoKooViedaifooPhiqu1ookahXae7keichael4Shae2ael8ietit2phawucai0Aighifu6olah9OquahDei2aevae3keich8ain1OoLa4O

Create a shell script to perform the upload using the text editor of your choice.  In our example, it will be stored in ~/bin/upload.sh:

#!/bin/bash ~/bin/unimelb-mf-clients-0.7.4/bin/unix/unimelb-mf-upload --mf.config ~/.Arcitecta/mflux.cfg --log-dir ~/logs --dest /projects/proj-demonstration-1128.4.15 ~/data-to-upload

On Linux there's typically two options for scheduling tasks: cron and systemd timers.  In this example, we will use a cron job.

Edit the crontab file with the following command:

crontab -e

Create a new scheduled task at the end of the crontab file.  To see documentation on the format, try the man 5 crontab command.  In our example, we will run the command once per day at 1 am local time.

# To define the time you can provide concrete values for # minute (m), hour (h), day of month (dom), month (mon), # and day of week (dow) or use '*' in these fields (for 'any'). # # For more information see the manual pages of crontab(5) and cron(8) # # m h  dom mon dow   command 0 1 * * * $HOME/bin/upload.sh

Save the file, and your job will be scheduled.

crontab: installing new crontab

macOS

In this example:

  • we will put the unimelb-mf-clients in the ~/Applications folder

  • we will save logs to the ~/Documents/logs folder

  • we will put the configuration file in the ~/.Arcitecta folder

Download from the GitLab page, selecting the Mac 64bit release.  Extract the tar.gz file by clicking on it.  It will be extracted to a folder in your Downloads folder, so move it o the Applications folder.

Create a Configuration File.  In this case we are going to use a secure token.  In our example, it will be stored in ~/.Arcitecta/mflux.cfg.

host=mediaflux.researchsoftware.unimelb.edu.au port=443 transport=https token=phooP1Angohb2ooyahbiLiuwa6ahjuoKooViedaifooPhiqu1ookahXae7keichael4Shae2ael8ietit2phawucai0Aighifu6olah9OquahDei2aevae3keich8ain1OoLa4O

Create a shell script to perform the upload using the text editor of your choice.  In our example, it will be stored in ~/bin/upload.sh:

#!/bin/bash ~/Applications/unimelb-mf-clients-0.7.4/bin/unix/unimelb-mf-upload --mf.config ~/.Arcitecta/mflux.cfg --log-dir ~/logs --dest /projects/proj-demonstration-1128.4.15 ~/data-to-upload

Edit the crontab file with the following command.  By default the vim text editor will be used.

crontab -e # this will use the default text editor, usually vim   # if you would prefer to use the pico text editor, use the following command instead: EDITOR=/usr/bin/pico crontab -e

Create a new scheduled task at the end of the crontab file.  To see documentation on the format, try the man 5 crontab command.  In our example, we will run the command once per day at 1 am local time.

# To define the time you can provide concrete values for # minute (m), hour (h), day of month (dom), month (mon), # and day of week (dow) or use '*' in these fields (for 'any'). # # For more information see the manual pages of crontab(5) and cron(8) # # m h  dom mon dow   command 0 1 * * * $HOME/bin/upload.sh

Save the file, and your job will be scheduled.

crontab: installing new crontab

Troubleshooting upload issues caused by special files

 

Sparse files

Sparse Files

Sparse files are files that have large sections of unallocated data. They are commonly used in Linux/Unix systems. Sparse files use storage efficiently when the files have a lot of holes (contiguous ranges of bytes having the value of zero) by storing only metadata for the holes instead of using real disk blocks.

 

Sparse files should be either excluded, or compressed before uploading to Mediaflux. As Mediaflux backend does not support sparse files and treats them as regular files. Uploading uncompressed sparse files will be waste of storage space. We've seen issues caused by very large sparse files. 

Find sparse files

To find the sparse files in your file system, you can use find  command below:

find ./ -type f -printf "%S\t%p\n" | awk '$1 < 1.0 {print $2}'

Compress sparse files

If you are aware of sparse files in your local file system, you can run the following command to compress them before uploading to Mediaflux:

find ./ -type f -printf "%S\t%p\n" | awk '$1 < 1.0 {print $2}' | xargs -I {} sh -c "tar -Sczvf {}.tar.gz {}; rm -f {}"

Warning

The above command compresses the sparse files to *.tar.gz files and preserve their holes (-S option for tar), and the original sparse files will be replaced. 

DO NOT try it if you don't know what you are doing.

 

FIFO (Named Pipe)

A FIFO (First In First Out) is similar to a pipe. The principal difference is that a FIFO has a name within the file system and is opened in the same way as a regular file. A FIFO has a write end and a read end, and data is read from the pipe in the same order as it is written. Fifo is also termed as Named pipes in Linux.

FIFO should not be uploaded to Mediaflux.

 

Mediaflux Explorer

Uploading FIFO causes Mediaflux Explorer (current version: v1.5.6) to crash.

 

unimelb-mf-upload (in unimelb-mf-clients)

Early versions (prior to v0.7.4) of unimelb-mf-upload also hangs when uploading FIFO. 

From version v0.7.4 and above, unimelb-mf-upload excludes FIFO files.

Find FIFO (Named Pipes)

The following command can be used to list the FIFO files in your file system:

find ./ -type p

unimelb-mf-download

This is a command-line Java application that you can use to efficiently download your data (folders recursively or individual files) from Mediaflux. Installation instructions are available in the Mediaflux Unimelb Command-Line Clients page.

You will need to know where your data are in Mediaflux (the last argument of the command) and where you want to locate (--out) the data on the destination computer.

  • This client can download files in parallel (--nb-workers). There is no magic in this, it will only go faster if there is sufficient network capacity. Therefore, please don't use more than 4 download threads. You may even find that if the network is heavily congested, 4 threads is no faster than 1. You may have to experiment a little to find the optimum.

  • The client can run in daemon (background) mode wherby it will keep on downloading data as it arrives in Mediaflux.

  • The client can synchronise from Mediaflux to the local file system and delete files that no longer exist in Mediaflux from the local file system.

  • Please see all command-line arguments with the --help switch

Examples

Example 1

Download data with one worker thread and skip pre-existing files, checking for files that pre-exist by their name and size only.

unimelb-mf-download --mf.config ~/.Arcitecta/mflux.cfg --out /data/projects/punim0058 /projects/proj-myproject-1128.1.59/12Jan2018

Example 2

Download data with four worker threads and we overwrite pre-existing files, and we check files pre-exist by their name and size and checksum (slower but safer). We don't need to specify the path to the config file as the client will look for it in the standard places.

unimelb-mf-download --overwite --csum-check --nb-workers 4 --out /data/projects/punim0058 /projects/proj-myproject-1128.1.59/12Jan2018

unimelb-mf-check

This is a command-line Java application that you can use to check and compare assets in Mediaflux against files on the local file system. Installation instructions are available in the Mediaflux Unimelb Command-Line Clients page.

See All the details for the arguments to this client.

The client checks the equality of files (you can say which direction you are checking) by existence, name, size and optionally checksum. The client can produce a report for you in CSV format.

Examples

Example 1

Compare in the downward direction (i.e. Mediaflux is the master)

unimelb-mf-check --mf.config ~/.Arcitecta/mflux.cfg --direction down --output ~/Documents/foo-download-check.csv ~/Documents/foo /projects/proj-myproj-1.2.3/foo

aterm-download

 

aterm-download is a wrapper scripting based on Arcitecta aterm.jar to download data from mediaflux. It is included in unimelb-mf-clients software package.

 

Synopsis

aterm-download: synopsis: Exports one or more assets using a specified profile. usage: aterm-download [<args>] <file> [<create-args>] arguments: -lp <local profile> [optional] A local profile (ecp) containing a specification for the export. -mode [test|live] [optional] Is this a test or a live export? Test export can be used to check whether a profile is correct. Defaults to 'live'. -ncsr <nb> [optional] The number of concurrent server requests. A number in the range [1,infinity]. Defaults to 1. Concurrent requests can increase performance as data is downloaded parallel to request processing. -where <query> [optional] Query that will return the assets for export/download. Any query conforming to AQL is valid. Must be specified if 'namespace' argument is omitted. -namespace <namespace> [optional] The asset namespace to export. Must be specified if 'where' argument is omitted -onerror [abort|continue] [optional] If there is an export error, what should happen? Defaults to 'abort'. -onlocalerror [abort|continue] [optional] If there is an error accessing or opening a local file (e.g. permissions, etc), what should happen? Defaults to 'abort'. -task-name <task name> [optional] Specifies the custom name for the task that monitors the progress of the export. User may track the progress of the task by using server.task.named.describe :name <task name>. -task-remove-after <hours> [optional] Used to specify how many hours after the export is complete do we want the monitoring task to be removed from the system. Defaults to '0' hours, i.e. now. -task-batch-size <batch size> [optional] When used task that monitors the progress of the export will update the progress after 'task-batch-size' of work units were completed. Defaults to '100' work units. -task-count-assets <true|false> [optional] Specifies if the assets should be counted before the export begins. This is used by task that tracks the progress of the export so that it can know total number of work units (file transfers). Defaults to 'false'. -task-report-bytes <true|false> [optional] Specifies if the task should include bytes transferred as well when updating progress, not just assets transferred. If set to true, bytes transferred will be reported once every second. Defaults to 'false'. -verbose [true|false] [optional] If set to true, will display those files being consumed. Defaults to false. -export-empty-namespaces [true|false] [optional] Specifies whether or not to export empty namespaces. If set to true, folders will be created for empty namespaces. This only works in conjunction with -namespace argument. It will be ignored if either of -lp or -where arguments are provided. Defaults to false. -folder-layout [none|collection] [optional] Specifies the folder layout for exported files. Ignored if '-lp' provided. Defaults to 'collection'. -filename-collisions [skip|rename|overwrite] [optional] Specifies how to handle filename collisions. Ignored if '-lp' provided. Defaults to 'rename'. -ns-parent-number [optional] When folder layout is set to 'collection' this argument specifies the number of collection parents to include. Defaults to infinity, i.e. all parents.

 

Configuration

To get aterm-download script, unimelb-mf-clients must be installed. For spartan HPC users, it is already installed and you just need to load the module:

module load unimelb-mf-clients

 

You also need to create a configuration file in $HOME/.Arcitecta/mflux.cfg

mkdir -p ~/.Arcitecta/ touch ~/.Arcitecta/mflux.cfg

The mflux.cfg file should contain the server details and user domain (unimelb for staff, student for students) and user name, see example below:

host=mediaflux.researchsoftware.unimelb.edu.au port=443 transport=https domain=unimelb user=UNI_USERNAME

Once the configuration file is created correctly, you can start using the aterm-download command in terminal.

Examples

Download a directory (asset namespace)

The command below downloads directory (asset namespace) /projects/proj-demonstration-1128.4.15/test-data from Mediaflux to current local directory:

aterm-download -verbose true -ns-parent-number 1 -namespace "/projects/proj-demonstration-1128.4.15/test-data" ./

 

Download individual files (assets)

To download an individual file (asset) /projects/proj-demonstration-1128.4.15/test-data/sample-file1.tar.gz from Mediaflux to current local directory:

aterm-download -ns-parent-number 0 -where "namespace='/projects/proj-demonstration-1128.4.15/test-data' and name='sample-file1.tar.gz'" ./

Utilities to check instrument uploads by Data Mover

 

Introduction

Two command line utilities have been developed to check the instrument uploads done by Mediaflux Data Mover. They are:

  • instrument-upload-list

    • A tool to list or search instrument data uploads in Mediaflux.

  • instrument-upload-missing-find

    • A tool to search local directories that have not been uploaded to Mediaflux, or the local directories do not match the total file count or size of the uploads in Mediaflux.

Installation on Windows 10

  1. Download latest unimelb-mf-clients for Windows from UoM GitLab site