3. Command Reference¶
In this section, we describe the full command line interface for Data Workspaces.
This interface is built around a script dws
, which is installed into your
path when you install the dataworkspaces
package. The overall interface
for dws
is:
dws [--batch] [--verbose] [--help] COMMAND [--help] [OPTIONS] [ARGS]...
dws
has three options common to all commands:
--batch
, which runs the command in a mode that never asks for user confirmation and will error out if it absolutely requires an input (useful for automation),--verbose
, which will print a lot of detail about what will be and has been done for a command (useful for debugging), and--help
, which prints these common options and a list of available commands.
Next on your command line comes the command name (e.g. init
, clone
, snapshot
).
Each command has its own arguments and options, as documented below.
All commands take a --help
argument, which will print the specific options and
arguments for the command. Finally,
the add
subcommand has further subcommands, representing the
individual resource types (e.g. git, local-files, rclone).
dws¶
dws [OPTIONS] COMMAND [ARGS]...
Options
-
-b
,
--batch
¶
Run in batch mode, never ask for user inputs.
-
--verbose
¶
Print extra debugging information and ask for confirmation before running actions.
add¶
Add a data collection to the workspace as a resource.
Possible types of resources are git
, local-files
, or rclone
; these are subcommands of add.
dws add [OPTIONS] COMMAND [ARGS]...
Options
-
--workspace-dir
<workspace_dir>
¶
api-resource¶
Resource to represent data obtained via an API. Use this when there is
no file-based representation of your data that can be versioned and captured
more directly. Subcommand of add
dws add api-resource [OPTIONS]
Options
-
--role
<role>
¶
-
--name
<name>
¶ Short name for this resource
git¶
Add a local git repository as a resource. Subcommand of add
dws add git [OPTIONS] PATH
Options
-
--role
<role>
¶
-
--name
<name>
¶ Short name for this resource
-
--branch
<branch>
¶ Branch of the repo to use, defaults to master.
-
-r
,
--read-only
¶
If specified, treat the origin repository as read-only and never push to it.
Arguments
-
PATH
¶
Required argument
local-files¶
Add a local file directory (not managed by git) to the workspace. Subcommand of add
dws add local-files [OPTIONS] PATH
Options
-
--role
<role>
¶
-
--name
<name>
¶ Short name for this resource
-
--compute-hash
¶
Compute hashes for all files. If this option is not set, we use a lightweight comparison of file sizes only.
Arguments
-
PATH
¶
Required argument
rclone¶
Add an rclone-d repository as a resource to the workspace. Subcommand of add
dws add rclone [OPTIONS] SOURCE DEST
Options
-
--role
<role>
¶
-
--name
<name>
¶ Short name for this resource
-
--config
<config>
¶ Configuration file for rclone
-
--compute-hash
¶
Compute hashes for all files
Arguments
-
SOURCE
¶
Required argument
-
DEST
¶
Required argument
clone¶
Clone the specified data workspace.
dws clone [OPTIONS] REPOSITORY [DIRECTORY]
Options
-
--hostname
<hostname>
¶ Hostname to identify this machine in snapshot directory paths, defaults to build-10277549-project-382008-data-workspaces-core
Arguments
-
REPOSITORY
¶
Required argument
-
DIRECTORY
¶
Optional argument
config¶
Get or set configuration parameters. Local parameters are only for this copy of the workspace, while global parameters are stored centrally and affect all copies.
If neither PARAMETER_NAME nor PARAMETER_VALUE are specified, this command prints a table of all parameters and their information (scope, value, default or not, and help text). If just PARAMETER_NAME is specified, it prints the specified parameter’s information. Finally, if both the parameter name and value are specified, the parameter is set to the specified value.
dws config [OPTIONS] [PARAMETER_NAME] [PARAMETER_VALUE]
Options
-
--workspace-dir
<workspace_dir>
¶
Arguments
-
[PARAMETER_NAME]
¶
Optional argument
-
[PARAMETER_VALUE]
¶
Optional argument
delete-snapshot¶
Delete the specified snapshot. This includes the metadata and lineage data for the snapshot. Unless –no-include-resources is specified, this also deletes any results data saved for the snapshot (under the snapshots subdirectory of a results resource).
dws delete-snapshot [OPTIONS] TAG_OR_HASH
Options
-
--workspace-dir
<workspace_dir>
¶
-
--no-include-resources
¶
If specified, do NOT include deleting an snapshot-specific content from resources.
Arguments
-
TAG_OR_HASH
¶
Required argument
deploy¶
Lineage-related commands
dws deploy [OPTIONS] COMMAND [ARGS]...
Options
-
--workspace-dir
<workspace_dir>
¶
build¶
Build a docker image containing this workspace. This command is EXERIMENTAL and subject to change.
dws deploy build [OPTIONS]
Options
-
--image-name
<image_name>
¶ Name of docker image, defaults to name of workspace
-
-f
,
--force-rebuild
¶
If specified, always rebuild image (force deletes the image from docker)
-
--git-user-email
<git_user_email>
¶ Email address used by git inside the container. Defaults to value of user.email for this workspace.
-
--git-user-name
<git_user_name>
¶ Username used by git inside the container. Defualts to value of user.name for this workspace.
run¶
Build a docker image containing this workspace. This command is EXERIMENTAL and subject to change.
dws deploy run [OPTIONS]
Options
-
--image-name
<image_name>
¶ Name of docker image, defaults to name of workspace
-
--no-mount-ssh-keys
¶
If specified, do not mount the host’s ~/.ssh directory into the container. This directory is need for git authentication.
diff¶
List differences between two snapshots
dws diff [OPTIONS] SNAPSHOT_OR_TAG1 SNAPSHOT_OR_TAG2
Options
-
--workspace-dir
<workspace_dir>
¶
Arguments
-
SNAPSHOT_OR_TAG1
¶
Required argument
-
SNAPSHOT_OR_TAG2
¶
Required argument
init¶
Initialize a new workspace
dws init [OPTIONS] [NAME]
Options
-
--hostname
<hostname>
¶ Hostname to identify this machine in snapshot directory paths, defaults to build-10277549-project-382008-data-workspaces-core
-
--create-resources
<create_resources>
¶ Initialize the workspace with subdirectories for the specified resource roles. Choices are ‘all’ or any comma-separated combination of source-data, intermediate-data, code, results.
-
--scratch-directory
<scratch_directory>
¶ Local scratch directory (defaults to WORKSPACE_DIR/scratch)
-
--git-fat-remote
<git_fat_remote>
¶ Initialize the workspace with the git-fat large file extension and use the specified URL for the remote datastore
-
--git-fat-user
<git_fat_user>
¶ Username for git fat remote (if not root)
-
--git-fat-port
<git_fat_port>
¶ Port number for git-fat remote (defaults to 22)
-
--git-fat-attributes
<git_fat_attributes>
¶ Comma-separated list of file patterns to manage under git-fat. For example –git-fat-attributes=’.gz,.zip’. If you do not specify here, you can always add the .gitattributes file later.
Arguments
-
NAME
¶
Optional argument
lineage¶
Lineage-related commands
dws lineage [OPTIONS] COMMAND [ARGS]...
Options
-
--workspace-dir
<workspace_dir>
¶
graph¶
Graph the lineage of a resource, writing the graph to an HTML file. Subcommand of lineage
dws lineage graph [OPTIONS] OUTPUT_FILE
Options
-
--resource
<resource>
¶ name of the resource to graph the lineage for (default to the first results resource)
-
--snapshot
<snapshot>
¶ Snapshot hash or tag to use for lineage. If not specified, use current lineage.
-
--format
<format>
¶ Format of the output graph (defaults to html)
Options: html|dot
-
--width
<width>
¶ Width of graph in pixels (defaults to 1024)
-
--height
<height>
¶ Height of graph in pixels (defaults to 800)
Arguments
-
OUTPUT_FILE
¶
Required argument
publish¶
Add a remote Git repository as the origin for the workspace and do the initial push of the workspace and any other resources.
dws publish [OPTIONS] REMOTE_REPOSITORY
Options
-
--workspace-dir
<workspace_dir>
¶
-
--skip
<skip>
¶ Comma-separated list of resource names that you wish to skip when pushing. The rest will be pushed to their remote origins, if applicable.
Arguments
-
REMOTE_REPOSITORY
¶
Required argument
pull¶
Pull the latest state of the workspace and its resources from their origins.
dws pull [OPTIONS]
Options
-
--workspace-dir
<workspace_dir>
¶
-
--only
<only>
¶ Comma-separated list of resource names that you wish to pull from the origin, if applicable. The rest will be skipped.
-
--skip
<skip>
¶ Comma-separated list of resource names that you wish to skip when pulling. The rest will be pulled from their remote origins, if applicable.
-
--only-workspace
¶
Only pull the workspace’s metadata, skipping the individual resources
push¶
Push the state of the workspace and its resources to their origins.
dws push [OPTIONS]
Options
-
--workspace-dir
<workspace_dir>
¶
-
--only
<only>
¶ Comma-separated list of resource names that you wish to push to the origin, if applicable. The rest will be skipped.
-
--skip
<skip>
¶ Comma-separated list of resource names that you wish to skip when pushing. The rest will be pushed to their remote origins, if applicable.
-
--only-workspace
¶
Only push the workspace’s metadata, skipping the individual resources
report¶
Report generation commands
dws report [OPTIONS] COMMAND [ARGS]...
Options
-
--workspace-dir
<workspace_dir>
¶
history¶
Show the history of snapshots. Subcommand of report
.
dws report history [OPTIONS]
Options
-
--limit
<limit>
¶ Number of previous snapshots to show (most recent first)
lineage¶
Show a lineage table for either the current workspace or a specific snapshot.
Subcommand of report
.
dws report lineage [OPTIONS]
Options
-
--snapshot
<snapshot>
¶ Optional tag or hash for a snapshot. Otherwise, shows the current status.
results¶
Show the contents of a results file. Subcommand of report
.
dws report results [OPTIONS]
Options
-
--snapshot
<snapshot>
¶ Optional tag or hash for a snapshot. Otherwise, shows the current status.
-
--resource
<resource>
¶ Optional resource name. Otherwise, will look for first resource with a results file.
status¶
Show the status of resources in this workspace. Subcommand of report
.
dws report status [OPTIONS]
restore¶
Restore the workspace to a prior state
dws restore [OPTIONS] TAG_OR_HASH
Options
-
--workspace-dir
<workspace_dir>
¶
-
--only
<only>
¶ Comma-separated list of resource names that you wish to revert to the specified snapshot. The rest will be left as-is.
-
--leave
<leave>
¶ Comma-separated list of resource names that you wish to leave in their current state. The rest will be restored to the specified snapshot.
-
--strict
¶
If specified, error out if unable to restore any of the requested resources (due to lack of a restore hash or removing the resource from workspace).
Arguments
-
TAG_OR_HASH
¶
Required argument