3. Command Reference¶
In this section, we describe the full command line interface for Data Workspaces.
This interface is built around a script dws
, which is installed into your
path when you install the dataworkspaces
package. The overall interface
for dws
is:
dws [--batch] [--verbose] [--help] COMMAND [--help] [OPTIONS] [ARGS]...
dws
has three options common to all commands:
--batch
, which runs the command in a mode that never asks for user confirmation and will error out if it absolutely requires an input (useful for automation),--verbose
, which will print a lot of detail about what will be and has been done for a command (useful for debugging), and--help
, which prints these common options and a list of available commands.
Next on your command line comes the command name (e.g. init
, clone
, snapshot
).
Each command has its own arguments and options, as documented below.
All commands take a --help
argument, which will print the specific options and
arguments for the command. Finally,
the add
subcommand has further subcommands, representing the
individual resource types (e.g. git, local-files, rclone).
dws¶
dws [OPTIONS] COMMAND [ARGS]...
Options
- -b, --batch¶
Run in batch mode, never ask for user inputs.
- --verbose¶
Print extra debugging information and ask for confirmation before running actions.
add¶
Add a data collection to the workspace as a resource.
Possible types of resources are git
, local-files
, or rclone
; these are subcommands of add.
dws add [OPTIONS] COMMAND [ARGS]...
Options
- --workspace-dir <workspace_dir>¶
api-resource¶
Resource to represent data obtained via an API. Use this when there is
no file-based representation of your data that can be versioned and captured
more directly. Subcommand of add
dws add api-resource [OPTIONS]
Options
- --role <role>¶
- --name <name>¶
Short name for this resource
git¶
Add a local git repository as a resource. Subcommand of add
dws add git [OPTIONS] PATH
Options
- --role <role>¶
- --name <name>¶
Short name for this resource
- --branch <branch>¶
Branch of the repo to use. If not specified, defaults to the current branch.
- -r, --read-only¶
If specified, treat the origin repository as read-only and never push to it.
- -e, --export¶
On snapshots, export lineage data for import into other workspaces
- --imported¶
This resource was exported from another workspace. Import the lineage data. An imported resource implies –read-only and –role=source-data.
Arguments
- PATH¶
Required argument
local-files¶
Add a local file directory (not managed by git) to the workspace. Subcommand of add
dws add local-files [OPTIONS] PATH
Options
- --role <role>¶
- --name <name>¶
Short name for this resource
- --compute-hash¶
Compute hashes for all files. If this option is not set, we use a lightweight comparison of file sizes only.
- -e, --export¶
On snapshots, export lineage data for import into other workspaces
- --imported¶
This resource was exported from another workspace. Import the lineage data. An imported resource implies –read-only and –role=source-data.
Arguments
- PATH¶
Required argument
rclone¶
Add an rclone-d repository as a resource to the workspace. Subcommand of add
.
This is designed for uni-directional synchronization between a remote and a local_path.
The remote has the form remote_name:remote_path, where remote_name is an entry in your
rclone config file.
dws add rclone [OPTIONS] REMOTE LOCAL_PATH
Options
- --role <role>¶
- --name <name>¶
Short name for this resource
- --config <config>¶
Configuration file for rclone
- --compute-hash¶
Compute hashes for all files
- -e, --export¶
On snapshots, export lineage data for import into other workspaces
- --imported¶
This resource was exported from another workspace. Import the lineage data. An imported resource implies –read-only and –role=source-data.
- --master <master>¶
Determines which system is the master. If ‘remote’, then pulls will be done, but not pushes. If ‘local’, then pushes will be done, but not pulls. If ‘none’ (the default), no action will be taken for pushes and pulls (you need to synchronize manually using rclone). When first adding the resource or cloning to a new machine, if the local directory does not exist, and ‘remote’ or ‘none’ were specified, the contents of the remote will copied down to the local directory.
- Options
none | remote | local
- --sync-mode <sync_mode>¶
When copying between local and master, which rclone command to use. If you specify ‘copy’, files are added or overwritten without deleting any files present at the target. If you specify ‘sync’, files at the target are removed if they are not present at the source. The default is ‘copy’. If master is ‘none’, this option has no effect.
- Options
copy | sync
- --size-only¶
If specified, use only the file size (rather than also modification time and checksum) to determine if a file has been changed. If your resource has a lot of files and access to the remote is over a WAN, you probably want to set this. Otherwise, syncs/copies can be VERY slow.
Arguments
- REMOTE¶
Required argument
- LOCAL_PATH¶
Required argument
s3¶
Add a S3 resource to the workspace. Subcommand of add
dws add s3 [OPTIONS] BUCKET_NAME
Options
- --role <role>¶
- --name <name>¶
Short name for this resource
Arguments
- BUCKET_NAME¶
Required argument
clone¶
Clone the specified data workspace.
dws clone [OPTIONS] REPOSITORY [DIRECTORY]
Options
- --hostname <hostname>¶
Hostname to identify this machine in snapshot directory paths, defaults to build-14956539-project-382008-data-workspaces-core
Arguments
- REPOSITORY¶
Required argument
- DIRECTORY¶
Optional argument
config¶
Get or set configuration parameters. Local parameters are only for this copy of the workspace, while global parameters are stored centrally and affect all copies.
If neither PARAMETER_NAME nor PARAMETER_VALUE are specified, this command prints a table of all parameters and their information (scope, value, default or not, and help text). If just PARAMETER_NAME is specified, it prints the specified parameter’s information. Finally, if both the parameter name and value are specified, the parameter is set to the specified value.
dws config [OPTIONS] [PARAMETER_NAME] [PARAMETER_VALUE]
Options
- --workspace-dir <workspace_dir>¶
- --resource <resource>¶
If specified, get/set parameters for the specified resource, rather than the workspace.
Arguments
- [PARAMETER_NAME]¶
Optional argument
- [PARAMETER_VALUE]¶
Optional argument
delete-snapshot¶
Delete the specified snapshot. This includes the metadata and lineage data for the snapshot. Unless –no-include-resources is specified, this also deletes any results data saved for the snapshot (under the snapshots subdirectory of a results resource).
dws delete-snapshot [OPTIONS] TAG_OR_HASH
Options
- --workspace-dir <workspace_dir>¶
- --no-include-resources¶
If specified, do NOT include deleting an snapshot-specific content from resources.
Arguments
- TAG_OR_HASH¶
Required argument
deploy¶
Lineage-related commands
dws deploy [OPTIONS] COMMAND [ARGS]...
Options
- --workspace-dir <workspace_dir>¶
build¶
Build a docker image containing this workspace. This command is EXERIMENTAL and subject to change.
dws deploy build [OPTIONS]
Options
- --image-name <image_name>¶
Name of docker image, defaults to name of workspace
- -f, --force-rebuild¶
If specified, always rebuild image (force deletes the image from docker)
- --git-user-email <git_user_email>¶
Email address used by git inside the container. Defaults to value of user.email for this workspace.
- --git-user-name <git_user_name>¶
Username used by git inside the container. Defualts to value of user.name for this workspace.
run¶
Build a docker image containing this workspace. This command is EXERIMENTAL and subject to change.
dws deploy run [OPTIONS]
Options
- --image-name <image_name>¶
Name of docker image, defaults to name of workspace
- --no-mount-ssh-keys¶
If specified, do not mount the host’s ~/.ssh directory into the container. This directory is need for git authentication.
diff¶
List differences between two snapshots
dws diff [OPTIONS] SNAPSHOT_OR_TAG1 SNAPSHOT_OR_TAG2
Options
- --workspace-dir <workspace_dir>¶
Arguments
- SNAPSHOT_OR_TAG1¶
Required argument
- SNAPSHOT_OR_TAG2¶
Required argument
init¶
Initialize a new workspace
dws init [OPTIONS] [NAME]
Options
- --hostname <hostname>¶
Hostname to identify this machine in snapshot directory paths, defaults to the result of the ‘hostname’ command.
- --create-resources <create_resources>¶
Initialize the workspace with subdirectories for the specified resource roles. Choices are ‘all’ or any comma-separated combination of source-data, intermediate-data, code, results.
- --scratch-directory <scratch_directory>¶
Local scratch directory (defaults to WORKSPACE_DIR/scratch)
- --git-fat-remote <git_fat_remote>¶
Initialize the workspace with the git-fat large file extension and use the specified URL for the remote datastore
- --git-fat-user <git_fat_user>¶
Username for git fat remote (if not root)
- --git-fat-port <git_fat_port>¶
Port number for git-fat remote (defaults to 22)
- --git-fat-attributes <git_fat_attributes>¶
Comma-separated list of file patterns to manage under git-fat. For example –git-fat-attributes=’.gz,.zip’. If you do not specify here, you can always add the .gitattributes file later.
- --git-lfs-attributes <git_lfs_attributes>¶
Comma-separated list of file patterns to manage under git-lfs. For example –git-lfs-attributes=’.gz,.zip’. If you do not specify here, you can always add the .gitattributes file later.
Arguments
- NAME¶
Optional argument
lineage¶
Lineage-related commands
dws lineage [OPTIONS] COMMAND [ARGS]...
Options
- --workspace-dir <workspace_dir>¶
graph¶
Graph the lineage of a resource, writing the graph to an HTML file. Subcommand of lineage
dws lineage graph [OPTIONS] OUTPUT_FILE
Options
- --resource <resource>¶
name of the resource to graph the lineage for (default to the first results resource)
- --snapshot <snapshot>¶
Snapshot hash or tag to use for lineage. If not specified, use current lineage.
- --format <format>¶
Format of the output graph (defaults to html)
- Options
html | dot
- --width <width>¶
Width of graph in pixels (defaults to 1024)
- --height <height>¶
Height of graph in pixels (defaults to 800)
Arguments
- OUTPUT_FILE¶
Required argument
publish¶
Add a remote Git repository as the origin for the workspace and do the initial push of the workspace and any other resources.
dws publish [OPTIONS] REMOTE_REPOSITORY
Options
- --workspace-dir <workspace_dir>¶
- --skip <RESOURCE1[,RESOURCE2,...>¶
Comma-separated list of resource names that you wish to skip when pushing. The rest will be pushed to their remote origins, if applicable.
Arguments
- REMOTE_REPOSITORY¶
Required argument
pull¶
Pull the latest state of the workspace and its resources from their origins.
dws pull [OPTIONS]
Options
- --workspace-dir <workspace_dir>¶
- --only <RESOURCE1[,RESOURCE2,...>¶
Comma-separated list of resource names that you wish to pull from the origin, if applicable. The rest will be skipped.
- --skip <RESOURCE1[,RESOURCE2,...>¶
Comma-separated list of resource names that you wish to skip when pulling. The rest will be pulled from their remote origins, if applicable.
- --only-workspace¶
Only pull the workspace’s metadata, skipping the individual resources
push¶
Push the state of the workspace and its resources to their origins.
dws push [OPTIONS]
Options
- --workspace-dir <workspace_dir>¶
- --only <RESOURCE1[,RESOURCE2,...>¶
Comma-separated list of resource names that you wish to push to the origin, if applicable. The rest will be skipped.
- --skip <RESOURCE1[,RESOURCE2,...>¶
Comma-separated list of resource names that you wish to skip when pushing. The rest will be pushed to their remote origins, if applicable.
- --only-workspace¶
Only push the workspace’s metadata, skipping the individual resources
report¶
Report generation commands
dws report [OPTIONS] COMMAND [ARGS]...
Options
- --workspace-dir <workspace_dir>¶
history¶
Show the history of snapshots. Subcommand of report
.
dws report history [OPTIONS]
Options
- --limit <limit>¶
Number of previous snapshots to show (most recent first)
lineage¶
Show a lineage table for either the current workspace or a specific snapshot.
Subcommand of report
.
dws report lineage [OPTIONS]
Options
- --snapshot <snapshot>¶
Optional tag or hash for a snapshot. Otherwise, shows the current status.
results¶
Show the contents of a results file. Subcommand of report
.
dws report results [OPTIONS]
Options
- --snapshot <snapshot>¶
Optional tag or hash for a snapshot. Otherwise, shows the current status.
- --resource <resource>¶
Optional resource name. Otherwise, will look for first resource with a results file.
status¶
Show the status of resources in this workspace. Subcommand of report
.
dws report status [OPTIONS]
restore¶
Restore the workspace to a prior state
dws restore [OPTIONS] TAG_OR_HASH
Options
- --workspace-dir <workspace_dir>¶
- --only <RESOURCE1[,RESOURCE2,...>¶
Comma-separated list of resource names that you wish to revert to the specified snapshot. The rest will be left as-is.
- --leave <RESOURCE1[,RESOURCE2,...>¶
Comma-separated list of resource names that you wish to leave in their current state. The rest will be restored to the specified snapshot.
- --strict¶
If specified, error out if unable to restore any of the requested resources (due to lack of a restore hash or removing the resource from workspace).
Arguments
- TAG_OR_HASH¶
Required argument
snapshot¶
Take a snapshot of the current workspace’s state
dws snapshot [OPTIONS] [TAG]
Options
- --workspace-dir <workspace_dir>¶
- -m, --message <message>¶
Message describing the snapshot
Arguments
- TAG¶
Optional argument
status¶
NOTE: this command is DEPRECATED. Please use dws report status
and dws report history
instead.
dws status [OPTIONS]
Options
- --workspace-dir <workspace_dir>¶
- --history¶
Show previous snapshots
- --limit <limit>¶
Number of previous snapshots to show (most recent first)
version¶
Print the version of Data Workspaces and exit.
dws version [OPTIONS]