3. Command Reference

In this section, we describe the full command line interface for Data Workspaces. This interface is built around a script dws, which is installed into your path when you install the dataworkspaces package. The overall interface for dws is:

dws [--batch] [--verbose] [--help] COMMAND [--help] [OPTIONS] [ARGS]...

dws has three options common to all commands:

  • --batch, which runs the command in a mode that never asks for user confirmation and will error out if it absolutely requires an input (useful for automation),
  • --verbose, which will print a lot of detail about what will be and has been done for a command (useful for debugging), and
  • --help, which prints these common options and a list of available commands.

Next on your command line comes the command name (e.g. init, clone, snapshot). Each command has its own arguments and options, as documented below. All commands take a --help argument, which will print the specific options and arguments for the command. Finally, the add subcommand has further subcommands, representing the individual resource types (e.g. git, local-files, rclone).

dws

dws [OPTIONS] COMMAND [ARGS]...

Options

-b, --batch

Run in batch mode, never ask for user inputs.

--verbose

Print extra debugging information and ask for confirmation before running actions.

add

Add a data collection to the workspace as a resource. Possible types of resources are git, local-files, or rclone; these are subcommands of add.

dws add [OPTIONS] COMMAND [ARGS]...

Options

--workspace-dir <workspace_dir>

api-resource

Resource to represent data obtained via an API. Use this when there is no file-based representation of your data that can be versioned and captured more directly. Subcommand of add

dws add api-resource [OPTIONS]

Options

--role <role>
--name <name>

Short name for this resource

git

Add a local git repository as a resource. Subcommand of add

dws add git [OPTIONS] PATH

Options

--role <role>
--name <name>

Short name for this resource

--branch <branch>

Branch of the repo to use, defaults to master.

-r, --read-only

If specified, treat the origin repository as read-only and never push to it.

-e, --export

On snapshots, export lineage data for import into other workspaces

--imported

This resource was exported from another workspace. Import the lineage data. An imported resource implies –read-only and –role=source-data.

Arguments

PATH

Required argument

local-files

Add a local file directory (not managed by git) to the workspace. Subcommand of add

dws add local-files [OPTIONS] PATH

Options

--role <role>
--name <name>

Short name for this resource

--compute-hash

Compute hashes for all files. If this option is not set, we use a lightweight comparison of file sizes only.

-e, --export

On snapshots, export lineage data for import into other workspaces

--imported

This resource was exported from another workspace. Import the lineage data. An imported resource implies –read-only and –role=source-data.

Arguments

PATH

Required argument

rclone

Add an rclone-d repository as a resource to the workspace. Subcommand of add. This is designed for uni-directional synchronization between a remote and a local_path. The remote has the form remote_name:remote_path, where remote_name is an entry in your rclone config file.

dws add rclone [OPTIONS] REMOTE LOCAL_PATH

Options

--role <role>
--name <name>

Short name for this resource

--config <config>

Configuration file for rclone

--compute-hash

Compute hashes for all files

-e, --export

On snapshots, export lineage data for import into other workspaces

--imported

This resource was exported from another workspace. Import the lineage data. An imported resource implies –read-only and –role=source-data.

--master <master>

Determines which system is the master. If ‘remote’, then pulls will be done, but not pushes. If ‘local’, then pushes will be done, but not pulls. If ‘none’ (the default), no action will be taken for pushes and pulls (you need to synchronize manually using rclone). When first adding the resource or cloning to a new machine, if the local directory does not exist, and ‘remote’ or ‘none’ were specified, the contents of the remote will copied down to the local directory.

Options:none|remote|local
--sync-mode <sync_mode>

When copying between local and master, which rclone command to use. If you specify ‘copy’, files are added or overwritten without deleting any files present at the target. If you specify ‘sync’, files at the target are removed if they are not present at the source. The default is ‘copy’. If master is ‘none’, this option has no effect.

Options:copy|sync
--size-only

If specified, use only the file size (rather than also modification time and checksum) to determine if a file has been changed. If your resource has a lot of files and access to the remote is over a WAN, you probably want to set this. Otherwise, syncs/copies can be VERY slow.

Arguments

REMOTE

Required argument

LOCAL_PATH

Required argument

clone

Clone the specified data workspace.

dws clone [OPTIONS] REPOSITORY [DIRECTORY]

Options

--hostname <hostname>

Hostname to identify this machine in snapshot directory paths, defaults to build-12607764-project-382008-data-workspaces-core

Arguments

REPOSITORY

Required argument

DIRECTORY

Optional argument

config

Get or set configuration parameters. Local parameters are only for this copy of the workspace, while global parameters are stored centrally and affect all copies.

If neither PARAMETER_NAME nor PARAMETER_VALUE are specified, this command prints a table of all parameters and their information (scope, value, default or not, and help text). If just PARAMETER_NAME is specified, it prints the specified parameter’s information. Finally, if both the parameter name and value are specified, the parameter is set to the specified value.

dws config [OPTIONS] [PARAMETER_NAME] [PARAMETER_VALUE]

Options

--workspace-dir <workspace_dir>
--resource <resource>

If specified, get/set parameters for the specified resource, rather than the workspace.

Arguments

[PARAMETER_NAME]

Optional argument

[PARAMETER_VALUE]

Optional argument

delete-snapshot

Delete the specified snapshot. This includes the metadata and lineage data for the snapshot. Unless –no-include-resources is specified, this also deletes any results data saved for the snapshot (under the snapshots subdirectory of a results resource).

dws delete-snapshot [OPTIONS] TAG_OR_HASH

Options

--workspace-dir <workspace_dir>
--no-include-resources

If specified, do NOT include deleting an snapshot-specific content from resources.

Arguments

TAG_OR_HASH

Required argument

deploy

Lineage-related commands

dws deploy [OPTIONS] COMMAND [ARGS]...

Options

--workspace-dir <workspace_dir>

build

Build a docker image containing this workspace. This command is EXERIMENTAL and subject to change.

dws deploy build [OPTIONS]

Options

--image-name <image_name>

Name of docker image, defaults to name of workspace

-f, --force-rebuild

If specified, always rebuild image (force deletes the image from docker)

--git-user-email <git_user_email>

Email address used by git inside the container. Defaults to value of user.email for this workspace.

--git-user-name <git_user_name>

Username used by git inside the container. Defualts to value of user.name for this workspace.

run

Build a docker image containing this workspace. This command is EXERIMENTAL and subject to change.

dws deploy run [OPTIONS]

Options

--image-name <image_name>

Name of docker image, defaults to name of workspace

--no-mount-ssh-keys

If specified, do not mount the host’s ~/.ssh directory into the container. This directory is need for git authentication.

diff

List differences between two snapshots

dws diff [OPTIONS] SNAPSHOT_OR_TAG1 SNAPSHOT_OR_TAG2

Options

--workspace-dir <workspace_dir>

Arguments

SNAPSHOT_OR_TAG1

Required argument

SNAPSHOT_OR_TAG2

Required argument

init

Initialize a new workspace

dws init [OPTIONS] [NAME]

Options

--hostname <hostname>

Hostname to identify this machine in snapshot directory paths, defaults to the result of the ‘hostname’ command.

--create-resources <create_resources>

Initialize the workspace with subdirectories for the specified resource roles. Choices are ‘all’ or any comma-separated combination of source-data, intermediate-data, code, results.

--scratch-directory <scratch_directory>

Local scratch directory (defaults to WORKSPACE_DIR/scratch)

--git-fat-remote <git_fat_remote>

Initialize the workspace with the git-fat large file extension and use the specified URL for the remote datastore

--git-fat-user <git_fat_user>

Username for git fat remote (if not root)

--git-fat-port <git_fat_port>

Port number for git-fat remote (defaults to 22)

--git-fat-attributes <git_fat_attributes>

Comma-separated list of file patterns to manage under git-fat. For example –git-fat-attributes=’.gz,.zip’. If you do not specify here, you can always add the .gitattributes file later.

--git-lfs-attributes <git_lfs_attributes>

Comma-separated list of file patterns to manage under git-lfs. For example –git-lfs-attributes=’.gz,.zip’. If you do not specify here, you can always add the .gitattributes file later.

Arguments

NAME

Optional argument

lineage

Lineage-related commands

dws lineage [OPTIONS] COMMAND [ARGS]...

Options

--workspace-dir <workspace_dir>

graph

Graph the lineage of a resource, writing the graph to an HTML file. Subcommand of lineage

dws lineage graph [OPTIONS] OUTPUT_FILE

Options

--resource <resource>

name of the resource to graph the lineage for (default to the first results resource)

--snapshot <snapshot>

Snapshot hash or tag to use for lineage. If not specified, use current lineage.

--format <format>

Format of the output graph (defaults to html)

Options:html|dot
--width <width>

Width of graph in pixels (defaults to 1024)

--height <height>

Height of graph in pixels (defaults to 800)

Arguments

OUTPUT_FILE

Required argument

publish

Add a remote Git repository as the origin for the workspace and do the initial push of the workspace and any other resources.

dws publish [OPTIONS] REMOTE_REPOSITORY

Options

--workspace-dir <workspace_dir>
--skip <RESOURCE1[,RESOURCE2,...>

Comma-separated list of resource names that you wish to skip when pushing. The rest will be pushed to their remote origins, if applicable.

Arguments

REMOTE_REPOSITORY

Required argument

pull

Pull the latest state of the workspace and its resources from their origins.

dws pull [OPTIONS]

Options

--workspace-dir <workspace_dir>
--only <RESOURCE1[,RESOURCE2,...>

Comma-separated list of resource names that you wish to pull from the origin, if applicable. The rest will be skipped.

--skip <RESOURCE1[,RESOURCE2,...>

Comma-separated list of resource names that you wish to skip when pulling. The rest will be pulled from their remote origins, if applicable.

--only-workspace

Only pull the workspace’s metadata, skipping the individual resources

push

Push the state of the workspace and its resources to their origins.

dws push [OPTIONS]

Options

--workspace-dir <workspace_dir>
--only <RESOURCE1[,RESOURCE2,...>

Comma-separated list of resource names that you wish to push to the origin, if applicable. The rest will be skipped.

--skip <RESOURCE1[,RESOURCE2,...>

Comma-separated list of resource names that you wish to skip when pushing. The rest will be pushed to their remote origins, if applicable.

--only-workspace

Only push the workspace’s metadata, skipping the individual resources

report

Report generation commands

dws report [OPTIONS] COMMAND [ARGS]...

Options

--workspace-dir <workspace_dir>

history

Show the history of snapshots. Subcommand of report.

dws report history [OPTIONS]

Options

--limit <limit>

Number of previous snapshots to show (most recent first)

lineage

Show a lineage table for either the current workspace or a specific snapshot. Subcommand of report.

dws report lineage [OPTIONS]

Options

--snapshot <snapshot>

Optional tag or hash for a snapshot. Otherwise, shows the current status.

results

Show the contents of a results file. Subcommand of report.

dws report results [OPTIONS]

Options

--snapshot <snapshot>

Optional tag or hash for a snapshot. Otherwise, shows the current status.

--resource <resource>

Optional resource name. Otherwise, will look for first resource with a results file.

status

Show the status of resources in this workspace. Subcommand of report.

dws report status [OPTIONS]

restore

Restore the workspace to a prior state

dws restore [OPTIONS] TAG_OR_HASH

Options

--workspace-dir <workspace_dir>
--only <RESOURCE1[,RESOURCE2,...>

Comma-separated list of resource names that you wish to revert to the specified snapshot. The rest will be left as-is.

--leave <RESOURCE1[,RESOURCE2,...>

Comma-separated list of resource names that you wish to leave in their current state. The rest will be restored to the specified snapshot.

--strict

If specified, error out if unable to restore any of the requested resources (due to lack of a restore hash or removing the resource from workspace).

Arguments

TAG_OR_HASH

Required argument

snapshot

Take a snapshot of the current workspace’s state

dws snapshot [OPTIONS] [TAG]

Options

--workspace-dir <workspace_dir>
-m, --message <message>

Message describing the snapshot

Arguments

TAG

Optional argument

status

NOTE: this command is DEPRECATED. Please use dws report status and dws report history instead.

dws status [OPTIONS]

Options

--workspace-dir <workspace_dir>
--history

Show previous snapshots

--limit <limit>

Number of previous snapshots to show (most recent first)

version

Print the version of Data Workspaces and exit.

dws version [OPTIONS]