Data Workspaces¶
Data management for reproducability and collaboration¶
Data Workspaces is an open source framework for maintaining the state of a data science project, including data sets, intermediate data, results, and code. It supports reproducability through snapshotting and lineage models and collaboration through a push/pull model inspired by source control systems like Git.
Data Workspaces is installed as a Python 3 package and provides a Git-like command line interface and programming APIs. Specific data science tools and workflows are supported through extensions called kits. The goal is to provide the reproducibility and collaboration benefits with minimal changes to your current projects and processes.
Data Workspaces runs on Unix-like systems, including Linux, MacOS, and on Windows via the Windows Subsystem for Linux.
Data Workspaces lets you:
- Track and version all the different resources for your data science project from one place.
- Automatically track the full history of your experimental results. Scripts can easily be developed to build reports on these results.
- Reproduce any prior experiment, including the source data, code, and configuration parameters used.
- Go back to a prior experiment as a “branching-off” point to explore additional permuations.
- Collaborate with others on the same project, sharing data, code, and results.
- Easily reproduce your environment on a new machine to parallelize work.
- Publish your environment on a site like GitHub or GitLab for others to download and explore.
1. Introduction¶
Quick Start¶
Here is a quick example to give you a flavor of the project, using scikit-learn and the famous digits dataset running in a Jupyter Notebook.
First, install [1] the libary:
pip install dataworkspaces
Now, we will create a workspace:
mkdir quickstart
cd ./quickstart
dws init --create-resources code,results
This created our workspace (which is a git repository under the covers) and initialized it with two subdirectories, one for the source code, and one for the results. These are special subdirectories, in that they are resources which can be tracked and versioned independently.
[1] | See the Installation section for more options and details. |
Now, we are going to add our source data to the workspace. This resides in an external, third-party git repository. It is simple to add:
git clone https://github.com/jfischer/sklearn-digits-dataset.git
dws add git --role=source-data --read-only ./sklearn-digits-dataset
The first line (git clone ...
) makes a local copy of the Git repository for the
Digits dataset. The second line (dws add git ...
) adds the repository to the workspace
as a resource to be tracked as part of our project. The --role
option tells Data Workspaces
how we will use the resource (as source data), and the --read-only
option indicates that
we should treat the repository as read-only and never try to push it to its
origin
[2]
(as you do not have write permissions to the origin
copy of this repository).
We can see the list of resources in our workspace via the command dws report status
:
$ dws report status
Status for workspace: quickstart
Resources for workspace: quickstart
| Resource | Role | Type | Parameters |
|________________________|_____________|__________________|___________________________________________________________________________|
| sklearn-digits-dataset | source-data | git | remote_origin_url=https://github.com/jfischer/sklearn-digits-dataset.git, |
| | | | relative_local_path=sklearn-digits-dataset, |
| | | | branch=master, |
| | | | read_only=True |
| code | code | git-subdirectory | relative_path=code |
| results | results | git-subdirectory | relative_path=results |
No resources for the following roles: intermediate-data.
[2] | In Git, each remote copy of a repository is assigned a name. By
convention, the origin is the copy from which the local copy was cloned. |
Now, we can create a Jupyter notebook for running our experiments:
cd ./code
jupyter notebook
This will bring up the Jupyter app in your brower. Click on the New
dropdown (on the right side) and select “Python 3”. Once in the notebook,
click on the current title (“Untitled”, at the top, next to “Jupyter”)
and change the title to digits-svc
.
Now, type the following Python code in the first cell:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from dataworkspaces.kits.scikit_learn import LineagePredictor, load_dataset_from_resource
# load the data from filesystem into a "Bunch"
dataset = load_dataset_from_resource('sklearn-digits-dataset')
# Instantiate a support vector classifier and wrap it for dws
classifier = LineagePredictor(SVC(gamma=0.001),
'multiclass_classification',
input_resource=dataset.resource,
model_save_file='digits.joblib')
# split the training and test data
X_train, X_test, y_train, y_test = train_test_split(
dataset.data, dataset.target, test_size=0.5, shuffle=False)
# train and score the classifier
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)
This code is the same as you would write for scikit-learn without dws, except that:
- we load the dataset from a resource rather than call the lower-level NumPy fuctions (although you can call those if you prefer), and
- we wrap the support vector classifier instance with a
LineagePredictor
.
It will take a second to train and run the classifier. In the output of the cell, you should then see:
Wrote results to results:results.json
0.9688542825361512
Now, you can save and shut down your notebook. If you look at the
directory quickstart/results
, you should see a saved model file,
digits.joblib
, and a results file, results.json
,
file with information about your run. We can format and view the results file
with the command dws report results
:
$ dws report results
Results file at results:/results.json
General Properties
| Key | Value |
|________________________|____________________________|
| step | digits-svc |
| start_time | 2020-01-14T12:54:00.473892 |
| execution_time_seconds | 0.13 |
| run_description | None |
Parameters
| Key | Value |
|_________________________|_______|
| C | 1.0 |
| cache_size | 200 |
| class_weight | None |
| coef0 | 0.0 |
| decision_function_shape | ovr |
| degree | 3 |
| gamma | 0.001 |
| kernel | rbf |
| max_iter | -1 |
| probability | False |
| random_state | None |
| shrinking | True |
| tol | 0.001 |
| verbose | False |
Metrics
| Key | Value |
|__________|_______|
| accuracy | 0.969 |
Metrics: classification_report
| Key | Value |
|______________|_______________________________________________________________________________________________________|
| 0.0 | precision: 1.0, recall: 0.9886363636363636, f1-score: 0.9942857142857142, support: 88 |
| 1.0 | precision: 0.9887640449438202, recall: 0.967032967032967, f1-score: 0.9777777777777779, support: 91 |
| 2.0 | precision: 0.9883720930232558, recall: 0.9883720930232558, f1-score: 0.9883720930232558, support: 86 |
| 3.0 | precision: 0.9753086419753086, recall: 0.8681318681318682, f1-score: 0.9186046511627908, support: 91 |
| 4.0 | precision: 0.9887640449438202, recall: 0.9565217391304348, f1-score: 0.9723756906077348, support: 92 |
| 5.0 | precision: 0.946236559139785, recall: 0.967032967032967, f1-score: 0.9565217391304348, support: 91 |
| 6.0 | precision: 0.989010989010989, recall: 0.989010989010989, f1-score: 0.989010989010989, support: 91 |
| 7.0 | precision: 0.9565217391304348, recall: 0.9887640449438202, f1-score: 0.9723756906077348, support: 89 |
| 8.0 | precision: 0.9361702127659575, recall: 1.0, f1-score: 0.967032967032967, support: 88 |
| 9.0 | precision: 0.9278350515463918, recall: 0.9782608695652174, f1-score: 0.9523809523809524, support: 92 |
| micro avg | precision: 0.9688542825361512, recall: 0.9688542825361512, f1-score: 0.9688542825361512, support: 899 |
| macro avg | precision: 0.9696983376479764, recall: 0.9691763901507882, f1-score: 0.9688738265020351, support: 899 |
| weighted avg | precision: 0.9696092010839529, recall: 0.9688542825361512, f1-score: 0.9686644837258652, support: 899 |
Next, let us take a snapshot, which will record the state of the workspace and save the data lineage along with our results:
dws snapshot -m "first run with SVC" SVC-1
SVC-1
is the tag of our snapshot.
If you look in quickstart/results
, you will see that the results
(currently just results.json
) have been moved to the subdirectory
snapshots/HOSTNAME-SVC-1
, where HOSTNAME
is the hostname for your
local machine). A file, lineage.json
, containing a full
data lineage graph for our experiment has also been
created in that directory.
We can see the history of snapshots with the command dws report history
:
$ dws report history
History of snapshots
| Hash | Tags | Created | accuracy | classification_report | Message |
|_________|_______|_____________________|__________|___________________________|____________________|
| f1401a8 | SVC-1 | 2020-01-14T13:00:39 | 0.969 | {'0.0': {'precision': 1.. | first run with SVC |
1 snapshots total
We can also see the lineage for this snapshot with the command dws report lineage --snapshot SVC-1
:
$ dws report lineage --snapshot SVC-1
Lineage for SVC-1
| Resource | Type | Details | Inputs |
|________________________|_____________|__________________________________________|________________________________________|
| results | Step | digits-svc at 2020-01-14 12:54:00.473892 | sklearn-digits-dataset (Hash:635b7182) |
| sklearn-digits-dataset | Source Data | Hash:635b7182 | None |
This report shows us that the results resource was writen by the digits-svc step, which had as its input the resource sklearn-digits-dataset. We also know the specific version of this resource (hash 635b71820) and that it is source data, not written by another step.
Some things you can do from here:
- Run more experiments and save their results by snapshotting the workspace.
If, at some point, we want to go back to our first experiment, we can run:
dws restore SVC-1
. This will restore the state of the source data and code subdirectories, but leave the full history of the results. - Upload your workspace on GitHub or an any other Git hosting application.
This can be to have a backup copy or to share with others.
Others can download it via
dws clone
. - More complex scenarios involving multi-step data pipelines can easily be automated. See the documentation for details.
See the Tutorial Section for a continuation of this example.
Installation¶
Now, let us look into more detail at the options for installation.
Prerequisites¶
This software runs directly on Linux and MacOSx. Windows is supported by via the Windows Subsystem for Linux. The following software should be pre-installed:
- git
- Python 3.5 or later
- Optionally, the rclone utility, if you are going to be using it to sync with a remote copy of your data.
Installation from the Python Package Index (PyPi)¶
This is the easiest way to install Data Workspaces is via the Python Package Index at http://pypi.org.
We recommend first creating a virtual environment to contain the Data Workspaces software and any other software needed for your project. Using the standard Python 3 distribution, you can create and activate a virtual environment via:
python3 -m venv VIRTUAL_ENVIRONMENT_PATH
source VIRTUAL_ENVIRONMENT_PATH/bin/activate
If you are using the Anaconda distribution of Python 3, you can create and activate a virtual environment via:
conda create --name VIRTUAL_ENVIRONMENT_NAME
conda activate VIRTUAL_ENVIRONMENT_NAME
Now that you have your virtual environment set up, we can install the actual library:
pip install dataworkspaces
To verify that it was installed correctly, run:
dws --help
Installation via the source tree¶
You can clone the source tree and install it as follows:
git clone git@github.com:data-workspaces/data-workspaces-core.git
cd data-workspaces-python
pip install `pwd`
dws --help # just a sanity check that it was installed correctly
Concepts¶
Data Workspaces provides a thin layer of the Git version control
system for easy management of source data, intermediate data, and results for
data science projects. A workspace is a Git repository with some added
metadata to track external resources and experiment history. You can create
and manipulate workspaces via dws
, a command line tool. There is
also a programmatic API for integrating more tightly with your data
pipeline.
A workspace contains one or more resources. Each resource represents a collection of data that has a particular role in the project – source data, intermediate data (generated by processing the original source data), code, and results. Resources can be subdirectories in the workspace’s Git repository, separate git repositories, local directories, or remote systems (e.g. an S3 bucket or a remote server’s files accessed via ssh).
Once the assets of a data science project have been organized into resources, one can do the work of developing the associated software and running experiments. At any point in time, you can take a snapshot, which captures the current state of all the resources referenced by the workspace. If you want to go back to a prior state of the workspace or even an individual resource, you can restore back to any prior snapshot.
Results resources are handled a little differently than other types: they are always additive. Each snapshot of a results resource takes the current files in the resource and moves it to a snapshot-specific subdirectory. This lets you view and compare the results of all your prior experiments.
You interact with your data workspace through the dws
command line tool,
which like Git, has various subcommands for the actions you might take
(e.g. creating a new snapshot, syncing with a remote repository, etc.).
Beyond the basic versioning of your project through snapshots, you can use the Lineage API to track each step of your workflow, including inputs/outputs, parameters, and metrics (accuracy, loss, precision, recall, roc, etc.). This lineage data is saved with your snapshots so you can understand how you arrived at each of your results.
Commmand Line Interface¶
To run the command line interface, you use the dws
command,
which should have been installed into your environment by pip install
.
dws
operations have the form:
dws [GLOBAL_OPTIONS] COMMAND [COMMAND_OPTIONS] [COMMAND_ARGS]
Just run dws --help
for a list of global options and commands.
Commands¶
Here is a summary of the key commands:
init
- initialize a new workspace in the current directoryadd
- add a resource (a git repo, a directory, an s3 bucket, etc.) to the current workspacesnapshot
- take a snapshot of the current state of the workspacerestore
- restore the state to a prior snapshotpublish
- associate a workspace with a remote git repository (e.g. on GitHub)push
- push a workspace and all resources to their (remote) originspull
- pull the workspace and all resources from their (remote) originsclone
- clone a workspace and all the associated resources to the local machinereport
- various reports about the workspacerun
- run a command and capture the lineage. This information is saved in a file for future calls to the same command. (not yet implemented)
See the Command Reference section for a full description of all commands and their options.
Workflow¶
To put these commands in context, here is a typical workflow for the initial data scientist on a project:

The person starting the project creates a new workspace on their local machine
using the init
command. Next, they need to tell the data workspace about
their code, data sets, and places where they will store intermediate data and
results. If subdirectories of the main workspace are sufficient, they
can do this as a part of the init
command, using the --create-resources
option. Otherwise, they use the add
command to define each resource associated with their project.
The data scientist can now run their experiements. This is typically an
iterative process, represented in the picture by the dashed box labeled
“Experiment Workflow”. Once they have finished a complete experiment, then can use the
snapshot
command to capture the state of their workspace.
They can go back and run further experiments, taking a snapshot each time they
have something interesting. They can also go back to a prior state using the
restore
command.
Collaboration¶
At some point, the data scientist will want to copy their project to a remote service for sharing (and backup). Data Workspaces can use any Git hosting service for this (e.g. GitHub, GitLab, or BitBucket) and does not need any special setup. Here is an overview of collaborations facilitated by Data Workspaces:

First, the data scientist creates an empty git repository
on the remote origin
(e.g. GitHub, GitLab, or BitBucket) and then runs the
publish
command to associate the origin
with the workspace and update the
origin
with the full history of the workspace.
A new collaborator can use the clone
command to copy the workspace down to
their local machine. They can then run experiments and take snapshots, just
like the original data scientisst. When readly, then can upload their changes to the via the push
command.
Others can then use the pull
command to download these changes to their workspace.
This process can be repeated as many times as necessary, and multiple collaborators can overlap
their work.
2. Tutorial¶
Let’s build on the Quick Start. If you haven’t already, run
through it so that you have a quickstart
workspace with one tag (SVC-1
).
Further Experiments¶
Now, let’s try to use Logistic Regression. Create a new Jupyter notebook called
digits-lr
in the code
subdirectory of quickstart
. Enter the following
code into a notebook cell:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from dataworkspaces.kits.scikit_learn import LineagePredictor, load_dataset_from_resource
# load the data from filesystem into a "Bunch"
dataset = load_dataset_from_resource('sklearn-digits-dataset')
# split the training and test data
X_train, X_test, y_train, y_test = train_test_split(
dataset.data, dataset.target, test_size=0.5, shuffle=False)
# Run the a grid search to find the best parameters
gs_params={'C':[1e-3, 1e-2, 1e-1, 1, 1e2], 'solver':['lbfgs'],
'multi_class':['multinomial']}
cv = GridSearchCV(LogisticRegression(), gs_params, cv=5, scoring='accuracy')
cv.fit(X_train, y_train)
# Instantiate a LogisticRegression classifier with the best parameters
# and wrap it for dws
classifier = LineagePredictor(LogisticRegression(**cv.best_params_),
'multiclass_classification',
input_resource=dataset.resource,
model_save_file='digits.joblib')
# train and score the classifier
classifier.fit(X_train, y_train)
classifier.score(X_test, y_test)
There are two differences from our previous notebook:
- we use a LogisticRegression classifier rather than a Support Vector classifier, and
- Before calling our wrapped classifier, we run a grid search to find the best combination of model parameters.
If you run this cell,
you should see several no-convergence warnings (some of the values for C
must be bad for this data set) and then a final accuracy result, around 94%
Ok, so our Logistic Regression
accuracy of 0.94 is not as good as we obtained from the
Support Vector Classifier (0.97). Let’s take a snapshot anyway,
so we have this experiment for future reference. Maybe someone will
ask us, “Did you try Logistic Regession?”, and we can show them
the full results and even use a restore
command to re-run the
experiment for them. Here’s how to take the snapshot:
dws snapshot -m "Logistic Regession experiment" LR-1
We can see both snapshots with the command dws report history
:
$ dws report history
History of snapshots
| Hash | Tags | Created | accuracy | classification_report | Message |
|_________|_______|_____________________|__________|___________________________|_______________________________|
| bf9fb37 | LR-1 | 2020-01-14T14:27:37 | 0.94 | {'0.0': {'precision': 0.. | Logistic Regession experiment |
| f1401a8 | SVC-1 | 2020-01-14T13:00:39 | 0.969 | {'0.0': {'precision': 1.. | first run with SVC |
2 snapshots total
Publishing a workspace¶
Now, we will publish our workspace on GitHub. A similar approach can be taken for other code hosting services like BitBucket or GitLab.
The first few steps are GitHub-specific, but the dws
commands will work
across all hosting services.
First, create an account on GitHub if you do not already have one. Next, go to your front page on GitHub and click on the green new repository button on the left side of the page:

You should now get a dialog like this:

Fill in a name for your repository (in this case, dws-tutorial
) and
select whether you want it to be public (visible to the work) or
private (only visible to those you explicitly grant access). You
won’t need a README file, .gitignore, or license file, as we will be
initializing the repository from your local copy. Go ahead and click
on the “Create Repository” button.
Now, back on the command line,
go to the directory containing the quickstart
workspace on your
local machine. Run the following command replacing YOUR_USERNAME
with your GitHub username:
dws publish git@github.com:YOUR_USERNAME/dws-tutorial.git
You have published your workspace and its history to a GitHub Repository.
At this point, if you refresh the page for this repository on GitHub, you should see something like this:

You have successfully published your workspace!
Cloning a workspace¶
Now, we want to use this workspace on a new machine (perhaps your own or perhaps belonging to a collaborator). First, make certain that the account on the second machine has at least read access to the repository. If you will be pushing updates from this account, it will also need write access to the repo. Next, make sure that your software dependencies are installed (e.g. Jupyter, NumPy, and Scikit-learn) and then install the Data Workspaces library into your local environment:
pip install dataworkspaces
From a browser on your second machine, go back to the GitHub page for your repository and click on the “Clone or download” button. It should show you a URL for cloning via SSH. Click on the clipboard icon to the right of the URL to copy the URL to your machine’s clipboard:

Then, on your second machine, go to the directory you intend to be the parent of th
workspace (in this case ~/workspaces
) and run the following:
dws clone GITHUB_CLONE_URL
where GITHUB_CLONE_URL
is the URL you copied to your clipboard.
It should ask you for the hostname you want to use to identify this machine. It defaults to the system hostname.
By default, the clone will be in the directory ./quickstart
, since
“quickstart” was the name of the original repo. You can change this
by adding the desired local directory name to the command line.
We can now change to the workspace’s directory and run the history command:
$ cd ./quickstart
$ dws report history
History of snapshots
| Hash | Tags | Created | accuracy | classification_report | Message |
|_________|_______|_____________________|__________|___________________________|_______________________________|
| bf9fb37 | LR-1 | 2020-01-14T14:27:37 | 0.94 | {'0.0': {'precision': 0.. | Logistic Regession experiment |
| f1401a8 | SVC-1 | 2020-01-14T13:00:39 | 0.969 | {'0.0': {'precision': 1.. | first run with SVC |
2 snapshots total
We see the full history from the original workspace!
Sharing updates¶
Let’s re-run the Support Vector classifier evaluation on the second
machine and see if we reproduce our results. First, go to the code
subdirectory in your workspace. Start the Jupyter notebook as follows:
jupyter notebook digits-svc.ipynb
This should bring up a browser with the notebook. You should see the code from our first experiment. Run the cell. You should get close the same results as on the first machine (0.97 accuracy). Save and shutdown the notebook.
Now, take a snapshot:
dws snapshot -m "reproduce on second machine" SVC-2
We have tagged this snapshot with the tag SVC-2
. We want to push the
entire workspace to GitHub. This can be done as follows:
dws push
After the push, the origin
respository on GitHub has been updated with
the latest snapshot and results. We can now go back to the origin machine
where we created the workspace, and download the changes. To do so, start
up a command line window, go into the workspace’s directory on the first machine,
and run:
dws pull
After the pull, we should see the experiment we ran on the second machine:
$ dws report history
History of snapshots
| Hash | Tags | Created | accuracy | classification_report | Message |
|_________|_______|_____________________|__________|___________________________|_______________________________|
| 2c195ba | SVC-2 | 2020-01-14T15:20:23 | 0.969 | {'0.0': {'precision': 1.. | reproduce on second machine |
| bf9fb37 | LR-1 | 2020-01-14T14:27:37 | 0.94 | {'0.0': {'precision': 0.. | Logistic Regession experiment |
| f1401a8 | SVC-1 | 2020-01-14T13:00:39 | 0.969 | {'0.0': {'precision': 1.. | first run with SVC |
3 snapshots total
3. Command Reference¶
In this section, we describe the full command line interface for Data Workspaces.
This interface is built around a script dws
, which is installed into your
path when you install the dataworkspaces
package. The overall interface
for dws
is:
dws [--batch] [--verbose] [--help] COMMAND [--help] [OPTIONS] [ARGS]...
dws
has three options common to all commands:
--batch
, which runs the command in a mode that never asks for user confirmation and will error out if it absolutely requires an input (useful for automation),--verbose
, which will print a lot of detail about what will be and has been done for a command (useful for debugging), and--help
, which prints these common options and a list of available commands.
Next on your command line comes the command name (e.g. init
, clone
, snapshot
).
Each command has its own arguments and options, as documented below.
All commands take a --help
argument, which will print the specific options and
arguments for the command. Finally,
the add
subcommand has further subcommands, representing the
individual resource types (e.g. git, local-files, rclone).
dws¶
dws [OPTIONS] COMMAND [ARGS]...
Options
-
-b
,
--batch
¶
Run in batch mode, never ask for user inputs.
-
--verbose
¶
Print extra debugging information and ask for confirmation before running actions.
add¶
Add a data collection to the workspace as a resource.
Possible types of resources are git
, local-files
, or rclone
; these are subcommands of add.
dws add [OPTIONS] COMMAND [ARGS]...
Options
-
--workspace-dir
<workspace_dir>
¶
api-resource¶
Resource to represent data obtained via an API. Use this when there is
no file-based representation of your data that can be versioned and captured
more directly. Subcommand of add
dws add api-resource [OPTIONS]
Options
-
--role
<role>
¶
-
--name
<name>
¶ Short name for this resource
git¶
Add a local git repository as a resource. Subcommand of add
dws add git [OPTIONS] PATH
Options
-
--role
<role>
¶
-
--name
<name>
¶ Short name for this resource
-
--branch
<branch>
¶ Branch of the repo to use, defaults to master.
-
-r
,
--read-only
¶
If specified, treat the origin repository as read-only and never push to it.
-
-e
,
--export
¶
On snapshots, export lineage data for import into other workspaces
-
--imported
¶
This resource was exported from another workspace. Import the lineage data. An imported resource implies –read-only and –role=source-data.
Arguments
-
PATH
¶
Required argument
local-files¶
Add a local file directory (not managed by git) to the workspace. Subcommand of add
dws add local-files [OPTIONS] PATH
Options
-
--role
<role>
¶
-
--name
<name>
¶ Short name for this resource
-
--compute-hash
¶
Compute hashes for all files. If this option is not set, we use a lightweight comparison of file sizes only.
-
-e
,
--export
¶
On snapshots, export lineage data for import into other workspaces
-
--imported
¶
This resource was exported from another workspace. Import the lineage data. An imported resource implies –read-only and –role=source-data.
Arguments
-
PATH
¶
Required argument
rclone¶
Add an rclone-d repository as a resource to the workspace. Subcommand of add
dws add rclone [OPTIONS] SOURCE DEST
Options
-
--role
<role>
¶
-
--name
<name>
¶ Short name for this resource
-
--config
<config>
¶ Configuration file for rclone
-
--compute-hash
¶
Compute hashes for all files
-
-e
,
--export
¶
On snapshots, export lineage data for import into other workspaces
-
--imported
¶
This resource was exported from another workspace. Import the lineage data. An imported resource implies –read-only and –role=source-data.
Arguments
-
SOURCE
¶
Required argument
-
DEST
¶
Required argument
clone¶
Clone the specified data workspace.
dws clone [OPTIONS] REPOSITORY [DIRECTORY]
Options
-
--hostname
<hostname>
¶ Hostname to identify this machine in snapshot directory paths, defaults to build-10945802-project-382008-data-workspaces-core
Arguments
-
REPOSITORY
¶
Required argument
-
DIRECTORY
¶
Optional argument
config¶
Get or set configuration parameters. Local parameters are only for this copy of the workspace, while global parameters are stored centrally and affect all copies.
If neither PARAMETER_NAME nor PARAMETER_VALUE are specified, this command prints a table of all parameters and their information (scope, value, default or not, and help text). If just PARAMETER_NAME is specified, it prints the specified parameter’s information. Finally, if both the parameter name and value are specified, the parameter is set to the specified value.
dws config [OPTIONS] [PARAMETER_NAME] [PARAMETER_VALUE]
Options
-
--workspace-dir
<workspace_dir>
¶
-
--resource
<resource>
¶ If specified, get/set parameters for the specified resource, rather than the workspace.
Arguments
-
[PARAMETER_NAME]
¶
Optional argument
-
[PARAMETER_VALUE]
¶
Optional argument
delete-snapshot¶
Delete the specified snapshot. This includes the metadata and lineage data for the snapshot. Unless –no-include-resources is specified, this also deletes any results data saved for the snapshot (under the snapshots subdirectory of a results resource).
dws delete-snapshot [OPTIONS] TAG_OR_HASH
Options
-
--workspace-dir
<workspace_dir>
¶
-
--no-include-resources
¶
If specified, do NOT include deleting an snapshot-specific content from resources.
Arguments
-
TAG_OR_HASH
¶
Required argument
deploy¶
Lineage-related commands
dws deploy [OPTIONS] COMMAND [ARGS]...
Options
-
--workspace-dir
<workspace_dir>
¶
build¶
Build a docker image containing this workspace. This command is EXERIMENTAL and subject to change.
dws deploy build [OPTIONS]
Options
-
--image-name
<image_name>
¶ Name of docker image, defaults to name of workspace
-
-f
,
--force-rebuild
¶
If specified, always rebuild image (force deletes the image from docker)
-
--git-user-email
<git_user_email>
¶ Email address used by git inside the container. Defaults to value of user.email for this workspace.
-
--git-user-name
<git_user_name>
¶ Username used by git inside the container. Defualts to value of user.name for this workspace.
run¶
Build a docker image containing this workspace. This command is EXERIMENTAL and subject to change.
dws deploy run [OPTIONS]
Options
-
--image-name
<image_name>
¶ Name of docker image, defaults to name of workspace
-
--no-mount-ssh-keys
¶
If specified, do not mount the host’s ~/.ssh directory into the container. This directory is need for git authentication.
diff¶
List differences between two snapshots
dws diff [OPTIONS] SNAPSHOT_OR_TAG1 SNAPSHOT_OR_TAG2
Options
-
--workspace-dir
<workspace_dir>
¶
Arguments
-
SNAPSHOT_OR_TAG1
¶
Required argument
-
SNAPSHOT_OR_TAG2
¶
Required argument
init¶
Initialize a new workspace
dws init [OPTIONS] [NAME]
Options
-
--hostname
<hostname>
¶ Hostname to identify this machine in snapshot directory paths, defaults to build-10945802-project-382008-data-workspaces-core
-
--create-resources
<create_resources>
¶ Initialize the workspace with subdirectories for the specified resource roles. Choices are ‘all’ or any comma-separated combination of source-data, intermediate-data, code, results.
-
--scratch-directory
<scratch_directory>
¶ Local scratch directory (defaults to WORKSPACE_DIR/scratch)
-
--git-fat-remote
<git_fat_remote>
¶ Initialize the workspace with the git-fat large file extension and use the specified URL for the remote datastore
-
--git-fat-user
<git_fat_user>
¶ Username for git fat remote (if not root)
-
--git-fat-port
<git_fat_port>
¶ Port number for git-fat remote (defaults to 22)
-
--git-fat-attributes
<git_fat_attributes>
¶ Comma-separated list of file patterns to manage under git-fat. For example –git-fat-attributes=’.gz,.zip’. If you do not specify here, you can always add the .gitattributes file later.
-
--git-lfs-attributes
<git_lfs_attributes>
¶ Comma-separated list of file patterns to manage under git-lfs. For example –git-lfs-attributes=’.gz,.zip’. If you do not specify here, you can always add the .gitattributes file later.
Arguments
-
NAME
¶
Optional argument
lineage¶
Lineage-related commands
dws lineage [OPTIONS] COMMAND [ARGS]...
Options
-
--workspace-dir
<workspace_dir>
¶
graph¶
Graph the lineage of a resource, writing the graph to an HTML file. Subcommand of lineage
dws lineage graph [OPTIONS] OUTPUT_FILE
Options
-
--resource
<resource>
¶ name of the resource to graph the lineage for (default to the first results resource)
-
--snapshot
<snapshot>
¶ Snapshot hash or tag to use for lineage. If not specified, use current lineage.
-
--format
<format>
¶ Format of the output graph (defaults to html)
Options: html|dot
-
--width
<width>
¶ Width of graph in pixels (defaults to 1024)
-
--height
<height>
¶ Height of graph in pixels (defaults to 800)
Arguments
-
OUTPUT_FILE
¶
Required argument
publish¶
Add a remote Git repository as the origin for the workspace and do the initial push of the workspace and any other resources.
dws publish [OPTIONS] REMOTE_REPOSITORY
Options
-
--workspace-dir
<workspace_dir>
¶
-
--skip
<skip>
¶ Comma-separated list of resource names that you wish to skip when pushing. The rest will be pushed to their remote origins, if applicable.
Arguments
-
REMOTE_REPOSITORY
¶
Required argument
pull¶
Pull the latest state of the workspace and its resources from their origins.
dws pull [OPTIONS]
Options
-
--workspace-dir
<workspace_dir>
¶
-
--only
<only>
¶ Comma-separated list of resource names that you wish to pull from the origin, if applicable. The rest will be skipped.
-
--skip
<skip>
¶ Comma-separated list of resource names that you wish to skip when pulling. The rest will be pulled from their remote origins, if applicable.
-
--only-workspace
¶
Only pull the workspace’s metadata, skipping the individual resources
push¶
Push the state of the workspace and its resources to their origins.
dws push [OPTIONS]
Options
-
--workspace-dir
<workspace_dir>
¶
-
--only
<only>
¶ Comma-separated list of resource names that you wish to push to the origin, if applicable. The rest will be skipped.
-
--skip
<skip>
¶ Comma-separated list of resource names that you wish to skip when pushing. The rest will be pushed to their remote origins, if applicable.
-
--only-workspace
¶
Only push the workspace’s metadata, skipping the individual resources
report¶
Report generation commands
dws report [OPTIONS] COMMAND [ARGS]...
Options
-
--workspace-dir
<workspace_dir>
¶
history¶
Show the history of snapshots. Subcommand of report
.
dws report history [OPTIONS]
Options
-
--limit
<limit>
¶ Number of previous snapshots to show (most recent first)
lineage¶
Show a lineage table for either the current workspace or a specific snapshot.
Subcommand of report
.
dws report lineage [OPTIONS]
Options
-
--snapshot
<snapshot>
¶ Optional tag or hash for a snapshot. Otherwise, shows the current status.
results¶
Show the contents of a results file. Subcommand of report
.
dws report results [OPTIONS]
Options
-
--snapshot
<snapshot>
¶ Optional tag or hash for a snapshot. Otherwise, shows the current status.
-
--resource
<resource>
¶ Optional resource name. Otherwise, will look for first resource with a results file.
status¶
Show the status of resources in this workspace. Subcommand of report
.
dws report status [OPTIONS]
restore¶
Restore the workspace to a prior state
dws restore [OPTIONS] TAG_OR_HASH
Options
-
--workspace-dir
<workspace_dir>
¶
-
--only
<only>
¶ Comma-separated list of resource names that you wish to revert to the specified snapshot. The rest will be left as-is.
-
--leave
<leave>
¶ Comma-separated list of resource names that you wish to leave in their current state. The rest will be restored to the specified snapshot.
-
--strict
¶
If specified, error out if unable to restore any of the requested resources (due to lack of a restore hash or removing the resource from workspace).
Arguments
-
TAG_OR_HASH
¶
Required argument
snapshot¶
Take a snapshot of the current workspace’s state
dws snapshot [OPTIONS] [TAG]
Options
-
--workspace-dir
<workspace_dir>
¶
-
-m
,
--message
<message>
¶ Message describing the snapshot
Arguments
-
TAG
¶
Optional argument
4. Lineage API¶
The Lineage API is provided by the module dataworkspaces.lineage
.
This module provides an API for tracking data lineage – the history of how a given result was created, including the versions of original source data and the various steps run in the data pipeline to produce the final result.
The basic idea is that your workflow is a sequence of pipeline steps:
---------- ---------- ---------- ----------
| | | | | | | |
| Step 1 |---->| Step 2 |---->| Step 3 |---->| Step 4 |
| | | | | | | |
---------- ---------- ---------- ----------
A step could be a command line script, a Jupyter notebook or perhaps a step in an automated workflow tool (e.g. Apache Airflow). Each step takes a number of inputs and parameters and generates outputs. The inputs are resources in your workspace (or subpaths within a resource) from which the step will read to perform its task. The parameters are configuration values passed to the step (e.g. the command line arguments of a script). The outputs are the resources (or subpaths within a resource), which are written to by the step. The outputs may represent results or intermediate data to be consumed by downstream steps.
The lineage API captures this data for each step. Here is a view of the data captured:
Parameters
|| || ||
\/ \/ \/
------------
=>| |=>
Input resources =>| Step i |=> Output resources
=>| |=>
------------
/\
||
Code
Dependencies
To do this, we need use the following classes:
ResourceRef
- A reference to a resource for use as a step input or output. A ResourceRef contains a resource name and an optional path within that resource. This lets you manage lineage down to the directory or even file level. The APIs also support specifying a path on the local filesystem instead of a ResourceRef. This path is automatically resolved to a ResourceRef (it must map to the a location under the local path of a resource). By storing :class:`~ResourceRef`s instead of hard-coded filesystem paths, we can include non-local resources (like an S3 bucket) and ensure that the workspace is easily deployed on a new machine.Lineage
- The main lineage object, instantiated at the start of your step. At the beginning of your step, you specify the inputs, parameters, and outputs. At the end of the step, the data is saved, along with any results you might have from that step. Lineage instances are context managers, which means you can use awith
statement to manage their lifecycle.LineageBuilder
- This is a helper class to guide the creation of your lineage object.
Example
Here is an example usage of the lineage API in a command line script:
import argparse
from dataworkspaces.lineage import LineageBuilder
def main():
...
parser = argparse.ArgumentParser()
parser.add_argument('--gamma', type=float, default=0.01,
help="Regularization parameter")
parser.add_argument('input_data_dir', metavar='INPUT_DATA_DIR', type=str,
help='Path to input data')
parser.add_argument('results_dir', metavar='RESULTS_DIR', type=str,
help='Path to where results should be stored')
args = parser.parse_args()
...
# Create a LineageBuilder instance to specify the details of the step
# to the lineage API.
builder = LineageBuilder()\
.as_script_step()\
.with_parameters({'gamma':args.gamma})\
.with_input_path(args.input_data_dir)\
.as_results_step(args.results_dir)
# builder.eval() will construct the lineage object. We call it within a
# with statement to get automatic save/cleanup when we leave the
# with block.
with builder.eval() as lineage:
... do your work here ...
# all done, write the results
lineage.write_results({'accuracy':accuracy,
'precision':precision,
'recall':recall,
'roc_auc':roc_auc})
# When leaving the with block, the lineage is automatically saved to the
# workspace. If an exception is thrown, the lineage is not saved, but the
# outputs are marked as being in an unknown state.
return 0
# boilerplate to call our main function if this is called as a script.
if __name__ == '__main__:
sys.exit(main())
Classes¶
-
class
dataworkspaces.lineage.
ResourceRef
[source]¶ A namedtuple that is used to identify an input or output of a step. The
name
parameter is the name of a resource. The optionalsubpath
parameter is a relative path within that resource. The subpath lets you store inputs/outputs from multiple steps within the same resource and track them independently.
-
class
dataworkspaces.lineage.
Lineage
[source]¶ This is the main object for tracking the execution of a step. Rather than instantiating it directly, use the
LineageBuilder
class to construct yourLineage
instance.-
abort
()[source]¶ The step has failed, so we mark its outputs in an unknown state. If you create the lineage via a “with” statement, then this will be called for you automatically.
-
add_output_path
(path: str) → None[source]¶ Resolve the path to a resource name and subpath. Add that to the lineage as an output of the step. From this point on, if the step fails (
abort()
is called), the associated resource and subpath will be marked as being in an “unknown” state.
-
-
class
dataworkspaces.lineage.
ResultsLineage
[source]¶ Lineage for a results step. This subclass is returned by the
LineageBuilder
whenas_results_step()
is called. This marks theLineage
object as generating results. It adds thewrite_results()
method for writing a JSON summary of the final results.Results resources will also have a
lineage.json
file added when the next snapshot is taken. This file contains the full lineage graph collected for the resource.-
write_results
(metrics: Dict[str, Any])[source]¶ Write a
results.json
file to the results directory specified when creating the lineage object (e.g. viaas_results_step()
). This json file contains information about the step execution (e.g. start time), parameters, and the provided metrics.
-
-
class
dataworkspaces.lineage.
LineageBuilder
[source]¶ Use this class to declaratively build
Lineage
objects. Instantiate a LineageBuilder instance, and call a sequence of configuration methods to specify your inputs, parameters, your workspace (if the script is not already inside the workspace), and whether this is a results step. Each configuration method returns the builder, so you can chain them together. Finally, calleval()
to instantiate theLineage
object.Configuration Methods
To specify the workflow step’s name, call one of:
as_script_step()
- the script’s name will be used to infer the step and the associated code resource- with_step_name - explicitly specify the step name
To specify the parameters of the step (e.g. command line arguments), use the
with_parameters()
method.To specify the input of the step call one or more of:
with_input_path()
- resolve the local filesystem path to a resource and subpath and add it to the lineage as inputs. May be called more than once.with_input_paths()
- resolve a list of local filesystem paths to resources and subpaths and add them to the lineage as inputs. May be called more than once.with_input_ref()
- add the resource and subpath to the lineage as an input. May be called more than once.with_no_inputs()
- mutually exclusive with the other input methods. This signals that there are no inputs to this step.
To specify code resource dependencies for the step, you can call
with_code_ref()
. For command-line Python scripts, the main code resource is handled automatically inas_script_step()
. Other subclasses of the LineageBuilder may provide similar functionality (e.g. the LineageBuilder for JupyterNotebooks will try to figure out the resource containing your notebook and set it in the lineage).If you need to specify the workspace’s root directory, use the
with_workspace_directory()
method. Otherwise, the lineage API will attempt to infer the workspace directory by looking at the path of the script.Call
as_results_step()
to indicate that this step is producing results. This will add a methodwrite_results()
to theLineage
object returned byeval()
. The methodas_results_step()
takes two parameters: results_dir and, optionally, run_description. The results directory should correspond to either the root directory of a results resource or a subdirectory within the resource. If you have multiple steps of your workflow that produce results, you can create separate subdirectories for each results-producing step.Example
Here is an example where we build a
Lineage
object for a script, that has one input, and that produces results:lineage = LineageBuilder()\ .as_script_step()\ .with_parameters({'gamma':0.001})\ .with_input_path(args.intermediate_data)\ .as_results_step('../results').eval()
Methods
-
as_results_step
(results_dir: str, run_description: Optional[str] = None) → dataworkspaces.lineage.LineageBuilder[source]¶
-
eval
() → dataworkspaces.lineage.Lineage[source]¶ Validate the current configuration, making sure all required properties have been specified, and return a
Lineage
object with the requested configuration.
-
with_code_ref
(ref: dataworkspaces.utils.lineage_utils.ResourceRef) → dataworkspaces.lineage.LineageBuilder[source]¶
Using Lineage¶
Once you have instrumented the individual steps of your workflow,
you can run the steps as normal. Lineage data is stored in the
directory .dataworkspace/current_lineage
, but not checked into
the associated Git repository.
When you take a snapshot, this
lineage data is copied to .dataworkspace/snapshot_lineage/HASH
,
where HASH is the hashcode associated with the snapshot,
and checked into git. This data is available as a record of how
you obtained the results associated with the snapshot. In the
future, more tools will be provided to analyze and operate on
this lineage (e.g. replaying workflows).
When you restore a snapshot, the lineage data assocociated
with the snapshot is restored to .dataworkspace/current_lineage
.
Consistency¶
In order to fully track the status of your workflow, we make a few restrictions:
- Independent steps should not overwrite the same ResourceRef or a ResourceRef where one ResourceRef refers to the subdirectory of another ResourceRef.
- A step’s execution should not transitively depend on two different versions of the same ResourceRef. If you try to run a step in this situation, an exception will be thrown.
These restrictions should not impact reasonable workflows in practice. Furthermore, they help to catch some common classes of errors (e.g. not rerunning all the dependent steps when a change is made to an input).
5. Kits Reference¶
In this section, we cover kits, integrations with various data science libraries and infrastructure provided by Data Workspaces.
Jupyter¶
Integration with Jupyter notebooks. This module provides a
LineageBuilder
subclass to simplify Lineage for Notebooks.
It also provides a collection of IPython magics (macros) for working in Jupyter notebooks.
-
class
dataworkspaces.kits.jupyter.
NotebookLineageBuilder
(results_dir: str, step_name: Optional[str] = None, run_description: Optional[str] = None)[source]¶ Notebooks are the final step in a pipeline (and potentially the only step). We customize the standard lineage builder to get the step name from the notebook’s name and to always have a results directory.
If you are not running this notebook in a server context (e.g. via nbconvert), the step name won’t be available. In that case, you can explicitly pass in the step name to the constructor.
-
dataworkspaces.kits.jupyter.
is_notebook
() → bool[source]¶ Return true if this code is running in a notebook.
-
dataworkspaces.kits.jupyter.
get_step_name_for_notebook
() → Optional[str][source]¶ Get the step name for a notebook by getting the path and then extracting the base name. In some situations (e.g. running on the command line via nbconvert), the notebook name is not available. We return None in those cases.
Magics¶
This module also provides a collection of IPython magics (macros) to simplify interactions with your data workspace when develping in a Jupyter Notebook.
Limitations¶
Currently these magics are only supported in interactive Jupyter Notebooks. They do not run properly
within JupyterLab (we are currently working on an extension specific to JupyterLab),
the nbconvert
command, or if you run the entire notebook with “Run All Cells”.
To develop a notebook interactively using the DWS magic commands and then run the same notebook
in batch mode, you can set the variable DWS_MAGIC_DISABLE
in your notebook, ahead of the
call to load the magics (%load_ext
). If you set it to True
, the commands will be
loaded in a disabled state and will run with no effect. Setting DWS_MAGIC_DISABLE
to
False
will load the magics in the enabled state and run all commands normally.
Loading the magics¶
To load the magics, run the following in an interactive cell of your Jupyter Notebook:
import dataworkspaces.kits.jupyter
%load_ext dataworkspaces.kits.jupyter
If the load runs correctly, you should see output like this in your cell:
Ran DWS initialization. The following magic commands have been added to your notebook:
%dws_info
- print information about your dws environment%dws_history
- print a history of snapshots in this workspace%dws_snapshot
- save and create a new snapshot%dws_lineage_table
- show a table of lineage for the workspace resources%dws_lineage_graph
- show a graph of lineage for a resource%dws_results
- show results from a run (results.json file)Run any command with the
--help
option to see a list of options for that command. The variableDWS_JUPYTER_NOTEBOOK
has been added to your variables, for use in future DWS calls.If you want to disable the DWS magic commands (e.g. when running in a batch context), set the variable
DWS_MAGIC_DISABLE
toTrue
ahead of the%load_ext
call.
Magic Command reference¶
We now describe the command options for the individual magics.
%dws_info
usage: dws_info [-h]
Print some information about this workspace
- optional arguments:
-h, --help show this help message and exit
%dws_history
Print a history of snapshots in this workspace
- optional arguments:
-h, --help show this help message and exit --max-count MAX_COUNT Maximum number of snapshots to show --tail Just show the last 10 entries in reverse order --baseline TAG_OR_HASH Snapshot tag or hash to use as a basis for metrics comparison. --heatmap Show a heatmap for metrics columns --maximize-metrics METRICS Metrics where larger values are better (e.g. accuracy) --minimize-metrics METRICS Metrics where smaller values are better (e.g. loss)
For easy visualization of results, the %dws_history
command
supports two styles of color coding. The
--heatmap
option will color code the background of metrics cells from dark red
(worst results) to dark green (best results). For common metrics, like accuracy and loss,
DWS knows the directionality of the metric (higher is better vs. lower is better). For less
common or custom metrics, you can use the --maximize-metrics
and --minimize-metrics
options to specify this. Here is an example heatmap:

The second style of coloring takes a baseline snapshot. Any metric values better than the
baseline have their text colored green, any metric values close to the baseline are bold
black text, and any metric values worse than the baseline are colored red. The
--baseline=SNAPSHOT
option enables this display mode. Here is an example:

%dws_snapshot
usage: dws_snapshot [-h] [-m MESSAGE] [-t TAG]
Save the notebook and create a new snapshot
- optional arguments:
-h, --help show this help message and exit -m MESSAGE, --message MESSAGE Message describing the snapshot -t TAG, --tag TAG Tag for the snapshot. Note that a given tag can only be used once (without deleting the old one).
%dws_lineage_table
usage: dws_lineage_table [-h] [–snapshot SNAPSHOT]
Show a table of lineage for the workspace’s resources
- optional arguments:
-h, --help show this help message and exit --snapshot SNAPSHOT If specified, print lineage as of the specified snapshot hash or tag
%dws_lineage_graph
usage: dws_lineage_table [-h] [–resource RESOURCE] [–snapshot SNAPSHOT]
Show a graph of lineage for a resource
- optional arguments:
-h, --help show this help message and exit --resource RESOURCE Graph lineage from this resource. Defaults to the results resource. Error if not specified and there is more than one. --snapshot SNAPSHOT If specified, graph lineage as of the specified snapshot hash or tag
%dws_results
usage: dws_results [-h] [–resource RESOURCE] [–snapshot SNAPSHOT]
Show results from a run (results.json file)
- optional arguments:
-h, --help show this help message and exit --resource RESOURCE Look for the results.json file in this resource. Otherwise, will look in all results resources and return the first match. --snapshot SNAPSHOT If specified, get results as of the specified snapshot or tag. Otherwise, looks at current workspace and then most recent snapshot.
Scikit-learn¶
This module (dataworkspaces.kits.scikit_learn
)
provides integration with the scikit-learn
framework. The main class provided here is LineagePredictor
,
which wraps any class following sklearn’s predictor protocol. It captures
inputs, model parameters and results. This module also provides
Metrics
and its subclasses, which
support the computation of common metrics and the writing of them
to a results file. Finally, there is
train_and_predict_with_cv()
, which
runs a common sklearn classification workflow, including grid search.
-
dataworkspaces.kits.scikit_learn.
load_dataset_from_resource
(resource_name: str, subpath: Optional[str] = None, workspace_dir: Optional[str] = None) → sklearn.utils.Bunch[source]¶ Load a datset (data and targets) from the specified resource, and returns an sklearn-style Bunch (a dictionary-like object). The bunch will include at least three attributes:
data
- a NumPy array of shape number_samples * number_featurestarget
- a NumPy array of length number_samplesresource
- aResourceRef
that provides the resource name and subpath (if any) for the data
Some other attributes that may also be present, depending on the data set:
DESCR
- text containing a full description of the data set (for humans)feature_names
- an array of length number_features containing the name of each feature.target_names
- an array containing the name of each target class
Data sets may define their own attributes as well (see below).
Parameters
- resource_name
- The name of the resource containing the dataset.
- subpath
- Optional subpath within the resource where this specific dataset is located. If not specified, the root of the resource is used.
- workspace_dir
- The root directory of your workspace in the local file system. Usually, this can be left unspecified and inferred by DWS, which will search up from the current working directory.
Creating a Dataset
To create a dataset in your resource that is suitable for importing by this function, you simply need to create a file for each attribute you want in the bunch and place all these files in the same directory within your resource. The names of the files should be
ATTRIBUTE.extn
whereATTRIBUTE
is the attribute name (e.g.data
orDESCR
) and.extn
is a file extension indicating the format. Supported file extensions are:.txt
or.rst
- text files.csv
- csv files. These are read in usingnumpy.loadtxt()
. If this fails because the csv does not contain all numeric data, pandas is used to read in the file. It is then converted back to a numpy array..csv.gz
or.csv.bz2
- these are compressed csv files which are treated the same was as csv files (numpy and pandas will automatically uncompress before parsing)..npy
- this a a file containing a serialized NumPy array saved vianumpy.save()
. It is loaded usingnumpy.load()
.
-
class
dataworkspaces.kits.scikit_learn.
LineagePredictor
(predictor, metrics: Union[str, type], input_resource: Union[str, dataworkspaces.utils.lineage_utils.ResourceRef], results_resource: Union[str, dataworkspaces.utils.lineage_utils.ResourceRef, None] = None, model_save_file: Optional[str] = None, workspace_dir: Optional[str] = None, verbose: bool = False)[source]¶ This is a wrapper for adding lineage to any predictor in sklearn. To use it, instantiate the predictor (for classification or regression) and then create a new instance of
LineagePredictor
.The initializer finds the associated workspace and initializes a
Lineage
instance. The input_resource is recorded in this lineage. Other methods call the underlying wrapped predictor’s methods, with additional functionality as needed (see below).Parameters
- predictor
- Any sklearn predictor instance. It must have
fit
andpredict
methods. - metrics
- Either a string naming a metrics type or a subclass of
Metrics
. If a string, it should be one of: binary_classification, multiclass_classification, or regression. - input_resource
- Resource providing the input data to this model. May be
specified by name, by a local file path, or via a
ResourceRef
. - resource_resource
- (optional) Resource where the results are to be stored.
May be specified by name, by a local file path, or via a
ResourceRef
. If not specified, will try to infer from the workspace. - model_save_file
- (optional) Name of file to store a (joblib-formmatted)
serialization of the trained model upon completion of the
fit()
method. This should be a relative path, as it is stored under the results resource. If model_save_file is not specified, no model is saved. - workspace_dir
- (optional) Directory specifying the workspace. Usually can be inferred from the current directory.
- verbose
- If True, print a lot of detailed information about the execution of Data Workspaces.
Example
Here is an example useage of the wrapper, taken from the Quick Start:
from sklearn.svm import SVC from sklearn.model_selection import train_test_split from dataworkspaces.kits.scikit_learn import load_dataset_from_resource from dataworkspaces.kits.scikit_learn import LineagePredictor dataset = load_dataset_from_resource('sklearn-digits-dataset') X_train, X_test, y_train, y_test = train_test_split( dataset.data, dataset.target, test_size=0.5, shuffle=False) classifier = LineagePredictor(SVC(gamma=0.001), metrics='multiclass_classification', input_resource=dataset.resource, model_save_file='digits.joblib') classifier.fit(X_train, y_train) score = classifier.score(X_test, y_test)
Methods
-
fit
(X, y, *args, **kwargs)[source]¶ The underlying fit() method of a predictor trains the predictio based on the input data (X) and labels (y).
If the input resource is an api resource, the wrapper captures the hash of the inputs. If
model_save_file
was specified, it also saves the trained model.
-
predict
(X)[source]¶ The underlying
predict()
method is called directly, without affecting the lineage.
-
score
(X, y, sample_weight=None)[source]¶ This method make predictions from a trained model and scores them according to the metrics specified when instantiated the wrapper.
If the input resource is an api resource, the wrapper captures its hash. The wapper runs the wrapped predictor’s
predict()
method to generate predictions. A metrics object is instantiated to compute the metrics for the predictions and aresults.json
file is written to the results resource. The lineage data is saved and finally the score is computed from the predictions and returned to the caller.
-
class
dataworkspaces.kits.scikit_learn.
Metrics
(expected, predicted, sample_weight=None)[source]¶ Metrics and its subclasses are convenience classes for sklearn metrics. The subclasses of Matrics are used by
train_and_predict_with_cv()
in printing a metrics report and generating the metrics json file.
-
class
dataworkspaces.kits.scikit_learn.
BinaryClassificationMetrics
(expected, predicted, sample_weight=None)[source]¶ Given an array of expected (target) values and the actual predicted values from a classifier, compute metrics that make sense for a binary classifier, including accuracy, precision, recall, roc auc, and f1 score.
-
class
dataworkspaces.kits.scikit_learn.
MulticlassClassificationMetrics
(expected, predicted, sample_weight=None)[source]¶ Given an array of expected (target) values and the actual predicted values from a classifier, compute metrics that make sense for a multi-class classifier, including accuracy and sklearn’s “classification report” showing per-class metrics.
TensorFlow¶
Integration with Tensorflow 1.x and 2.0
This is an experimental API and subject to change.
Wrapping a Karas Model
Below is an example of wrapping one of the standard tf.keras model classes,
based on https://www.tensorflow.org/tutorials/keras/basic_classification.
Assume we have a workspace already set up, with two resources: a Source Data
resource of type api-resource, which is used to capture the hash of
input data as it is passed to the model, and a Results resource to
keep the metrics. The only change we need to do to capture the lineage from
the model is to wrap the model’s class, using
add_lineage_to_keras_model_class()
.
Here is the code:
# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
from dataworkspaces.kits.tensorflow1 import add_lineage_to_keras_model_class
# Wrap our model class. This is the only DWS-specific change needed.
# We add an optional checkpoint configuration, which will cause checkpoints
# to be written to the workspace's scratch directory and then the best
# checkpoint copied to the results resource.
keras.Sequential = add_lineage_to_keras_model_class(keras.Sequential,
checkpoint_config=CheckpointConfig(model='fashion',
monitor='loss'))
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
model = keras.Sequential([
keras.layers.Flatten(input_shape=(28, 28)),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5)
test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Test accuracy:', test_acc)
This will create a results.json
file in the results resource. It will
look like this:
{
"step": "test",
"start_time": "2019-09-26T11:33:22.100584",
"execution_time_seconds": 26.991521,
"parameters": {
"optimizer": "adam",
"loss_function": "sparse_categorical_crossentropy",
"epochs": 5,
"fit_batch_size": null,
"evaluate_batch_size": null
},
"run_description": null,
"metrics": {
"loss": 0.3657455060243607,
"acc": 0.8727999925613403
}
}
Subclassing from a Keras Model
If you subclass from a Keras Model class, you can just use
add_lineage_to-keras_model_class()
as a decorator. Here is an example:
@add_lineage_to_keras_model_class
class MyModel(keras.Model):
def __init__(self):
# The Tensorflow documentation tends to specify the class name
# when calling the superclass __init__ function. Don't do this --
# it breaks if you use class decorators!
#super(MyModel, self).__init__()
super().__init__()
self.dense1 = tf.keras.layers.Dense(4, activation=tf.nn.relu)
self.dense2 = tf.keras.layers.Dense(5, activation=tf.nn.softmax)
def call(self, inputs):
x1 = self.dense1(inputs)
return self.dense2(x1)
model = MyModel()
import numpy as np
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(np.zeros(20).reshape((5,4)), np.ones(5), epochs=5)
test_loss, test_acc = model.evaluate(np.zeros(16).reshape(4,4), np.ones(4))
print('Test accuracy:', test_acc)
Supported datatypes for API Resources
If you are using the API Resource Type for your input resource, the model wrapper will hash the incoming data parameters and include the hash values in the data lineage. To compute the hashes, Data Workspaces must access the underlying data representation. The following data types are currently supported:
- NumPy
ndarray
- Pandas
DataFrame
andSeries
- Tensorflow
Tensor
andDataset
, as well as tuples and dictionaries containing these types. These types supported if you are either running Tensorflow 2.x (graph or eager mode) or 1.x only in eager mode. This restriction is due to the inability to access the underlying tensor representation when Tensorflow is running in graph mode in version 1.x
If you are using another data representation, or running Tensorflow 1.x in graph mode, you can always use a resource type that stores the data in files (e.g. git or local-files) and pass in the input resource name to the wrapper function.
API
-
class
dataworkspaces.kits.tensorflow.
DwsModelCheckpoint
(model_name: str, monitor: str = 'val_loss', save_best_only: bool = False, mode: str = 'auto', save_freq: Union[str, int] = 'epoch', results_resource: Union[str, dataworkspaces.utils.lineage_utils.ResourceRef, None] = None, workspace_dir: Optional[str] = None, verbose: Union[int, bool] = 0)[source]¶ Subclass of tf.keras.callbacks.ModelCheckpoint which will save checkpoints to the workspace’s stratch space and then move the most recent/best checkpoint to the results directory at the end of the run.
You can instantiate this class directly and pass it to the
callbacks
parameter of the model’sfit()
method:model.fit(train_images, train_labels, epochs=10, callbacks=[DwsModelCheckpoint('fashion', monitor='loss', save_best_only=True)])
You can also pass
CheckpointConfig
instance to theadd_lineage_to_keras_model_class()
wrapper function.
-
class
dataworkspaces.kits.tensorflow.
CheckpointConfig
[source]¶ Configuration for checkpoints, to be passed as a parameter to
add_lineage_to_keras_model_class()
, instead of directly instantiatingDwsModelChecpoint
.The checkpoints are initially written under the workspace’s scratch space. At the end of training, the best checkpoint is copied to the results resource.
The configuration fields are:
model_name
- name of the model to use in checkpoint filesmonitor
- metric to monitor - defaults to val_losssave_best_only
- if True, only checkpoints better than the previous are kept.mode
- how to determine whether a metric is the “best” - auto, min, or maxsave_freq
- ‘epoch’ or an interger
-
mode
¶ Alias for field number 3
-
model_name
¶ Alias for field number 0
-
monitor
¶ Alias for field number 1
-
save_best_only
¶ Alias for field number 2
-
save_freq
¶ Alias for field number 4
-
dataworkspaces.kits.tensorflow.
add_lineage_to_keras_model_class
(Cls: type, input_resource: Union[str, dataworkspaces.utils.lineage_utils.ResourceRef, None] = None, results_resource: Union[str, dataworkspaces.utils.lineage_utils.ResourceRef, None] = None, workspace_dir: Optional[str] = None, checkpoint_config: Optional[dataworkspaces.kits.tensorflow.CheckpointConfig] = None, verbose: bool = False) → type[source]¶ This function wraps a Keras model class with a subclass that overwrites key methods to make calls to the data lineage API.
Parameters:
Cls
– the class being wrappedinput_resources
– optional list of input resources to this model. Each resource may be specified by name, by a local file path, or via aResourceRef
. If no inputs are specified, will try to infer from the workspace.results_resource
– optional resource where the results are to be stored. May be specified by name, by a local file path, or via aResourceRef
. if not specified, will try to infer from the workspace.workspace-dir
– Optional directory specifying the workspace. Usually can be inferred from the current directory.checkpoint_config
– Optional instance ofCheckpointConfig
, which is used to enable checkpointing on fit and fit_generator()verbose
– If True, print extra debugging information.
The following methods are wrapped:
__init__()
- loads the workspace and adds dws-specific class memberscompile()
- captures theoptimizer
andloss_function
parameter valuesfit()
- captures theepochs
andbatch_size
parameter values; if input is an API resource, capture hash values of training data, otherwise capture input resource name. If the input is an API resource, and it is either a Keras Sequence or a generator, writes the generator and captures the hashes of returned values as it is iterated through.evaluate()
- captures thebatch_size
parameter value; if input is an API resource, capture hash values of test data, otherwise capture input resource name; capture metrics and write them to results resource. If the input is an API resource, and it is either a Keras Sequence or a generator, writes the generator and captures the hashes of returned values as it is iterated through.
6. Resource Reference¶
This section provide a little detail on how to use specific resource types. For specific command line options, please see the Command Reference.
Git resources¶
The git
resource type provides project tracking and management for Git repositories.
There are actually two types of git resources supported:
- A standalone repository. This can be either one controller by the user or, for source data, a 3rd party repository to be treated as a read-only resource.
- A subdirectory of the data workspace’s git repository can be treated as a separate resource. This is especially convenient for small projects, where multiple types of data (source data, code, results, etc.) can be kept in a single repository but versioned independently.
When running the dws add git ...
commend, the type of repository (standalone vs.
subdirectory of the main workspace) is automatically detected. In either case, it is
expected that there is a local copy of the repository available when adding it as
a resource to the workspace. It is recommended, but not required, to have a remote
origin
, so that the push
, pull
and clone
commands can work with
the resource.
Examples¶
When initializing a new workspace, one can add sub-directory resources for any and
each of the resource roles (source-data, code, intermediate-data, and results).
This is done via the --create-resources
option as follows:
$ mkdir example-ws
$ cd example-ws/
$ dws init --create-resources=code,results
$ dws init --create-resources=code,results
Have now successfully initialized a workspace at /Users/dws/code/t/example-ws
Will now create sub-directory resources for code, results
Added code to git repository
Have now successfully Added Git repository subdirectory code in role 'code' to workspace
Added results to git repository
Have now successfully Added Git repository subdirectory results in role 'results' to workspace
Finished initializing resources:
code: ./code
results: ./results
$ ls
code results
Here is an example from the Quick Start where we
add an entire third party repository to our workspace as a read-only resource.
We first clone it into a subdirectory of the workspace and then tell dws
about it:
git clone https://github.com/jfischer/sklearn-digits-dataset.git
dws add git --role=source-data --read-only ./sklearn-digits-dataset
Support for Large Files: Git-lfs and Git-fat integration¶
It can be nice to manage your golden source data in a Git repository. Unfortunately, due to its architecture and focus as a source code tracking system, Git can have significant performance issues with large files. Furthermore, hosting services like GitHub place limits on the size of individual files and on commit sizes. To get around this, various extensions to Git have sprung up. Data Workspaces currently integrates with two of them, git-lfs and git-fat.
Git-lfs¶
Git-lfs (large file storage) is a utility which interacts with a
git hosting service using a special protocol. This protocol is supported
by most popular Git hosting services/servers, including GitHub and
GitLab.
You need to manually install the git-lfs
executable (see
https://git-lfs.github.com for details).
Data Workspaces automatically determines whether a particular git repository
is using git-lfs
by looking for any references to git-lfs
in a
.gitattributes
within the repository. This is done for both the
workspace’s metadata repository and any git resources. DWS also will ensure
that the user is correctly configured for git-lfs
,
by running git-lfs install
if the user does not have an associated entry for
in their .gitconfig
file.
We support the following integration points with git-lfs
:
- The git repo for the workspace itself can be git-lfs enabled when it is
created. This is done through the
--git-lfs-attributes
command line option ondws init
. See the Command Reference entry for details (or the example below). - Any
dws push
ordws pull
of a git-lfs-enabled workspace will automatically call the associated git-lfs command for the workspace’s main repo. - If you add a git repository as a resource to the workspace, and it has
references to
git-lfs
in a.gitattributes
file, then anydws push
ordws pull
commands will automatically call the associatedgit-lfs
commands.
Git-fat¶
Git-fat allows you to
store your large files on a host you control that is accessible via
ssh
(or other protocols supported through rsync
). The large
files themselves are hashed and stored on the (remote) server. The
metadata for these files is stored in the git repository and versioned
with the rest of your git files.
Git-fat is just
a Python script, which we ship as a part of the dataworkspaces
Python
package. [1] Running pip install dataworkspaces
will put git-fat
into your path and make it available to your git
commands and dws
commands.
We support the following integration points with git-fat
:
- The git repo for the workspace itself can be git-fat enabled when it is
created. This is done through command line options on
dws init
. See the Command Reference entry for details (or the example below). - Any
dws push
ordws pull
of a git-fat-enabled workspace will automatically call the associated git-fat command for the workspace’s main repo. - If you add a git repository as a resource to the workspace, and it has a
.gitfat
file, then anydws push
ordws pull
commands will automatically call the associated git-fat commands. - As mentioned above, git-fat is included in the dataworkspaces package and installed in your path.
[1] | Unfortunately, git-fat is written in Python 2, so you will need to have Python 2 available on your system to use it. |
Example¶
Here is an example using git-fat to store all gzipped files of the workspace’s main git repo on a remote server.
First, we set up a directory on our remote server to store the large files:
fat@remote-server $ mkdir ~/fat-store
Now, back on our personal machine, we initialize a workspace, specifying the remote server and that .gz files should be managed by git-fat:
local $ mkdir git-fat-example
local $ cd git-fat-example/
local $ dws init --create-resources=source-data \
--git-fat-remote=remote-server:/home/fat/fat-store \
--git-fat-user=fat --git-fat-attributes='*.gz'
local $ ls
source-data
A bit later, we’ve added some .gz files to our source data resource. We
take a snapshot and then dws push
to the origin:
local $ ls source-data
README.txt census-state-populations.csv.gz zipcode.csv.gz
local $ dws snapshot s1
local $ dws push # this will also push to the remote fat store
If we now go to the remote store, we can see the hashed files:
fat@remote-server $ ls fat-store
26f2cac452f70ad91da3ccd05fc40ba9f03b9f48 d9cc0c11069d76fe9435b9c4ca64a335098de2d7
Our local workspace has our full files, which can be used by our scripts as-is. However, if you look at the origin repository, you will find the content of each .gz file replaced by a single line referencing the hash. If you clone this repo, you will get the full files, through the magic of git-fat.
Adding resources using rclone¶
The rclone resource type leverages the rclone command line utility to provide synchronization with a variety of remote data services.
dws add rclone [options] source-repo target-repo
dws add rclone adds a remote repository set up using rclone.
We use rclone to set up remote repositories.
Example¶
We use rclone config to set up a repository pointing to a local directory:
$ rclone config show
; empty config
$ rclone config create localfs local unc true
The configuration file (typically at ~/.config/rclone/rclone.conf
)
now looks like this:
[localfs]
type = local
config_automatic = yes
unc = true
Next, we use the backend to add a repository to dws:
$ dws add rclone --role=source-data my_local_files:/Users/rupak/tmp tmpfiles
This creates a local directory tmpfiles and copies the contents of /Users/rupak/tmp to it.
Similarly, we can make a remote S3 bucket:
$ rclone config
mbk-55-51:docs rupak$ rclone --config=rclone.conf config
Current remotes:
Name Type
==== ====
localfs local
e) Edit existing remote
n) New remote
d) Delete remote
r) Rename remote
c) Copy remote
s) Set configuration password
q) Quit config
e/n/d/r/c/s/q> n
name> s3bucket
Type of storage to configure.
# Pick choice 4 for S3 and configure the bucket
...
# set configuration parameters
Once the S3 bucket is configured, we can get files from it:
$ dws add rclone --role=source-data s3bucket:mybucket s3files
Configuration Files¶
By default, we use the default configuration file used by rclone. This is the file printed out by:
$ rclone config file
and usually resides in $HOME/.config/rclone/rclone.conf
However, you can specify a different configuration file:
$ dws add rclone --config=/path/to/configfile --role=source-data localfs:/Users/rupak/tmp tmpfiles
In this case, make sure the config file you are using has the remote localfs
defined.
7. Internals: Developer’s Guide¶
This section is a guide for people working on the development of Data Workspaces or people who which to extend it (e.g. through their own resource types or kits).
Installation and setup for development¶
In summary:
- Install Python 3 via the Anaconda distribution.
- Make sure you have the
git
andmake
utilties on your system. - Create a virtual environment called
dws
and activate it. - Install
mypy
viapip
. - Clone the main
data-workspaces-python
repo. - Install the
dataworkpaces
package into your virtual environment. - Download and configure
rclone
. - Run the tests.
Here are the details:
We recommend using Anaconda3
for development, as it can be easily installed on Mac and Linux and includes
most of the packages you will need for development and data science projects.
You will also need some basic system utilities: git
and make
(which may
already be installed).
Once you have Anaconda3 installed, create a virtual environment for your work:
conda create --name dws
To activate the environment:
conda activate dws
You will need the mypy
type checker, which is run as part of the tests.
It is best to install via pip to get the latest version (some older versions
may be buggy). Once you have activated your environment, mypy
may be installed
as follows:
pip install mypy
Next, clone the Data Workspaces main source tree:
git clone git@github.com:data-workspaces/data-workspaces-core.git
Now, we install the data workspaces library, via pip
, using an editable
mode so that our source tree changes are immediately visible:
cd data-workspaces-core
pip install --editable `pwd`
With this setup, you should not have to configure PYTHONPATH
.
Next, we install rclone
, a file copying utility used by the rclone resource.
You can download the latest rclone
executable from http://rclone.org. Just make
sure that the executable is available in your executable path. Alternatively,
on Linux, you can install rclone
via your system’s package manager. To
configure rclone
, see the instructions here in the
Resource Reference.
Now, you should be ready to run the tests:
cd tests
make test
The tests will print a lot to the console. If all goes well, it should end with something like this:
----------------------------------------------------------------------
Ran 40 tests in 23.664s
OK
Overall Design¶
Here is a block diagram of the system architecture:

The core of the system is the Workspaces API, which is located in
dataworkspaces/workspace.py
. This API provides a base Workspace
classes for
the workspace and a base Resource
class for resources. There are also several
mixin classes which define extensions to the basic functionality. See
Core Workspace and Resource API for details.
The Workspace
class has one or more backends which implement the
storage and management of the workspace metadata. Currently, there is only
one complete backend, the git backend, which stores its metadata in
a git repository.
Independent of the workspace backend are resource types, which provide
concrete implementations of the the Resource
base class. These currently include
resource types for git, git subdirectories, rclone, and local files.
Above the workspace API sits the Command API, which implements command functions
for operations like init, clone, push, pull, snapshot and restore. There is a thin layer above this for programmatic access (dataworkspaces.api
) as well as a full
command line interface, implmented using the
click package
(see dataworkspaces.dws
).
The Lineage API is also implemented on top of the basic workspace interface and provides a way for pipeline steps to record their inputs, outputs, and code dependencies.
Finally, kits provide integration with user-facing libraries and applications, such as Scikit-learn and Jupyter notebooks.
Code Layout¶
The code is organized as follows:
dataworkspaces/
api.py
- API to run a subset of the workspace commands from Python. This is useful for building integrations.dws.py
- the command line interfaceerrors.py
- common exception class definitionslineage.py
- the generic lineage apiworkspace.py
- the core api for workspaces and resourcesbackends/
- implementations of workspace backendsutils/
- lower level utilities used by the upper layersresources/
- implementations of the resource typescommands/
- implementations of the individual dws commandsthird_party/
- third-party code (e.g. git-fat)kits/
- adapters to specific external technologies
Git Database Layout¶
When using the git backend, a data workspace is contained within a Git repository.
The metadata about resources,
snapshots and lineage is stored in the subdirectory .dataworkspace
. The various
resources can be other subdirectories of the workspace’s repository or may be
external to the main Git repo.
The layout for the files under the .dataworkspace
directory is as follows:
.dataworkspace/
config.json
- overall configuration (e.g. workspace name, global params)local_params.json
- local parameters (e.g. hostname); not checked into gitresources.json
- lists all resources and their config parametersresource_local_params.json
- configuration for resources that is local to this machine (e.g. path to the resource); not checked into gitcurrent_lineage/
- contains lineage files reflecting current state of each resource; not checked into gitfile/
- contains metadata for local files based resources; in particular, has the file-level hash information for snapshotssnapshots/
- snapshot metadata
snapshot-<HASHCODE>.json
- lists the hashes for each resource in the snapshot. The hash of this file is the hash of the overall snapshot.snapshot_history.json
- metadata for the past snapshotssnapshot_lineage/
- contains lineage data for past snapshots
<HASHCODE>/
- directory containing the current lineage files at the time of the snapshot associated with the hashcode. Unlikecurrent_lineeage
, this is checked into git.
In designing the workspace database, we try to follow the following guidelines:
- Use JSON as our file format where possible - it is human readable and editable and easy to work with from within Python.
- Local or uncommitted state is not stored in Git (we use .gitignore to keep
the files outside of the repo). Such files are initialized by
dws init
anddws clone
. - Avoid git merge conflicts by storing data in seperate files where possible. For example, the resources.json file should really be broken up into one file per resource, stored under a common directory (see issue #13).
- Use git’s design as an inspiration. It provides an efficient and flexible representation.
Command Design¶
The bulk of the work for each command is done by the core Workspace API
and its backends. The command fuction itself (dataworkspaces.commands.COMMAND_NAME
)
performs parameter-checking, calls the associated parts of Workspace API,
and handles user interactions when needed.
Resource Design¶
Resources are orthoginal to commands and represent the collections of files to be versioned.
A resource may have one of four roles:
- Source Data Set - this should be treated read-only by the ML pipeline. Source data sets can be versioned.
- Intermediate Data - derived data created from the source data set(s) via one or more data pipeline stages.
- Results - the outputs of the machine learning / data science process.
- Code - code used to create the intermediate data and results, typically in a git repository or Docker container.
The treatment of resources may vary based on the role. We now look at resource functionality per role.
Source Data Sets¶
We want the ability to name source data sets and swap them in and out without changing other parts of the workspace. This still needs to be implemented.
Intermediate Data¶
For intermediate data, we may want to delete it from the current state of the workspace if it becomes out of date (e.g. a data source version is changed or swapped out). This still needs to be implemented.
Results¶
In general, results should be additive.
For the snapshot
command, we move the results to a specific subdirectory per
snapshot. The name of this subdirectory is determined by a template that can
be changed by setting the parameter results.subdir
. By default, the template
is: {DAY}/{DATE_TIME}-{USER}-{TAG}
. The moving of files is accomplished via the
method results_move_current_files(rel_path, exclude)
on the Resource <resources>
class. The snapshot()
method of the resource is still called as usual, after
the result files have been moved.
Individual files may be excluded from being moved to a subdirectory. This is done
through a configuration command. Need to think about where this would be stored –
in the resources.json file? The files would be passed in the exclude set to
results_move_current_files
.
If we run restore
to revert the workspace to an
older state, we should not revert the results database. It should always
be kept at the latest version. This is done by always putting results
resources into the leave set, as if specified in the --leave
option.
If the user puts a results resource in the --only
set, we will error
out for now.
Integration API¶
The module dataworkspaces.api
provides a simplified, high level programmatic
inferface to Data Workspaces. It is for integration with third-party tooling.
This is an API for selected Data Workspaces management functions.
-
class
dataworkspaces.api.
ResourceInfo
[source]¶ Named tuple representing the results from a call to
get_resource_info()
.-
local_path
¶ Alias for field number 3
-
name
¶ Alias for field number 0
-
resource_type
¶ Alias for field number 2
-
role
¶ Alias for field number 1
-
-
class
dataworkspaces.api.
SnapshotInfo
[source]¶ Named tuple represneting the results from a call to
get_snapshot_history()
-
hashval
¶ Alias for field number 1
-
message
¶ Alias for field number 4
-
metrics
¶ Alias for field number 5
-
snapshot_number
¶ Alias for field number 0
Alias for field number 2
-
timestamp
¶ Alias for field number 3
-
-
dataworkspaces.api.
get_api_version
()[source]¶ The API version is maintained independently of the overall DWS version. It should be more stable.
-
dataworkspaces.api.
get_resource_info
(workspace_uri_or_path: Optional[str] = None, verbose: bool = False)[source]¶ Returns a list of ResourceInfo instances, describing the resources defined for this workspace.
-
dataworkspaces.api.
get_results
(workspace_uri_or_path: Optional[str] = None, tag_or_hash: Optional[str] = None, resource_name: Optional[str] = None, verbose: bool = False) → Optional[Tuple[Dict[str, Any], str]][source]¶ Get a results file as a parsed json dict. If no resource or snapshot is specified, searches all the results resources for a file. If a snapshot is specified, we look in the subdirectory where the resuls have been moved. If no snapshot is specified, and we don’t find a file, we look in the most recent snapshot.
Returns a tuple with the results and the logical path (resource:/subpath) to the results. If nothing is found, returns None.
-
dataworkspaces.api.
get_snapshot_history
(workspace_uri_or_path: Optional[str] = None, reverse: bool = False, max_count: Optional[int] = None, verbose: bool = False) → Iterable[dataworkspaces.api.SnapshotInfo][source]¶ Get the history of snapshots, starting with the oldest first (unless :reverse: is True). Returns a list of SnapshotInfo instances, containing the snapshot number, hash, tag, timestamp, and message. If :max_count: is specified, returns at most that many snapshots.
-
dataworkspaces.api.
get_version
()[source]¶ Get the version string for the installed version of Data Workspaces
-
dataworkspaces.api.
make_lineage_graph
(output_file: str, workspace_uri_or_path: Optional[str] = None, resource_name: Optional[str] = None, tag_or_hash: Optional[str] = None, width: int = 1024, height: int = 800, verbose: bool = False) → None[source]¶ Write a lineage graph as an html/javascript page to the specified file.
-
dataworkspaces.api.
make_lineage_table
(workspace_uri_or_path: Optional[str] = None, tag_or_hash: Optional[str] = None, verbose: bool = False) → Iterable[Tuple[str, str, str, Optional[List[str]]]][source]¶ Make a table of the lineage for each resource. The columns are: ref, lineage type, details, inputs
-
dataworkspaces.api.
restore
(tag_or_hash: str, workspace_uri_or_path: Optional[str] = None, only: Optional[List[str]] = None, leave: Optional[List[str]] = None, verbose: bool = False) → int[source]¶ Restore to a previous snapshot, identified by either its hash or its tag (if one was specified). Parameters:
only
- an optional list of resources to store. If specified all other resources will be left as-is.leave
- an optional list of resource to leave as-is. Bothonly
andleave
should not be specified together.
Returns the number of resources changed.
-
dataworkspaces.api.
take_snapshot
(workspace_uri_or_path: Optional[str] = None, tag: Optional[str] = None, message: str = '', verbose: bool = False) → str[source]¶ Take a snapshot of the workspace, using the tag and message, if provided. Returns the snapshot hash (which can be used to restore to this point).
Core Workspace API¶
Here is the detailed documentation for the Core Workspace API, found in
dataworkspaces.workspace
.
Main definitions of the workspace abstractions
Workspace backends, like git, subclass from the Workspace
base
class. Resource implementations (e.g. local files or git resource)
subclass from the Resource
base class.
Optional capabilities for both workspace backends and resource backends are defined via abstract mixin classes. These classes do not inherit from the base workspace/resource classes, to avoid issues with multiple inheritance.
Complex operations involving resources use the following pattern, where COMMAND is the command and CAPABILITY is the capability needed to perform the command:
class CAPABILITYWorkspaceMixin:
...
def _COMMAND_precheck(self, resource_list:List[CAPABILITYResourceMixin]) -> None:
# Backend can override to add more checks
for r in resource_list:
r.COMMAND_precheck()
def COMMAND(sef resource_list:List[CAPAIBILITYResourceMixin]) -> None:
self._COMMAND_precheck(resource_list)
...
for r in resource_list:
r.COMMAND()
The module dataworkspaces.commands.COMMAND
should look like this:
def COMMAND_command(workspace, ...):
if not isinstance(workspace, CAPABILITYWorkspaceMixin):
raise ConfigurationError("Workspace %s does not support CAPABILITY"%
workspace.name)
mixin = cast(CAPABILITYWorkspaceMixin, workspace)
... error checking ...
resource_list = ...
workspace.COMMAND(resource_list)
workspace.save("Completed command COMMAND")
Core Classes¶
-
class
dataworkspaces.workspace.
Workspace
(name: str, dws_version: str, batch: bool = False, verbose: bool = False)[source]¶ -
add_resource
(name: str, resource_type: str, role: str, *args, **kwargs) → dataworkspaces.workspace.Resource[source]¶ Add a resource to the repository for tracking.
-
as_lineage_ws
() → dataworkspaces.workspace.SnapshotWorkspaceMixin[source]¶ If this workspace supports snapshots and lineage, cast it to a SnapshotWorkspaceMixin. Otherwise, raise an NotSupportedError exception.
-
as_snapshot_ws
() → dataworkspaces.workspace.SnapshotWorkspaceMixin[source]¶ If this workspace supports snapshots, cast it to a SnapshotWorkspaceMixin. Otherwise, raise an NotSupportedError exception.
-
batch
= None¶ attribute: True if input from user should be avoided (bool)
-
clone_resource
(name: str) → dataworkspaces.workspace.LocalStateResourceMixin[source]¶ Instantiate the resource locally. This is used in cases where the resource has local state.
-
dws_version
= None¶ attribute: Version of dataworkspaces that was used to create the workspace (str)
-
get_global_param
(param_name: str) → Any[source]¶ Returns the value of the global param if set, otherwise the default. If the param is not set, returns the default value. If the param is not defined throws ParamNotFoundError.
-
get_instance
() → str[source]¶ Return a unique identifier for this instance of the workspace. For lineage tracking, it is assumed that only one pipeline is running at a time in an instance. If the workspace exists on a local filesystem, then it should correspond to the machine and path where the workspace resides. Typically, some combination of hostname and user are sufficient.
Uniquenes of the instance is important for things like naming the results snapshot subdirectories.
-
get_local_param
(param_name: str) → Any[source]¶ Returns the value of the local param if set, otherwise the default. If the param is not set, returns the default value. If the param is not defined throws ParamNotFoundError.
-
get_names_for_resources_that_need_to_be_cloned
() → Iterable[str][source]¶ Find all the resources that have local state, but no local parameters (not even an empty dict). These needed to be cloned. This is to be called during the pull() command.
-
get_names_of_resources_with_local_state
() → Iterable[str][source]¶ Return an iterable of the resource names in the workspace that have local state.
-
get_resource
(name: str) → dataworkspaces.workspace.Resource[source]¶ Get the associated resource from the workspace metadata.
-
get_resource_names
() → Iterable[str][source]¶ Return an iterable of resource names. The names should be returned in a consistent order, specifically the order in which they were added to the workspace. This supports backwards compatilbity for operations like snapshots.
-
get_resource_role
(resource_name) → str[source]¶ Get the role of a resource without having to instantiate it.
-
get_resource_type
(resource_name) → str[source]¶ Get the type of a resource without having to instantiate it.
-
get_resources
() → Iterable[dataworkspaces.workspace.Resource][source]¶ Iterate through all the resources
-
get_scratch_directory
() → str[source]¶ Return an absolute path for the local scratch directory to be used by this workspace.
-
get_workspace_local_path_if_any
() → Optional[str][source]¶ If the workspace maintains local state and has a “home” directory, return it. Otherwise, return None.
This is useful for things like providing defaults for resource local paths or providing special handling for resources enclosed in the workspace (e.g. GitRepoResource vs. GitSubdirResource)
-
map_local_path_to_resource
(path: str, expecting_a_code_resource: bool = False) → dataworkspaces.utils.lineage_utils.ResourceRef[source]¶ Given a path on the local filesystem, map it to a resource and the path within the resource. Raises PathNotAResourceError if no match is found.
Note: this does not check whether the path already exists.
-
name
= None¶ attribute: A short name for this workspace (str)
-
set_global_param
(name: str, value: Any) → None[source]¶ Validate and set a global parameter. Setting does not necessarily take effect until save() is called
-
set_local_param
(name: str, value: Any) → None[source]¶ Validate and set a local parameter. Setting does not necessarily take effect until save() is called
-
suggest_resource_name
(resource_type: str, role: str, *args)[source]¶ Given the arguments passed in for creating a resource, suggest a (unique) name for the resource.
-
validate_local_path_for_resource
(proposed_resource_name: str, proposed_local_path: str) → None[source]¶ When creating a resource, validate that the proposed local path is usable for the resource. By default, this checks existing resources with local state to see if they have conflicting paths and, if a local path exists for the workspace, whether there is a conflict (the entire workspace cannot be used as a resource path).
Subclasses may want to add more checks. For subclasses that do not support any local state, including in resources, they can override the base implementation and throw an exception.
-
validate_resource_name
(resource_name: str, subpath: Optional[str] = None, expected_role: Optional[str] = None) → None[source]¶ Validate that the given resource name and optional subpath are valid in the current state of the workspace. Otherwise throws a ConfigurationError.
-
verbose
= None¶ attribute: Print detailed logging (bool)
-
-
class
dataworkspaces.workspace.
ResourceRoles
[source]¶ This class defines constants for the four resource roles.
-
CODE
= 'code'¶
-
INTERMEDIATE_DATA
= 'intermediate-data'¶
-
RESULTS
= 'results'¶
-
SOURCE_DATA_SET
= 'source-data'¶
-
-
class
dataworkspaces.workspace.
Resource
(resource_type: str, name: str, role: str, workspace: dataworkspaces.workspace.Workspace)[source]¶ Base class for all resources
-
get_params
() → Dict[str, Any][source]¶ Get the parameters that define the configuration of the resource globally.
-
is_imported
() → bool[source]¶ Returns True if this resource has an imported parameter and it is True.
-
name
= None¶ attribute: unique name for this resource within the workspace (str)
-
resource_type
= None¶ attribute: name for this resource’s type (e.g. git, local-files, etc.) (str)
-
role
= None¶ Role of the resource, one of
ResourceRoles
-
validate_subpath_exists
(subpath: str) → None[source]¶ Validate that the subpath is valid within this resource. Otherwise should raise a ConfigurationError.
-
workspace
= None¶ attribute: The workspace that contains this resource (Workspace)
-
-
dataworkspaces.workspace.
RESOURCE_ROLE_CHOICES
= ['source-data', 'intermediate-data', 'code', 'results']¶ Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified.
Factory Classes and Functions¶
-
class
dataworkspaces.workspace.
WorkspaceFactory
[source]¶ This class collects the various ways of instantiating a workspace: creating from an existing one, initing a new one, and cloning into a new environment.
Each backend should implement a subclass and provide a singleton instance as the FACTORY member of the module.
-
static
clone_workspace
(local_params: Dict[str, Any], batch: bool, verbose: bool, *args) → dataworkspaces.workspace.Workspace[source]¶ Clone an existing workspace into the local environment. Note that hostname is used as an instance identifier (TODO: make this more generic).
This only clones the workspace itself, any local state resources should be cloned separately.
If a workspace has no local state, this factory method might not do anything.
-
static
-
dataworkspaces.workspace.
load_workspace
(uri: str, batch: bool, verbose: bool) → dataworkspaces.workspace.Workspace[source]¶ Given a requested workspace backend, and backend-specific parameters, instantiate and return a workspace. The workspace is specified by a uri, where the backend-type is the scheme and rest is interpreted by the backend.
The backend name / scheme is used to load a backend module whose name is dataworkspaces.backends.SCHEME.
-
dataworkspaces.workspace.
find_and_load_workspace
(batch: bool, verbose: bool, uri_or_local_path: Optional[str] = None) → dataworkspaces.workspace.Workspace[source]¶ This tries to find the workspace and load it. There are three cases:
- If uri_or_local_path is a uri, we call load_workspace() directly
- If uri_or_local_path is specified, but not a uri, we interpret it as a local path and try to instantitate a git-backend workspace at that location in the loca filesystem.
- If uri_or_local_path is not specified, we start at the current directory and search up the directory tree until we find something that looks like a git backend workspace.
TODO: In the future, this should also look for a config file that might specify the workspace or list workspaces by name.
-
dataworkspaces.workspace.
init_workspace
(backend_name: str, workspace_name: str, hostname: str, batch: bool, verbose: bool, scratch_dir: str, *args, **kwargs) → dataworkspaces.workspace.Workspace[source]¶ Given a requested workspace backend, and backend-specific parameters, initialize a new workspace, then instantitate and return it.
A backend name is a module name. The module should have an init_workspace() function defined.
TODO: the hostname should be generalized as an “instance name”, but we also need backward compatibility. TODO: is this function now redundant? Compare to
load_workspace()
.
-
class
dataworkspaces.workspace.
ResourceFactory
[source]¶ Abstract factory class to be implemented for each resource type.
-
clone
(params: Dict[str, Any], workspace: dataworkspaces.workspace.Workspace) → dataworkspaces.workspace.LocalStateResourceMixin[source]¶ Instantiate a local copy of the resource that came from the remote origin. We don’t yet have local params, since this resource is not yet on the local machine. If not in batch mode, this method can ask the user for any additional information needed (e.g. a local path). In batch mode, should either come up with a reasonable default or error out if not enough information is available.
-
from_command_line
(role: str, name: str, workspace: dataworkspaces.workspace.Workspace, *args, **kwargs) → dataworkspaces.workspace.Resource[source]¶ Instantiate a resource object from the add command’s arguments
-
from_json
(params: Dict[str, Any], local_params: Dict[str, Any], workspace: dataworkspaces.workspace.Workspace) → dataworkspaces.workspace.Resource[source]¶ Instantiate a resource object from saved params and local params
-
has_local_state
() → bool[source]¶ Return true if this resource has local state and needs a clone step the first time it is used.
-
suggest_name
(workspace: dataworkspaces.workspace.Workspace, role: str, *args) → str[source]¶ Given the arguments passed in to create a resource, suggest a name for the case where the user did not provide one via –name. This will be used by suggest_resource_name() to find a short, but unique name for the resource.
-
Mixins for Files and Local State¶
-
class
dataworkspaces.workspace.
FileResourceMixin
[source]¶ This is a mixin to be implemented by resources which provide a hierarchy of files. Examples include a git repo, local filesystem, or S3 bucket. A database would be a resource that does NOT implement this API.
-
add_results_file
(data: Union[Dict[str, Any], List[Any]], rel_dest_path: str) → None[source]¶ Save JSON results data to the specified path in the resource. Note that, although this is usually used for results role resources, it could also be used for intermediate-data resources if they are exported (causing the lineage file to be written to the resource).
TODO: this is used for both results and lineage files. Perhaps we should either rename it to something like add_json_file() or create a separate call for lineage.
-
delete_file
(rel_path: str) → None[source]¶ Delete a file from the resource. If the resource is read-only or otherwise does not support modifications, should throw a NotSupportedError.
-
does_subpath_exist
(subpath: str, must_be_file: bool = False, must_be_directory: bool = False) → bool[source]¶ Return True the subpath is valid within this resource, False otherwise. If must_be_file is True, return True only if the subpath corresponds to content. If must_be_directory is True, return True only if the subpath corresponds to a directory.
-
read_results_file
(subpath: str) → Dict[str, Any][source]¶ Read and parse json results data from the specified path in the resource. If the path does not exist or is not a file throw a ConfigurationError.
-
results_copy_current_files
(rel_dest_root: str, exclude_files: Set[str], exclude_dirs_re: Pattern[AnyStr]) → None[source]¶ A snapshot is being taken, and we want to copy the files in the resource to the relative subdirectory rel_dest_root. We should exclude the files in the set exclude_files and exclude any directories matching exclude_dirs_re (e.g. the directory to which the files are being moved).
By default results_move_current_files() is called, but the copy is used when we export the resource.
-
results_move_current_files
(rel_dest_root: str, exclude_files: Set[str], exclude_dirs_re: Pattern[AnyStr]) → None[source]¶ A snapshot is being taken, and we want to move the files in the resource to the relative subdirectory rel_dest_root. We should exclude the files in the set exclude_files and exclude any directories matching exclude_dirs_re (e.g. the directory to which the files are being moved).
-
-
class
dataworkspaces.workspace.
LocalStateResourceMixin
[source]¶ Mixin for the resource api for resources with local state that need to be “cloned”
-
get_local_params
() → Dict[str, Any][source]¶ Get the parameters that define any local configuration of the resource (e.g. local filepaths)
-
get_local_path_if_any
() → Optional[str][source]¶ If the resource has an associated local path on the system, return it. Othewise, return None. Even if it has local state, this might not be a file-based resource. Thus, the return value is an Optional string.
-
pull_precheck
()[source]¶ Perform any prechecks before updating this resource from the remote origin.
-
Mixins for Synchronized and Centralized Workspaces¶
Workspace backends should inherit from one of either
SyncedWorkspaceMixin
or CentralWorkspaceMixin
.
-
class
dataworkspaces.workspace.
SyncedWorkspaceMixin
[source]¶ This mixin is for workspaces that support synchronizing with a master copy via push/pull operations.
-
publish
(*args) → None[source]¶ Make a local repo available at a remote location. For example, we may make it available on GitHub, GitLab or some similar service.
-
pull_resources
(resource_list: List[dataworkspaces.workspace.LocalStateResourceMixin]) → None[source]¶ Download latest updates from remote origin. By default, includes any resources that support syncing via the LocalStateResourceMixin.
Note that this does not handle the actual workspace pull or the cloning of new resources.
-
pull_workspace
() → dataworkspaces.workspace.SyncedWorkspaceMixin[source]¶ Pull the workspace itself and return a new workspace object reflecting the latest state changes.
-
push
(resource_list: List[dataworkspaces.workspace.LocalStateResourceMixin]) → None[source]¶ Upload updates to remote origin.
Backend subclass also needs to handle syncing of the workspace itself. If this is called with an empty set of resources, then we are just syncing the workspace. Pushing the workspace should include pushing of any new resources.
-
-
class
dataworkspaces.workspace.
CentralWorkspaceMixin
[source]¶ This mixin is for workspaces that have a central store and do not need synchronization of the workspace itself. They still may need to sychronize individual resources.
-
get_resources_that_need_to_be_cloned
() → List[str][source]¶ Return a list of resources with local state that are not present in the local system. This is used after a pull to clone these resources.
-
Mixins for Snapshot Functionality¶
To support snapshots, the interfaces defined by
SnapshotWorkspaceMixin
and SnapshotResourceMixin
should
be implmented by workspace backends and resources, respectively.
SnapshotMetadata
defines the metadata to be stored for each
snapshot.
-
class
dataworkspaces.workspace.
SnapshotMetadata
(hashval: str, tags: List[str], message: str, hostname: str, timestamp: str, relative_destination_path: str, restore_hashes: Dict[str, Optional[str]], metrics: Optional[Dict[str, Any]] = None, updated_timestamp: Optional[str] = None)[source]¶ The metadata we store for each snapshot (in addition to the manifest). relative_destination_path refers to the path used in resources that copy their current state to a subdirectory for each snapshot.
-
class
dataworkspaces.workspace.
SnapshotWorkspaceMixin
[source]¶ Mixin class for workspaces that support snapshots and restores.
-
delete_snapshot
(hash_val: str, include_resources=False) → None[source]¶ Given a snapshot hash, delete the entry from the workspace’s metadata. If include_resources is True, then delete any data from the associated resources (e.g. snapshot subdirectories).
-
get_lineage_store
() → dataworkspaces.utils.lineage_utils.LineageStore[source]¶ Return the store for lineage data. If this workspace backend does not support lineage for some reason, the call should raise a ConfigurationError.
-
get_most_recent_snapshot
() → Optional[dataworkspaces.workspace.SnapshotMetadata][source]¶ Helper function to return the metadata for the most recent snapshot (by timestamp). Returns None if no snapshot found
-
get_next_snapshot_number
() → int[source]¶ Return a number that can be used for this snapshot. For a given local copy of thw workspace, it is guaranteed to be unique and increasing. It is not guarenteed to be globally unique (need to combine with hostname to get that).
-
get_snapshot_by_partial_hash
(partial_hash: str) → dataworkspaces.workspace.SnapshotMetadata[source]¶ Given a partial hash for the snapshot, find the snapshot whose hash starts with this prefix and return the metadata asssociated with the snapshot.
-
get_snapshot_by_tag
(tag: str) → dataworkspaces.workspace.SnapshotMetadata[source]¶ Given a tag, return the asssociated snapshot metadata. This lookup could be slower ,if a reverse index is not kept.
-
get_snapshot_by_tag_or_hash
(tag_or_hash: str) → dataworkspaces.workspace.SnapshotMetadata[source]¶ Given a string that is either a tag or a (partial)hash corresponding to a snapshot, return the associated resrouce metadata. Throws a ConfigurationError if no entry is found.
-
get_snapshot_manifest
(hash_val: str) → List[Any][source]¶ Returns the snapshot manifest for the given hash as a parsed JSON structure. The top-level dict maps resource names resource parameters.
-
get_snapshot_metadata
(hash_val: str) → dataworkspaces.workspace.SnapshotMetadata[source]¶ Given the full hash of a snapshot, return the metadata. This lookup should be quick.
-
list_snapshots
(reverse: bool = True, max_count: Optional[int] = None) → Iterable[dataworkspaces.workspace.SnapshotMetadata][source]¶ Returns an iterable of snapshot metadata, sorted by timestamp ascending (or descending if reverse is True). If max_count is specified, return at most that many snaphsots.
-
remove_tag_from_snapshot
(hash_val: str, tag: str) → None[source]¶ Remove the specified tag from the specified snapshot. Throw an InternalError if either the snapshot or the tag do not exist.
-
restore
(snapshot_hash: str, restore_hashes: Dict[str, Optional[str]], restore_resources: List[SnapshotResourceMixin]) → None[source]¶ Restore the specified resources to the specified hashes. The list should have been previously filtered to include only those with valid (not None) restore hashes.
-
save_snapshot_metadata_and_manifest
(metadata: dataworkspaces.workspace.SnapshotMetadata, manifest: bytes) → None[source]¶ Save the snapshot metadata and manifest using the hash in metadata.hashval.
-
snapshot
(tag: Optional[str] = None, message: str = '') → Tuple[dataworkspaces.workspace.SnapshotMetadata, bytes][source]¶ Take snapshot of the resources in the workspace, and metadata for the snapshot and a manifest in the workspace. We assume that the tag does not already exist (checks can be made in the command before calling this method).
We also copy the lineage data if the workspace supports lineage.
Returns the snapshot metadata and the (binary) snapshot hash. These should be saved into the workspace by the caller (i.e. the snapshot command). We don’t do that here, as futher interactions with the user may be needed. In particular, if the hash is identical to a previous hash, we ask the user if they want to overwrite.
-
supports_lineage
() → bool[source]¶ Return True if this workspace’s backend supports lineage, False otherwise
-
-
class
dataworkspaces.workspace.
SnapshotResourceMixin
[source]¶ Mixin for the resource api for resources that can take snapshots.
-
copy_imported_lineage
(lineage_store: dataworkspaces.utils.lineage_utils.LineageStore) → None[source]¶ If imported lineage, copy the lineage.json file to the lineage store. The pull_resources() method on the workspace will call it after pulling the resource.
If the resource does not store files locally, this default implementation will need to be overridden.
-
delete_snapshot
(workspace_snapshot_hash: str, resource_restore_hash: str, relative_path: str) → None[source]¶ Delete any state associated with the snapshot, including any files under relative_path
-
restore_precheck
(restore_hashval: str) → None[source]¶ Run any prechecks before restoring to the specified hash value (aka certificate). This should throw a ConfigurationError if the restore would fail for some reason.
-
snapshot
() → Tuple[Optional[str], Optional[str]][source]¶ Take the actual snapshot of the resource and return a tuple of two hash values, the first for comparison, and the second for restoring. The comparison hash value is the one we save in the snapshot manifest. The restore hash value is saved in the snapshot metadata. In many cases both hashes are the same. If the resource does not support restores, it can return None for the second hash. This will cause attempted restores involving this resource to error out.
-