Skip to content

Usage

databooks is a tool designed to make the life of Jupyter notebook users easier, especially when it comes to sharing and versioning notebooks. That is because Jupyter notebooks are actually JSON files, with extra metadata that are useful for Jupyter but unnecessary for many users. When committing notebooks you commit all the metadata that may cause some issues down the line. This is where databooks comes in.

The package currently has 3 main features, exposed as CLI commands

  1. databooks meta: to remove unnecessary notebook metadata that can cause git conflicts
  2. databooks fix: to fix conflicts after they've occurred, by parsing versions of the conflicting file and computing its difference in a Jupyter-friendly way, so you (user) can manually resolve them in the Jupyter interface
  3. databooks assert: to assert that the notebook metadata actually conforms to desired values - ensure that notebook has sequential execution count, tags, etc.
  4. databooks show: to show a rich representation of the notebooks in the terminal

databooks meta

The only thing you need to pass is a path. We have sensible defaults to do the rest.

databooks meta path/to/notebooks

With that, for each notebook in the path, by default:

  • It will remove execution counts
  • It won't remove cell outputs
  • It will remove metadata from all cells (such as cell tags or ids)
  • It will remove all metadata from your notebook (including kernel information)
  • It won't overwrite files for you

Nonetheless, the tool is highly configurable. You could choose to remove cell outputs by passing --rm-outs. Or if there is some metadata you'd like to keep, such as cell tags, you can do so by passing --cell-meta-keep tags. Also, if you do want to save the clean notebook you can either pass a prefix (--prefix ...) or a suffix (--suffix ...) that will be added before writing the file, or you can simply overwrite the source file (--overwrite).

databooks fix

In databooks meta we try to avoid git conflicts. In databooks fix we fix conflicts after they have occurred. Similar to databooks meta ..., the only required argument here is a path.

databooks fix path/to/notebooks

For each notebook in the path that has git conflicts:

  • It will keep the metadata from the notebook in HEAD
  • For the conflicting cells, it will wrap some special cells around the differences, like in normal git conflicts

Similarly to databooks meta, the default behavior can be changed by passing a configuration pyproject.toml file or specifying the CLI arguments. You could, for instance, keep the metadata from the notebook in BASE (as opposed to HEAD). If you know you only care about the notebook cells in HEAD or BASE, then you could pass --cells-head or --no-cells-head and not worry about fixing conflicted cells in Jupyter.

You can also pass a special --cell-fields-ignore parameter, that will remove the cell metadata from both versions of the conflicting notebook before comparing them. This is because depending on your Jupyter version you may have an id field, that will be unique for each cell. That is, all the cells will be considered different even if they have the same source and outputs as their ids are different. By removing id and execution_count (we'll do this by default) we only compare the actual code and outputs to determine if the cells have changed or not.

Note

If a notebook with conflicts (thus not valid JSON/Jupyter) is committed to the repo, databooks fix will not consider the file as something to fix - same behavior as git.

Fun fact

"Fix" may be a misnomer: the "broken" JSON in the notebook is not actually fixed - instead we compare the versions of the notebook that caused the conflict.

databooks assert

In databooks meta, we remove unwanted metadata. But sometimes we may still want some metadata (such as cell tags), or more than that, we may want the metadata to have certain values. This is where databooks assert comes in. We can use this command to ensure that the metadata is present and has the desired values.

databooks assert is akin to (and inspired by) Python's assert. Therefore, the user must pass a path and a string with the expression to be evaluated for each notebook.

databooks assert path/to/notebooks --expr "python expression to assert on notebooks"

This can be used, for example, to make sure that the notebook cells were executed in order. Or that we have markdown cells, or to set a maximum number of cells for each notebook.

Evidently, there are some limitations to the expressions that a user can pass.

Variables in scope:

All the variables available for in your assert expressions are subclasses of Pydantic models. Therefore, you can use these models as regular python objects (i.e.: to access the cell types, one could write [cell.cell_type for cell in nb.cells]). For convenience's sake, follows a list of currently supported variables that can be used in assert expressions:

  • nb: Jupyter notebook found in path
  • raw_cells: notebook "raw" cells
  • md_cells: notebook markdown cells
  • code_cells: notebook code cells
  • exec_cells: executed notebook code cells

Built-in functions:

These limitations are designed to allow anyone to use databooks assert safely. This is because we use built-in's eval, and eval is really dangerous. To mitigate that (and for your safety), we actually parse the string and only allow a couple of operations to happen. Check out our tests to see what is and isn't allowed and see the source to see how that happens!

It's also relevant to mention that to avoid repetitive typing you can configure the tool to omit the source string. An even more powerful method is to combine it with pre-commit or CI/CD. Check out the rest of the "Usage" section for more info!

Recipes

It can be a bit repetitive and tedious to write out expressions to be asserted. Or even hard to think of how to express these assertions about notebooks. With that in mind, we also include "user recipes". These recipes store some useful expressions to be checked, to be used both as shorthand of other expressions and inspiration for you to come up with your own recipe! Feel free to submit a PR with your recipe or open an issue if you're having issues coming up with a recipe for your goal.

has-tags

  • Description: Assert that there is at least one cell with tags.
  • Source: any(getattr(cell.metadata, 'tags', []) for cell in nb.cells)

has-tags-code

  • Description: Assert that there is at least one code cell with tags.
  • Source: any(getattr(cell.metadata, 'tags', []) for cell in code_cells)

max-cells

  • Description: Assert that there are less than 64 cells in the notebook.
  • Source: len(nb.cells) < 64

no-empty-code

  • Description: Assert that there are no empty code cells in the notebook.
  • Source: all(cell.source for cell in code_cells)

seq-exec

  • Description: Assert that the executed code cells were executed sequentially (similar effect to when you 'restart kernel and run all cells').
  • Source: [c.execution_count for c in exec_cells] == list(range(1, len(exec_cells) + 1))

seq-increase

  • Description: Assert that the executed code cells were executed in increasing order.
  • Source: [c.execution_count for c in exec_cells] == sorted([c.execution_count for c in exec_cells])

startswith-md

  • Description: Assert that the first cell in notebook is a markdown cell.
  • Source: nb.cells[0].cell_type == 'markdown'

Tip

If your use case is more complex and cannot be translated into a single expression, you can always download databooks and use it as a part of your script!

databooks show

Sometimes we may want to quickly visualize the notebook. However, it can be a bit cumbersome to start the Jupyter server, navigate to the file that we'd like inspect and open it in the terminal. Moreover, by opening the file in Jupyter we may already modify the notebook metadata.

This is where databooks show comes in place. Simply specify the path(s) to the notebook(s) to visualize them in the terminal. You can optionally pass pager to open a scrollable window that won't actually print the notebook contents in the terminal.