
This tutorial walks through the core functionality of Dyff.

You will learn how to create a dataset, download and serve a model from Hugging Face Hub, run an evaluation, and report on your findings.

If connected to a remote Dyff instance, this tutorial can be completed on a MacOS, Windows, or Linux machine. For local development clusters, only Linux is supported at this time.

Structure of a safety audit

At a high level, auditing an AI/ML system consists of running the system on input data, collecting the outputs, computing performance metrics on those outputs, and aggregating scores from multiple datasets and metrics into an audit report.

Dyff structures the audit process into well-defined steps called workflows. The following figure shows the data flow through the various workflows ordered from left to right.

digraph dyff_domain_model { node [ordering=in, margin=0]; newrank=true; rankdir=LR; "Dataset" [ label="Dataset", shape="box", href="../apis/dyff/user-guide/core-resources.html#dataset", target="_top", ]; "InferenceService" [ label="Inference\nService", shape="box", href="../apis/dyff/user-guide/core-resources.html#inference-service", target="_top", ]; "InferenceSession" [ label="Inference\nSession", shape="box", href="../apis/dyff/user-guide/core-resources.html#inference-session", target="_top", ]; "Evaluation" [ label="Evaluation", shape="box", href="../apis/dyff/user-guide/core-resources.html#evaluation", target="_top", ]; "Module" [ label="Module", shape="box", href="../apis/dyff/user-guide/core-resources.html#module", target="_top", ]; "Method" [ label="Method", shape="octagon", href="../apis/dyff/user-guide/core-resources.html#method", target="_top", ]; "Measurement" [ label="Measurement", shape="box", href="../apis/dyff/user-guide/core-resources.html#measurement", target="_top", ]; "SafetyCase" [ label="SafetyCase", shape="box", href="../apis/dyff/user-guide/core-resources.html#safety-case", target="_top", ]; "Model" [ label="Model", shape="box", style="dashed", href="../apis/dyff/user-guide/core-resources.html#model", target="_top", ]; "Score" [ label="Score", shape="box", style="dashed", href="../apis/dyff/user-guide/core-resources.html#score", target="_top", ]; "Dataset" -> "Evaluation"; "InferenceService" -> "InferenceSession" [label=<<i> 1 ... n </i>>]; "Evaluation" -> "Method" [label=<<i> 1 ... n </i>>]; "Module" -> "Method" [label=<<i> n ... 1 </i>>]; "Method" -> "Measurement"; "Method" -> "SafetyCase"; "SafetyCase" -> "Score" [label=<<i> 1 ... n </i>>]; "Model" -> "InferenceService" [label=<<i> 1 ... n </i>>] { rank=same; edge [style=invis]; InferenceService -> Dataset; } { rank=same; edge [dir=both]; InferenceSession -> Evaluation; } { rank=same; edge [style=invis]; Module -> Method; } { rank=same; edge [style=invis]; Measurement -> SafetyCase; } }

The workflows in the figure correspond to different resource types in the Dyff Platform. To run a workflow, you use the Dyff API to create instances of the corresponding resources with appropriate configurations. The resource is stored in Dyff’s database and its status is updated as the the workflow proceeds.


An InferenceService is the “system under test”. Dyff requires that the system is packaged as a Web service that runs in a Docker container and provides an HTTP API for making inferences on input data. This highly generic interface allows Dyff to perform audits directly on production-ready intelligent systems. Dyff can automatically wrap models from popular sources like Hugging Face as InferenceServices.


A Model describes the artifacts that comprise an inference model. Dyff can create InferenceServices automatically for models from common sources like Hugging Face. In most cases, the model artifacts such as neural network weights simply get loaded into a “runner” container to expose the model as an inference service, so services backed by models are cheap to create.


An InferenceSession is a running instance of an InferenceService. Multiple replicas of the service can be run in a single session to increase throughput. Dyff automatically orchestrates the computational resources required, including GPU accelerators for neural network models.

InferenceSessions serve two purposes. First, platform users can use them to perform inference interactively via the Dyff API. This is useful for prototyping evaluations. Second, InferenceSessions are used by Evaluations; the evaluation machinery is implemented as a “client” of the session that feeds in input data taken from a Dataset.


A Dataset is a set of input instances on which to evaluate systems. Dyff uses the Apache Arrow format to represent datasets. The Arrow dataset format is a columnar format optimized for data science and machine learning workflows. It is mostly inter-convertible with JSON and Pandas DataFrame formats. An Arrow dataset has a static schema describing the names and types of columns. Dyff includes various schemas for common types of data.


An Evaluation is the process of making an inference for each instance in a dataset using a given inference service – for example, classifying all of the images in ImageNet using a particular neural network model. The result of an evaluation is another Apache Arrow dataset containing the inference outputs.


A Method is an implementation of an analysis process. Methods take inputs – either Evaluation outputs or the outputs of other intermediate Methods – and produce either Measurements or SafetyCases. Any number of different applicable methods can be run against a single set of evaluation outputs.


A Module contains Python code artifacts that can be used by Methods. At runtime, the root directoy of the Module will be mounted at a known location and added to the PYTHONPATH so that it can be imported. Methods can use multiple modules.


A Measurement is the result of transforming raw inference outputs into meaningful performance statistics by applying a scoring Method on the raw outputs. For a simple classification task, for example, the method might assign a 0-1 score to each instance according to whether the correct label was among the top-\(k\) highest-scoring predicted labels. The output of a measurement is another Arrow dataset.


In previous Dyff versions, the Report resource provided similar functionality to the combination of Method + Measurement, but with less flexibility. Report is deprecated and will be removed in a future version.


A SafetyCase is an artifact suitable for human consumption that presents the overall evidence for the safety or un-safety of a system for a particular use case.

Safety cases are also generated by applying Methods, but the methods that produce safety cases are implemented as Jupyter notebooks. The notebook generates an output document consisting of formatted text, figures, tables, and other multi-media content. The Dyff Platform, using Jupyter tools, renders the output documents as HTML pages and serves them on the Web.


In addition to the main safety case document, SafetyCases may also “export” one or more scalar-valued Scores. Scores are used by the Dyff frontend to populate various performance summary and comparison views.