Tutorial

This tutorial walks through the core functionality of Dyff.

You will learn how to create a dataset, download and serve a model from Hugging Face Hub, run an evaluation, and report on your findings.

If connected to a remote Dyff instance, this tutorial can be completed on a MacOS, Windows, or Linux machine. For local development clusters, only Linux is supported at this time.

Structure of a safety audit

At a high level, auditing an AI/ML system consists of running the system on input data, collecting the outputs, computing performance metrics on those outputs, and aggregating scores from multiple datasets and metrics into an audit report.

Dyff structures the audit process into well-defined steps called workflows. The following figure shows the data flow through the various workflows ordered from left to right.

digraph dyff_domain_model { node [ordering=in]; newrank=true; rankdir=LR; "Dataset" [ label="Dataset", shape="box", href="../apis/dyff/user-guide/core-resources.html#dataset", target="_top", ]; "InferenceService" [ label="Inference\nService", shape="box", href="../apis/dyff/user-guide/core-resources.html#inference-service", target="_top", ]; "InferenceSession" [ label="Inference\nSession", shape="box", href="../apis/dyff/user-guide/core-resources.html#inference-session", target="_top", ]; "Evaluation" [ label="Evaluation", shape="box", href="../apis/dyff/user-guide/core-resources.html#evaluation", target="_top", ]; "Measurement" [ label="Measurement", shape="box", href="../apis/dyff/user-guide/core-resources.html#measurement", target="_top", ]; "SafetyCase" [ label="Safety\nCase", shape="box", href="../apis/dyff/user-guide/core-resources.html#safety-case", target="_top", ]; "Model" [ label="Model", shape="box", style="dashed", href="../apis/dyff/user-guide/core-resources.html#model", target="_top", ]; "User" [ label="User", shape="plaintext", href="../apis/dyff/user-guide/core-resources.html#user", target="_top", ]; "Dataset" -> "Evaluation"; "InferenceService" -> "InferenceSession" [label=<<i> 1 ... n </i>>]; "InferenceSession" -> "User" [dir=both]; "Evaluation" -> "Measurement" [label=<<i> 1 ... n </i>>]; "Measurement" -> "SafetyCase" [label=<<i> n ... 1 </i>>]; "Model" -> "InferenceService" [label=<<i> 1 ... n </i>>] { rank=same; rankdir=TB edge [style=invis]; InferenceService -> Dataset; } { rank=same; rankdir=TB edge [dir=both]; InferenceSession -> Evaluation; } { rank=same; rankdir=TB edge [style=invis]; User -> SafetyCase; } }

The workflows in the figure correspond to different resource types in the Dyff Platform. To run a workflow, you use the Dyff API to create instances of the corresponding resources with appropriate configurations. The resource is stored in Dyff’s database and its status is updated as the the workflow proceeds.

InferenceService

An InferenceService is the “system under test”. Dyff requires that the system is packaged as a Web service that runs in a Docker container and provides an HTTP API for making inferences on input data. This highly generic interface allows Dyff to perform audits directly on production-ready intelligent systems. Dyff can automatically wrap models from popular sources like Hugging Face as InferenceServices.

Model

A Model describes the artifacts that comprise an inference model. Dyff can create InferenceServices automatically for models from common sources like Hugging Face. In most cases, the model artifacts such as neural network weights simply get loaded into a “runner” container to expose the model as an inference service, so services backed by models are cheap to create.

InferenceSession

An InferenceSession is a running instance of an InferenceService. Multiple replicas of the service can be run in a single session to increase throughput. Dyff automatically orchestrates the computational resources required, including GPU accelerators for neural network models.

InferenceSessions serve two purposes. First, platform users can use them to perform inference interactively via the REST API. This is useful for prototyping evaluations. Second, InferenceSessions are used by Evaluations; the evaluation machinery is implemented as a “client” of the session that feeds in input data taken from a Dataset.

Dataset

A Dataset is a set of input instances on which to evaluate systems. Dyff uses the Apache Arrow format to represent datasets. The Arrow dataset format is a columnar format optimized for data science and machine learning workflows. It is mostly inter-convertible with JSON and Pandas DataFrame formats. An Arrow dataset has a static schema describing the names and types of columns. Dyff includes various schemas for common types of data.

Evaluation

An Evaluation is the process of making an inference for each instance in a dataset using a given inference service – for example, classifying all of the images in ImageNet using a particular neural network model. The result of an evaluation is another Apache Arrow dataset containing the inference outputs. For example, for a typical classifier, each output instance would contain the top-\(k\) highest-scoring label predictions. Dyff scales for large evaluations by running multiple replicas of the system under test.

Measurement

A Measurement is the result of transforming raw inference outputs into meaningful performance statistics by applying a scoring Method on the raw outputs. For a simple classification task, for example, the method might assign a 0-1 score to each instance according to whether the correct label was among the top-\(k\) highest-scoring predicted labels. The output of a measurement is another Arrow dataset. Any number of different applicable methods can be run against a single set of evaluation outputs.

Note

In previous Dyff versions, the Report resource provided similar functionality to the combination of Method + Measurement, but with less flexibility. Report is deprecated and will be removed in a future version.

SafetyCase

Finally, a SafetyCase is an artifact suitable for human consumption that summarizes the results of one or more Measurements to describe the overall evidence for the safety or un-safety of a system for a particular use case.

Safety cases are also generated by applying Methods, but the methods that produce safety cases are implemented as Jupyter notebooks. The notebooks generate output documents consisting of formatted text, figures, tables, and other multi-media content. The Dyff Platform, using Jupyter tools, renders the output documents as HTML pages and serves them on the Web.