Tutorial¶
This tutorial walks through the core functionality of Dyff.
You will learn how to create a dataset, download and serve a model from Hugging Face Hub, run an evaluation, and report on your findings.
If connected to a remote Dyff instance, this tutorial can be completed on a MacOS, Windows, or Linux machine. For local development clusters, only Linux is supported at this time.
Structure of a safety audit¶
At a high level, auditing an AI/ML system consists of running the system on input data, collecting the outputs, computing performance metrics on those outputs, and aggregating scores from multiple datasets and metrics into an audit report.
Dyff structures the audit process into well-defined steps called workflows. The following figure shows the data flow through the various workflows ordered from left to right.
The workflows in the figure correspond to different resource types in the Dyff Platform. To run a workflow, you use the Dyff API to create instances of the corresponding resources with appropriate configurations. The resource is stored in Dyff’s database and its status is updated as the the workflow proceeds.
InferenceService¶
An InferenceService
is the “system under test”.
Dyff requires that the system is packaged as a Web service that runs in a Docker
container and provides an HTTP API for making inferences on input data. This
highly generic interface allows Dyff to perform audits directly on
production-ready intelligent systems. Dyff can automatically wrap models from popular sources like Hugging Face as
InferenceServices.
Model¶
A Model
describes the artifacts that comprise an
inference model. Dyff can create InferenceServices automatically for models from
common sources like Hugging Face. In most cases, the model artifacts such
as neural network weights simply get loaded into a “runner” container to expose
the model as an inference service, so services backed by models are cheap to
create.
InferenceSession¶
An InferenceSession
is a running instance of an
InferenceService. Multiple replicas of the service can be run in a single
session to increase throughput. Dyff automatically orchestrates the
computational resources required, including GPU accelerators for neural network
models.
InferenceSessions serve two purposes. First, platform users can use them to perform inference interactively via the REST API. This is useful for prototyping evaluations. Second, InferenceSessions are used by Evaluations; the evaluation machinery is implemented as a “client” of the session that feeds in input data taken from a Dataset.
Dataset¶
A Dataset
is a set of input instances on which to
evaluate systems. Dyff uses the Apache Arrow format to represent datasets.
The Arrow dataset format is a columnar format optimized for data science and
machine learning workflows. It is mostly inter-convertible with JSON and Pandas
DataFrame formats. An Arrow dataset has a static schema describing the names and
types of columns. Dyff includes various schemas for common
types of data.
Evaluation¶
An Evaluation
is the process of making an
inference for each instance in a dataset using a given inference service – for
example, classifying all of the images in ImageNet using a particular neural
network model. The result of an evaluation is another Apache Arrow dataset
containing the inference outputs. For example, for a typical classifier, each
output instance would contain the top-\(k\) highest-scoring label
predictions. Dyff scales for large evaluations by running multiple replicas of
the system under test.
Measurement¶
A Measurement
is the result of transforming raw
inference outputs into meaningful performance statistics by applying a scoring
Method
on the raw outputs. For a simple
classification task, for example, the method might assign a 0-1 score to each
instance according to whether the correct label was among the top-\(k\)
highest-scoring predicted labels. The output of a measurement is another Arrow
dataset. Any number of different applicable methods can be run against a single
set of evaluation outputs.
Note
In previous Dyff versions, the Report
resource provided similar functionality to the combination of Method
+
Measurement
, but with less flexibility. Report
is deprecated and
will be removed in a future version.
SafetyCase¶
Finally, a SafetyCase
is an artifact suitable for
human consumption that summarizes the results of one or more Measurements
to
describe the overall evidence for the safety or un-safety of a system for a
particular use case.
Safety cases are also generated by applying Methods
, but the methods that produce safety cases are
implemented as Jupyter notebooks. The notebooks generate output documents
consisting of formatted text, figures, tables, and other multi-media content.
The Dyff Platform, using Jupyter tools, renders the output documents as HTML
pages and serves them on the Web.