Tutorial¶
This tutorial walks through the core functionality of Dyff.
You will learn how to create a dataset, run inference on the data, anayze the results, and report on your findings.
Dyff API clients¶
You will interact with the Dyff API using a Python client. In the tutorial examples, we will assume that the client instance is called dyffapi
. API functions are grouped by the resource type that they affect. For example, to retrieve a Dataset
resource, you would use the get()
function for datasets:
dyffapi.datasets.get("dataset-id")
We recommend that you complete this tutorial using the DyffLocalPlatform
client. This implements most of the Dyff API entirely locally within Python. The main things that it can’t do are:
Run inference with large AI models that require exotic hardware, and
Query resources by their attributes (because it doesn’t have a database engine).
We provide inference service mock-ups that you can use for prototyping. DyffLocalPlatform
is intended to be API-compatible with the real Dyff API Client
; in fact, in many tutorial examples, we declare the client instance as:
dyffapi: Client | DyffLocalPlatform = ...
In most cases, you should be able to swap out the “local” client for the “remote” client and everything will still work.
Note
Use DyffLocalPlatform
for prototyping and debugging, and swap it for the remote Client
when everything is ready. This will save you time, give you better error messages, and avoid spending money on cloud computing resources.
Structure of a safety audit¶
Auditing an AI system consists of running the system on input data, collecting the outputs, analyzing those outputs, and presenting conclusions in an audit report.
Dyff structures the audit process into steps called workflows. The following figure shows the data flow through the various workflows ordered from left to right.
The workflows in the figure correspond to different resource types in the Dyff Platform. To run a workflow, you use the Dyff API to create instances of the corresponding resources with appropriate configurations. The resource is stored in Dyff’s database and its status is updated as the work proceeds.
InferenceService¶
An InferenceService
is the “system under test”.
Dyff requires that the system is packaged as a Web service that runs in a Docker
container and provides an HTTP API for making inferences on input data. This
generic interface allows Dyff to perform audits directly on
production-ready systems. Dyff can automatically wrap models from popular sources like Hugging Face as
InferenceServices.
Model¶
A Model
describes the artifacts that comprise an
inference model. Dyff can create InferenceServices automatically for models from
common sources like Hugging Face. In most cases, the model artifacts such
as neural network weights simply get loaded into a “runner” container to expose
the model as an inference service, so services backed by models are cheap to
create.
InferenceSession¶
An InferenceSession
is a running instance of an
InferenceService. Multiple replicas of the service can be run in a single
session to increase throughput. Dyff automatically orchestrates the
computational resources required, including GPU accelerators for neural network
models.
InferenceSessions serve two purposes. First, platform users can use them to perform inference interactively via the Dyff API. This is useful for experimentation and prototyping. Second, InferenceSessions are used by Evaluations; the evaluation machinery is implemented as a “client” of the session that feeds in input data taken from a Dataset.
Dataset¶
A Dataset
is a set of input instances on which to
evaluate systems. Dyff uses the Apache Arrow format to represent datasets.
The Arrow format is mostly inter-convertible with JSON and Pandas
DataFrame formats. An Arrow dataset has a static schema describing the names and
types of columns. Dyff includes various schemas for common
types of data.
Evaluation¶
An Evaluation
is the process of making an
inference for each instance in a dataset using a given inference service – for
example, classifying all of the images in ImageNet using a particular neural
network model. The result of an evaluation is another Apache Arrow dataset
containing the inference outputs.
Method¶
A Method
is an implementation of an analysis process. Methods take inputs – either Evaluation outputs or the outputs of other intermediate Methods – and produce either Measurements or SafetyCases. Any number of different applicable methods can be run against a single set of evaluation outputs.
Module¶
A Module
contains Python code artifacts that can be used by Methods. At runtime, the root directoy of the Module will be mounted at a known location and added to the PYTHONPATH
so that it can be imported. Methods can use multiple modules.
Measurement¶
A Measurement
is new Arrow dataset resulting from
applying a Method
to one or more input datasets
– either the outputs of Evaluations, or other Measurements. For a simple
classification task, for example, the method might assign a 0-1 score to each
instance according to whether the correct label was among the top-\(k\)
highest-scoring predicted labels.
One important use case for Measurements is to post-process Evaluation outputs to anonymize them or remove sensitive information. The result can then be shared publicly, allowing others to analyze it.
Note
In previous Dyff versions, the Report
resource provided similar functionality to the combination of Method
+
Measurement
, but with less flexibility. Report
is deprecated and
will be removed in a future version.
SafetyCase¶
A SafetyCase
is an artifact suitable for
human consumption that presents the overall evidence for the safety or un-safety of a system for a particular use case.
Like measurements, safety cases are generated by running a Methods
, but the methods that produce safety cases are
implemented as Jupyter notebooks. The notebook generates an output document
consisting of formatted text, figures, tables, and other multi-media content.
The Dyff Platform, using Jupyter tools, renders the output documents as HTML
pages and serves them on the Web.
Score¶
In addition to the main safety case document, SafetyCases may also “export” one or more scalar-valued Scores
. Scores are used by the Dyff App to populate various performance summary and comparison views.