Tutorial¶

This tutorial walks through the core functionality of Dyff.

You will learn how to create a dataset, run inference on the data, anayze the results, and report on your findings.

Dyff API clients¶

You will interact with the Dyff API using a Python client. In the tutorial examples, we will assume that the client instance is called dyffapi. API functions are grouped by the resource type that they affect. For example, to retrieve a Dataset resource, you would use the get() function for datasets:

dyffapi.datasets.get("dataset-id")

We recommend that you complete this tutorial using the DyffLocalPlatform client. This implements most of the Dyff API entirely locally within Python. The main things that it can’t do are:

Run inference with large AI models that require exotic hardware, and
Query resources by their attributes (because it doesn’t have a database engine).

We provide inference service mock-ups that you can use for prototyping. DyffLocalPlatform is intended to be API-compatible with the real Dyff API Client; in fact, in many tutorial examples, we declare the client instance as:

dyffapi: Client | DyffLocalPlatform = ...

In most cases, you should be able to swap out the “local” client for the “remote” client and everything will still work.

Note

Use DyffLocalPlatform for prototyping and debugging, and swap it for the remote Client when everything is ready. This will save you time, give you better error messages, and avoid spending money on cloud computing resources.

Structure of a safety audit¶

Auditing an AI system consists of running the system on input data, collecting the outputs, analyzing those outputs, and presenting conclusions in an audit report.

Dyff structures the audit process into steps called workflows. The following figure shows the data flow through the various workflows ordered from left to right.

The workflows in the figure correspond to different resource types in the Dyff Platform. To run a workflow, you use the Dyff API to create instances of the corresponding resources with appropriate configurations. The resource is stored in Dyff’s database and its status is updated as the work proceeds.

InferenceService¶

An InferenceService is the “system under test”. Dyff requires that the system is packaged as a Web service that runs in a Docker container and provides an HTTP API for making inferences on input data. This generic interface allows Dyff to perform audits directly on production-ready systems. Dyff can automatically wrap models from popular sources like Hugging Face as InferenceServices.

Model¶

A Model describes the artifacts that comprise an inference model. Dyff can create InferenceServices automatically for models from common sources like Hugging Face. In most cases, the model artifacts such as neural network weights simply get loaded into a “runner” container to expose the model as an inference service, so services backed by models are cheap to create.

InferenceSession¶

An InferenceSession is a running instance of an InferenceService. Multiple replicas of the service can be run in a single session to increase throughput. Dyff automatically orchestrates the computational resources required, including GPU accelerators for neural network models.

InferenceSessions serve two purposes. First, platform users can use them to perform inference interactively via the Dyff API. This is useful for experimentation and prototyping. Second, InferenceSessions are used by Evaluations; the evaluation machinery is implemented as a “client” of the session that feeds in input data taken from a Dataset.

Dataset¶

A Dataset is a set of input instances on which to evaluate systems. Dyff uses the Apache Arrow format to represent datasets. The Arrow format is mostly inter-convertible with JSON and Pandas DataFrame formats. An Arrow dataset has a static schema describing the names and types of columns. Dyff includes various schemas for common types of data.

Evaluation¶

An Evaluation is the process of making an inference for each instance in a dataset using a given inference service – for example, classifying all of the images in ImageNet using a particular neural network model. The result of an evaluation is another Apache Arrow dataset containing the inference outputs.

Method¶

A Method is an implementation of an analysis process. Methods take inputs – either Evaluation outputs or the outputs of other intermediate Methods – and produce either Measurements or SafetyCases. Any number of different applicable methods can be run against a single set of evaluation outputs.

Module¶

A Module contains Python code artifacts that can be used by Methods. At runtime, the root directoy of the Module will be mounted at a known location and added to the PYTHONPATH so that it can be imported. Methods can use multiple modules.

Measurement¶

A Measurement is new Arrow dataset resulting from applying a Method to one or more input datasets – either the outputs of Evaluations, or other Measurements. For a simple classification task, for example, the method might assign a 0-1 score to each instance according to whether the correct label was among the top-\(k\) highest-scoring predicted labels.

One important use case for Measurements is to post-process Evaluation outputs to anonymize them or remove sensitive information. The result can then be shared publicly, allowing others to analyze it.

Note

In previous Dyff versions, the Report resource provided similar functionality to the combination of Method + Measurement, but with less flexibility. Report is deprecated and will be removed in a future version.

SafetyCase¶

A SafetyCase is an artifact suitable for human consumption that presents the overall evidence for the safety or un-safety of a system for a particular use case.

Like measurements, safety cases are generated by running a Methods, but the methods that produce safety cases are implemented as Jupyter notebooks. The notebook generates an output document consisting of formatted text, figures, tables, and other multi-media content. The Dyff Platform, using Jupyter tools, renders the output documents as HTML pages and serves them on the Web.

Score¶

In addition to the main safety case document, SafetyCases may also “export” one or more scalar-valued Scores. Scores are used by the Dyff App to populate various performance summary and comparison views.