Analyze evaluation results¶

The raw outputs of an AI system don’t tell us anything about its safety or performance properties; they need to be further analyzed to extract insights and conclusions. Dyff uses the generic term “analysis” for all of the data processing activities that happen downstream of the “evaluation” step. To run an analysis, we first need to implement an analysis Method and upload it to Dyff.

A Method can produce two different kinds of outputs:

Measurements
A Measurement is a set of numbers that quantify some aspect of system performance. Measurements are scoped to a single Evaluation, meaning that the Measurement quantifies performance for a specific run of the AI system on one specific input dataset. A typical example of a Measurement is the mean classification error on a labeled input dataset.

Safety Cases
A SafetyCase is a document intended for human readers that compiles and presents the evidence of the safety (or un-safety) of a system for a particular use case and context. Safety cases are scoped to a single AI system, represented in Dyff by an InferenceService resource. Safety cases in Dyff are rendered as HTML documents containing text, tables, charts, and other graphics.

In this guide, we focus on implementing Methods that create Measurements. Safety cases are covered in the next page.

Implementing a measurement Method in Python¶

A Method is just a function that takes some data as inputs, does some computation, and produces an output. Dyff supports implementing Methods as Python functions. In this guide, we’ll implement a Method that computes the mean word length of the input prompts and generated text completions produced from an evaluation of a generative language model.

Define the schema of the output Measurement¶

The first step is to define the schema of the Measurement that will be produced by the Method. In our example, we do this by creating a Pydantic model type that describes the measurement data:

# mypy: disable-error-code="import-untyped"
import re
import statistics
from typing import Iterable

import pandas
import pyarrow
import pydantic

from dyff.schema.dataset import ReplicatedItem
from dyff.schema.dataset.arrow import arrow_schema


class WordLengthScoredItem(ReplicatedItem):
    meanPromptWordLength: float = pydantic.Field(
        description="Mean number of characters in the words in the prompt text."
    )
    meanCompletionWordLength: float = pydantic.Field(
        description="Mean number of characters in the words in the system completions."
    )
    wordOfTheDay: str = pydantic.Field(
        description="The value of the 'wordOfTheDay' argument to the Method."
    )


def word_length(
    args: dict[str, str],
    *,
    prompts: pyarrow.dataset.Dataset,
    completions: pyarrow.dataset.Dataset,
) -> Iterable[pyarrow.RecordBatch]:
    schema = arrow_schema(WordLengthScoredItem)
    prompts_df: pandas.DataFrame = prompts.to_table().to_pandas()
    completions_df: pandas.DataFrame = completions.to_table().to_pandas()

    def _mean_word_length(text: str) -> float:
        words = re.split(r"\s", text.strip())
        if len(words) == 0:
            return 0.0
        else:
            return statistics.mean(len(word) for word in words)

    prompts_df["meanPromptWordLength"] = prompts_df.apply(
        lambda row: _mean_word_length(row["text"]), axis=1
    ).drop("text")
    completions_df["meanCompletionWordLength"] = completions_df.apply(
        lambda row: _mean_word_length(row["responses"][0]["text"]), axis=1
    ).drop("responses")

    measurement_df = prompts_df.merge(completions_df, on="_index_")
    measurement_df["wordOfTheDay"] = args["wordOfTheDay"]

    yield from pyarrow.Table.from_pandas(measurement_df, schema=schema).to_batches()

Notice that the data model inherits from ReplicatedItem, which adds the fields _index_ and _replication_. This is because we want to have one row in the Measurement output for each (input, output) pair generated by the system, and each output is uniquely identified by an (_index_, _replication_) tuple.

Implement the Method as a function¶

Now, we actually implement the analysis method. You can implement the method in Python by defining a Python function with a signature that follows a certain pattern, illustrated here:

def word_length(
    args: dict[str, str],
    *,
    prompts: pyarrow.dataset.Dataset,
    completions: pyarrow.dataset.Dataset,
) -> Iterable[pyarrow.RecordBatch]:

The function must take one positional argument, which is a dict mapping argument names to argument values. When running the Method, you can pass in configuration settings via the arguments dict.

The function must also define one or more keyword-only arguments. All of the keyword arguments accept a PyArrow Dataset instance. These are bound to the data associated with specific Dyff entities when the method is run. For example, almost all analysis methods will need to access the outputs of the AI system to evaluate its performance. You could define a keyword argument called outputs to receive this data. The names of these arguments are arbitrary; in the example, we name the outputs dataset completions.

The names and types of all arguments and data inputs, and the type and schema of the output, must be declared when creating the Method resource through the Dyff API. Remember that inputs of type MethodInputKinds.Dataset will have a schema conforming to the required fields for input data, and inputs of type MethodInputKinds.Evaluation will have a schema conforming to the required fields for output data.

In our example, we want to access both the input and output data from the evaluation, so that we can compute the mean word lengths for both the inputs and outputs. So, we specify two keyword-only arguments, one for each of these data artifacts. Then, we just have to do some basic Pandas operations to create our output measurement, taking advantage of the easy conversion between Pandas and PyArrow data formats.

Note

In a “production-quality” implementation, you should consider processing the data incrementally as a stream of batches whenever possible, rather than loading the entire dataset into memory in a DataFrame.

Finally, after processing the data, we return our results as a stream of PyArrow batches, using the schema we defined earlier to specify the PyArrow schema of the result:

# mypy: disable-error-code="import-untyped"
import re
import statistics
from typing import Iterable

import pandas
import pyarrow
import pydantic

from dyff.schema.dataset import ReplicatedItem
from dyff.schema.dataset.arrow import arrow_schema


class WordLengthScoredItem(ReplicatedItem):
    meanPromptWordLength: float = pydantic.Field(
        description="Mean number of characters in the words in the prompt text."
    )
    meanCompletionWordLength: float = pydantic.Field(
        description="Mean number of characters in the words in the system completions."
    )
    wordOfTheDay: str = pydantic.Field(
        description="The value of the 'wordOfTheDay' argument to the Method."
    )


def word_length(
    args: dict[str, str],
    *,
    prompts: pyarrow.dataset.Dataset,
    completions: pyarrow.dataset.Dataset,
) -> Iterable[pyarrow.RecordBatch]:
    schema = arrow_schema(WordLengthScoredItem)
    prompts_df: pandas.DataFrame = prompts.to_table().to_pandas()
    completions_df: pandas.DataFrame = completions.to_table().to_pandas()

    def _mean_word_length(text: str) -> float:
        words = re.split(r"\s", text.strip())
        if len(words) == 0:
            return 0.0
        else:
            return statistics.mean(len(word) for word in words)

    prompts_df["meanPromptWordLength"] = prompts_df.apply(
        lambda row: _mean_word_length(row["text"]), axis=1
    ).drop("text")
    completions_df["meanCompletionWordLength"] = completions_df.apply(
        lambda row: _mean_word_length(row["responses"][0]["text"]), axis=1
    ).drop("responses")

    measurement_df = prompts_df.merge(completions_df, on="_index_")
    measurement_df["wordOfTheDay"] = args["wordOfTheDay"]

    yield from pyarrow.Table.from_pandas(measurement_df, schema=schema).to_batches()

Deploying and running the method¶

To run an analysis method on Dyff, you need to create three resources:

A Module containing the implementation code.

A Method that describes the method and its inputs and outputs, and references the Module from step (1).

A Measurement that references the Method from step (2) and specifies the IDs of specific resources to pass as inputs.

The APIs for completing these steps are the same whether you’re using the DyffLocalPlatform or a remote Client. You can use an instance of the local platform to develop and test these specifications, then simply switch it with a remote client to create the resources for real.

Create a Module¶

A Module is just a directory tree containing code files. You create a Module in basically the same way as a Dataset. Assuming you’ve implemented your method in a file called examples/my_package.py in the directory /home/me/dyff/my-module, you would create and upload the package like this:

from __future__ import annotations

from pathlib import Path

import my_package

from dyff.audit.local import DyffLocalPlatform
from dyff.schema.dataset import arrow
from dyff.schema.platform import *
from dyff.schema.requests import *

ACCOUNT: str = ...
ROOT_DIR: Path = Path("/home/me/dyff")

# Develop using the local platform
dyffapi = DyffLocalPlatform(
    storage_root=ROOT_DIR / ".dyff-local",
)
# When you're ready, switch to the remote platform:
# dyffapi = Client(...)

module_root = str(ROOT_DIR / "my-module")
module = dyffapi.modules.create_package(
    module_root,
    account=ACCOUNT,
    name="my-module",
)
dyffapi.modules.upload_package(module, module_root)
print(module.json(indent=2))

Create a Method¶

The Method resource basically specifies the “function signature” of your method. There’s a lot to specify, but it’s all pretty straightforward. The comments in the example explain some of the fields in the specification further:

method_request = MethodCreateRequest(
    name="mean-word-length",
    # The analysis results describe one Evaluation
    scope=MethodScope.Evaluation,
    description="Computes the mean length of words in the input and output datasets.",
    # The method is implemented as the python function 'my_package.word_count()'
    implementation=MethodImplementation(
        kind=MethodImplementationKind.PythonFunction,
        pythonFunction=MethodImplementationPythonFunction(
            fullyQualifiedName="my_package.word_count",
        ),
    ),
    # The method accepts one argument called 'wordOfTheDay'
    parameters=[
        MethodParameter(keyword="wordOfTheDay", description="A cromulent word"),
    ],
    # The method accepts two PyArrow datasets as inputs:
    # - The one called 'prompts' is from a Dataset resource (i.e., system inputs)
    # - The one called 'completions' is from an Evaluation resource (system outputs)
    inputs=[
        MethodInput(kind=MethodInputKind.Dataset, keyword="prompts"),
        MethodInput(kind=MethodInputKind.Evaluation, keyword="completions"),
    ],
    # The method produces a Measurement
    output=MethodOutput(
        kind=MethodOutputKind.Measurement,
        measurement=MeasurementSpec(
            name="mean-word-length",
            description="The mean length of words in the input and output datasets.",
            # There is (at least) one row per input (_index_ x _replication_)
            level=MeasurementLevel.Instance,
            # This is the schema of the output
            schema=DataSchema(
                arrowSchema=arrow.encode_schema(
                    arrow.arrow_schema(my_package.WordLengthScoredItem)
                ),
            ),
        ),
    ),
    # The Module containing 'my_package'
    modules=[module.id],
    account=ACCOUNT,
)
method = dyffapi.methods.create(method_request)
print(method.json(indent=2))

Note that methods can also take Measurements as input. For example, you could run one Method to produce an instance-level Measurement, then run another Method that takes the instance-level measurement as input and computes dataset-level summary statistics.

Create a Measurement¶

Finally, the Measurement resource represents the computational work needed to run your Method on specific inputs. You can think of this step as “invoking” the method with specified arguments. The “request” class here is called AnalysisCreateRequest because the same request class is also used to create other analysis resources such as SafetyCases.

dataset_id: str = ...
evaluation_id: str = ...
analysis_request = AnalysisCreateRequest(
    account=ACCOUNT,
    method=method.id,
    arguments=[
        AnalysisArgument(keyword="wordOfTheDay", value="embiggen"),
    ],
    inputs=[
        AnalysisInput(keyword="prompts", entity=dataset_id),
        AnalysisInput(keyword="completions", entity=evaluation_id),
    ],
)
measurement = dyffapi.measurements.create(analysis_request)
print(measurement.json(indent=2))

Full Example¶

from __future__ import annotations

from pathlib import Path

import my_package

from dyff.audit.local import DyffLocalPlatform
from dyff.schema.dataset import arrow
from dyff.schema.platform import *
from dyff.schema.requests import *

ACCOUNT: str = ...
ROOT_DIR: Path = Path("/home/me/dyff")

# Develop using the local platform
dyffapi = DyffLocalPlatform(
    storage_root=ROOT_DIR / ".dyff-local",
)
# When you're ready, switch to the remote platform:
# dyffapi = Client(...)

module_root = str(ROOT_DIR / "my-module")
module = dyffapi.modules.create_package(
    module_root,
    account=ACCOUNT,
    name="my-module",
)
dyffapi.modules.upload_package(module, module_root)
print(module.json(indent=2))

method_request = MethodCreateRequest(
    name="mean-word-length",
    # The analysis results describe one Evaluation
    scope=MethodScope.Evaluation,
    description="Computes the mean length of words in the input and output datasets.",
    # The method is implemented as the python function 'my_package.word_count()'
    implementation=MethodImplementation(
        kind=MethodImplementationKind.PythonFunction,
        pythonFunction=MethodImplementationPythonFunction(
            fullyQualifiedName="my_package.word_count",
        ),
    ),
    # The method accepts one argument called 'wordOfTheDay'
    parameters=[
        MethodParameter(keyword="wordOfTheDay", description="A cromulent word"),
    ],
    # The method accepts two PyArrow datasets as inputs:
    # - The one called 'prompts' is from a Dataset resource (i.e., system inputs)
    # - The one called 'completions' is from an Evaluation resource (system outputs)
    inputs=[
        MethodInput(kind=MethodInputKind.Dataset, keyword="prompts"),
        MethodInput(kind=MethodInputKind.Evaluation, keyword="completions"),
    ],
    # The method produces a Measurement
    output=MethodOutput(
        kind=MethodOutputKind.Measurement,
        measurement=MeasurementSpec(
            name="mean-word-length",
            description="The mean length of words in the input and output datasets.",
            # There is (at least) one row per input (_index_ x _replication_)
            level=MeasurementLevel.Instance,
            # This is the schema of the output
            schema=DataSchema(
                arrowSchema=arrow.encode_schema(
                    arrow.arrow_schema(my_package.WordLengthScoredItem)
                ),
            ),
        ),
    ),
    # The Module containing 'my_package'
    modules=[module.id],
    account=ACCOUNT,
)
method = dyffapi.methods.create(method_request)
print(method.json(indent=2))

dataset_id: str = ...
evaluation_id: str = ...
analysis_request = AnalysisCreateRequest(
    account=ACCOUNT,
    method=method.id,
    arguments=[
        AnalysisArgument(keyword="wordOfTheDay", value="embiggen"),
    ],
    inputs=[
        AnalysisInput(keyword="prompts", entity=dataset_id),
        AnalysisInput(keyword="completions", entity=evaluation_id),
    ],
)
measurement = dyffapi.measurements.create(analysis_request)
print(measurement.json(indent=2))