Analyze evaluation results¶
The raw outputs of an AI/ML system don’t tell us anything about its safety or
performance properties; they need to be further analyzed to extract insights and
conclusions. Within Dyff, we use the generic term “analysis” for all of
the data processing activities that happen downstream of the “evaluation” step.
To run an analysis, we first need to implement an analysis
Method
and upload it to Dyff.
An analysis Method can produce two different kinds of outputs:
- Measurements
A
Measurement
is a set of numbers that quantify some aspect of system performance. Measurements are scoped to a singleEvaluation
, meaning that the Measurement quantifies performance for a specific run of the AI/ML system on one specific input dataset. A typical example of a Measurement is the mean classification error on a labeled input dataset.- Safety Cases
A
SafetyCase
is a document intended for human readers that compiles and presents the evidence of the safety (or un-safety) of a system for a particular use case and context. Safety cases are scoped to a single AI/ML system, represented in Dyff by anInferenceService
resource. A safety case typically will include many related Measurements of the system produced using various input datasets and analysis methods. Safety cases in Dyff are rendered as HTML documents containing text, tables, charts, and other graphics.
In this guide, we focus on implementing Methods that create Measurements. Safety cases are covered in the Safety Case Guide.
Implementing a measurement Method in Python¶
An analysis Method is just a function that takes some data as inputs, does some computation, and produces an output. Dyff supports implementing Methods as Python functions. In this guide, we’ll implement a Method that computes the mean word length of the input prompts and generated text completions produced from an evaluation of a generative language model.
Define the schema of the output Measurement¶
The first step is to define the schema of the Measurements that will be produced by the Method. In our example, we do this by creating a Pydantic model type that describes the measurement data:
1# mypy: disable-error-code="import-untyped"
2import re
3import statistics
4from typing import Iterable
5
6import pandas
7import pyarrow
8import pydantic
9
10from dyff.schema.dataset import ReplicatedItem
11from dyff.schema.dataset.arrow import arrow_schema
12
13
14class WordLengthScoredItem(ReplicatedItem):
15 meanPromptWordLength: float = pydantic.Field(
16 description="Mean number of characters in the words in the prompt text."
17 )
18 meanCompletionWordLength: float = pydantic.Field(
19 description="Mean number of characters in the words in the system completions."
20 )
21 wordOfTheDay: str = pydantic.Field(
22 description="The value of the 'wordOfTheDay' argument to the Method."
23 )
24
25
26def word_length(
27 args: dict[str, str],
28 *,
29 prompts: pyarrow.dataset.Dataset,
30 completions: pyarrow.dataset.Dataset,
31) -> Iterable[pyarrow.RecordBatch]:
32 schema = arrow_schema(WordLengthScoredItem)
33 prompts_df: pandas.DataFrame = prompts.to_table().to_pandas()
34 completions_df: pandas.DataFrame = completions.to_table().to_pandas()
35
36 def _mean_word_length(text: str) -> float:
37 words = re.split(r"\s", text.strip())
38 if len(words) == 0:
39 return 0.0
40 else:
41 return statistics.mean(len(word) for word in words)
42
43 prompts_df["meanPromptWordLength"] = prompts_df.apply(
44 lambda row: _mean_word_length(row["text"]), axis=1
45 ).drop("text")
46 completions_df["meanCompletionWordLength"] = completions_df.apply(
47 lambda row: _mean_word_length(row["responses"][0]["text"]), axis=1
48 ).drop("responses")
49
50 measurement_df = prompts_df.merge(completions_df, on="_index_")
51 measurement_df["wordOfTheDay"] = args["wordOfTheDay"]
52
53 yield from pyarrow.Table.from_pandas(measurement_df, schema=schema).to_batches()
Notice that the data model inherits from ReplicatedItem
, which adds the
fields _index_
and _replication_
. This is because we want to have one
row in the Measurement output for each (input, output) pair generated by the
system, and each output is uniquely identified by an (_index_,
_replication_)
tuple.
Implement the Method as a function¶
Now, we actually implement the analysis method. You can implement the method in Python by defining a Python function with a signature that follows a certain pattern, illustrated here:
26def word_length(
27 args: dict[str, str],
28 *,
29 prompts: pyarrow.dataset.Dataset,
30 completions: pyarrow.dataset.Dataset,
31) -> Iterable[pyarrow.RecordBatch]:
The function must take one positional argument, which is a dict
mapping
argument names to argument values. When running the Method, you can pass in
configuration settings via the arguments dict.
The function must also define one or more keyword-only arguments. All of the
keyword arguments accept a PyArrow Dataset
instance. These are bound to the
data associated with specific Dyff entities when the method is run. For example,
almost all analysis methods will need to access the outputs of the AI/ML system
to evaluate its performance. You could define a keyword argument called
outputs
to receive this data. The names of these arguments are arbitrary; in
the example, we call the outputs dataset completions
.
The names and types of all arguments and data inputs, and the type and schema of
the output, must be declared when creating the
Method
resource through the Dyff API. Remember
that inputs of type MethodInputKinds.Dataset
will have a schema conforming
to the required fields for input data, and inputs of
type MethodInputKinds.Evaluation
will have a schema conforming to the
required fields for output data.
In our example, we want to access both the input and output data from the evaluation, so that we can compute the mean word lengths for both the inputs and outputs. So, we specify two keyword-only arguments, one for each of these data artifacts. Then, we just have to do some basic Pandas operations to create our output measurement, taking advantage of the easy conversion between Pandas and PyArrow data formats.
Note
In a “production-quality” implementation, you should consider processing the data incrementally as a stream of batches whenever possible, rather than loading the entire dataset into memory in a DataFrame.
Finally, after processing the data, we return our results as a stream of PyArrow batches, using the schema we defined earlier to specify the PyArrow schema of the result:
1# mypy: disable-error-code="import-untyped"
2import re
3import statistics
4from typing import Iterable
5
6import pandas
7import pyarrow
8import pydantic
9
10from dyff.schema.dataset import ReplicatedItem
11from dyff.schema.dataset.arrow import arrow_schema
12
13
14class WordLengthScoredItem(ReplicatedItem):
15 meanPromptWordLength: float = pydantic.Field(
16 description="Mean number of characters in the words in the prompt text."
17 )
18 meanCompletionWordLength: float = pydantic.Field(
19 description="Mean number of characters in the words in the system completions."
20 )
21 wordOfTheDay: str = pydantic.Field(
22 description="The value of the 'wordOfTheDay' argument to the Method."
23 )
24
25
26def word_length(
27 args: dict[str, str],
28 *,
29 prompts: pyarrow.dataset.Dataset,
30 completions: pyarrow.dataset.Dataset,
31) -> Iterable[pyarrow.RecordBatch]:
32 schema = arrow_schema(WordLengthScoredItem)
33 prompts_df: pandas.DataFrame = prompts.to_table().to_pandas()
34 completions_df: pandas.DataFrame = completions.to_table().to_pandas()
35
36 def _mean_word_length(text: str) -> float:
37 words = re.split(r"\s", text.strip())
38 if len(words) == 0:
39 return 0.0
40 else:
41 return statistics.mean(len(word) for word in words)
42
43 prompts_df["meanPromptWordLength"] = prompts_df.apply(
44 lambda row: _mean_word_length(row["text"]), axis=1
45 ).drop("text")
46 completions_df["meanCompletionWordLength"] = completions_df.apply(
47 lambda row: _mean_word_length(row["responses"][0]["text"]), axis=1
48 ).drop("responses")
49
50 measurement_df = prompts_df.merge(completions_df, on="_index_")
51 measurement_df["wordOfTheDay"] = args["wordOfTheDay"]
52
53 yield from pyarrow.Table.from_pandas(measurement_df, schema=schema).to_batches()
Deploying and running the method¶
To run an analysis method on Dyff, you need to create three resources:
A
Module
containing the implementation code.A
Method
that describes the method and its inputs and outputs, and references theModule
from step (1).A
Measurement
that references theMethod
from step (2) and specifies the IDs of specific resources to pass as inputs.
The APIs for completing these steps are the same whether you’re using the
DyffLocalPlatform
or a remote Client
. You can use an instance of the
local platform to develop and test these specifications, then simply switch it
with a remote client to create the resources for real.
Create a Module¶
A Module
is just a directory tree containing code
files. You create a Module
in basically the same way as a
Dataset
. Assuming you’ve implemented your method
in a file called examples/my_package.py
in the directory /home/me/dyff/my-module
,
you would create and upload the package like this:
1from __future__ import annotations
2
3from pathlib import Path
4
5import my_package
6
7from dyff.audit.local import DyffLocalPlatform
8from dyff.schema.dataset import arrow
9from dyff.schema.platform import *
10from dyff.schema.requests import *
11
12ACCOUNT: str = ...
13ROOT_DIR: Path = Path("/home/me/dyff")
14
15# Develop using the local platform
16dyffapi = DyffLocalPlatform(
17 storage_root=ROOT_DIR / ".dyff-local",
18)
19# When you're ready, switch to the remote platform:
20# dyffapi = Client(...)
21
22module_root = str(ROOT_DIR / "my-module")
23module = dyffapi.modules.create_package(
24 module_root,
25 account=ACCOUNT,
26 name="my-module",
27)
28dyffapi.modules.upload_package(module, module_root)
29print(module.json(indent=2))
Create a Method¶
The Method
resource basically specifies the
“function signature” of your method. There’s a lot to specify, but it’s all
pretty straightforward. The comments in the example explain some of the fields
in the specification further:
31method_description = """
32# Summary
33
34Computes the mean length of words in the input and output
35datasets. The description uses [Markdown](https://www.markdownguide.org) syntax.
36"""
37method_request = MethodCreateRequest(
38 name="mean-word-length",
39 # The analysis results describe one Evaluation
40 scope=MethodScope.Evaluation,
41 description=method_description,
42 # The method is implemented as the python function 'my_package.word_count()'
43 implementation=MethodImplementation(
44 kind=MethodImplementationKind.PythonFunction,
45 pythonFunction=MethodImplementationPythonFunction(
46 fullyQualifiedName="my_package.word_count",
47 ),
48 ),
49 # The method accepts one argument called 'wordOfTheDay'
50 parameters=[
51 MethodParameter(keyword="wordOfTheDay", description="A cromulent word"),
52 ],
53 # The method accepts two PyArrow datasets as inputs:
54 # - The one called 'prompts' is from a Dataset resource (i.e., system inputs)
55 # - The one called 'completions' is from an Evaluation resource (system outputs)
56 inputs=[
57 MethodInput(kind=MethodInputKind.Dataset, keyword="prompts"),
58 MethodInput(kind=MethodInputKind.Evaluation, keyword="completions"),
59 ],
60 # The method produces a Measurement
61 output=MethodOutput(
62 kind=MethodOutputKind.Measurement,
63 measurement=MeasurementSpec(
64 name="mean-word-length",
65 description="This is also **Markdown**.",
66 # There is (at least) one row per input (_index_ x _replication_)
67 level=MeasurementLevel.Instance,
68 # This is the schema of the output
69 schema=DataSchema(
70 arrowSchema=arrow.encode_schema(
71 arrow.arrow_schema(my_package.WordLengthScoredItem)
72 ),
73 ),
74 ),
75 ),
76 # The Module containing the Method code
77 modules=[module.id],
78 account=ACCOUNT,
79)
80method = dyffapi.methods.create(method_request)
81print(method.json(indent=2))
82
Note that inputs can also be Measurements
and Reports
. For example, you
could run one Method to produce an instance-level Measurement, then run another
Method that takes the instance-level measurement as input and computes
dataset-level summary statistics.
Create a Measurement¶
Finally, the Measurement
resource represents the
computational work needed to run your Method on specific inputs. The “request”
class here is called AnalysisCreateRequest
because the same request class is also used to create other analysis resources
such as SafetyCases
. You can think of
this step as “invoking” the method with specified arguments:
83dataset_id: str = ...
84evaluation_id: str = ...
85analysis_request = AnalysisCreateRequest(
86 account=ACCOUNT,
87 method=method.id,
88 arguments=[
89 AnalysisArgument(keyword="wordOfTheDay", value="embiggen"),
90 ],
91 inputs=[
92 AnalysisInput(keyword="prompts", entity=dataset_id),
93 AnalysisInput(keyword="completions", entity=evaluation_id),
94 ],
95)
96measurement = dyffapi.measurements.create(analysis_request)
97print(measurement.json(indent=2))
Full Example¶
1from __future__ import annotations
2
3from pathlib import Path
4
5import my_package
6
7from dyff.audit.local import DyffLocalPlatform
8from dyff.schema.dataset import arrow
9from dyff.schema.platform import *
10from dyff.schema.requests import *
11
12ACCOUNT: str = ...
13ROOT_DIR: Path = Path("/home/me/dyff")
14
15# Develop using the local platform
16dyffapi = DyffLocalPlatform(
17 storage_root=ROOT_DIR / ".dyff-local",
18)
19# When you're ready, switch to the remote platform:
20# dyffapi = Client(...)
21
22module_root = str(ROOT_DIR / "my-module")
23module = dyffapi.modules.create_package(
24 module_root,
25 account=ACCOUNT,
26 name="my-module",
27)
28dyffapi.modules.upload_package(module, module_root)
29print(module.json(indent=2))
30
31method_description = """
32# Summary
33
34Computes the mean length of words in the input and output
35datasets. The description uses [Markdown](https://www.markdownguide.org) syntax.
36"""
37method_request = MethodCreateRequest(
38 name="mean-word-length",
39 # The analysis results describe one Evaluation
40 scope=MethodScope.Evaluation,
41 description=method_description,
42 # The method is implemented as the python function 'my_package.word_count()'
43 implementation=MethodImplementation(
44 kind=MethodImplementationKind.PythonFunction,
45 pythonFunction=MethodImplementationPythonFunction(
46 fullyQualifiedName="my_package.word_count",
47 ),
48 ),
49 # The method accepts one argument called 'wordOfTheDay'
50 parameters=[
51 MethodParameter(keyword="wordOfTheDay", description="A cromulent word"),
52 ],
53 # The method accepts two PyArrow datasets as inputs:
54 # - The one called 'prompts' is from a Dataset resource (i.e., system inputs)
55 # - The one called 'completions' is from an Evaluation resource (system outputs)
56 inputs=[
57 MethodInput(kind=MethodInputKind.Dataset, keyword="prompts"),
58 MethodInput(kind=MethodInputKind.Evaluation, keyword="completions"),
59 ],
60 # The method produces a Measurement
61 output=MethodOutput(
62 kind=MethodOutputKind.Measurement,
63 measurement=MeasurementSpec(
64 name="mean-word-length",
65 description="This is also **Markdown**.",
66 # There is (at least) one row per input (_index_ x _replication_)
67 level=MeasurementLevel.Instance,
68 # This is the schema of the output
69 schema=DataSchema(
70 arrowSchema=arrow.encode_schema(
71 arrow.arrow_schema(my_package.WordLengthScoredItem)
72 ),
73 ),
74 ),
75 ),
76 # The Module containing the Method code
77 modules=[module.id],
78 account=ACCOUNT,
79)
80method = dyffapi.methods.create(method_request)
81print(method.json(indent=2))
82
83dataset_id: str = ...
84evaluation_id: str = ...
85analysis_request = AnalysisCreateRequest(
86 account=ACCOUNT,
87 method=method.id,
88 arguments=[
89 AnalysisArgument(keyword="wordOfTheDay", value="embiggen"),
90 ],
91 inputs=[
92 AnalysisInput(keyword="prompts", entity=dataset_id),
93 AnalysisInput(keyword="completions", entity=evaluation_id),
94 ],
95)
96measurement = dyffapi.measurements.create(analysis_request)
97print(measurement.json(indent=2))