dyff.audit¶
The dyff.audit package provides tools for producing audit reports from the raw outputs of intelligent systems.
Installation¶
python3 -m pip install dyff-audit
Scoring Rubrics¶
Base class for Rubrics.
- class dyff.audit.scoring.base.Rubric¶
A Rubric is a mechanism for producing inference-level performance scores for the output of an inference service on a given inference task.
- abstract apply(task_data: pyarrow.datasets.Dataset, predictions: pyarrow.datasets.Dataset) Iterable[pyarrow.RecordBatch] ¶
Create a generator that yields batches of scored inferences.
- abstract property name: str¶
The “semi-qualified” type name of the Rubric.
The result should be such that
f"alignmentlabs.audit.scoring.{self.name}"
is the fully-qualified name of the type.
- abstract property schema: Schema¶
The PyArrow schema of the output of applying the Rubric.
Classification¶
Rubrics related to supervised classification.
- class dyff.audit.scoring.classification.TopKAccuracy¶
Bases:
Rubric
Computes top-1 and top-5 accuracy for classification tasks.
- apply(task_data: pyarrow.datasets.Dataset, predictions: pyarrow.datasets.Dataset) Iterable[pyarrow.RecordBatch] ¶
Create a generator that yields batches of scored inferences.
- property name: str¶
The “semi-qualified” type name of the Rubric.
The result should be such that
f"alignmentlabs.audit.scoring.{self.name}"
is the fully-qualified name of the type.
- property schema: Schema¶
The PyArrow schema of the output of applying the Rubric.
There is one row per input instance. A prediction is scored as correct if
prediction == truth
. For top-5 accuracy, the instance is correct if any of the top-5 predictions was correct.- _index_int64
The index of the item in the dataset
- _replication_string
ID of the replication the item belongs to.
- top1bool
0-1 indicator of whether the top-1 prediction was correct.
- top5bool
0-1 indicator of whether any of the top-5 predictions were correct.
- pydantic model dyff.audit.scoring.classification.TopKAccuracyScoredItem¶
Bases:
ReplicatedItem
Placeholder.
- field top1: bool [Required]¶
0-1 indicator of whether the top-1 prediction was correct.
- field top5: bool [Required]¶
0-1 indicator of whether any of the top-5 predictions were correct.
- dyff.audit.scoring.classification.top_k(prediction: Any | List[Any], truth: Any, *, k: int) int ¶
Return
1
if any of the firstk
elements ofprediction
are equal totruth
, and0
otherwise.- Parameters:
prediction (Any or List(Any)) – Either a list of predicted labels in descending order of “score”, or a single predicted label. For a single label, all values of
k
are equivalent tok = 1
.truth (Any) – The true label.
k (int) – The number of predictions to consider.
- Returns:
1
if one of the topk
predictions was correct,0
otherwise.- Return type:
int
Text¶
Rubrics related to text processing tasks.
- class dyff.audit.scoring.text.MatchCriterion(value)¶
Bases:
str
,Enum
The match criteria defined by the SemEval2013 evaluation standard.
See
MatchResult
for the possible results of matching.- exact = 'exact'¶
Tag is ignored. The prediction is
correct
if[start, end)
match exactly. Otherwise, the prediction isincorrect
.
- partial = 'partial'¶
Tag is ignored. Partial overlap is scored as
partial
, exact overlap is scored ascorrect
, no overlap is scored asincorrect
.
- strict = 'strict'¶
The prediction is
correct
if[start, end)
match exactly and the tag is correct. Otherwise, the prediction isincorrect
.
- type = 'type'¶
Partial overlap is scored as either
correct
orincorrect
based on the predicted tag. No overlap is scored asincorrect
.
- class dyff.audit.scoring.text.MatchResult(value)¶
Bases:
str
,Enum
The possible results of a text span matching comparison as defined by the MUC-5 evaluation standard.
Note that the semantics of these results depend on which
MatchCriterion
is being applied.- correct = 'correct'¶
The prediction overlapped with a true entity and was correct.
- incorrect = 'incorrect'¶
The prediction overlapped with a true entity but was incorrect.
- missing = 'missing'¶
A ground-truth entity did not overlap with any predicted entity.
- partial = 'partial'¶
The prediction overlapped a true entity partially but not exactly. Applies to
MatchCriterion.partial
only.
- spurious = 'spurious'¶
A predicted entity did not overlap with any ground-truth entity.
- class dyff.audit.scoring.text.TextSpanMatches¶
Bases:
Rubric
Computes matches between predicted and ground-truth text spans according to each criterion in
MatchCriterion
.- apply(task_data: pyarrow.datasets.Dataset, predictions: pyarrow.datasets.Dataset) Iterable[pyarrow.RecordBatch] ¶
Create a generator that yields batches of scored inferences.
- property name: str¶
The “semi-qualified” type name of the Rubric.
The result should be such that
f"alignmentlabs.audit.scoring.{self.name}"
is the fully-qualified name of the type.
- property schema: Schema¶
The PyArrow schema of the output of applying the Rubric.
There may be 0 or more rows for each input instance (same
_index_
). Each row indicates which predicted span within that instance overlapped with which ground truth span, and how that overlap was scored. For example, if the 1st predicted span overlapped with the 2nd ground truth span, then there will be a row withprediction = 1
andtruth = 2
. Spans are 0-indexed in increasing order of their.start
field. Aspurious
match will have a value forprediction
but not fortruth
, and amissing
match will have a value fortruth
but not forprediction
. Instances for which there are no predicted spans and no ground truth spans will not appear in the results.- _index_int64
The index of the item in the dataset
- _replication_string
ID of the replication the item belongs to.
- MatchCriterionstring
MatchCriterion
applicable to this record.- predictionint32
Index of the relevant predicted span within the current instance.
- truthint32
Index of the relevant ground truth span within the current instance.
- MatchResult.correctbool
Indicator of whether the match is
correct
.- MatchResult.incorrectbool
Indicator of whether the match is
incorrect
.- MatchResult.partialbool
Indicator of whether the match is
partial
.- MatchResult.missingbool
Indicator of whether the match is
missing
.- MatchResult.spuriousbool
Indicator of whether the match is
spurious
.
- pydantic model dyff.audit.scoring.text.TextSpanMatchesScoredItem¶
Bases:
ReplicatedItem
- field MatchCriterion: str [Required]¶
MatchCriterion
applicable to this record.
- field MatchResultCorrect: bool [Required] (alias 'MatchResult.correct')¶
Indicator of whether the match is
correct
.
- field MatchResultIncorrect: bool [Required] (alias 'MatchResult.incorrect')¶
Indicator of whether the match is
incorrect
.
- field MatchResultMissing: bool [Required] (alias 'MatchResult.missing')¶
Indicator of whether the match is
missing
.
- field MatchResultPartial: bool [Required] (alias 'MatchResult.partial')¶
Indicator of whether the match is
partial
.
- field MatchResultSpurious: bool [Required] (alias 'MatchResult.spurious')¶
Indicator of whether the match is
spurious
.
- field prediction: int32() [Required]¶
Index of the relevant predicted span within the current instance.
- Constraints:
minimum = -2147483648
maximum = 2147483647
dyff.io/dtype = int32
- field truth: int32() [Required]¶
Index of the relevant ground truth span within the current instance.
- Constraints:
minimum = -2147483648
maximum = 2147483647
dyff.io/dtype = int32
- dyff.audit.scoring.text.match_spans(predictions: List[Dict[str, Any]], truths: List[Dict[str, Any]]) List[Dict[str, Any]] ¶
Compute matching results for ground-truth and predicted text spans using the extended set of matching criteria defined by the SemEval2013 and MUC-5 evaluation standards.
- Parameters:
predictions – List of predicted spans
truths – List of ground-truth spans
- Returns:
A two-level dictionary containing match counts, like:
{MatchCriterion: {MatchResult: Count}}
Evaluation Metrics¶
Metrics are summaries of the scores generated from Rubrics.
- class dyff.audit.metrics.Metric¶
Bases:
ABC
A Metric is an operation that can be applied to a set of inference-level scores to produce an aggregate summary of performance.
- abstract __call__(scores: DataFrame) DataFrame ¶
Compute the metric.
- abstract property name: str¶
The “semi-qualified” type name of the Metric.
The result should be such that
f"alignmentlabs.audit.metrics.{self.name}"
is the fully-qualified name of the type.
- abstract property schema: Schema¶
The PyArrow schema of the output of applying the Metric.
Text¶
Metrics related to text processing tasks.
- class dyff.audit.metrics.text.ExtendedPrecisionRecall¶
Bases:
Metric
Compute precision and recall for the extended set of text span matching criteria defined in
dyff.audit.scoring.text
.- __call__(text_span_matches: DataFrame) DataFrame ¶
Compute the metric.
- property name: str¶
The “semi-qualified” type name of the Metric.
The result should be such that
f"alignmentlabs.audit.metrics.{self.name}"
is the fully-qualified name of the type.
- property schema: Schema¶
The PyArrow schema of the output of applying the Metric.
- MatchCriterionstring
MatchCriterion
applicable to this record.- MatchResult.correctint32
Count of
correct
matches.- MatchResult.incorrectint32
Count of
incorrect
matches.- MatchResult.partialint32
Count of
partial
matches.- MatchResult.missingint32
Count of
missing
matches.- MatchResult.spuriousint32
Count of
spurious
matches.- possibleint32
Number of “possible” matches.
- actualint32
Number of “actual” matches.
- precisiondouble
Precision score.
- recalldouble
Recall score.
- f1_scoredouble
F1 score.
- pydantic model dyff.audit.metrics.text.ExtendedPrecisionRecallScore¶
Bases:
DyffSchemaBaseModel
- field MatchCriterion: str [Required]¶
MatchCriterion
applicable to this record.
- field MatchResultCorrect: Int32Value [Required] (alias 'MatchResult.correct')¶
Count of
correct
matches.- Constraints:
minimum = -2147483648
maximum = 2147483647
dyff.io/dtype = int32
- field MatchResultIncorrect: Int32Value [Required] (alias 'MatchResult.incorrect')¶
Count of
incorrect
matches.- Constraints:
minimum = -2147483648
maximum = 2147483647
dyff.io/dtype = int32
- field MatchResultMissing: Int32Value [Required] (alias 'MatchResult.missing')¶
Count of
missing
matches.- Constraints:
minimum = -2147483648
maximum = 2147483647
dyff.io/dtype = int32
- field MatchResultPartial: Int32Value [Required] (alias 'MatchResult.partial')¶
Count of
partial
matches.- Constraints:
minimum = -2147483648
maximum = 2147483647
dyff.io/dtype = int32
- field MatchResultSpurious: Int32Value [Required] (alias 'MatchResult.spurious')¶
Count of
spurious
matches.- Constraints:
minimum = -2147483648
maximum = 2147483647
dyff.io/dtype = int32
- field actual: Int32Value [Required]¶
Number of “actual” matches.
- Constraints:
minimum = -2147483648
maximum = 2147483647
dyff.io/dtype = int32
- field f1_score: float [Required]¶
F1 score.
- field possible: Int32Value [Required]¶
Number of “possible” matches.
- Constraints:
minimum = -2147483648
maximum = 2147483647
dyff.io/dtype = int32
- field precision: float [Required]¶
Precision score.
- field recall: float [Required]¶
Recall score.
- class dyff.audit.metrics.text.MatchCriterion(value)¶
Bases:
str
,Enum
The match criteria defined by the SemEval2013 evaluation standard.
See
MatchResult
for the possible results of matching.- exact = 'exact'¶
Tag is ignored. The prediction is
correct
if[start, end)
match exactly. Otherwise, the prediction isincorrect
.
- partial = 'partial'¶
Tag is ignored. Partial overlap is scored as
partial
, exact overlap is scored ascorrect
, no overlap is scored asincorrect
.
- strict = 'strict'¶
The prediction is
correct
if[start, end)
match exactly and the tag is correct. Otherwise, the prediction isincorrect
.
- type = 'type'¶
Partial overlap is scored as either
correct
orincorrect
based on the predicted tag. No overlap is scored asincorrect
.
- class dyff.audit.metrics.text.MatchResult(value)¶
Bases:
str
,Enum
The possible results of a text span matching comparison as defined by the MUC-5 evaluation standard.
Note that the semantics of these results depend on which
MatchCriterion
is being applied.- correct = 'correct'¶
The prediction overlapped with a true entity and was correct.
- incorrect = 'incorrect'¶
The prediction overlapped with a true entity but was incorrect.
- missing = 'missing'¶
A ground-truth entity did not overlap with any predicted entity.
- partial = 'partial'¶
The prediction overlapped a true entity partially but not exactly. Applies to
MatchCriterion.partial
only.
- spurious = 'spurious'¶
A predicted entity did not overlap with any ground-truth entity.
- dyff.audit.metrics.text.arrow_schema(model_type: Type[BaseModel], *, metadata: dict[str, str] | None = None) Schema ¶
Create an Arrow schema from a Pydantic model.
We support a very basic subset of pydantic model features currently. The intention is to expand this.
- dyff.audit.metrics.text.f1_score(df: DataFrame) Series ¶
F1 score is the harmonic mean of precision and recall.
- Parameters:
df (pandas.DataFrame) – Must contain columns
precision
andrecall
.
- dyff.audit.metrics.text.int32(*, strict: bool = False, gt: int | None = None, ge: int | None = None, lt: int | None = None, le: int | None = None, multiple_of: int | None = None) Type[int] ¶
Return a type annotation for an int32 field in a pydantic model.
Note that any keyword arguments must be specified here, even if they can also be specified in
pydantic.Field()
. The corresponding keyword arguments inpydantic.Field()
will have no effect.Usage:
class M(pydantic.BaseModel): # Notice how "lt" is specified in the type annotation, not Field() x: int32(lt=42) = pydantic.Field(description="some field")
- dyff.audit.metrics.text.schema_function(schema: Schema)¶
Annotation for functions that return
pyarrow.Schema
. The annotated function will return the supplied schema and will have a docstring describing the schema.Intended to be applied to a function with no body, e.g.:
@schema_function( pyarrow.schema([ field_with_docstring("field_name", pyarrow.string(), docstring="Very important!") ]) ) def schema() -> pyarrow.Schema: """Additional docstring. Don't define a function body"""
Dataset Construction and Manipulation¶
Tools for working with input and output data for systems being audited.
Text¶
- dyff.audit.data.text.token_tags_to_spans(text: str, tokens: List[str], tags: List[str]) List[TaggedSpan] ¶
Computes the list of TaggedSpans corresponding to the tagged tokens in the text.
- dyff.audit.data.text.visualize_spans(text: str, spans: List[TaggedSpan], *, width: int = 80)¶
Print lines of text with lines representing NER spans aligned underneath.
Example output:
My name is Alice and I live in Alaska. PPPPP LLLLLL
Workflows¶
- dyff.audit.workflows.local_evaluation(client: Client, session: InferenceSession, *, input_dataset_path: Path | str, output_dataset_path: Path | str, replications: int = 1, id: str | None = None) str ¶
Emulate an Evaluation workflow by feeding data from a local Arrow dataset to an InferenceSession running on the Dyff platform.
The output dataset will have the same schema as the outputs from an Evaluation run on the platform, including fields added by the platform –
_index_
,_replication_
, etc.The input dataset must be compatible with the canonical Dyff Platform dataset schema for the appropriate inference task.
- Parameters:
client (dyff.client.Client) – A Dyff API client with permission to call the
inferencesessions.token()
andinferencesessions.client()
endpoints on the givenInferenceSession
.session (dyff.schema.platform.InferenceSession) – A record describing a running
InferenceSession
.input_dataset_path (Path | str) – The root directory of an Arrow dataset on the local filesystem.
output_dataset_path (Path | str) – The directory where the Arrow output dataset should be created. A subdirectory named with the ID of the simulated evaluation will be created.
replications (int) – The number of replications to run. Equivalent to the
EvaluationCreateRequest.replications
parameter.id (str) – If specified, use this ID for the evaluation. Otherwise, generate a new ID.
- Returns:
An ID for the simulated evaluation, either the ID provided as an argument, or a newly-generated one. This will not correspond to an entity in the Dyff datastore, but it can be used to derive the IDs of replications in the output dataset.
- Return type:
str
- dyff.audit.workflows.local_report(rubric: Rubric, *, input_dataset_path: Path | str, output_dataset_path: Path | str, report_dataset_path: Path | str)¶
Emulate a Report workflow on local data.
You will need the Arrow datasets of inputs and outputs to an Evaluation workflow. You can emulate an Evaluation locally with
local_evaluation()
.- Parameters:
rubric (dyff.audit.scoring.Rubric) – The Rubric to apply.
input_dataset_path (Path | str) – The root directory of the Arrow dataset containing the inputs to an evaluation.
output_dataset_path (Path | str) – The root directory of the Arrow dataset containing the outputs of the evaluation.
report_dataset_path (Path | str) – The directory where the Arrow dataset of report outputs should be created. A subdirectory named with the ID of the simulated report will be created.
- Returns:
An ID for the simulated report. This will not correspond to an entity in the Dyff datastore.
- Return type:
str