Utilities

Scoring Rubrics

Base class for Rubrics.

class dyff.audit.scoring.base.Rubric

A Rubric is a mechanism for producing inference-level performance scores for the output of an inference service on a given inference task.

abstract apply(task_data: pyarrow.datasets.Dataset, predictions: pyarrow.datasets.Dataset) Iterable[pyarrow.RecordBatch]

Create a generator that yields batches of scored inferences.

abstract property name: str

The “semi-qualified” type name of the Rubric.

The result should be such that f"alignmentlabs.audit.scoring.{self.name}" is the fully-qualified name of the type.

abstract property schema: Schema

The PyArrow schema of the output of applying the Rubric.

Classification

Rubrics related to supervised classification.

class dyff.audit.scoring.classification.TopKAccuracy

Bases: Rubric

Computes top-1 and top-5 accuracy for classification tasks.

apply(task_data: pyarrow.datasets.Dataset, predictions: pyarrow.datasets.Dataset) Iterable[pyarrow.RecordBatch]

Create a generator that yields batches of scored inferences.

property name: str

The “semi-qualified” type name of the Rubric.

The result should be such that f"alignmentlabs.audit.scoring.{self.name}" is the fully-qualified name of the type.

property schema: Schema

The PyArrow schema of the output of applying the Rubric.

There is one row per input instance. A prediction is scored as correct if prediction == truth. For top-5 accuracy, the instance is correct if any of the top-5 predictions was correct.

_index_int64

The index of the item in the dataset

_replication_string

ID of the replication the item belongs to.

top1bool

0-1 indicator of whether the top-1 prediction was correct.

top5bool

0-1 indicator of whether any of the top-5 predictions were correct.

pydantic model dyff.audit.scoring.classification.TopKAccuracyScoredItem

Bases: ReplicatedItem

Placeholder.

field top1: bool [Required]

0-1 indicator of whether the top-1 prediction was correct.

field top5: bool [Required]

0-1 indicator of whether any of the top-5 predictions were correct.

dyff.audit.scoring.classification.top_k(prediction: Any | List[Any], truth: Any, *, k: int) int

Return 1 if any of the first k elements of prediction are equal to truth, and 0 otherwise.

Parameters:
  • prediction (Any or List(Any)) – Either a list of predicted labels in descending order of “score”, or a single predicted label. For a single label, all values of k are equivalent to k = 1.

  • truth (Any) – The true label.

  • k (int) – The number of predictions to consider.

Returns:

1 if one of the top k predictions was correct, 0 otherwise.

Return type:

int

Text

Rubrics related to text processing tasks.

class dyff.audit.scoring.text.MatchCriterion(value)

Bases: str, Enum

The match criteria defined by the SemEval2013 evaluation standard.

See MatchResult for the possible results of matching.

exact = 'exact'

Tag is ignored. The prediction is correct if [start, end) match exactly. Otherwise, the prediction is incorrect.

partial = 'partial'

Tag is ignored. Partial overlap is scored as partial, exact overlap is scored as correct, no overlap is scored as incorrect.

strict = 'strict'

The prediction is correct if [start, end) match exactly and the tag is correct. Otherwise, the prediction is incorrect.

type = 'type'

Partial overlap is scored as either correct or incorrect based on the predicted tag. No overlap is scored as incorrect.

class dyff.audit.scoring.text.MatchResult(value)

Bases: str, Enum

The possible results of a text span matching comparison as defined by the MUC-5 evaluation standard.

Note that the semantics of these results depend on which MatchCriterion is being applied.

correct = 'correct'

The prediction overlapped with a true entity and was correct.

incorrect = 'incorrect'

The prediction overlapped with a true entity but was incorrect.

missing = 'missing'

A ground-truth entity did not overlap with any predicted entity.

partial = 'partial'

The prediction overlapped a true entity partially but not exactly. Applies to MatchCriterion.partial only.

spurious = 'spurious'

A predicted entity did not overlap with any ground-truth entity.

class dyff.audit.scoring.text.TextSpanMatches

Bases: Rubric

Computes matches between predicted and ground-truth text spans according to each criterion in MatchCriterion.

apply(task_data: pyarrow.datasets.Dataset, predictions: pyarrow.datasets.Dataset) Iterable[pyarrow.RecordBatch]

Create a generator that yields batches of scored inferences.

property name: str

The “semi-qualified” type name of the Rubric.

The result should be such that f"alignmentlabs.audit.scoring.{self.name}" is the fully-qualified name of the type.

property schema: Schema

The PyArrow schema of the output of applying the Rubric.

There may be 0 or more rows for each input instance (same _index_). Each row indicates which predicted span within that instance overlapped with which ground truth span, and how that overlap was scored. For example, if the 1st predicted span overlapped with the 2nd ground truth span, then there will be a row with prediction = 1 and truth = 2. Spans are 0-indexed in increasing order of their .start field. A spurious match will have a value for prediction but not for truth, and a missing match will have a value for truth but not for prediction. Instances for which there are no predicted spans and no ground truth spans will not appear in the results.

_index_int64

The index of the item in the dataset

_replication_string

ID of the replication the item belongs to.

MatchCriterionstring

MatchCriterion applicable to this record.

predictionint32

Index of the relevant predicted span within the current instance.

truthint32

Index of the relevant ground truth span within the current instance.

MatchResult.correctbool

Indicator of whether the match is correct.

MatchResult.incorrectbool

Indicator of whether the match is incorrect.

MatchResult.partialbool

Indicator of whether the match is partial.

MatchResult.missingbool

Indicator of whether the match is missing.

MatchResult.spuriousbool

Indicator of whether the match is spurious.

pydantic model dyff.audit.scoring.text.TextSpanMatchesScoredItem

Bases: ReplicatedItem

field MatchCriterion: str [Required]

MatchCriterion applicable to this record.

field MatchResultCorrect: bool [Required] (alias 'MatchResult.correct')

Indicator of whether the match is correct.

field MatchResultIncorrect: bool [Required] (alias 'MatchResult.incorrect')

Indicator of whether the match is incorrect.

field MatchResultMissing: bool [Required] (alias 'MatchResult.missing')

Indicator of whether the match is missing.

field MatchResultPartial: bool [Required] (alias 'MatchResult.partial')

Indicator of whether the match is partial.

field MatchResultSpurious: bool [Required] (alias 'MatchResult.spurious')

Indicator of whether the match is spurious.

field prediction: int32() [Required]

Index of the relevant predicted span within the current instance.

Constraints:
  • minimum = -2147483648

  • maximum = 2147483647

  • dyff.io/dtype = int32

field truth: int32() [Required]

Index of the relevant ground truth span within the current instance.

Constraints:
  • minimum = -2147483648

  • maximum = 2147483647

  • dyff.io/dtype = int32

dyff.audit.scoring.text.match_spans(predictions: List[Dict[str, Any]], truths: List[Dict[str, Any]]) List[Dict[str, Any]]

Compute matching results for ground-truth and predicted text spans using the extended set of matching criteria defined by the SemEval2013 and MUC-5 evaluation standards.

Parameters:
  • predictions – List of predicted spans

  • truths – List of ground-truth spans

Returns:

A two-level dictionary containing match counts, like: {MatchCriterion: {MatchResult: Count}}

Evaluation Metrics

Metrics are summaries of the scores generated from Rubrics.

class dyff.audit.metrics.Metric

Bases: ABC

A Metric is an operation that can be applied to a set of inference-level scores to produce an aggregate summary of performance.

abstract __call__(scores: DataFrame) DataFrame

Compute the metric.

abstract property name: str

The “semi-qualified” type name of the Metric.

The result should be such that f"alignmentlabs.audit.metrics.{self.name}" is the fully-qualified name of the type.

abstract property schema: Schema

The PyArrow schema of the output of applying the Metric.

Text

Metrics related to text processing tasks.

class dyff.audit.metrics.text.ExtendedPrecisionRecall

Bases: Metric

Compute precision and recall for the extended set of text span matching criteria defined in dyff.audit.scoring.text.

__call__(text_span_matches: DataFrame) DataFrame

Compute the metric.

property name: str

The “semi-qualified” type name of the Metric.

The result should be such that f"alignmentlabs.audit.metrics.{self.name}" is the fully-qualified name of the type.

property schema: Schema

The PyArrow schema of the output of applying the Metric.

MatchCriterionstring

MatchCriterion applicable to this record.

MatchResult.correctint32

Count of correct matches.

MatchResult.incorrectint32

Count of incorrect matches.

MatchResult.partialint32

Count of partial matches.

MatchResult.missingint32

Count of missing matches.

MatchResult.spuriousint32

Count of spurious matches.

possibleint32

Number of “possible” matches.

actualint32

Number of “actual” matches.

precisiondouble

Precision score.

recalldouble

Recall score.

f1_scoredouble

F1 score.

pydantic model dyff.audit.metrics.text.ExtendedPrecisionRecallScore

Bases: DyffSchemaBaseModel

field MatchCriterion: str [Required]

MatchCriterion applicable to this record.

field MatchResultCorrect: Int32Value [Required] (alias 'MatchResult.correct')

Count of correct matches.

Constraints:
  • minimum = -2147483648

  • maximum = 2147483647

  • dyff.io/dtype = int32

field MatchResultIncorrect: Int32Value [Required] (alias 'MatchResult.incorrect')

Count of incorrect matches.

Constraints:
  • minimum = -2147483648

  • maximum = 2147483647

  • dyff.io/dtype = int32

field MatchResultMissing: Int32Value [Required] (alias 'MatchResult.missing')

Count of missing matches.

Constraints:
  • minimum = -2147483648

  • maximum = 2147483647

  • dyff.io/dtype = int32

field MatchResultPartial: Int32Value [Required] (alias 'MatchResult.partial')

Count of partial matches.

Constraints:
  • minimum = -2147483648

  • maximum = 2147483647

  • dyff.io/dtype = int32

field MatchResultSpurious: Int32Value [Required] (alias 'MatchResult.spurious')

Count of spurious matches.

Constraints:
  • minimum = -2147483648

  • maximum = 2147483647

  • dyff.io/dtype = int32

field actual: Int32Value [Required]

Number of “actual” matches.

Constraints:
  • minimum = -2147483648

  • maximum = 2147483647

  • dyff.io/dtype = int32

field f1_score: float [Required]

F1 score.

field possible: Int32Value [Required]

Number of “possible” matches.

Constraints:
  • minimum = -2147483648

  • maximum = 2147483647

  • dyff.io/dtype = int32

field precision: float [Required]

Precision score.

field recall: float [Required]

Recall score.

class dyff.audit.metrics.text.MatchCriterion(value)

Bases: str, Enum

The match criteria defined by the SemEval2013 evaluation standard.

See MatchResult for the possible results of matching.

exact = 'exact'

Tag is ignored. The prediction is correct if [start, end) match exactly. Otherwise, the prediction is incorrect.

partial = 'partial'

Tag is ignored. Partial overlap is scored as partial, exact overlap is scored as correct, no overlap is scored as incorrect.

strict = 'strict'

The prediction is correct if [start, end) match exactly and the tag is correct. Otherwise, the prediction is incorrect.

type = 'type'

Partial overlap is scored as either correct or incorrect based on the predicted tag. No overlap is scored as incorrect.

class dyff.audit.metrics.text.MatchResult(value)

Bases: str, Enum

The possible results of a text span matching comparison as defined by the MUC-5 evaluation standard.

Note that the semantics of these results depend on which MatchCriterion is being applied.

correct = 'correct'

The prediction overlapped with a true entity and was correct.

incorrect = 'incorrect'

The prediction overlapped with a true entity but was incorrect.

missing = 'missing'

A ground-truth entity did not overlap with any predicted entity.

partial = 'partial'

The prediction overlapped a true entity partially but not exactly. Applies to MatchCriterion.partial only.

spurious = 'spurious'

A predicted entity did not overlap with any ground-truth entity.

dyff.audit.metrics.text.arrow_schema(model_type: Type[BaseModel], *, metadata: dict[str, str] | None = None) Schema

Create an Arrow schema from a Pydantic model.

We support a very basic subset of pydantic model features currently. The intention is to expand this.

dyff.audit.metrics.text.f1_score(df: DataFrame) Series

F1 score is the harmonic mean of precision and recall.

Parameters:

df (pandas.DataFrame) – Must contain columns precision and recall.

dyff.audit.metrics.text.int32(*, strict: bool = False, gt: int | None = None, ge: int | None = None, lt: int | None = None, le: int | None = None, multiple_of: int | None = None) Type[int]

Return a type annotation for an int32 field in a pydantic model.

Note that any keyword arguments must be specified here, even if they can also be specified in pydantic.Field(). The corresponding keyword arguments in pydantic.Field() will have no effect.

Usage:

class M(pydantic.BaseModel):
    # Notice how "lt" is specified in the type annotation, not Field()
    x: int32(lt=42) = pydantic.Field(description="some field")
dyff.audit.metrics.text.schema_function(schema: Schema)

Annotation for functions that return pyarrow.Schema. The annotated function will return the supplied schema and will have a docstring describing the schema.

Intended to be applied to a function with no body, e.g.:

@schema_function(
  pyarrow.schema([
    field_with_docstring("field_name", pyarrow.string(), docstring="Very important!")
  ])
)
def schema() -> pyarrow.Schema:
  """Additional docstring. Don't define a function body"""

Dataset Construction and Manipulation

Tools for working with input and output data for systems being audited.

Text

dyff.audit.data.text.token_tags_to_spans(text: str, tokens: List[str], tags: List[str]) List[TaggedSpan]

Computes the list of TaggedSpans corresponding to the tagged tokens in the text.

dyff.audit.data.text.visualize_spans(text: str, spans: List[TaggedSpan], *, width: int = 80)

Print lines of text with lines representing NER spans aligned underneath.

Example output:

My name is Alice and I live in Alaska.
           PPPPP               LLLLLL

Workflows

dyff.audit.workflows.local_evaluation(client: Client, session: InferenceSession, *, input_dataset_path: Path | str, output_dataset_path: Path | str, replications: int = 1, id: str | None = None) str

Emulate an Evaluation workflow by feeding data from a local Arrow dataset to an InferenceSession running on the Dyff platform.

Deprecated since version 0.3.3: Use DyffLocalPlatform to get similar functionality

The output dataset will have the same schema as the outputs from an Evaluation run on the platform, including fields added by the platform – _index_, _replication_, etc.

The input dataset must be compatible with the canonical Dyff Platform dataset schema for the appropriate inference task.

Parameters:
  • client (dyff.client.Client) – A Dyff API client with permission to call the inferencesessions.token() and inferencesessions.client() endpoints on the given InferenceSession.

  • session (dyff.schema.platform.InferenceSession) – A record describing a running InferenceSession.

  • input_dataset_path (Path | str) – The root directory of an Arrow dataset on the local filesystem.

  • output_dataset_path (Path | str) – The directory where the Arrow output dataset should be created. A subdirectory named with the ID of the simulated evaluation will be created.

  • replications (int) – The number of replications to run. Equivalent to the EvaluationCreateRequest.replications parameter.

  • id (str) – If specified, use this ID for the evaluation. Otherwise, generate a new ID.

Returns:

An ID for the simulated evaluation, either the ID provided as an argument, or a newly-generated one. This will not correspond to an entity in the Dyff datastore, but it can be used to derive the IDs of replications in the output dataset.

Return type:

str

dyff.audit.workflows.local_report(rubric: Rubric, *, input_dataset_path: Path | str, output_dataset_path: Path | str, report_dataset_path: Path | str)

Emulate a Report workflow on local data.

Deprecated since version 0.3.3: Use DyffLocalPlatform to get similar functionality

You will need the Arrow datasets of inputs and outputs to an Evaluation workflow. You can emulate an Evaluation locally with local_evaluation().

Parameters:
  • rubric (dyff.audit.scoring.Rubric) – The Rubric to apply.

  • input_dataset_path (Path | str) – The root directory of the Arrow dataset containing the inputs to an evaluation.

  • output_dataset_path (Path | str) – The root directory of the Arrow dataset containing the outputs of the evaluation.

  • report_dataset_path (Path | str) – The directory where the Arrow dataset of report outputs should be created. A subdirectory named with the ID of the simulated report will be created.

Returns:

An ID for the simulated report. This will not correspond to an entity in the Dyff datastore.

Return type:

str