Utilities¶
Scoring Rubrics¶
Base class for Rubrics.
- class dyff.audit.scoring.base.Rubric¶
- A Rubric is a mechanism for producing inference-level performance scores for the output of an inference service on a given inference task. - abstract apply(task_data: pyarrow.datasets.Dataset, predictions: pyarrow.datasets.Dataset) Iterable[pyarrow.RecordBatch]¶
- Create a generator that yields batches of scored inferences. 
 - abstract property name: str¶
- The “semi-qualified” type name of the Rubric. - The result should be such that - f"alignmentlabs.audit.scoring.{self.name}"is the fully-qualified name of the type.
 - abstract property schema: Schema¶
- The PyArrow schema of the output of applying the Rubric. 
 
Classification¶
Rubrics related to supervised classification.
- class dyff.audit.scoring.classification.TopKAccuracy¶
- Bases: - Rubric- Computes top-1 and top-5 accuracy for classification tasks. - apply(task_data: pyarrow.datasets.Dataset, predictions: pyarrow.datasets.Dataset) Iterable[pyarrow.RecordBatch]¶
- Create a generator that yields batches of scored inferences. 
 - property name: str¶
- The “semi-qualified” type name of the Rubric. - The result should be such that - f"alignmentlabs.audit.scoring.{self.name}"is the fully-qualified name of the type.
 - property schema: Schema¶
- The PyArrow schema of the output of applying the Rubric. - There is one row per input instance. A prediction is scored as correct if - prediction == truth. For top-5 accuracy, the instance is correct if any of the top-5 predictions was correct.- _index_int64
- The index of the item in the dataset 
- _replication_string
- ID of the replication the item belongs to. 
- top1bool
- 0-1 indicator of whether the top-1 prediction was correct. 
- top5bool
- 0-1 indicator of whether any of the top-5 predictions were correct. 
 
 
- pydantic model dyff.audit.scoring.classification.TopKAccuracyScoredItem¶
- Bases: - ReplicatedItem- Placeholder. - field top1: bool [Required]¶
- 0-1 indicator of whether the top-1 prediction was correct. 
 - field top5: bool [Required]¶
- 0-1 indicator of whether any of the top-5 predictions were correct. 
 
- dyff.audit.scoring.classification.top_k(prediction: Any | List[Any], truth: Any, *, k: int) int¶
- Return - 1if any of the first- kelements of- predictionare equal to- truth, and- 0otherwise.- Parameters:
- prediction (Any or List(Any)) – Either a list of predicted labels in descending order of “score”, or a single predicted label. For a single label, all values of - kare equivalent to- k = 1.
- truth (Any) – The true label. 
- k (int) – The number of predictions to consider. 
 
- Returns:
- 1if one of the top- kpredictions was correct,- 0otherwise.
- Return type:
- int 
 
Text¶
Rubrics related to text processing tasks.
- class dyff.audit.scoring.text.MatchCriterion(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶
- Bases: - str,- Enum- The match criteria defined by the SemEval2013 evaluation standard. - See - MatchResultfor the possible results of matching.- exact = 'exact'¶
- Tag is ignored. The prediction is - correctif- [start, end)match exactly. Otherwise, the prediction is- incorrect.
 - partial = 'partial'¶
- Tag is ignored. Partial overlap is scored as - partial, exact overlap is scored as- correct, no overlap is scored as- incorrect.
 - strict = 'strict'¶
- The prediction is - correctif- [start, end)match exactly and the tag is correct. Otherwise, the prediction is- incorrect.
 - type = 'type'¶
- Partial overlap is scored as either - corrector- incorrectbased on the predicted tag. No overlap is scored as- incorrect.
 
- class dyff.audit.scoring.text.MatchResult(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶
- Bases: - str,- Enum- The possible results of a text span matching comparison as defined by the MUC-5 evaluation standard. - Note that the semantics of these results depend on which - MatchCriterionis being applied.- correct = 'correct'¶
- The prediction overlapped with a true entity and was correct. 
 - incorrect = 'incorrect'¶
- The prediction overlapped with a true entity but was incorrect. 
 - missing = 'missing'¶
- A ground-truth entity did not overlap with any predicted entity. 
 - partial = 'partial'¶
- The prediction overlapped a true entity partially but not exactly. Applies to - MatchCriterion.partialonly.
 - spurious = 'spurious'¶
- A predicted entity did not overlap with any ground-truth entity. 
 
- class dyff.audit.scoring.text.TextSpanMatches¶
- Bases: - Rubric- Computes matches between predicted and ground-truth text spans according to each criterion in - MatchCriterion.- apply(task_data: pyarrow.datasets.Dataset, predictions: pyarrow.datasets.Dataset) Iterable[pyarrow.RecordBatch]¶
- Create a generator that yields batches of scored inferences. 
 - property name: str¶
- The “semi-qualified” type name of the Rubric. - The result should be such that - f"alignmentlabs.audit.scoring.{self.name}"is the fully-qualified name of the type.
 - property schema: Schema¶
- The PyArrow schema of the output of applying the Rubric. - There may be 0 or more rows for each input instance (same - _index_). Each row indicates which predicted span within that instance overlapped with which ground truth span, and how that overlap was scored. For example, if the 1st predicted span overlapped with the 2nd ground truth span, then there will be a row with- prediction = 1and- truth = 2. Spans are 0-indexed in increasing order of their- .startfield. A- spuriousmatch will have a value for- predictionbut not for- truth, and a- missingmatch will have a value for- truthbut not for- prediction. Instances for which there are no predicted spans and no ground truth spans will not appear in the results.- _index_int64
- The index of the item in the dataset 
- _replication_string
- ID of the replication the item belongs to. 
- MatchCriterionstring
- MatchCriterionapplicable to this record.
- predictionint32
- Index of the relevant predicted span within the current instance. 
- truthint32
- Index of the relevant ground truth span within the current instance. 
- MatchResult.correctbool
- Indicator of whether the match is - correct.
- MatchResult.incorrectbool
- Indicator of whether the match is - incorrect.
- MatchResult.partialbool
- Indicator of whether the match is - partial.
- MatchResult.missingbool
- Indicator of whether the match is - missing.
- MatchResult.spuriousbool
- Indicator of whether the match is - spurious.
 
 
- pydantic model dyff.audit.scoring.text.TextSpanMatchesScoredItem¶
- Bases: - ReplicatedItem- field MatchCriterion: str [Required]¶
- MatchCriterionapplicable to this record.
 - field MatchResultCorrect: bool [Required] (alias 'MatchResult.correct')¶
- Indicator of whether the match is - correct.
 - field MatchResultIncorrect: bool [Required] (alias 'MatchResult.incorrect')¶
- Indicator of whether the match is - incorrect.
 - field MatchResultMissing: bool [Required] (alias 'MatchResult.missing')¶
- Indicator of whether the match is - missing.
 - field MatchResultPartial: bool [Required] (alias 'MatchResult.partial')¶
- Indicator of whether the match is - partial.
 - field MatchResultSpurious: bool [Required] (alias 'MatchResult.spurious')¶
- Indicator of whether the match is - spurious.
 - field prediction: int32() [Required]¶
- Index of the relevant predicted span within the current instance. - Constraints:
- minimum = -2147483648 
- maximum = 2147483647 
- dyff.io/dtype = int32 
 
 
 - field truth: int32() [Required]¶
- Index of the relevant ground truth span within the current instance. - Constraints:
- minimum = -2147483648 
- maximum = 2147483647 
- dyff.io/dtype = int32 
 
 
 
- dyff.audit.scoring.text.match_spans(predictions: List[Dict[str, Any]], truths: List[Dict[str, Any]]) List[Dict[str, Any]]¶
- Compute matching results for ground-truth and predicted text spans using the extended set of matching criteria defined by the SemEval2013 and MUC-5 evaluation standards. - Parameters:
- predictions – List of predicted spans 
- truths – List of ground-truth spans 
 
- Returns:
- A two-level dictionary containing match counts, like: - {MatchCriterion: {MatchResult: Count}}
 
Evaluation Metrics¶
Metrics are summaries of the scores generated from Rubrics.
- class dyff.audit.metrics.Metric¶
- Bases: - ABC- A Metric is an operation that can be applied to a set of inference-level scores to produce an aggregate summary of performance. - abstract __call__(scores: DataFrame) DataFrame¶
- Compute the metric. 
 - abstract property name: str¶
- The “semi-qualified” type name of the Metric. - The result should be such that - f"alignmentlabs.audit.metrics.{self.name}"is the fully-qualified name of the type.
 - abstract property schema: Schema¶
- The PyArrow schema of the output of applying the Metric. 
 
Text¶
Metrics related to text processing tasks.
- class dyff.audit.metrics.text.ExtendedPrecisionRecall¶
- Bases: - Metric- Compute precision and recall for the extended set of text span matching criteria defined in - dyff.audit.scoring.text.- __call__(text_span_matches: DataFrame) DataFrame¶
- Compute the metric. 
 - property name: str¶
- The “semi-qualified” type name of the Metric. - The result should be such that - f"alignmentlabs.audit.metrics.{self.name}"is the fully-qualified name of the type.
 - property schema: Schema¶
- The PyArrow schema of the output of applying the Metric. - MatchCriterionstring
- MatchCriterionapplicable to this record.
- MatchResult.correctint32
- Count of - correctmatches.
- MatchResult.incorrectint32
- Count of - incorrectmatches.
- MatchResult.partialint32
- Count of - partialmatches.
- MatchResult.missingint32
- Count of - missingmatches.
- MatchResult.spuriousint32
- Count of - spuriousmatches.
- possibleint32
- Number of “possible” matches. 
- actualint32
- Number of “actual” matches. 
- precisiondouble
- Precision score. 
- recalldouble
- Recall score. 
- f1_scoredouble
- F1 score. 
 
 
- pydantic model dyff.audit.metrics.text.ExtendedPrecisionRecallScore¶
- Bases: - DyffSchemaBaseModel- field MatchCriterion: str [Required]¶
- MatchCriterionapplicable to this record.
 - field MatchResultCorrect: Int32Value [Required] (alias 'MatchResult.correct')¶
- Count of - correctmatches.- Constraints:
- minimum = -2147483648 
- maximum = 2147483647 
- dyff.io/dtype = int32 
 
 
 - field MatchResultIncorrect: Int32Value [Required] (alias 'MatchResult.incorrect')¶
- Count of - incorrectmatches.- Constraints:
- minimum = -2147483648 
- maximum = 2147483647 
- dyff.io/dtype = int32 
 
 
 - field MatchResultMissing: Int32Value [Required] (alias 'MatchResult.missing')¶
- Count of - missingmatches.- Constraints:
- minimum = -2147483648 
- maximum = 2147483647 
- dyff.io/dtype = int32 
 
 
 - field MatchResultPartial: Int32Value [Required] (alias 'MatchResult.partial')¶
- Count of - partialmatches.- Constraints:
- minimum = -2147483648 
- maximum = 2147483647 
- dyff.io/dtype = int32 
 
 
 - field MatchResultSpurious: Int32Value [Required] (alias 'MatchResult.spurious')¶
- Count of - spuriousmatches.- Constraints:
- minimum = -2147483648 
- maximum = 2147483647 
- dyff.io/dtype = int32 
 
 
 - field actual: Int32Value [Required]¶
- Number of “actual” matches. - Constraints:
- minimum = -2147483648 
- maximum = 2147483647 
- dyff.io/dtype = int32 
 
 
 - field f1_score: float [Required]¶
- F1 score. 
 - field possible: Int32Value [Required]¶
- Number of “possible” matches. - Constraints:
- minimum = -2147483648 
- maximum = 2147483647 
- dyff.io/dtype = int32 
 
 
 - field precision: float [Required]¶
- Precision score. 
 - field recall: float [Required]¶
- Recall score. 
 
- class dyff.audit.metrics.text.MatchCriterion(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶
- Bases: - str,- Enum- The match criteria defined by the SemEval2013 evaluation standard. - See - MatchResultfor the possible results of matching.- exact = 'exact'¶
- Tag is ignored. The prediction is - correctif- [start, end)match exactly. Otherwise, the prediction is- incorrect.
 - partial = 'partial'¶
- Tag is ignored. Partial overlap is scored as - partial, exact overlap is scored as- correct, no overlap is scored as- incorrect.
 - strict = 'strict'¶
- The prediction is - correctif- [start, end)match exactly and the tag is correct. Otherwise, the prediction is- incorrect.
 - type = 'type'¶
- Partial overlap is scored as either - corrector- incorrectbased on the predicted tag. No overlap is scored as- incorrect.
 
- class dyff.audit.metrics.text.MatchResult(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶
- Bases: - str,- Enum- The possible results of a text span matching comparison as defined by the MUC-5 evaluation standard. - Note that the semantics of these results depend on which - MatchCriterionis being applied.- correct = 'correct'¶
- The prediction overlapped with a true entity and was correct. 
 - incorrect = 'incorrect'¶
- The prediction overlapped with a true entity but was incorrect. 
 - missing = 'missing'¶
- A ground-truth entity did not overlap with any predicted entity. 
 - partial = 'partial'¶
- The prediction overlapped a true entity partially but not exactly. Applies to - MatchCriterion.partialonly.
 - spurious = 'spurious'¶
- A predicted entity did not overlap with any ground-truth entity. 
 
- dyff.audit.metrics.text.arrow_schema(model_type: Type[BaseModel], *, metadata: dict[str, str] | None = None) Schema¶
- Create an Arrow schema from a Pydantic model. - We support a very basic subset of pydantic model features currently. The intention is to expand this. 
- dyff.audit.metrics.text.f1_score(df: DataFrame) Series¶
- F1 score is the harmonic mean of precision and recall. - Parameters:
- df (pandas.DataFrame) – Must contain columns - precisionand- recall.
 
- dyff.audit.metrics.text.int32(*, strict: bool = False, gt: int | None = None, ge: int | None = None, lt: int | None = None, le: int | None = None, multiple_of: int | None = None) Type[int]¶
- Return a type annotation for an int32 field in a pydantic model. - Note that any keyword arguments must be specified here, even if they can also be specified in - pydantic.Field(). The corresponding keyword arguments in- pydantic.Field()will have no effect.- Usage: - class M(pydantic.BaseModel): # Notice how "lt" is specified in the type annotation, not Field() x: int32(lt=42) = pydantic.Field(description="some field")
- dyff.audit.metrics.text.schema_function(schema: Schema)¶
- Annotation for functions that return - pyarrow.Schema. The annotated function will return the supplied schema and will have a docstring describing the schema.- Intended to be applied to a function with no body, e.g.: - @schema_function( pyarrow.schema([ field_with_docstring("field_name", pyarrow.string(), docstring="Very important!") ]) ) def schema() -> pyarrow.Schema: """Additional docstring. Don't define a function body""" 
Dataset Construction and Manipulation¶
Tools for working with input and output data for systems being audited.
Text¶
- dyff.audit.data.text.token_tags_to_spans(text: str, tokens: List[str], tags: List[str]) List[TaggedSpan]¶
- Computes the list of TaggedSpans corresponding to the tagged tokens in the text. 
- dyff.audit.data.text.visualize_spans(text: str, spans: List[TaggedSpan], *, width: int = 80)¶
- Print lines of text with lines representing NER spans aligned underneath. - Example output: - My name is Alice and I live in Alaska. PPPPP LLLLLL
Workflows¶
- dyff.audit.workflows.local_evaluation(client: Client, session: InferenceSession, *, input_dataset_path: Path | str, output_dataset_path: Path | str, replications: int = 1, id: str | None = None) str¶
- Emulate an Evaluation workflow by feeding data from a local Arrow dataset to an InferenceSession running on the Dyff platform. - Deprecated since version 0.3.3: Use - DyffLocalPlatformto get similar functionality- The output dataset will have the same schema as the outputs from an Evaluation run on the platform, including fields added by the platform – - _index_,- _replication_, etc.- The input dataset must be compatible with the canonical Dyff Platform dataset schema for the appropriate inference task. - Parameters:
- client (dyff.client.Client) – A Dyff API client with permission to call the - inferencesessions.token()and- inferencesessions.client()endpoints on the given- InferenceSession.
- session (dyff.schema.platform.InferenceSession) – A record describing a running - InferenceSession.
- input_dataset_path (Path | str) – The root directory of an Arrow dataset on the local filesystem. 
- output_dataset_path (Path | str) – The directory where the Arrow output dataset should be created. A subdirectory named with the ID of the simulated evaluation will be created. 
- replications (int) – The number of replications to run. Equivalent to the - EvaluationCreateRequest.replicationsparameter.
- id (str) – If specified, use this ID for the evaluation. Otherwise, generate a new ID. 
 
- Returns:
- An ID for the simulated evaluation, either the ID provided as an argument, or a newly-generated one. This will not correspond to an entity in the Dyff datastore, but it can be used to derive the IDs of replications in the output dataset. 
- Return type:
- str 
 
- dyff.audit.workflows.local_report(rubric: Rubric, *, input_dataset_path: Path | str, output_dataset_path: Path | str, report_dataset_path: Path | str)¶
- Emulate a Report workflow on local data. - Deprecated since version 0.3.3: Use - DyffLocalPlatformto get similar functionality- You will need the Arrow datasets of inputs and outputs to an Evaluation workflow. You can emulate an Evaluation locally with - local_evaluation().- Parameters:
- rubric (dyff.audit.scoring.Rubric) – The Rubric to apply. 
- input_dataset_path (Path | str) – The root directory of the Arrow dataset containing the inputs to an evaluation. 
- output_dataset_path (Path | str) – The root directory of the Arrow dataset containing the outputs of the evaluation. 
- report_dataset_path (Path | str) – The directory where the Arrow dataset of report outputs should be created. A subdirectory named with the ID of the simulated report will be created. 
 
- Returns:
- An ID for the simulated report. This will not correspond to an entity in the Dyff datastore. 
- Return type:
- str