dyff-operator

dyff-operator is the “kernel” of the platform. A Kubernetes operator is simply a daemon process called a controller that watches the k8s system state for events pertaining to certain custom resources and responds by taking actions via the k8s API.

dyff-operator controls the following Dyff k8s Custom Resource kinds:

  • Auditaudits.dyff.io

  • Evaluationevaluations.dyff.io

  • InferenceServiceinferenceservices.dyff.io

  • InferenceSessioninferencesessions.dyff.io

  • Modelmodels.dyff.io

  • Reportreports.dyff.io

We use the generic term workflows to refer to the computational work implied by one of these resources. For example, the evaluation workflow consists of:

  1. Starting a Deployment with one or more replicas of an inference runner container, and an associated k8s Service to provide a web interface;

  2. starting a Job that wraps an evaluation client container, which loads data from a dataset and performs inference on it by making HTTP requests to the inference service; and

  3. once the client job is finished, starting a Job that wraps an output verification container, which checks that the output of the client job is complete and correctly formatted.

When an Evaluation resource is created, the controller sees this event and creates the k8s resources needed for steps (1) and (2). The controller then watches the Job resource that wraps the evaluation client. When the job status reaches a completed state, the controller creates the additional k8s resources needed for step (3). Throughout the process, the controller sets conditions on the /status subresource to indicate progress of the workflow or to signal failures.

The “steps” of each workflow are implemented as arbitrary executable programs packaged as Docker images. Currently, this code lives under dyff/apps:

evaluation_client

The “client” part of an Evaluation. Reads data from a dataset, makes inference API calls over HTTP, and writes the inference outputs to another dataset.

fetch_model

Downloads an ML model into storage.

mocks/inferenceservice

A mock inference service for testing.

run_report

Does the computational work of a Report. Reads output data from an Evaluation, applies a scoring Rubric, and writes the results to another dataset.

runners/vllm

The “server” part of an Evaluation. Uses the vLLM package to run LLMs.

verify_evaluation_output

Checks that Evaluation outputs are complete and correctly formatted.

Design Guidelines

dyff-operator developers should familiarize themselves with the Kubernetes API conventions . We don’t follow these to the letter, but they should be the default starting point for designing new functionality.

dyff-operator is a standalone component

The most important design principal for the dyff-operator is that it should be useful by itself, using only the k8s API implemented by kubectl.

Components of the dyff-operator MUST NOT interact with any Dyff Platform services that are not managed by the dyff-operator itself.

dyff-operator components MUST NOT depend on the dyff-api package, only the dyff package (client components).

Dyff Custom Resources MUST contain all information necessary to run the workflow.

Specifically: If a resource A depends on a resource B, then any information about B needed to execute A should be included in the manifest of A. Workflows must not assume that other referenced resources will be present in the k8s system database.

Workflows and Workflow Steps

The controller executes workflows by creating one or more Pods that run Docker containers that implement the steps fo the workflow and passing them appropriate arguments and configuration information. The Pods usually are managed indirectly via a Job or other container resource. The controller may also create other k8s resources like ConfigMap.

Workflow Steps MUST exit with an appropriate integer status code, either zero for “sucess” or non-zero for “failure”.

Rationale: This is necessary for container resources like Job to respond to failure statuses.

Workflow Steps MUST NOT interact with the Kubernetes API.

Communicating Workflow Status

Dyff Custom Resources MUST use k8s status conditions to indicate the progress of the workflow.

Example:

status:
  conditions:
    - lastTransitionTime: "2023-12-13T18:27:59Z"
      message: Evaluation is complete
      reason: Complete
      status: "True"
      type: Complete
    - lastTransitionTime: "2023-12-13T18:07:35Z"
      message: Evaluation is complete
      reason: Complete
      status: "False"
      type: Failed

Example Dyff Resource Manifest

This is an example of an Evaluation manifest. This example includes the /status subresource, which is managed by the dyff-operator.

apiVersion: dyff.io/v1alpha1
kind: Evaluation
metadata:
labels:
    dyff.io/account: example
    dyff.io/component: evaluation
    dyff.io/id: f167452938ef4a8c91c7373a37bd6af0
    dyff.io/workflow: evaluation
name: eval-f167452938ef4a8c91c7373a37bd6af0
namespace: default
spec:
account: example
dataset: db5facc1e37f48e58db0d47297a3634c
id: f167452938ef4a8c91c7373a37bd6af0
inferenceSession:
    accelerator:
    gpu:
        hardwareTypes:
        - nvidia.com/gpu-a100
        memory: 16Gi
    kind: GPU
    args:
    - --model
    - /dyff/mnt/model/models--tiiuae--falcon-7b/snapshots/898df1396f35e447d5fe44e0a3ccaaaa69f30d36
    - --download-dir
    - /dyff/mnt/model
    - --dtype
    - float16
    dependencies:
    - kind: ReadOnlyVolume
        readOnlyVolume:
        claimName: model-371288ec69724bf8bebf51811c581f6b-rox
        mountPath: /dyff/mnt/model
        name: model
    env:
    - name: HF_DATASETS_OFFLINE
        value: "1"
    - name: HF_HOME
        value: /dyff/mnt/model
    - name: HUGGINGFACE_HUB_CACHE
        value: /dyff/mnt/model
    - name: TRANSFORMERS_CACHE
        value: /dyff/mnt/model
    - name: TRANSFORMERS_OFFLINE
        value: "1"
    image: us-central1-docker.pkg.dev/dyff-354017/dyff-system/ul-dsri/dyff/dyff/vllm-runner:latest
    replicas: 1
    resources:
    requests:
        memory: 16Gi
    useSpotPods: true
interface:
    endpoint: generate
    inputPipeline:
    - configuration: '{"prompt": "$.text"}'
        kind: TransformJSON
    outputPipeline:
    - configuration: '{"collections": ["text"]}'
        kind: ExplodeCollections
    outputSchema:
    arrowSchema: /////ygDAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAMAAABkAgAAwAEAAAQAAAC6/f//AAABDBgAAABcAAAACAAAABwAAAABAAAAYAAAAAkAAAByZXNwb25zZXMAAAABAAAABAAAALD9//8gAAAABAAAABMAAABJbmZlcmVuY2UgcmVzcG9uc2VzAAcAAABfX2RvY19fAGT///8QABQACAAGAAcADAAAABAAEAAAAAAAAQ0YAAAAIAAAAAQAAAACAAAAeAAAABQAAAAEAAAAaXRlbQAAAACk////Zv7//wAAAQUUAAAAUAAAAAgAAAAUAAAAAAAAAAQAAAB0ZXh0AAAAAAEAAAAEAAAAVP7//xgAAAAEAAAACQAAAFRleHQgZGF0YQAAAAcAAABfX2RvY19fAAQABAAEAAAAxv7//wAAAQIUAAAAlAAAAAgAAAAgAAAAAAAAABAAAABfcmVzcG9uc2VfaW5kZXhfAAAAAAEAAAAEAAAAwP7//1QAAAAEAAAARgAAAFRoZSBpbmRleCBvZiB0aGUgcmVzcG9uc2UgYW1vbmcgcmVzcG9uc2VzIHRvIHRoZSBjb3JyZXNwb25kaW5nIF9pbmRleF8AAAcAAABfX2RvY19fANj+//8AAAABQAAAAHL///8AAAEFFAAAAHwAAAAIAAAAHAAAAAAAAAANAAAAX3JlcGxpY2F0aW9uXwAAAAEAAAAEAAAAaP///zwAAAAEAAAALgAAAElEIG9mIHRoZSByZXBsaWNhdGlvbiB0aGUgcmVzcG9uc2UgYmVsb25ncyB0by4AAAcAAABfX2RvY19fAAQABgAEAAAAAAASABgACAAGAAcADAAAABAAFAASAAAAAAABAhQAAAB4AAAACAAAABQAAAAAAAAABwAAAF9pbmRleF8AAQAAAAwAAAAIAAwABAAIAAgAAAA0AAAABAAAACQAAABUaGUgaW5kZXggb2YgdGhlIGl0ZW0gaW4gdGhlIGRhdGFzZXQAAAAABwAAAF9fZG9jX18ACAAMAAgABwAIAAAAAAAAAUAAAAAAAAAA
replications: 48
workersPerReplica: 32
status:
conditions:
    - lastTransitionTime: "2023-12-13T18:27:59Z"
    message: Evaluation is complete
    reason: Complete
    status: "True"
    type: Complete
    - lastTransitionTime: "2023-12-13T18:07:35Z"
    message: Evaluation is complete
    reason: Complete
    status: "False"
    type: Failed