dyff-operator¶
dyff-operator is the “kernel” of the platform. A Kubernetes operator is simply a daemon process called a controller that watches the k8s system state for events pertaining to certain custom resources and responds by taking actions via the k8s API.
dyff-operator controls the following Dyff k8s Custom Resource kinds:
Audit–audits.dyff.io
Evaluation–evaluations.dyff.io
InferenceService–inferenceservices.dyff.io
InferenceSession–inferencesessions.dyff.io
Model–models.dyff.io
Report–reports.dyff.io
We use the generic term workflows to refer to the computational work implied by one of these resources. For example, the evaluation workflow consists of:
Starting a
Deploymentwith one or more replicas of an inference runner container, and an associated k8sServiceto provide a web interface;starting a
Jobthat wraps an evaluation client container, which loads data from a dataset and performs inference on it by making HTTP requests to the inference service; andonce the client job is finished, starting a
Jobthat wraps an output verification container, which checks that the output of the client job is complete and correctly formatted.
When an Evaluation resource is created, the controller sees this event and
creates the k8s resources needed for steps (1) and (2). The controller then
watches the Job resource that wraps the evaluation client. When the job
status reaches a completed state, the controller creates the additional k8s
resources needed for step (3). Throughout the process, the controller sets
conditions
on the /status subresource to indicate progress of the workflow or to signal
failures.
The “steps” of each workflow are implemented as arbitrary executable programs
packaged as Docker images. Currently, this code lives under dyff/apps:
evaluation_clientThe “client” part of an Evaluation. Reads data from a dataset, makes inference API calls over HTTP, and writes the inference outputs to another dataset.
fetch_modelDownloads an ML model into storage.
mocks/inferenceserviceA mock inference service for testing.
run_reportDoes the computational work of a Report. Reads output data from an Evaluation, applies a scoring Rubric, and writes the results to another dataset.
runners/vllmThe “server” part of an Evaluation. Uses the vLLM package to run LLMs.
verify_evaluation_outputChecks that Evaluation outputs are complete and correctly formatted.
Design Guidelines¶
dyff-operator developers should familiarize themselves with the Kubernetes API conventions . We don’t follow these to the letter, but they should be the default starting point for designing new functionality.
dyff-operator is a standalone component¶
The most important design principal for the dyff-operator is that it should be
useful by itself, using only the k8s API implemented by kubectl.
Components of the dyff-operator MUST NOT interact with any Dyff Platform services that are not managed by the dyff-operator itself.
dyff-operator components MUST NOT depend on the
dyff-apipackage, only thedyffpackage (client components).
Dyff Custom Resources MUST contain all information necessary to run the workflow.
Specifically: If a resource A depends on a resource B, then any information about B needed to execute A should be included in the manifest of A. Workflows must not assume that other referenced resources will be present in the k8s system database.
Workflows and Workflow Steps¶
The controller executes workflows by creating one or more Pods that run Docker
containers that implement the steps fo the workflow and passing them appropriate
arguments and configuration information. The Pods usually are managed indirectly
via a Job or other container resource. The controller may also create other
k8s resources like ConfigMap.
Workflow Steps MUST exit with an appropriate integer status code, either zero for “sucess” or non-zero for “failure”.
Rationale: This is necessary for container resources like
Jobto respond to failure statuses.
Workflow Steps MUST NOT interact with the Kubernetes API.
Communicating Workflow Status¶
Dyff Custom Resources MUST use k8s status conditions to indicate the progress of the workflow.
Example:
status: conditions: - lastTransitionTime: "2023-12-13T18:27:59Z" message: Evaluation is complete reason: Complete status: "True" type: Complete - lastTransitionTime: "2023-12-13T18:07:35Z" message: Evaluation is complete reason: Complete status: "False" type: Failed
Example Dyff Resource Manifest¶
This is an example of an Evaluation manifest. This example includes the
/status subresource, which is managed by the dyff-operator.
apiVersion: dyff.io/v1alpha1
kind: Evaluation
metadata:
labels:
dyff.io/account: example
dyff.io/component: evaluation
dyff.io/id: f167452938ef4a8c91c7373a37bd6af0
dyff.io/workflow: evaluation
name: eval-f167452938ef4a8c91c7373a37bd6af0
namespace: default
spec:
account: example
dataset: db5facc1e37f48e58db0d47297a3634c
id: f167452938ef4a8c91c7373a37bd6af0
inferenceSession:
accelerator:
gpu:
hardwareTypes:
- nvidia.com/gpu-a100
memory: 16Gi
kind: GPU
args:
- --model
- /dyff/mnt/model/models--tiiuae--falcon-7b/snapshots/898df1396f35e447d5fe44e0a3ccaaaa69f30d36
- --download-dir
- /dyff/mnt/model
- --dtype
- float16
dependencies:
- kind: ReadOnlyVolume
readOnlyVolume:
claimName: model-371288ec69724bf8bebf51811c581f6b-rox
mountPath: /dyff/mnt/model
name: model
env:
- name: HF_DATASETS_OFFLINE
value: "1"
- name: HF_HOME
value: /dyff/mnt/model
- name: HUGGINGFACE_HUB_CACHE
value: /dyff/mnt/model
- name: TRANSFORMERS_CACHE
value: /dyff/mnt/model
- name: TRANSFORMERS_OFFLINE
value: "1"
image: us-central1-docker.pkg.dev/dyff-354017/dyff-system/ul-dsri/dyff/dyff/vllm-runner:latest
replicas: 1
resources:
requests:
memory: 16Gi
useSpotPods: true
interface:
endpoint: generate
inputPipeline:
- configuration: '{"prompt": "$.text"}'
kind: TransformJSON
outputPipeline:
- configuration: '{"collections": ["text"]}'
kind: ExplodeCollections
outputSchema:
arrowSchema: /////ygDAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAMAAABkAgAAwAEAAAQAAAC6/f//AAABDBgAAABcAAAACAAAABwAAAABAAAAYAAAAAkAAAByZXNwb25zZXMAAAABAAAABAAAALD9//8gAAAABAAAABMAAABJbmZlcmVuY2UgcmVzcG9uc2VzAAcAAABfX2RvY19fAGT///8QABQACAAGAAcADAAAABAAEAAAAAAAAQ0YAAAAIAAAAAQAAAACAAAAeAAAABQAAAAEAAAAaXRlbQAAAACk////Zv7//wAAAQUUAAAAUAAAAAgAAAAUAAAAAAAAAAQAAAB0ZXh0AAAAAAEAAAAEAAAAVP7//xgAAAAEAAAACQAAAFRleHQgZGF0YQAAAAcAAABfX2RvY19fAAQABAAEAAAAxv7//wAAAQIUAAAAlAAAAAgAAAAgAAAAAAAAABAAAABfcmVzcG9uc2VfaW5kZXhfAAAAAAEAAAAEAAAAwP7//1QAAAAEAAAARgAAAFRoZSBpbmRleCBvZiB0aGUgcmVzcG9uc2UgYW1vbmcgcmVzcG9uc2VzIHRvIHRoZSBjb3JyZXNwb25kaW5nIF9pbmRleF8AAAcAAABfX2RvY19fANj+//8AAAABQAAAAHL///8AAAEFFAAAAHwAAAAIAAAAHAAAAAAAAAANAAAAX3JlcGxpY2F0aW9uXwAAAAEAAAAEAAAAaP///zwAAAAEAAAALgAAAElEIG9mIHRoZSByZXBsaWNhdGlvbiB0aGUgcmVzcG9uc2UgYmVsb25ncyB0by4AAAcAAABfX2RvY19fAAQABgAEAAAAAAASABgACAAGAAcADAAAABAAFAASAAAAAAABAhQAAAB4AAAACAAAABQAAAAAAAAABwAAAF9pbmRleF8AAQAAAAwAAAAIAAwABAAIAAgAAAA0AAAABAAAACQAAABUaGUgaW5kZXggb2YgdGhlIGl0ZW0gaW4gdGhlIGRhdGFzZXQAAAAABwAAAF9fZG9jX18ACAAMAAgABwAIAAAAAAAAAUAAAAAAAAAA
replications: 48
workersPerReplica: 32
status:
conditions:
- lastTransitionTime: "2023-12-13T18:27:59Z"
message: Evaluation is complete
reason: Complete
status: "True"
type: Complete
- lastTransitionTime: "2023-12-13T18:07:35Z"
message: Evaluation is complete
reason: Complete
status: "False"
type: Failed