dyff-operator¶
dyff-operator is the “kernel” of the platform. A Kubernetes operator is simply a daemon process called a controller that watches the k8s system state for events pertaining to certain custom resources and responds by taking actions via the k8s API.
dyff-operator controls the following Dyff k8s Custom Resource kinds:
Audit
–audits.dyff.io
Evaluation
–evaluations.dyff.io
InferenceService
–inferenceservices.dyff.io
InferenceSession
–inferencesessions.dyff.io
Model
–models.dyff.io
Report
–reports.dyff.io
We use the generic term workflows to refer to the computational work implied by one of these resources. For example, the evaluation workflow consists of:
Starting a
Deployment
with one or more replicas of an inference runner container, and an associated k8sService
to provide a web interface;starting a
Job
that wraps an evaluation client container, which loads data from a dataset and performs inference on it by making HTTP requests to the inference service; andonce the client job is finished, starting a
Job
that wraps an output verification container, which checks that the output of the client job is complete and correctly formatted.
When an Evaluation
resource is created, the controller sees this event and
creates the k8s resources needed for steps (1) and (2). The controller then
watches the Job
resource that wraps the evaluation client. When the job
status reaches a completed state, the controller creates the additional k8s
resources needed for step (3). Throughout the process, the controller sets
conditions
on the /status
subresource to indicate progress of the workflow or to signal
failures.
The “steps” of each workflow are implemented as arbitrary executable programs
packaged as Docker images. Currently, this code lives under dyff/apps
:
evaluation_client
The “client” part of an Evaluation. Reads data from a dataset, makes inference API calls over HTTP, and writes the inference outputs to another dataset.
fetch_model
Downloads an ML model into storage.
mocks/inferenceservice
A mock inference service for testing.
run_report
Does the computational work of a Report. Reads output data from an Evaluation, applies a scoring Rubric, and writes the results to another dataset.
runners/vllm
The “server” part of an Evaluation. Uses the vLLM package to run LLMs.
verify_evaluation_output
Checks that Evaluation outputs are complete and correctly formatted.
Design Guidelines¶
dyff-operator developers should familiarize themselves with the Kubernetes API conventions . We don’t follow these to the letter, but they should be the default starting point for designing new functionality.
dyff-operator is a standalone component¶
The most important design principal for the dyff-operator is that it should be
useful by itself, using only the k8s API implemented by kubectl
.
Components of the dyff-operator MUST NOT interact with any Dyff Platform services that are not managed by the dyff-operator itself.
dyff-operator components MUST NOT depend on the
dyff-api
package, only thedyff
package (client components).
Dyff Custom Resources MUST contain all information necessary to run the workflow.
Specifically: If a resource A depends on a resource B, then any information about B needed to execute A should be included in the manifest of A. Workflows must not assume that other referenced resources will be present in the k8s system database.
Workflows and Workflow Steps¶
The controller executes workflows by creating one or more Pods that run Docker
containers that implement the steps fo the workflow and passing them appropriate
arguments and configuration information. The Pods usually are managed indirectly
via a Job
or other container resource. The controller may also create other
k8s resources like ConfigMap
.
Workflow Steps MUST exit with an appropriate integer status code, either zero for “sucess” or non-zero for “failure”.
Rationale: This is necessary for container resources like
Job
to respond to failure statuses.
Workflow Steps MUST NOT interact with the Kubernetes API.
Communicating Workflow Status¶
Dyff Custom Resources MUST use k8s status conditions to indicate the progress of the workflow.
Example:
status: conditions: - lastTransitionTime: "2023-12-13T18:27:59Z" message: Evaluation is complete reason: Complete status: "True" type: Complete - lastTransitionTime: "2023-12-13T18:07:35Z" message: Evaluation is complete reason: Complete status: "False" type: Failed
Example Dyff Resource Manifest¶
This is an example of an Evaluation
manifest. This example includes the
/status
subresource, which is managed by the dyff-operator.
apiVersion: dyff.io/v1alpha1
kind: Evaluation
metadata:
labels:
dyff.io/account: example
dyff.io/component: evaluation
dyff.io/id: f167452938ef4a8c91c7373a37bd6af0
dyff.io/workflow: evaluation
name: eval-f167452938ef4a8c91c7373a37bd6af0
namespace: default
spec:
account: example
dataset: db5facc1e37f48e58db0d47297a3634c
id: f167452938ef4a8c91c7373a37bd6af0
inferenceSession:
accelerator:
gpu:
hardwareTypes:
- nvidia.com/gpu-a100
memory: 16Gi
kind: GPU
args:
- --model
- /dyff/mnt/model/models--tiiuae--falcon-7b/snapshots/898df1396f35e447d5fe44e0a3ccaaaa69f30d36
- --download-dir
- /dyff/mnt/model
- --dtype
- float16
dependencies:
- kind: ReadOnlyVolume
readOnlyVolume:
claimName: model-371288ec69724bf8bebf51811c581f6b-rox
mountPath: /dyff/mnt/model
name: model
env:
- name: HF_DATASETS_OFFLINE
value: "1"
- name: HF_HOME
value: /dyff/mnt/model
- name: HUGGINGFACE_HUB_CACHE
value: /dyff/mnt/model
- name: TRANSFORMERS_CACHE
value: /dyff/mnt/model
- name: TRANSFORMERS_OFFLINE
value: "1"
image: us-central1-docker.pkg.dev/dyff-354017/dyff-system/ul-dsri/dyff/dyff/vllm-runner:latest
replicas: 1
resources:
requests:
memory: 16Gi
useSpotPods: true
interface:
endpoint: generate
inputPipeline:
- configuration: '{"prompt": "$.text"}'
kind: TransformJSON
outputPipeline:
- configuration: '{"collections": ["text"]}'
kind: ExplodeCollections
outputSchema:
arrowSchema: /////ygDAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAMAAABkAgAAwAEAAAQAAAC6/f//AAABDBgAAABcAAAACAAAABwAAAABAAAAYAAAAAkAAAByZXNwb25zZXMAAAABAAAABAAAALD9//8gAAAABAAAABMAAABJbmZlcmVuY2UgcmVzcG9uc2VzAAcAAABfX2RvY19fAGT///8QABQACAAGAAcADAAAABAAEAAAAAAAAQ0YAAAAIAAAAAQAAAACAAAAeAAAABQAAAAEAAAAaXRlbQAAAACk////Zv7//wAAAQUUAAAAUAAAAAgAAAAUAAAAAAAAAAQAAAB0ZXh0AAAAAAEAAAAEAAAAVP7//xgAAAAEAAAACQAAAFRleHQgZGF0YQAAAAcAAABfX2RvY19fAAQABAAEAAAAxv7//wAAAQIUAAAAlAAAAAgAAAAgAAAAAAAAABAAAABfcmVzcG9uc2VfaW5kZXhfAAAAAAEAAAAEAAAAwP7//1QAAAAEAAAARgAAAFRoZSBpbmRleCBvZiB0aGUgcmVzcG9uc2UgYW1vbmcgcmVzcG9uc2VzIHRvIHRoZSBjb3JyZXNwb25kaW5nIF9pbmRleF8AAAcAAABfX2RvY19fANj+//8AAAABQAAAAHL///8AAAEFFAAAAHwAAAAIAAAAHAAAAAAAAAANAAAAX3JlcGxpY2F0aW9uXwAAAAEAAAAEAAAAaP///zwAAAAEAAAALgAAAElEIG9mIHRoZSByZXBsaWNhdGlvbiB0aGUgcmVzcG9uc2UgYmVsb25ncyB0by4AAAcAAABfX2RvY19fAAQABgAEAAAAAAASABgACAAGAAcADAAAABAAFAASAAAAAAABAhQAAAB4AAAACAAAABQAAAAAAAAABwAAAF9pbmRleF8AAQAAAAwAAAAIAAwABAAIAAgAAAA0AAAABAAAACQAAABUaGUgaW5kZXggb2YgdGhlIGl0ZW0gaW4gdGhlIGRhdGFzZXQAAAAABwAAAF9fZG9jX18ACAAMAAgABwAIAAAAAAAAAUAAAAAAAAAA
replications: 48
workersPerReplica: 32
status:
conditions:
- lastTransitionTime: "2023-12-13T18:27:59Z"
message: Evaluation is complete
reason: Complete
status: "True"
type: Complete
- lastTransitionTime: "2023-12-13T18:07:35Z"
message: Evaluation is complete
reason: Complete
status: "False"
type: Failed