Workflow status¶
Every Dyff workflow resource has .status and .reason fields that are set
by the platform to record the progress of the workflow. The following diagram
shows the possible paths that the .status of a resource can take. The boxes
with thick edges represent “terminal” statuses. Deleted is a special status
that we will describe at the end of this section.
Workflow status transitions¶
About status¶
When you create a new resource in Dyff, you are telling the Dyff
platform to execute the computations of an associated workflow in order to
progress its status from an initial Created status to a “success” status –
either Complete or Ready. For example, the Evaluation workflow
requires spinning up an InferenceSession containing one or more replicas of
an InferenceService, feeding data to the session from a Dataset, and
storing and verifying the outputs. This idea of the system trying to “reconcile”
the status of a resource is similar to how the Kubernetes platform works.
The .status field records the last “milestone” in the workflow that has been
reached. When you create a resource, it starts its lifecycle in the Created
status. The Created status means that the resource specification has been
added to the Dyff datastore, but no work has been done yet.
Many workflows require some computational work to happen. The Admitted
status means that this computational work has begun.
Some workflows do not result in any computation. For example, when uploading a
Dataset, the data passes from your local filesystem directly to URLs
obtained from Dyff, so the workflow never enters the Admitted status.
Terminal statuses can be divided into “success”, “failure”, and “early
termination”. The names of these statuses depend on the nature of the workflow.
For workflows perform a computational “job”, like Evaluations, the success
status is called Complete, and the failure status is called Failed. For
workflows that produce an artifact that is meant to be consumed by other
workflows, such as building an InferenceService, the success status is
called Ready and the failure status is called Error. Any workflow that
has not reached a terminal status may be terminated by the user, or sometimes by
Dyff, in which case it enters the Terminated status.
All statuses¶
Created¶
The Created status means that the resource specification has been added to
the Dyff datastore, but no work has been done yet. The following reason
values are associated with the Created status:
NoneThe
reasonwill beNoneif Dyff has not yet processed the resource specification. This is thereasonyou will see in the resource specification returned by the resource creation endpoints.QuotaLimitThis means that the workflow is waiting to be admitted because admitting it would cause computational resource use to exceed one or more quotas that are set for your account. For example, you may have a quota of 1 GPU on your account. If you create two
Evaluationresources that each require a GPU, one of those resources will wait in theCreatedstatus withreason = QuotaLimit.UnsatisfiedDependencyThis means that your workflow depends on a resource that has not yet reached an appropriate success status. For example, you might create a
Reportthat references the results of anEvaluationwhen that evaluation is still running. The report will wait in theCreatedstatus withreason = UnsatisfiedDepencencyuntil the evaluation completes successfully.
Admitted¶
The Admitted status means that computational work has begun in support of
the workflow. Currently, the reason will always be None in the
Admitted status.
NoneThe
reasonwill beNoneif the workflow is in the first “stage” of its computation. Most workflows have only one computational step, so theirreasonwill always beNonein theAdmittedstatus.
Ready and Completed¶
These statuses indicate that the workflow completed successfully. The reason
will be None.
Failed and Error¶
These statuses indicate that something went wrong. They will always have an
associated reason.
SchemaErrorApplies to: all resources
This means that there was an error when creating the Kubernetes resource manifests needed to run the computational workloads for the workflow. This is usually due to a bug in Dyff; please report this to the developers.
FailedDependencyApplies to: all resources
This means that the workflow depends on a resource that is in a failed or deleted status.
InferenceFailedApplies to:
EvaluationThe inference step of an evaluation workflow failed. Typically, this indicates a problem with the underlying inference service. For example, it may have raised an exception for one of the inference inputs, or it might have taken too long to return a response, resulting in a timeout error.
VerificationFailedApplies to:
EvaluationThe verification step of an evaluation workflow failed. For example, there may be missing or duplicated responses. Usually, this is due to an internal error in the platform, as the inference step is supposed to check for these errors and retry the problematic instances. The verification step is a fail-safe that is expected to always succeed.
BuildFailedApplies to:
InferenceServiceThis is seen when an
InferenceServicecalls for building a Docker container and the container build failed.FetchFailedApplies to:
ModelThis is seen when a
Modelcalls for fetching model data from a remote source (e.g., downloading neural network weights from Hugging Face) and the fetch operation failed.RunFailedApplies to:
ReportThere was an error while running a report.