Workflow status¶
Every Dyff workflow resource has .status
and .reason
fields that are set
by the platform to record the progress of the workflow. The following diagram
shows the possible paths that the .status
of a resource can take. The boxes
with thick edges represent “terminal” statuses. Deleted
is a special status
that we will describe at the end of this section.
About status¶
When you create a new resource in Dyff, you are telling the Dyff
platform to execute the computations of an associated workflow in order to
progress its status from an initial Created
status to a “success” status –
either Complete
or Ready
. For example, the Evaluation
workflow
requires spinning up an InferenceSession
containing one or more replicas of
an InferenceService
, feeding data to the session from a Dataset
, and
storing and verifying the outputs. This idea of the system trying to “reconcile”
the status of a resource is similar to how the Kubernetes platform works.
The .status
field records the last “milestone” in the workflow that has been
reached. When you create a resource, it starts its lifecycle in the Created
status. The Created
status means that the resource specification has been
added to the Dyff datastore, but no work has been done yet.
Many workflows require some computational work to happen. The Admitted
status means that this computational work has begun.
Some workflows do not result in any computation. For example, when uploading a
Dataset
, the data passes from your local filesystem directly to URLs
obtained from Dyff, so the workflow never enters the Admitted
status.
Terminal statuses can be divided into “success”, “failure”, and “early
termination”. The names of these statuses depend on the nature of the workflow.
For workflows perform a computational “job”, like Evaluations
, the success
status is called Complete
, and the failure status is called Failed
. For
workflows that produce an artifact that is meant to be consumed by other
workflows, such as building an InferenceService
, the success status is
called Ready
and the failure status is called Error
. Any workflow that
has not reached a terminal status may be terminated by the user, or sometimes by
Dyff, in which case it enters the Terminated
status.
All statuses¶
Created
¶
The Created
status means that the resource specification has been added to
the Dyff datastore, but no work has been done yet. The following reason
values are associated with the Created
status:
None
The
reason
will beNone
if Dyff has not yet processed the resource specification. This is thereason
you will see in the resource specification returned by the resource creation endpoints.QuotaLimit
This means that the workflow is waiting to be admitted because admitting it would cause computational resource use to exceed one or more quotas that are set for your account. For example, you may have a quota of 1 GPU on your account. If you create two
Evaluation
resources that each require a GPU, one of those resources will wait in theCreated
status withreason = QuotaLimit
.UnsatisfiedDependency
This means that your workflow depends on a resource that has not yet reached an appropriate success status. For example, you might create a
Report
that references the results of anEvaluation
when that evaluation is still running. The report will wait in theCreated
status withreason = UnsatisfiedDepencency
until the evaluation completes successfully.
Admitted
¶
The Admitted
status means that computational work has begun in support of
the workflow. Currently, the reason
will always be None
in the
Admitted
status.
None
The
reason
will beNone
if the workflow is in the first “stage” of its computation. Most workflows have only one computational step, so theirreason
will always beNone
in theAdmitted
status.
Ready
and Completed
¶
These statuses indicate that the workflow completed successfully. The reason
will be None
.
Failed
and Error
¶
These statuses indicate that something went wrong. They will always have an
associated reason
.
SchemaError
Applies to: all resources
This means that there was an error when creating the Kubernetes resource manifests needed to run the computational workloads for the workflow. This is usually due to a bug in Dyff; please report this to the developers.
FailedDependency
Applies to: all resources
This means that the workflow depends on a resource that is in a failed or deleted status.
InferenceFailed
Applies to:
Evaluation
The inference step of an evaluation workflow failed. Typically, this indicates a problem with the underlying inference service. For example, it may have raised an exception for one of the inference inputs, or it might have taken too long to return a response, resulting in a timeout error.
VerificationFailed
Applies to:
Evaluation
The verification step of an evaluation workflow failed. For example, there may be missing or duplicated responses. Usually, this is due to an internal error in the platform, as the inference step is supposed to check for these errors and retry the problematic instances. The verification step is a fail-safe that is expected to always succeed.
BuildFailed
Applies to:
InferenceService
This is seen when an
InferenceService
calls for building a Docker container and the container build failed.FetchFailed
Applies to:
Model
This is seen when a
Model
calls for fetching model data from a remote source (e.g., downloading neural network weights from Hugging Face) and the fetch operation failed.RunFailed
Applies to:
Report
There was an error while running a report.