Storage¶
Dyff workflows correspond to kinds of data records. Each record has a globally
unique .id
field and belongs to a single .account
. The .id
is a
UUID4 identifier encoded as a hexidecmal string with the dashes removed. Large
data associated with workflows is stored in object storage “buckets” at a path
that can be determined from the corresponding entity.
Kafka¶
The dyff.workflows.state
topic in Kafka is the source of truth for workflow
records. Services that require access to stored workflow records are implemented
as Kafka consumers of this topic. Services may build local persistent data
stores derived from the Kafka topic. They must always be able to reconstruct
their local store by replaying the Kafka messages.
Dyff API Data Store¶
The dyff-api
service maintains a database of workflow records to facilitate
user queries. We refer to this database as the datastore. The current version
of Dyff uses MongoDB as its database engine, but the datastore is exposed
through a DB-agnostic interface, and the intention is for future versions to
support swappable datastore backends.
Bulk Storage in Buckets¶
Large data artifacts such as the actual input instances associated with a
dataset are stored in “storage buckets” in canonical paths that are determined
by the .id
of the associated workflow and certain other properties. Object
storage is manipulated through an abstract interface that supports swappable
backends. By default, Dyff uses an s3
backend, which is compatible with all
major cloud providers and with the minio
system.