Storage

Dyff workflows correspond to kinds of data records. Each record has a globally unique .id field and belongs to a single .account. The .id is a UUID4 identifier encoded as a hexidecmal string with the dashes removed. Large data associated with workflows is stored in object storage “buckets” at a path that can be determined from the corresponding entity.

Kafka

The dyff.workflows.state topic in Kafka is the source of truth for workflow records. Services that require access to stored workflow records are implemented as Kafka consumers of this topic. Services may build local persistent data stores derived from the Kafka topic. They must always be able to reconstruct their local store by replaying the Kafka messages.

Read more

Dyff API Data Store

The dyff-api service maintains a database of workflow records to facilitate user queries. We refer to this database as the datastore. The current version of Dyff uses MongoDB as its database engine, but the datastore is exposed through a DB-agnostic interface, and the intention is for future versions to support swappable datastore backends.

Bulk Storage in Buckets

Large data artifacts such as the actual input instances associated with a dataset are stored in “storage buckets” in canonical paths that are determined by the .id of the associated workflow and certain other properties. Object storage is manipulated through an abstract interface that supports swappable backends. By default, Dyff uses an s3 backend, which is compatible with all major cloud providers and with the minio system.