Production cluster

This guide will walk through the deployment and administration of a production-grade instance of the Dyff platform.

Dyff architecture block diagram.

About the Dyff platform

Dyff is a “batteries-included” platform for conducting AI safety research. Beyond providing only Python libraries for modeling safety cases, Dyff offers a large-scale Kubernetes-native platform that supports the testing of Large Language Models (LLMs) at full speed and against large datasets.

In addition, the Dyff platform is completely open source, including the component projects as well as the hosted version of Dyff available at app.dyff.io. This means that Dyff can be fully self-hosted at your organization’s site, allowing you to perform safety evaluations without your organization’s data or models having to leave your data center, and you can study our live production deployment to learn how to operate Dyff successfully.

Dyff is designed from the ground up to be run on any Kubernetes cluster with no dependence on any specific cloud provider. With that, our autoscaling functionality has primarily been tested on Google Cloud and we are the most familiar with operating production Dyff in that context.

If you encounter issues with operating a Dyff component in a certain cloud environment, please create an issue on the relevant Dyff project and we can start a conversation about it.

If you are not ready for production and just want a local development cluster, see the Local cluster tutorial.

Platform components

A typical Dyff platform deployment consists of the following components.

Backing services

In addition to Dyff-specific components, the following additional components are typically present:

  • Kafka preserves the entire history of workflow events and is the system of record for workflows.

  • MongoDB stores the auth database and provides a materialized view of workflow results.

  • Dyff inputs, intermediate artifacts, and outputs are stored in S3-compatible object storage.