Create a Dataset¶

In this section, we will create our first Dyff resource: the Dataset.

Create an Arrow dataset locally¶

Dyff uses the Apache Arrow format for storing datasets. To upload a new dataset, you first need to create an Arrow dataset locally. Arrow is a column-oriented format that’s mostly inter-convertible with Pandas DataFrames and JSON. In this guide, we’re going to create a simple text dataset, which will have one data column called text. Arrow is strongly typed, and all Arrow datasets must have an associated schema.

Define some example text data¶

For this guide, we’re going to create a simple text dataset containing the following data consisting of opening phrases from some famous novels:

DATA = [
    "Call me Ishmael. ",
    "It was the best of times, ",
    "A screaming comes across the sky. ",
    "It was a bright cold day in April, ",
    "In a hole in the ground there lived a hobbit. ",
]

Define the dataset schema¶

We’ll use some utilities from the Dyff client library to create the necessary schema:

schema = DataSchema.make_input_schema(text.Text)
arrow_schema = arrow.decode_schema(schema.arrowSchema)

Dyff expects datasets to have certain additional “metadata” columns and structure. The required structure is described in detail in the Data Schemas guide. For now, you just need to know that input datasets must have an integer column called _index_ that uniquely identifies each row in the dataset. The DataSchema.make_input_schema() function creates a schema with the required structure for input datasets. The argument is the schema of the data; in this case, that’s dyff.schema.dataset.text.Text, which has a single field text of type str. After creating the schema, we need to “decode” the Arrow schema because it’s a Python object that’s encoded in binary in the DataSchema object.

Generate the dataset¶

Now we define the data generating function. To write an Arrow dataset to disk, we need to define an Iterable of pyarrow.RecordBatch objects. The record batches can be constructed in multiple ways. Usually, the easiest way is from what PyArrow calls the pylist format, which is a list of dict objects, each of which represents one “row” in the dataset. When generating the data, we need to add the metadata fields required by Dyff; in this case, that’s the _index_ field:

def data_generator():
    pylist = [{"_index_": i, "text": t} for i, t in enumerate(DATA)]
    yield pyarrow.RecordBatch.from_pylist(pylist, schema=arrow_schema)

Write the dataset to disk¶

Now we need to write the dataset in Arrow format. When interacting with Dyff Platform datasets, always use the functions in dyff.schema.dataset.arrow rather than the corresponding pyarrow functions, because the ones in the Dyff client library set some necessary default parameters. We’ll use a temporary directory to hold the files.

with tempfile.TemporaryDirectory() as tmpdir:
    arrow.write_dataset(
        data_generator(), output_path=tmpdir, feature_schema=arrow_schema
    )

    dataset = dyffapi.datasets.create_arrow_dataset(
        tmpdir,
        account=ACCOUNT,
        name="famous-first-phrases",
    )
    dyffapi.datasets.upload_arrow_dataset(dataset, tmpdir)

    print(dataset.json(indent=2))

Upload the dataset¶

Uploading the dataset happens in two steps. First, you create a Dataset record. Dyff assigns a unique ID to the dataset and returns a full Dataset object. Second, you upload the actual data, providing the Dataset object so that Dyff knows where to store the data. The hashes of the dataset files must match the hashes calculated in the create step.

with tempfile.TemporaryDirectory() as tmpdir:
    arrow.write_dataset(
        data_generator(), output_path=tmpdir, feature_schema=arrow_schema
    )

    dataset = dyffapi.datasets.create_arrow_dataset(
        tmpdir,
        account=ACCOUNT,
        name="famous-first-phrases",
    )
    dyffapi.datasets.upload_arrow_dataset(dataset, tmpdir)

    print(dataset.json(indent=2))

Full example¶

import tempfile

import pyarrow

from dyff.audit.local import DyffLocalPlatform
from dyff.client import Client
from dyff.schema.dataset import arrow, text
from dyff.schema.platform import DataSchema

dyffapi: Client | DyffLocalPlatform = ...
ACCOUNT: str = ...

DATA = [
    "Call me Ishmael. ",
    "It was the best of times, ",
    "A screaming comes across the sky. ",
    "It was a bright cold day in April, ",
    "In a hole in the ground there lived a hobbit. ",
]

schema = DataSchema.make_input_schema(text.Text)
arrow_schema = arrow.decode_schema(schema.arrowSchema)


def data_generator():
    pylist = [{"_index_": i, "text": t} for i, t in enumerate(DATA)]
    yield pyarrow.RecordBatch.from_pylist(pylist, schema=arrow_schema)


with tempfile.TemporaryDirectory() as tmpdir:
    arrow.write_dataset(
        data_generator(), output_path=tmpdir, feature_schema=arrow_schema
    )

    dataset = dyffapi.datasets.create_arrow_dataset(
        tmpdir,
        account=ACCOUNT,
        name="famous-first-phrases",
    )
    dyffapi.datasets.upload_arrow_dataset(dataset, tmpdir)

    print(dataset.json(indent=2))