Create a Dataset

In this section, we will create our first Dyff resource: the Dataset.

Create an Arrow dataset locally

Dyff uses the Apache Arrow format for storing datasets. To upload a new dataset, you first need to create an Arrow dataset locally. Arrow is a column-oriented format that’s mostly inter-convertible with Pandas DataFrames and JSON. In this guide, we’re going to create a simple text dataset, which will have one data column called text. Arrow is strongly typed, and all Arrow datasets must have an associated schema.

Define some example text data

For this guide, we’re going to create a simple text dataset containing the following data consisting of opening phrases from some famous novels:

13DATA = [
14    "Call me Ishmael. ",
15    "It was the best of times, ",
16    "A screaming comes across the sky. ",
17    "It was a bright cold day in April, ",
18    "In a hole in the ground there lived a hobbit. ",
19]
20

Define the dataset schema

We’ll use some utilities from the Dyff client library to create the necessary schema:

21schema = DataSchema.make_input_schema(text.Text)
22arrow_schema = arrow.decode_schema(schema.arrowSchema)
23

Dyff expects datasets to have certain additional “metadata” columns and structure. The required structure is described in detail in the Data Schemas guide. For now, you just need to know that input datasets must have an integer column called _index_ that uniquely identifies each row in the dataset. The DataSchema.make_input_schema() function creates a schema with the required structure for input datasets. The argument is the schema of the data; in this case, that’s dyff.schema.dataset.text.Text, which has a single field text of type str. After creating the schema, we need to “decode” the Arrow schema because it’s a Python object that’s encoded in binary in the DataSchema object.

Generate the dataset

Now we define the data generating function. To write an Arrow dataset to disk, we need to define an Iterable of pyarrow.RecordBatch objects. The record batches can be constructed in multiple ways. Usually, the easiest way is from what PyArrow calls the pylist format, which is a list of dict objects, each of which represents one “row” in the dataset. When generating the data, we need to add the metadata fields required by Dyff; in this case, that’s the _index_ field:

25def data_generator():
26    pylist = [{"_index_": i, "text": t} for i, t in enumerate(DATA)]
27    yield pyarrow.RecordBatch.from_pylist(pylist, schema=arrow_schema)
28

Write the dataset to disk

Now we need to write the dataset in Arrow format. When interacting with Dyff Platform datasets, always use the functions in dyff.schema.dataset.arrow rather than the corresponding pyarrow functions, because the ones in the Dyff client library set some necessary default parameters. We’ll use a temporary directory to hold the files.

30with tempfile.TemporaryDirectory() as tmpdir:
31    arrow.write_dataset(
32        data_generator(), output_path=tmpdir, feature_schema=arrow_schema
33    )
34
35    dataset = dyffapi.datasets.create_arrow_dataset(
36        tmpdir,
37        account=ACCOUNT,
38        name="famous-first-phrases",
39    )
40    dyffapi.datasets.upload_arrow_dataset(dataset, tmpdir)
41
42    print(dataset.json(indent=2))

Upload the dataset

Uploading the dataset happens in two steps. First, you create a Dataset record. Dyff assigns a unique ID to the dataset and returns a full Dataset object. Second, you upload the actual data, providing the Dataset object so that Dyff knows where to store the data. The hashes of the dataset files must match the hashes calculated in the create step.

30with tempfile.TemporaryDirectory() as tmpdir:
31    arrow.write_dataset(
32        data_generator(), output_path=tmpdir, feature_schema=arrow_schema
33    )
34
35    dataset = dyffapi.datasets.create_arrow_dataset(
36        tmpdir,
37        account=ACCOUNT,
38        name="famous-first-phrases",
39    )
40    dyffapi.datasets.upload_arrow_dataset(dataset, tmpdir)
41
42    print(dataset.json(indent=2))

Full example

 1import tempfile
 2
 3import pyarrow
 4
 5from dyff.audit.local import DyffLocalPlatform
 6from dyff.client import Client
 7from dyff.schema.dataset import arrow, text
 8from dyff.schema.platform import DataSchema
 9
10dyffapi: Client | DyffLocalPlatform = ...
11ACCOUNT: str = ...
12
13DATA = [
14    "Call me Ishmael. ",
15    "It was the best of times, ",
16    "A screaming comes across the sky. ",
17    "It was a bright cold day in April, ",
18    "In a hole in the ground there lived a hobbit. ",
19]
20
21schema = DataSchema.make_input_schema(text.Text)
22arrow_schema = arrow.decode_schema(schema.arrowSchema)
23
24
25def data_generator():
26    pylist = [{"_index_": i, "text": t} for i, t in enumerate(DATA)]
27    yield pyarrow.RecordBatch.from_pylist(pylist, schema=arrow_schema)
28
29
30with tempfile.TemporaryDirectory() as tmpdir:
31    arrow.write_dataset(
32        data_generator(), output_path=tmpdir, feature_schema=arrow_schema
33    )
34
35    dataset = dyffapi.datasets.create_arrow_dataset(
36        tmpdir,
37        account=ACCOUNT,
38        name="famous-first-phrases",
39    )
40    dyffapi.datasets.upload_arrow_dataset(dataset, tmpdir)
41
42    print(dataset.json(indent=2))