Create a Dataset¶
In this section, we will create our first Dyff resource: the Dataset.
Create an Arrow dataset locally¶
Dyff uses the Apache Arrow format for storing
datasets. To upload a new dataset, you first need to create an Arrow dataset
locally. Arrow is a column-oriented format that’s mostly inter-convertible with
Pandas DataFrames and JSON. In this guide, we’re going to create a simple text
dataset, which will have one data column called text
. Arrow is strongly
typed, and all Arrow datasets must have an associated schema.
Define some example text data¶
For this guide, we’re going to create a simple text dataset containing the following data consisting of opening phrases from some famous novels:
13DATA = [
14 "Call me Ishmael. ",
15 "It was the best of times, ",
16 "A screaming comes across the sky. ",
17 "It was a bright cold day in April, ",
18 "In a hole in the ground there lived a hobbit. ",
19]
20
Define the dataset schema¶
We’ll use some utilities from the Dyff client library to create the necessary schema:
21schema = DataSchema.make_input_schema(text.Text)
22arrow_schema = arrow.decode_schema(schema.arrowSchema)
23
Dyff expects datasets to have certain additional “metadata” columns and
structure. The required structure is described in detail in the Data
Schemas guide. For now, you just need to know that input
datasets must have an integer column called _index_
that uniquely identifies
each row in the dataset. The DataSchema.make_input_schema()
function creates
a schema with the required structure for input datasets. The argument is the
schema of the data; in this case, that’s dyff.schema.dataset.text.Text
,
which has a single field text
of type str
. After creating the schema, we
need to “decode” the Arrow schema because it’s a Python object that’s encoded in
binary in the DataSchema
object.
Generate the dataset¶
Now we define the data generating function. To write an Arrow dataset to disk,
we need to define an Iterable
of pyarrow.RecordBatch
objects. The record
batches can be constructed in multiple ways. Usually, the easiest way is from
what PyArrow calls the pylist
format, which is a list of dict
objects,
each of which represents one “row” in the dataset. When generating the data, we
need to add the metadata fields required by Dyff; in this case, that’s the
_index_
field:
25def data_generator():
26 pylist = [{"_index_": i, "text": t} for i, t in enumerate(DATA)]
27 yield pyarrow.RecordBatch.from_pylist(pylist, schema=arrow_schema)
28
Write the dataset to disk¶
Now we need to write the dataset in Arrow format. When interacting with Dyff
Platform datasets, always use the functions in dyff.schema.dataset.arrow
rather than the corresponding pyarrow
functions, because the ones in the
Dyff client library set some necessary default parameters. We’ll use a temporary
directory to hold the files.
30with tempfile.TemporaryDirectory() as tmpdir:
31 arrow.write_dataset(
32 data_generator(), output_path=tmpdir, feature_schema=arrow_schema
33 )
34
35 dataset = dyffapi.datasets.create_arrow_dataset(
36 tmpdir,
37 account=ACCOUNT,
38 name="famous-first-phrases",
39 )
40 dyffapi.datasets.upload_arrow_dataset(dataset, tmpdir)
41
42 print(dataset.json(indent=2))
Upload the dataset¶
Uploading the dataset happens in two steps. First, you create a Dataset
record. Dyff assigns a unique ID to the dataset and returns a full Dataset
object. Second, you upload the actual data, providing the Dataset
object so
that Dyff knows where to store the data. The hashes of the dataset files must
match the hashes calculated in the create
step.
30with tempfile.TemporaryDirectory() as tmpdir:
31 arrow.write_dataset(
32 data_generator(), output_path=tmpdir, feature_schema=arrow_schema
33 )
34
35 dataset = dyffapi.datasets.create_arrow_dataset(
36 tmpdir,
37 account=ACCOUNT,
38 name="famous-first-phrases",
39 )
40 dyffapi.datasets.upload_arrow_dataset(dataset, tmpdir)
41
42 print(dataset.json(indent=2))
Full example¶
1import tempfile
2
3import pyarrow
4
5from dyff.audit.local import DyffLocalPlatform
6from dyff.client import Client
7from dyff.schema.dataset import arrow, text
8from dyff.schema.platform import DataSchema
9
10dyffapi: Client | DyffLocalPlatform = ...
11ACCOUNT: str = ...
12
13DATA = [
14 "Call me Ishmael. ",
15 "It was the best of times, ",
16 "A screaming comes across the sky. ",
17 "It was a bright cold day in April, ",
18 "In a hole in the ground there lived a hobbit. ",
19]
20
21schema = DataSchema.make_input_schema(text.Text)
22arrow_schema = arrow.decode_schema(schema.arrowSchema)
23
24
25def data_generator():
26 pylist = [{"_index_": i, "text": t} for i, t in enumerate(DATA)]
27 yield pyarrow.RecordBatch.from_pylist(pylist, schema=arrow_schema)
28
29
30with tempfile.TemporaryDirectory() as tmpdir:
31 arrow.write_dataset(
32 data_generator(), output_path=tmpdir, feature_schema=arrow_schema
33 )
34
35 dataset = dyffapi.datasets.create_arrow_dataset(
36 tmpdir,
37 account=ACCOUNT,
38 name="famous-first-phrases",
39 )
40 dyffapi.datasets.upload_arrow_dataset(dataset, tmpdir)
41
42 print(dataset.json(indent=2))