Advanced Inference Service Usage

In some situations, you may need to create your own inference services so that you can customize their behavior. The most common reasons to do this are to set different default parameters for the underlying inference model or to alter the input-output interface of the service.

Creating a working inference service can be tricky, because you need to specify low-level details like which container image should be used to run the service and what compute hardware is required, and some models are not compatible with certain images or hardware. Usually, you won’t care about these details, because you just want a configuration that works. So, our advice is to start with a working inference service specification for the underlying model you want to use, make a copy of it, and change only the parts that are relevant to your use case.

To create a new inference service, we construct an InferenceServiceCreateRequest:

service_request = InferenceServiceCreateRequest(
    account="example",
    name=service_name,
    model=model_id,
    runner=runner,
    interface=interface,
)

Most of the configuration happens in the runner and interface parameters.

The InferenceServiceRunner

The runner specifies the compute environment that will support the inference service. Usually, the runner is a generic “wrapper” container that loads a specific model at run-time. For example, Dyff uses vLLM to run large language models.

The main parts of the runner specification are:

  1. The container image to use

  2. Runner-specific command line arguments to pass to the container

  3. Optional accelerator hardware to use (e.g., GPUs)

  4. Memory and storage requirements for the model

This example creates a runner that runs the facebook/opt-125m model on an Nvidia T4 GPU, with additional configuration to support serving the OpenAI API interface to the model:

runner = InferenceServiceRunner(
    kind=InferenceServiceRunnerKind.VLLM,
    image=ContainerImageSource(
        host="registry.gitlab.com",
        name="dyff/workflows/vllm-runner",
        digest="sha256:6a5107db7bb5dac4d231f5c0a97f34d95c17bda40ff6428b00e1c621352eb125",
        tag="0.6.3",
    ),
    # Command line args; format is specific to the runner
    args=[
        # T4 GPUs don't support the 'bfloat16' format this model defaults to
        "--dtype",
        "float16",
        # The OpenAI interface requires us to specify name(s) for the served model
        "--served-model-name",
        model_id,
        model_name,
    ],
    accelerator=Accelerator(
        kind="GPU",
        gpu=AcceleratorGPU(
            hardwareTypes=["nvidia.com/gpu-t4"],
            count=1,
        ),
    ),
    resources=ModelResources(
        storage="300Mi",
        memory="8Gi",
    ),
)

The InferenceInterface

The interface determines the input and output formats of the service. This is the interface that will be used when the service is referenced in an Evaluation. Fully specifying the interface allows the evaluations to be reproducible. However, the interface can be bypassed when using inference sessions in interactive mode. In this mode, clients can call any of the endpoints provided by the service directly, using whatever their “native” data format is. For example, the vLLM runner serves the OpenAI inference routes at /openai/v1/completions, etc.

This example adapts the OpenAI completions endpoint to fit the standard format expected by text completion evaluations:

interface = InferenceInterface(
    # The inference endpoint served by the runner container
    endpoint="openai/v1/completions",
    # The output records should look like: {"text": "To be, or not to be"}
    outputSchema=DataSchema.make_output_schema(
        DyffDataSchema(
            components=["text.Text"],
        ),
    ),
    # How to convert inputs to the service to the format the runner expects
    # Input: {"text": "The question"}
    # Adapted: {"prompt": "The question", "model": <model_id>}
    inputPipeline=[
        SchemaAdapter(
            kind="TransformJSON",
            configuration={"prompt": "$.text", "model": model_id},
        ),
    ],
    # How to convert the runner output to match outputSchema
    # Output: {"choices": ["that is the question", "that is a tautology"]}
    # Adapted: [{"text": "that is the question"}, {"text": "that is a tautology"}]
    outputPipeline=[
        SchemaAdapter(
            kind="ExplodeCollections",
            configuration={"collections": ["choices"]},
        ),
        SchemaAdapter(
            kind="TransformJSON",
            configuration={"text": "$.choices.text"},
        ),
    ],
)

Available Accelerators

Note

Accelerators like GPUs can be expensive to run. Choose the minimum hardware allocation that suits your use case.

Dyff currently supports only GPU-type hardware accelerators. The available hardware types depend on the environment where Dyff is deployed. For the Dyff instance maintained by DSRI, the supported hardware types are:

nvidia.com/gpu-t4

nvidia.com/gpu-a100

nvidia.com/gpu-a100-80gb

Only certain gpu_count values are supported, and there are associated memory resource request constraints. These constraints are determined by the Google Cloud platform.

gpu_type

gpu_count

memory (max)

nvidia.com/gpu-t4

1

287Gi

nvidia.com/gpu-t4

2

287Gi

nvidia.com/gpu-t4

4

587Gi

nvidia.com/gpu-a100

1

60Gi

nvidia.com/gpu-a100

2

134Gi

nvidia.com/gpu-a100

4

296Gi

nvidia.com/gpu-a100

8

618Gi

nvidia.com/gpu-a100

16

1250Gi

nvidia.com/gpu-a100-80gb

1

134Gi

nvidia.com/gpu-a100-80gb

2

296Gi

nvidia.com/gpu-a100-80gb

4

618Gi

nvidia.com/gpu-a100-80gb

8

1250Gi

Available Runners

Note that command line arguments to runners are specified as a list of tokens. These will be joined with spaces when forwarded to the runner. So, if you want to specify the flag --dtype float16 to the runner, your args list will look like: args=["--dtype", "float16"].

vLLM

This is a thin wrapper around the API servers provided by the vLLM project. Refer to the vLLM source code for the canonical documentation of endpoints and command line arguments.

Note that using the OpenAI API endpoints requires image version 0.3.2 or greater.

kind : InferenceServiceRunnerKind.VLLM

images : https://gitlab.com/dyff/workflows/vllm-runner/container_registry

endpoints :

/openai/v1/*(Requires image version >=0.3.2)

OpenAI v1 API endpoints, as implemented in vllm.entrypoints.openai.api_server. Not all models support all endpoints, and you may need to specify which interface you want (e.g., "completions" vs. "embeddings").

args : These are some of the most commonly used command line arguments for the vLLM runner. Refer to the vLLM source code for a complete list.

--tensor-parallel-size

If using gpu_count > 1 in your accelerator configuration, set --tensor-parallel-size to the same value as gpu_count.

--dtype

Overrides the default model weight data type. Some accelerators, such as Nvidia T4 GPUs, do not support certain common dtypes such as bfloat16, so you may need to force a supported dtype.

--trust-remote-code

This flag is needed when running Hugging Face models that require custom Python code. Always review all custom code for the specific version of the model you are running before using this flag.

--served-model-name name1 [name2 [name3 ...]]

Model names that the OpenAI API endpoints will accept. You must pass one of the specified names in the "model" field in all requests to the OpenAI API endpoints.

When using the vLLM runner, you should request a minimum of 8Gi of memory, even if the model you’re running is smaller than that. This provides enough memory for the vLLM engine.