Advanced Inference Service Usage¶
In some situations, you may need to create your own inference services so that you can customize their behavior. The most common reasons to do this are to set different default parameters for the underlying inference model or to alter the input-output interface of the service.
Creating a working inference service can be tricky, because you need to specify low-level details like which container image should be used to run the service and what compute hardware is required, and some models are not compatible with certain images or hardware. Usually, you won’t care about these details, because you just want a configuration that works. So, our advice is to start with a working inference service specification for the underlying model you want to use, make a copy of it, and change only the parts that are relevant to your use case.
To create a new inference service, we construct an
InferenceServiceCreateRequest:
service_request = InferenceServiceCreateRequest(
    account="example",
    name=service_name,
    model=model_id,
    runner=runner,
    interface=interface,
)
Most of the configuration happens in the runner and interface parameters.
The InferenceServiceRunner¶
The runner specifies the compute environment that will support the inference service. Usually, the runner is a generic “wrapper” container that loads a specific model at run-time. For example, Dyff uses vLLM to run large language models.
The main parts of the runner specification are:
The container image to use
Runner-specific command line arguments to pass to the container
Optional accelerator hardware to use (e.g., GPUs)
Memory and storage requirements for the model
This example creates a runner that runs the facebook/opt-125m model on an Nvidia T4 GPU, with additional configuration to support serving the OpenAI API interface to the model:
runner = InferenceServiceRunner(
    kind=InferenceServiceRunnerKind.VLLM,
    image=ContainerImageSource(
        host="registry.gitlab.com",
        name="dyff/workflows/vllm-runner",
        digest="sha256:6a5107db7bb5dac4d231f5c0a97f34d95c17bda40ff6428b00e1c621352eb125",
        tag="0.6.3",
    ),
    # Command line args; format is specific to the runner
    args=[
        # T4 GPUs don't support the 'bfloat16' format this model defaults to
        "--dtype",
        "float16",
        # The OpenAI interface requires us to specify name(s) for the served model
        "--served-model-name",
        model_id,
        model_name,
    ],
    accelerator=Accelerator(
        kind="GPU",
        gpu=AcceleratorGPU(
            hardwareTypes=["nvidia.com/gpu-t4"],
            count=1,
        ),
    ),
    resources=ModelResources(
        storage="300Mi",
        memory="8Gi",
    ),
)
The InferenceInterface¶
The interface determines the input and output formats of the service. This is the interface that will be used when the service is referenced in an Evaluation. Fully specifying the interface allows the evaluations to be reproducible. However, the interface can be bypassed when using inference sessions in interactive mode. In this mode, clients can call any of the endpoints provided by the service directly, using whatever their “native” data format is. For example, the vLLM runner serves the OpenAI inference routes at /openai/v1/completions, etc.
This example adapts the OpenAI completions endpoint to fit the standard format expected by text completion evaluations:
interface = InferenceInterface(
    # The inference endpoint served by the runner container
    endpoint="openai/v1/completions",
    # The output records should look like: {"text": "To be, or not to be"}
    outputSchema=DataSchema.make_output_schema(
        DyffDataSchema(
            components=["text.Text"],
        ),
    ),
    # How to convert inputs to the service to the format the runner expects
    # Input: {"text": "The question"}
    # Adapted: {"prompt": "The question", "model": <model_id>}
    inputPipeline=[
        SchemaAdapter(
            kind="TransformJSON",
            configuration={"prompt": "$.text", "model": model_id},
        ),
    ],
    # How to convert the runner output to match outputSchema
    # Output: {"choices": ["that is the question", "that is a tautology"]}
    # Adapted: [{"text": "that is the question"}, {"text": "that is a tautology"}]
    outputPipeline=[
        SchemaAdapter(
            kind="ExplodeCollections",
            configuration={"collections": ["choices"]},
        ),
        SchemaAdapter(
            kind="TransformJSON",
            configuration={"text": "$.choices.text"},
        ),
    ],
)
Available Accelerators¶
Note
Accelerators like GPUs can be expensive to run. Choose the minimum hardware allocation that suits your use case.
Dyff currently supports only GPU-type hardware accelerators. The available hardware types depend on the environment where Dyff is deployed. For the Dyff instance maintained by DSRI, the supported hardware types are:
nvidia.com/gpu-t4
nvidia.com/gpu-a100
nvidia.com/gpu-a100-80gb
Only certain gpu_count values are supported, and there are associated memory resource request constraints. These constraints are determined by the Google Cloud platform.
gpu_type
gpu_count
memory(max)
nvidia.com/gpu-t41
287Gi
nvidia.com/gpu-t42
287Gi
nvidia.com/gpu-t44
587Gi
nvidia.com/gpu-a1001
60Gi
nvidia.com/gpu-a1002
134Gi
nvidia.com/gpu-a1004
296Gi
nvidia.com/gpu-a1008
618Gi
nvidia.com/gpu-a10016
1250Gi
nvidia.com/gpu-a100-80gb1
134Gi
nvidia.com/gpu-a100-80gb2
296Gi
nvidia.com/gpu-a100-80gb4
618Gi
nvidia.com/gpu-a100-80gb8
1250Gi
Available Runners¶
Note that command line arguments to runners are specified as a list of tokens. These will be joined with spaces when forwarded to the runner. So, if you want to specify the flag --dtype float16 to the runner, your args list will look like: args=["--dtype", "float16"].
vLLM¶
This is a thin wrapper around the API servers provided by the vLLM project. Refer to the vLLM source code for the canonical documentation of endpoints and command line arguments.
Note that using the OpenAI API endpoints requires image version 0.3.2 or greater.
kind:InferenceServiceRunnerKind.VLLM
images: https://gitlab.com/dyff/workflows/vllm-runner/container_registry
endpoints:
/openai/v1/*(Requires image version>=0.3.2)
OpenAI v1 API endpoints, as implemented in
vllm.entrypoints.openai.api_server. Not all models support all endpoints, and you may need to specify which interface you want (e.g.,"completions"vs."embeddings").
args: These are some of the most commonly used command line arguments for the vLLM runner. Refer to the vLLM source code for a complete list.
--tensor-parallel-size
If using
gpu_count > 1in your accelerator configuration, set--tensor-parallel-sizeto the same value asgpu_count.
--dtype
Overrides the default model weight data type. Some accelerators, such as Nvidia T4 GPUs, do not support certain common dtypes such as
bfloat16, so you may need to force a supported dtype.
--trust-remote-code
This flag is needed when running Hugging Face models that require custom Python code. Always review all custom code for the specific version of the model you are running before using this flag.
--served-model-name name1 [name2 [name3 ...]]
Model names that the OpenAI API endpoints will accept. You must pass one of the specified names in the
"model"field in all requests to the OpenAI API endpoints.
When using the vLLM runner, you should request a minimum of 8Gi of memory, even if the model you’re running is smaller than that. This provides enough memory for the vLLM engine.