Create the InferenceSession

Grab the id from the InferenceService output and add --service_id=<id> to the terminal command.

Example generation command

python generate-huggingface-resource-scripts.py \
    --hf_name=allenai/olmo-2-0325-32b-instruct \
    --hf_revision=0000000000000000000000000000000000000000 \
    --model_size=70Gi \
    --model_id=f6288ef018b34ff285dde430af40035d \
    --gpu_type=nvidia.com/gpu-a100 \
    --gpu_count=4 \
    --inference_api=completions \
    '--service_args=--max-model-len 90000' \
    '--inference_args={"max_tokens": 2000}' \
    --service_name=allenai/olmo-2-0325-32b-instruct/completions/c90000-t2000 \
    --service_id=d8445b6174c94f6aa1099c1a4004297a \
    session

Run the generated script

python sessions/allenai--olmo-2-0325-32b-instruct--completions--c90000-t2000.py

Verify and tear down

Verify that the model eventually starts up and responds to an inference request.

After verifying functionality, make sure to delete the session, either via the Dyff API, or by deleting the Kubernetes inferencesession resource. The generated session script does not delete the session.

from dyff.client import Client

dyffapi = Client()
dyffapi.inferencesessions.delete("<InferenceSession ID>")