Create the InferenceSession¶
Grab the id from the InferenceService output and add --service_id=<id>
to the terminal command.
Example generation command¶
python generate-huggingface-resource-scripts.py \
--hf_name=allenai/olmo-2-0325-32b-instruct \
--hf_revision=0000000000000000000000000000000000000000 \
--model_size=70Gi \
--model_id=f6288ef018b34ff285dde430af40035d \
--gpu_type=nvidia.com/gpu-a100 \
--gpu_count=4 \
--inference_api=completions \
'--service_args=--max-model-len 90000' \
'--inference_args={"max_tokens": 2000}' \
--service_name=allenai/olmo-2-0325-32b-instruct/completions/c90000-t2000 \
--service_id=d8445b6174c94f6aa1099c1a4004297a \
session
Run the generated script¶
python sessions/allenai--olmo-2-0325-32b-instruct--completions--c90000-t2000.py
Verify and tear down¶
Verify that the model eventually starts up and responds to an inference request.
After verifying functionality, make sure to delete the session, either via the Dyff API, or by deleting the Kubernetes inferencesession resource. The generated session script does not delete the session.
from dyff.client import Client
dyffapi = Client()
dyffapi.inferencesessions.delete("<InferenceSession ID>")