Create the InferenceService¶
Creating the service is when most of the configuration is specified. These are the most important options:
Example terminal command¶
python generate-huggingface-resource-scripts.py \
--hf_name=allenai/olmo-2-0325-32b-instruct \
--hf_revision=b96024342a77a69aa0dda815c3454a671f477463 \
--model_size=70Gi \
--model_id=f6288ef018b34ff285dde430af40035d \
--gpu_type="nvidia.com/gpu-a100" \
--gpu_count=4 \
--inference_api=completions \
--service_args='--max-model-len 90000' \
--inference_args='{"max_tokens": 2000}' \
--service_name="allenai/olmo-2-0325-32b-instruct/completions/c90000-t2000" \
service
GPU configuration¶
--gpu_type='nvidia.com/gpu-a100'
--gpu_count=4
These options would request a 1-node configuration (--node_count=1) with 4x Nvidia
A100 GPUs per node.
Command-line arguments to the service runner¶
--service_args='--max-model-len 90000'
service_args is a string containing command-line options. Shell quoting rules apply
within this string.
For the vLLM runner, added inside --service_args, it is important to set
--max-model-len to a value no larger than the model’s maximum input size. For
HuggingFace models, this can be found in config.json under the key
max_position_embeddings (some older models use a different key). While setting a
lower value conserves memory, most use cases — especially reasoning — require a long
context window, so avoid going below 90000. In practice, most 7–14B models fit on
A100s and 15–70B models fit on H100s, all on a single node. If memory is still tight at
that context length, the following parameters can be tuned:
--max-num-batched-tokens--max-num-seqs--max-cudagraph-capture-size
Runtime arguments to the service¶
--inference_args='{"max_tokens": 2000}'
This is a string containing JSON data. This data is merged into each API request sent to the service at runtime.
Inference API and service name¶
--inference_api=completions
--service_name=allenai/olmo-2-0325-32b-instruct/completions/c90000-t2000
The --inference_api flag selects which API the service should expose. The
--service_name can be used to differentiate services created with different options.
Our convention is that the service name looks like
<model/name>/<inference_api>[/<non-default options>]. Here we used c for
“context” and t for (output) “tokens”.
Run the generated script¶
Run the script that was generated in the previous step. Scripts are generated in the
services/ subdirectory.
python services/allenai--olmo-2-0325-32b-instruct--completions--c90000-t2000.py