Create the InferenceService¶

Creating the service is when most of the configuration is specified. These are the most important options:

Example terminal command¶

python generate-huggingface-resource-scripts.py \
    --hf_name=allenai/olmo-2-0325-32b-instruct \
    --hf_revision=b96024342a77a69aa0dda815c3454a671f477463 \
    --model_size=70Gi \
    --model_id=f6288ef018b34ff285dde430af40035d \
    --gpu_type="nvidia.com/gpu-a100" \
    --gpu_count=4 \
    --inference_api=completions \
    --service_args='--max-model-len 90000' \
    --inference_args='{"max_tokens": 2000}' \
    --service_name="allenai/olmo-2-0325-32b-instruct/completions/c90000-t2000" \
    service

GPU configuration¶

--gpu_type='nvidia.com/gpu-a100'
--gpu_count=4

These options would request a 1-node configuration (--node_count=1) with 4x Nvidia A100 GPUs per node.

Command-line arguments to the service runner¶

--service_args='--max-model-len 90000'

service_args is a string containing command-line options. Shell quoting rules apply within this string.

For the vLLM runner, added inside --service_args, it is important to set --max-model-len to a value no larger than the model’s maximum input size. For HuggingFace models, this can be found in config.json under the key max_position_embeddings (some older models use a different key). While setting a lower value conserves memory, most use cases — especially reasoning — require a long context window, so avoid going below 90000. In practice, most 7–14B models fit on A100s and 15–70B models fit on H100s, all on a single node. If memory is still tight at that context length, the following parameters can be tuned:

--max-num-batched-tokens
--max-num-seqs
--max-cudagraph-capture-size

Runtime arguments to the service¶

--inference_args='{"max_tokens": 2000}'

This is a string containing JSON data. This data is merged into each API request sent to the service at runtime.

Inference API and service name¶

--inference_api=completions
--service_name=allenai/olmo-2-0325-32b-instruct/completions/c90000-t2000

The --inference_api flag selects which API the service should expose. The --service_name can be used to differentiate services created with different options. Our convention is that the service name looks like <model/name>/<inference_api>[/<non-default options>]. Here we used c for “context” and t for (output) “tokens”.

Run the generated script¶

Run the script that was generated in the previous step. Scripts are generated in the services/ subdirectory.

python services/allenai--olmo-2-0325-32b-instruct--completions--c90000-t2000.py