Usage Guide¶
This guide explains how to use the FIRST Inference Gateway to access Large Language Models through an OpenAI-compatible API.
Authentication¶
The Gateway uses Globus Auth for authentication. You need a valid access token to make requests.
Getting Your Access Token¶
Use the provided authentication script:
First-Time Setup¶
This stores refresh and access tokens locally (typically in ~/.globus/app/...).
Getting an Access Token¶
# Retrieve your current valid access token
export MY_TOKEN=$(python inference-auth-token.py get_access_token)
echo "Token stored in MY_TOKEN environment variable."
The script automatically refreshes expired tokens using your stored refresh token.
Force Re-authentication¶
If you need to change accounts or encounter permission errors:
# Log out from Globus
# Visit: https://app.globus.org/logout
# Force re-authentication
python inference-auth-token.py authenticate --force
Token Validity¶
- Access tokens: Valid for 48 hours
- Refresh tokens: Valid for 6 months of inactivity
- Some institutions may enforce more frequent re-authentication (e.g., weekly)
Making Inference Requests¶
The Gateway provides an OpenAI-compatible API with two main endpoints:
Chat Completions Endpoint¶
For conversational interactions:
Federated endpoint (routes across multiple backends):
curl -X POST http://127.0.0.1:8000/resource_server/v1/chat/completions \
-H "Authorization: Bearer $MY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"messages": [
{"role": "user", "content": "Explain the concept of Globus Compute in simple terms."}
],
"max_tokens": 150
}'
Specific backend (targets a particular cluster/framework):
curl -X POST http://127.0.0.1:8000/resource_server/local/vllm/v1/chat/completions \
-H "Authorization: Bearer $MY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"messages": [
{"role": "user", "content": "Explain the concept of Globus Compute in simple terms."}
],
"max_tokens": 150
}'
Completions Endpoint¶
For text completion:
curl -X POST http://127.0.0.1:8000/resource_server/v1/completions \
-H "Authorization: Bearer $MY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "The future of AI is",
"max_tokens": 100
}'
Streaming Responses¶
Both endpoints support streaming. Set stream: true in your request:
Python example with streaming:
import openai
client = openai.OpenAI(
base_url="http://localhost:8000/resource_server/v1",
api_key=MY_TOKEN # Your Globus access token
)
stream = client.chat.completions.create(
model="facebook/opt-125m",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
Testing Streaming with ngrok¶
For testing streaming with remote endpoints, use ngrok to create a secure tunnel:
-
Install ngrok: Visit ngrok.com
-
Start tunnel:
-
Update your test script to use the ngrok URL:
Using the OpenAI Python SDK¶
The Gateway is fully compatible with the OpenAI Python SDK:
import openai
# Configure client to point to the Gateway
client = openai.OpenAI(
base_url="http://localhost:8000/resource_server/v1",
api_key=MY_TOKEN # Your Globus access token
)
# Make a request
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
],
max_tokens=200
)
print(response.choices[0].message.content)
Available Models¶
To see available models, contact your Gateway administrator or check the deployment's model catalog.
Common model naming conventions: - facebook/opt-125m, facebook/opt-1.3b, etc. - meta-llama/Meta-Llama-3-8B-Instruct - openai/gpt-4o-mini (if configured with OpenAI backend)
The exact models available depend on your deployment's configuration.
Batch Processing¶
For large-scale inference workloads, the Gateway supports batch processing:
curl -X POST http://127.0.0.1:8000/resource_server/v1/batches \
-H "Authorization: Bearer $MY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"input_file_path": "/path/to/input.jsonl",
"output_folder_path": "/path/to/output/",
"model": "meta-llama/Meta-Llama-3-8B-Instruct"
}'
Input file format (JSONL):
{"messages": [{"role": "user", "content": "What is AI?"}], "max_tokens": 100}
{"messages": [{"role": "user", "content": "Explain ML"}], "max_tokens": 100}
Check batch status:
curl -X GET http://127.0.0.1:8000/resource_server/v1/batches/<batch_id> \
-H "Authorization: Bearer $MY_TOKEN"
Benchmarking¶
The repository includes a benchmark script for performance testing:
Download Dataset¶
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json -P examples/load-testing/
Run Benchmark¶
python examples/load-testing/benchmark-serving.py \
--backend vllm \
--model facebook/opt-125m \
--base-url http://127.0.0.1:8000/resource_server/v1/chat/completions \
--dataset-name sharegpt \
--dataset-path examples/load-testing/ShareGPT_V3_unfiltered_cleaned_split.json \
--output-file benchmark_results.jsonl \
--num-prompts 100
Additional options: - --request-rate: Control request rate (requests per second) - --max-concurrency: Limit concurrent requests - --disable-ssl-verification: For testing with self-signed certificates - --disable-stream: Test non-streaming mode
Request Parameters¶
Common Parameters¶
| Parameter | Type | Description |
|---|---|---|
model | string | Model identifier (required) |
messages | array | List of message objects (chat endpoint) |
prompt | string | Text prompt (completions endpoint) |
max_tokens | integer | Maximum tokens to generate |
temperature | float | Sampling temperature (0.0-2.0) |
top_p | float | Nucleus sampling parameter |
stream | boolean | Enable streaming responses |
stop | string/array | Stop sequences |
Example with Advanced Parameters¶
curl -X POST http://127.0.0.1:8000/resource_server/v1/chat/completions \
-H "Authorization: Bearer $MY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Write a story"}],
"max_tokens": 500,
"temperature": 0.7,
"top_p": 0.9,
"stream": false,
"stop": ["\n\n"]
}'
Error Handling¶
The Gateway returns standard HTTP status codes:
| Code | Meaning |
|---|---|
| 200 | Success |
| 400 | Bad Request (invalid parameters) |
| 401 | Unauthorized (invalid or missing token) |
| 403 | Forbidden (insufficient permissions) |
| 404 | Not Found (model or endpoint not available) |
| 429 | Too Many Requests (rate limited) |
| 500 | Internal Server Error |
| 503 | Service Unavailable (backend offline) |
Error response format:
{
"error": {
"message": "Model not found",
"type": "invalid_request_error",
"code": "model_not_found"
}
}
Best Practices¶
- Token Management
- Store tokens securely
- Refresh tokens before they expire
-
Never commit tokens to version control
-
Request Optimization
- Use appropriate
max_tokensvalues - Implement retry logic with exponential backoff
-
Use streaming for long responses
-
Rate Limiting
- Respect rate limits
- Implement client-side throttling
-
Use batch processing for bulk operations
-
Error Handling
- Handle network errors gracefully
- Implement timeout logic
- Log errors for debugging
Support¶
For issues, questions, or feature requests: - Check the GitHub repository - Contact your Gateway administrator - Review the Administrator Setup Guide for deployment issues