Skip to content

FIRST

Welcome to the documentation for the Federated Inference Resource Scheduling Toolkit (FIRST). FIRST enables AI Model inference as a service across distributed HPC clusters through an OpenAI-compatible API.

What is FIRST?

FIRST (Federated Inference Resource Scheduling Toolkit) is a system that allows secure, remote execution of Inference on AI Models through an OpenAI-compatible API. It validates and authorizes inference requests to scientific computing clusters using Globus Auth and Globus Compute.

System Architecture

System Architecture

The Inference Gateway consists of several components:

  • API Gateway: Django-based REST/Ninja API that handles authorization and request routing
  • Globus Auth: Authentication and authorization service
  • Globus Compute Endpoints: Remote execution framework on HPC clusters (or local machines)
  • Inference Server Backend: High-performance inference service for LLMs (e.g., vLLM)

For Administrators

For Users

Key Features

  • Federated Access: Route requests across multiple HPC clusters automatically
  • OpenAI-Compatible: Works with existing OpenAI SDK and tools
  • Secure: Globus Auth integration with group-based access control
  • High Performance: Support for vLLM and other optimized inference backends
  • Flexible: Deploy via Docker, bare metal, or Kubernetes
  • Scalable: Auto-scaling and resource management for HPC environments

Example Deployment

For a production example, see the ALCF Inference Endpoints documentation.

Getting Help

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.


Quick Start Paths