Monitoring & Troubleshooting¶

This guide covers monitoring the FIRST Inference Gateway and troubleshooting common issues.

Monitoring¶

Application Logs¶

Docker Deployment¶

# View all logs
docker-compose logs -f

# View gateway logs only
docker-compose logs -f inference-gateway

# Last 100 lines
docker-compose logs --tail=100 inference-gateway

Bare Metal Deployment¶

# Application logs
tail -f logs/django_info.log

# Gunicorn logs
tail -f logs/backend_gateway.error.log
tail -f logs/backend_gateway.access.log

# Systemd service logs
sudo journalctl -u inference-gateway -f

Database Monitoring¶

# Connection stats
psql -h localhost -U inferencedev -d inferencegateway -c "SELECT * FROM pg_stat_activity;"

# Table sizes
psql -h localhost -U inferencedev -d inferencegateway -c "
SELECT schemaname,tablename,pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables WHERE schemaname='public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;"

Redis Monitoring¶

# Connect to Redis CLI
redis-cli

# In Redis CLI:
INFO
DBSIZE
MONITOR  # Watch commands in real-time

Globus Compute Endpoints¶

# List endpoints
globus-compute-endpoint list

# Check status
globus-compute-endpoint status my-endpoint

# View logs
globus-compute-endpoint log my-endpoint -n 100

# Follow logs
tail -f ~/.globus_compute/my-endpoint/endpoint.log

Common Issues¶

Gateway Won't Start¶

Symptoms: Container/service fails to start

Check:

# Docker
docker-compose logs inference-gateway

# Bare metal
sudo journalctl -u inference-gateway -n 50
python manage.py check

Common Causes:

Missing environment variables
Database connection failure
Port already in use
Syntax error in settings

Database Connection Errors¶

Symptoms: OperationalError: could not connect to server

Solutions:

# Verify PostgreSQL is running
sudo systemctl status postgresql
docker-compose ps postgres

# Test connection
psql -h localhost -U inferencedev -d inferencegateway

# Check pg_hba.conf
sudo nano /etc/postgresql/*/main/pg_hba.conf

# Restart PostgreSQL
sudo systemctl restart postgresql

Authentication Failures¶

Symptoms: 401 Unauthorized, Globus token errors

Solutions:

Verify Globus application credentials in .env

Check scope was created successfully:

curl -s --user $CLIENT_ID:$CLIENT_SECRET \
    https://auth.globus.org/v2/api/clients/$CLIENT_ID

Force re-authentication:

python inference-auth-token.py authenticate --force

Verify redirect URIs match in Globus app settings

Globus Compute Errors¶

Symptoms: Function execution failures, timeout errors

Solutions:

# Check endpoint is running
globus-compute-endpoint list

# Restart endpoint
globus-compute-endpoint restart my-endpoint

# View detailed logs
globus-compute-endpoint log my-endpoint -n 200

# Verify function UUID is allowed
cat ~/.globus_compute/my-endpoint/config.yaml

Model Not Found¶

Symptoms: Model 'xxx' not found errors

Solutions:

Verify fixture was loaded:

python manage.py dumpdata resource_server.endpoint

Check model name matches exactly in fixture

Reload fixtures:

python manage.py loaddata fixtures/endpoints.json

Slow Response Times¶

Causes:

Cold start (first request to endpoint)
GPU not available
Model loading time
Network latency

Solutions:

Enable hot nodes (min_blocks > 0 in Globus Compute config)
Monitor GPU usage: nvidia-smi
Check vLLM logs for bottlenecks
Increase Gunicorn timeout:
```
timeout = 300
```

Out of Memory Errors¶

Symptoms: OOM kills, CUDA out of memory

Solutions:

# vLLM: Reduce GPU memory usage
--gpu-memory-utilization 0.7

# vLLM: Use quantization
--quantization awq

# vLLM: Reduce context length
--max-model-len 2048

# System: Add swap space
sudo fallocate -l 32G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Health Checks¶

Manual Health Checks¶

# Gateway health
curl http://localhost:8000/

# vLLM health
curl http://localhost:8001/health

# Database connectivity
python manage.py dbshell

# Redis connectivity
redis-cli ping

Automated Health Monitoring¶

Create a health check script:

#!/bin/bash
# health_check.sh

# Check gateway
if curl -s http://localhost:8000/ > /dev/null; then
    echo "✓ Gateway is healthy"
else
    echo "✗ Gateway is down"
    systemctl restart inference-gateway
fi

# Check database
if psql -h localhost -U inferencedev -d inferencegateway -c "SELECT 1;" > /dev/null 2>&1; then
    echo "✓ Database is healthy"
else
    echo "✗ Database is down"
fi

Add to crontab:

*/5 * * * * /path/to/health_check.sh >> /var/log/health_check.log 2>&1

Performance Metrics¶

Key Metrics to Monitor¶

Request Rate: Requests per second
Latency: Response time (p50, p95, p99)
Error Rate: Percentage of failed requests
Queue Depth: Pending Globus Compute tasks
GPU Utilization: GPU memory and compute usage
Database Connections: Active connections
Cache Hit Rate: Redis cache effectiveness

Prometheus Metrics¶

If using Prometheus, key metrics to track:

# Request metrics
http_requests_total
http_request_duration_seconds

# Globus Compute metrics
globus_compute_tasks_submitted
globus_compute_tasks_completed
globus_compute_tasks_failed

# System metrics
process_cpu_seconds_total
process_resident_memory_bytes

Troubleshooting Checklist¶

When issues occur, work through this checklist:

[ ] Check application logs
[ ] Verify all services are running
[ ] Test database connectivity
[ ] Check Redis connectivity
[ ] Verify Globus Compute endpoints are online
[ ] Test authentication flow
[ ] Check network connectivity
[ ] Review recent configuration changes
[ ] Check disk space
[ ] Monitor resource usage (CPU, RAM, GPU)

Getting Help¶

If you're still stuck:

Check documentation: Review the relevant setup guides
Search issues: Look for similar issues on GitHub
Enable debug logging: Set DEBUG=True temporarily
Collect information:
Version information
Error messages and stack traces
Configuration (sanitize secrets!)
Relevant log excerpts
Open an issue: Provide all collected information

Monitoring & Troubleshooting¶

Monitoring¶

Application Logs¶

Docker Deployment¶

Bare Metal Deployment¶

Database Monitoring¶

Redis Monitoring¶

Globus Compute Endpoints¶

Common Issues¶

Gateway Won't Start¶

Database Connection Errors¶

Authentication Failures¶

Globus Compute Errors¶

Model Not Found¶

Slow Response Times¶

Out of Memory Errors¶

Health Checks¶

Manual Health Checks¶

Automated Health Monitoring¶

Performance Metrics¶

Key Metrics to Monitor¶

Prometheus Metrics¶

Troubleshooting Checklist¶

Getting Help¶

Additional Resources¶