Monitoring & Troubleshooting¶
This guide covers monitoring the FIRST Inference Gateway and troubleshooting common issues.
Monitoring¶
Application Logs¶
Docker Deployment¶
# View all logs
docker-compose logs -f
# View gateway logs only
docker-compose logs -f inference-gateway
# Last 100 lines
docker-compose logs --tail=100 inference-gateway
Bare Metal Deployment¶
# Application logs
tail -f logs/django_info.log
# Gunicorn logs
tail -f logs/backend_gateway.error.log
tail -f logs/backend_gateway.access.log
# Systemd service logs
sudo journalctl -u inference-gateway -f
Database Monitoring¶
# Connection stats
psql -h localhost -U inferencedev -d inferencegateway -c "SELECT * FROM pg_stat_activity;"
# Table sizes
psql -h localhost -U inferencedev -d inferencegateway -c "
SELECT schemaname,tablename,pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables WHERE schemaname='public' ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;"
Redis Monitoring¶
Globus Compute Endpoints¶
# List endpoints
globus-compute-endpoint list
# Check status
globus-compute-endpoint status my-endpoint
# View logs
globus-compute-endpoint log my-endpoint -n 100
# Follow logs
tail -f ~/.globus_compute/my-endpoint/endpoint.log
Common Issues¶
Gateway Won't Start¶
Symptoms: Container/service fails to start
Check:
# Docker
docker-compose logs inference-gateway
# Bare metal
sudo journalctl -u inference-gateway -n 50
python manage.py check
Common Causes:
- Missing environment variables
- Database connection failure
- Port already in use
- Syntax error in settings
Database Connection Errors¶
Symptoms: OperationalError: could not connect to server
Solutions:
# Verify PostgreSQL is running
sudo systemctl status postgresql
docker-compose ps postgres
# Test connection
psql -h localhost -U inferencedev -d inferencegateway
# Check pg_hba.conf
sudo nano /etc/postgresql/*/main/pg_hba.conf
# Restart PostgreSQL
sudo systemctl restart postgresql
Authentication Failures¶
Symptoms: 401 Unauthorized, Globus token errors
Solutions:
- Verify Globus application credentials in
.env - Check scope was created successfully:
- Force re-authentication:
- Verify redirect URIs match in Globus app settings
Globus Compute Errors¶
Symptoms: Function execution failures, timeout errors
Solutions:
# Check endpoint is running
globus-compute-endpoint list
# Restart endpoint
globus-compute-endpoint restart my-endpoint
# View detailed logs
globus-compute-endpoint log my-endpoint -n 200
# Verify function UUID is allowed
cat ~/.globus_compute/my-endpoint/config.yaml
Model Not Found¶
Symptoms: Model 'xxx' not found errors
Solutions:
- Verify fixture was loaded:
- Check model name matches exactly in fixture
- Reload fixtures:
Slow Response Times¶
Causes:
- Cold start (first request to endpoint)
- GPU not available
- Model loading time
- Network latency
Solutions:
- Enable hot nodes (min_blocks > 0 in Globus Compute config)
- Monitor GPU usage:
nvidia-smi - Check vLLM logs for bottlenecks
- Increase Gunicorn timeout:
Out of Memory Errors¶
Symptoms: OOM kills, CUDA out of memory
Solutions:
# vLLM: Reduce GPU memory usage
--gpu-memory-utilization 0.7
# vLLM: Use quantization
--quantization awq
# vLLM: Reduce context length
--max-model-len 2048
# System: Add swap space
sudo fallocate -l 32G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Health Checks¶
Manual Health Checks¶
# Gateway health
curl http://localhost:8000/
# vLLM health
curl http://localhost:8001/health
# Database connectivity
python manage.py dbshell
# Redis connectivity
redis-cli ping
Automated Health Monitoring¶
Create a health check script:
#!/bin/bash
# health_check.sh
# Check gateway
if curl -s http://localhost:8000/ > /dev/null; then
echo "✓ Gateway is healthy"
else
echo "✗ Gateway is down"
systemctl restart inference-gateway
fi
# Check database
if psql -h localhost -U inferencedev -d inferencegateway -c "SELECT 1;" > /dev/null 2>&1; then
echo "✓ Database is healthy"
else
echo "✗ Database is down"
fi
Add to crontab:
Performance Metrics¶
Key Metrics to Monitor¶
- Request Rate: Requests per second
- Latency: Response time (p50, p95, p99)
- Error Rate: Percentage of failed requests
- Queue Depth: Pending Globus Compute tasks
- GPU Utilization: GPU memory and compute usage
- Database Connections: Active connections
- Cache Hit Rate: Redis cache effectiveness
Prometheus Metrics¶
If using Prometheus, key metrics to track:
# Request metrics
http_requests_total
http_request_duration_seconds
# Globus Compute metrics
globus_compute_tasks_submitted
globus_compute_tasks_completed
globus_compute_tasks_failed
# System metrics
process_cpu_seconds_total
process_resident_memory_bytes
Troubleshooting Checklist¶
When issues occur, work through this checklist:
- [ ] Check application logs
- [ ] Verify all services are running
- [ ] Test database connectivity
- [ ] Check Redis connectivity
- [ ] Verify Globus Compute endpoints are online
- [ ] Test authentication flow
- [ ] Check network connectivity
- [ ] Review recent configuration changes
- [ ] Check disk space
- [ ] Monitor resource usage (CPU, RAM, GPU)
Getting Help¶
If you're still stuck:
- Check documentation: Review the relevant setup guides
- Search issues: Look for similar issues on GitHub
- Enable debug logging: Set
DEBUG=Truetemporarily - Collect information:
- Version information
- Error messages and stack traces
- Configuration (sanitize secrets!)
- Relevant log excerpts
- Open an issue: Provide all collected information