Monitoring & Troubleshooting
This page covers how to monitor a running ReArch deployment and resolve common problems.
Health Checks
Section titled “Health Checks”All services in the Docker Compose stack include health checks:
| Service | Health Endpoint | Interval |
|---|---|---|
| Frontend | GET / (nginx) | 30s |
| Backend | GET /health | 30s |
| MCP Proxy | GET /health | 30s |
| Keycloak | GET /health/ready | 30s |
| Redis | redis-cli ping | 30s |
| MongoDB | mongosh --eval "db.runCommand('ping')" | 30s |
Use docker service ls to check the status of all services in the Swarm stack. A service showing 0/1 replicas indicates a failing health check.
Viewing Logs
Section titled “Viewing Logs”Docker Swarm
Section titled “Docker Swarm”# All logs for a servicedocker service logs rearch_backend --follow
# Recent logs with timestampsdocker service logs rearch_backend --since 1h --timestamps
# All servicesdocker service logs rearch_frontend --followdocker service logs rearch_mcp-proxy --followdocker service logs rearch_keycloak --followLocal Development
Section titled “Local Development”./development.sh logs # All services./development.sh logs backend # Specific service./development.sh logs sessions # Conversation containersCommon Issues
Section titled “Common Issues”Conversation container fails to start
Section titled “Conversation container fails to start”Symptoms: Container status shows “error” in the session sidebar. The jobs dashboard shows a failed container setup job.
Common causes:
- Docker image not found — rebuild the image from the repository settings.
- Docker daemon unreachable — verify that the backend can access
/var/run/docker.sock. - Insufficient disk space — clean up old images with
docker image prune. - Network not found — ensure the overlay network exists (
docker network ls).
Agent not responding
Section titled “Agent not responding”Symptoms: Messages are sent but the agent does not reply.
Check:
- Container status is “running” in the session sidebar.
- The LLM provider is configured with a valid API key (Administration > Settings > LLM Providers).
- The selected model is enabled and the provider API is reachable from the backend.
- Backend logs show no errors related to the AI SDK.
MCP tools not available
Section titled “MCP tools not available”Symptoms: The agent cannot use MCP tools like GitHub, Sentry, etc.
Check:
- MCP servers are configured and enabled in Administration > MCP Servers.
- The MCP proxy is healthy (
docker service logs rearch_mcp-proxy). - The proxy status indicator on the MCP Servers page shows “connected”.
MCP_PROXY_SECRETmatches between the backend and the MCP proxy service.
Keycloak login redirect loop
Section titled “Keycloak login redirect loop”Symptoms: Clicking login redirects back to the login page repeatedly.
Check:
- The
rearch-appclient in Keycloak has the correct Valid redirect URIs and Web origins. - The frontend is not behind Traefik forward-auth middleware (it must be publicly accessible).
- Cookies are not being blocked by the browser.
See Keycloak Troubleshooting for more.
WebSocket connection fails
Section titled “WebSocket connection fails”Symptoms: Real-time updates (agent typing, job status) do not appear. The frontend falls back to polling.
Check:
- Traefik is configured to pass WebSocket upgrades on the
api.<domain>route. - No intermediate proxy (e.g., Cloudflare, AWS ALB) is stripping the
Upgrade: websocketheader. - The backend’s
FRONTEND_URLis set correctly for CORS.
High memory usage
Section titled “High memory usage”Symptoms: The host is running out of memory.
Check:
- Number of active conversation containers (
docker ps | grep conv-). Each container consumes memory based on its template and workload. - Idle container cleanup is configured to stop inactive containers automatically.
- MongoDB memory usage — consider setting
--wiredTigerCacheSizeGBif MongoDB is consuming too much RAM.
Metrics to Monitor
Section titled “Metrics to Monitor”For production deployments, consider monitoring:
| Metric | Source | Why |
|---|---|---|
| Active conversation containers | docker ps count | Capacity planning |
| AI cost per day | Usage Analytics dashboard | Budget control |
| Job queue depth | Redis / BullMQ | Detect backlogs |
| Disk usage | Host filesystem | Image builds consume disk |
| MongoDB connections | MongoDB logs | Connection pool exhaustion |
| Failed jobs count | Jobs dashboard | Operational health |