Troubleshooting¶
Common problems and the first thing to check. If your problem isn't here, open an issue on GitHub.
Service won't start¶
Look at the bottom of the journal for the actual exception. Most common causes:
- Database connection refused →
DATABASE_URLis wrong, or PostgreSQL isn't running. Test withpsql -h localhost -U p3guardian -d p3guardian. - Telegram bot token rejected → token is wrong or revoked. Test with:
curl https://api.telegram.org/bot<TOKEN>/getMe. - Permission denied on log file →
p3guardianuser isn't inadmgroup. Fix:sudo usermod -a -G adm p3guardian && sudo systemctl restart p3guardian. - Missing Python dep → run
sudo -u p3guardian /opt/p3guardian/.venv/bin/pip install -e /opt/p3guardianto refresh.
No events appearing in the database¶
If a specific source has zero rows but you expect activity:
- Check the file is growing.
sudo tail -f /var/log/<file>from another terminal — does it scroll? - Check the monitor is registered.
journalctl -u p3guardian | grep "Monitor started"— is the source in the list? - Check log file permissions.
ls -la /var/log/<file>— readable by theadmgroup? - Known issue: stale FD after logrotate. If a log file was rotated while Fenrir held an open fd, the tailer can get stuck reading the deleted file. Symptom:
ls -l /proc/$(pidof python | head -1)/fd/ | grep <file>shows an FD pointing to a different inode than the on-disk file. Fix: restartp3guardian.service. Long-term fix landed in0.5.x(LogTailerdoes periodic inode checks).
AI investigations not firing¶
Symptoms: HIGH events appear in events, but investigation_jobs stays empty.
Check:
- Severity threshold: only HIGH and CRITICAL trigger. MEDIUM does not. Normal.
- Dedupe window: if the same
(category, ip)was investigated in the last 60 minutes, subsequent events are deduped.SELECT * FROM investigation_jobs WHERE category=... ORDER BY created_at DESCto confirm. - Concurrency cap: if 2 investigations are already running, new ones queue up.
journalctl -u p3guardian | grep ANALYSTshows starts/dones. - AI backend reachable:
curl -s https://openrouter.ai/api/v1/models -H "Authorization: Bearer $OPENROUTER_API_KEY"should return JSON with the model list.
Telegram alerts not arriving¶
# 1. Bot is up
curl https://api.telegram.org/bot<TOKEN>/getMe
# 2. Chat ID is right
# Send /start to the bot from the target chat, then:
curl https://api.telegram.org/bot<TOKEN>/getUpdates | grep chat
# the "id" in the chat object is your TELEGRAM_CHAT_ID
# 3. Bot is allowed in the chat
# If it's a group, the bot must be added as a member.
# If it's a channel, the bot must be admin.
# 4. Server can reach Telegram
curl -I https://api.telegram.org/
Restart p3guardian after fixing.
Dashboard is empty / 502¶
# Is the service running?
systemctl is-active p3guardian
# Is it bound to the expected port?
sudo ss -tlnp | grep 8443
# Is nginx reverse-proxy talking to the right backend?
sudo nginx -T 2>/dev/null | grep -A 3 'proxy_pass.*8443'
# Can you reach it locally?
curl -s http://127.0.0.1:8443/api/stats | python3 -m json.tool
If everything looks right but the public URL still 502s, check cloudflared:
"I'm getting too many emails from root"¶
Not an alert problem — a system mail problem. fail2ban and sudo can spam root@ (forwarded by /etc/aliases). To silence:
# Stop fail2ban from emailing on every ban (in jail.d/*.conf):
sudo sed -i 's|action_mwl|action_|g; s|action_mw|action_|g' /etc/fail2ban/jail.d/*.conf
sudo systemctl reload fail2ban
# Stop sudo from emailing on bad password:
echo 'Defaults !mail_badpass' | sudo tee /etc/sudoers.d/99-no-mail-badpass
sudo chmod 0440 /etc/sudoers.d/99-no-mail-badpass
sudo visudo -cf /etc/sudoers.d/99-no-mail-badpass
Real Fenrir alerts go via Telegram, not email. The email channel only carries genuine system errors (cron failures, disk full, kernel panic).
"PII anonymizer not loading"¶
It's optional. Either install it:
Or accept that cloud LLM calls send raw prompts (you'll see a one-time warning in the journal). For development, fine. For production with cloud LLMs, install it.
Compliance checks all "fail" or "partial"¶
Run the compliance audit interactively to see why:
curl -s "http://127.0.0.1:8443/api/compliance/run?framework=GDPR" \
| python3 -c "import json,sys; [print(c['control_id'], c['status'], c.get('details','')) for c in json.load(sys.stdin)['controls']]"
For each control with status=fail, look at evidence — it tells you which sub-check failed.
Common false-failures:
- GDPR-5.1.f firewall_active=false when UFW is actually running → you're on a Fenrir version older than
0.5.xthat didn't fall back tosystemctl is-active. Update. - GDPR-5.1.f ssl_configured_sites=0 → you terminate TLS at Cloudflare upstream, not in nginx.
0.5.x+accepts this and counts cloudflared as valid. Update. - GDPR-33 fail with 0 actual breaches → an old breach record in the DB has status=open with deadline passed. Either close it (
UPDATE breach_notifications SET status='closed', closed_at=NOW() WHERE id=...) or setBREACH_NOTIFICATION_EMAILto provide the missing evidence.
Investigation reports show verdict=inconclusive¶
The agent couldn't reach a confident decision in 5 tool rounds. Common reasons:
- Insufficient log data. The event source's logs were rotated or empty when the agent looked. Check that retention is reasonable (we don't recommend < 7 days of
auth.log). - Cloud LLM rate-limited. Look for HTTP 429 errors in the journal during the investigation window. Solution: use a different OpenRouter model or a dedicated API key.
- Playbook is too narrow. If a category fires often but always inconclusive, the playbook is asking for impossible signals. Edit the playbook to widen it.
Server load is too high¶
Fenrir's idle CPU is < 1%. RAM is ~200 MB. If you're seeing higher:
# What's hot?
sudo top -p $(pidof python | head -1)
# How many tool processes are running?
sudo pgrep -af 'p3guardian|grep|find' | head
If you see a find /usr /bin /sbin /opt -perm -4000 running for minutes, that's the baseline_monitor's setuid scan on a slow disk. Bump the interval in app.py from 600s to 1800s.
If you see many simultaneous Python subprocesses spawning, the AI tool-loop may be in a tight loop. Check the dispatcher concurrency cap (default 2) is in effect.
Resetting the baseline¶
If you've intentionally added many users / services / ports and want Fenrir to accept the new state as normal:
Next baseline run (within 10 min) will reseed the snapshot. No alerts in the meantime.
Still stuck¶
Open an issue with /tmp/fenrir.log attached (redact any sensitive content): https://github.com/P3consultingtech/p3guardian/issues