Troubleshooting

Common failure modes, what causes them, how to fix.

Start with `pipedai-worker doctor`
Most worker-side problems surface in the diagnostics command. Run it first; it prints pass/fail per check (claude on PATH, claude executes, auth env, API reachability, fs permissions, worker.json validity) with remediation hints.

Worker shows offline in the dashboard

Workers heartbeat on every poll. The API marks a worker offline when lastSeenAt is more than three minutes old. Check, in order:

  • The worker process is actually running (ps aux | grep pipedai-worker, or systemctl status pipedai-worker).
  • The worker can reach the API. From the worker host: curl -sI <NEXT_PUBLIC_PIPEDAI_API_URL>/api/v1/health should return 200.
  • The worker token isn't revoked. The dashboard's Settings → API Keys tab shows revoked-state per key. If revoked, re-register with a fresh wrk_ token.
  • Filesystem perms on ~/.pipedai/worker.json — the worker reads it on startup. Doctor catches this.

Trigger fires but the run never executes

Two common causes:

  • The trigger's assigned worker is offline — runs queue but don't execute. The Workers list page banner surfaces this. Reassign the trigger to a different online worker, or bring the original back.
  • The trigger is disabled. Toggle enabled on the trigger editor.

Run fails with infra-fault, retries indefinitely (or stops too soon)

PipedAI auto-retries runs tagged faultClass="infra" with backoff (30s / 2min / 10min) up to Trigger.maxRetries (default 3). Common infra-fault sources:

  • Claude Max usage-window exhaustion. The worker classifies these specifically and surfaces a structured error message like “Claude Max usage window exhausted; retry in ~24 min.”
  • Network errors reaching the API or the MCP (ENOTFOUND, ECONNREFUSED, 5xx).
  • Claude binary missing or filesystem unwritable on the worker host.

If the same trigger fails infra-fault repeatedly, raise maxRetries on the trigger or address the upstream cause. If you're hitting Claude Max usage windows, consider moving that trigger to api-key mode on a different worker.

Run fails with client-fault, no retry

Client faults are not auto-retried because the same input would fail again. The worker classifies a run as client-fault when it sees:

  • MCP returns 4xx (401/403/404/405/406/410) — usually means the MCP service token is wrong, the Authorization header is malformed, or the MCP path is wrong.
  • The assistant emits an explicit [error: …] tag in the final message.

Fix the underlying problem (rotate the MCP token via the trigger editor, fix the prompt, fix the MCP), then trigger an on-demand run from the dashboard's “Run now” button to verify.

Webhook deliveries failing

Open the trigger editor — the Completion webhook section surfaces the last 5 deliveries with HTTP code, attempt count, and (truncated) error message. Common causes:

  • HTTP 5xx / 408 / 429 / network error — retried up to 3 times with backoff 1s / 4s / 16s. If all three fail, the row shows failed and the run is otherwise unaffected. Pipeline execution doesn't wait on webhook delivery.
  • HTTP 4xx other than 408 / 429 — permanent failure, no retry. Usually means your endpoint rejected the body shape, the signature, or the authentication header. Verify your handler reads the raw body bytes (not the parsed JSON) when computing the HMAC.
  • Receiver not reachable from api-beta.pipedai.app — DNS failure, firewall block, or the receiver only accepts internal traffic. Confirm with curl from any machine outside your private network.

See the webhooks guide for verification samples in Node, Python, and Ruby.

Failover worker didn't take over

The takeover happens at poll time — the failover worker has to poll the API for the reassignment to land. Check, in order:

  • The failover worker is itself online (poll interval default is 60s; if it's offline too, no takeover).
  • The primary has been silent for at least 10 minutes — lastSeenAt on the worker row needs to be stale beyond that threshold. Use the workers list page to confirm.
  • The trigger has queued runs to reassign. Triggers with no queued work just stay assigned to the primary (the takeover only acts on queued-status runs).
  • A successful takeover writes a run.failover-claimed audit-log entry; if the audit log shows none for a window where you expected one, the conditions above weren't met.

“Cannot demote/remove the last owner”

Every environment must have at least one owner. Promote another member to owner first, then demote or remove the original.

“Environment is not empty”

Environment delete is blocked while it has active triggers or registered workers. The 409 response includes details: { activeTriggers, registeredWorkers } so you know exactly how much draining is left. Disable / soft-delete the triggers and revoke worker tokens, then retry the delete.

Worker won't register — 401 Invalid token

  • The wrk_ token is single-use. If you already used it, the API replaced the worker's wkt_. Generate a fresh wrk_ in Settings → API Keys → Worker Registration.
  • The token might be revoked. Check the API Keys list for a Revoked badge.
  • Confirm the API URL is correct — the registration endpoint lives on pipedai-api, not on the dashboard hostname.

Still stuck?

File an issue on Marolence/pipedai-worker with the output of pipedai-worker doctor and the trigger / run IDs that demonstrate the problem.

bashpipedai-worker doctor --api-url=https://api-beta.pipedai.app