GetStartupInfo returns map[string]any, and type-asserting tools/skills entries without checking ok is fragile. While the current implementation always stores the correct types, a future refactor could cause silent nil dereference. Add ok checks with explicit nil fallback.
When running PicoClaw inside Termux or termux-chroot, HTTPS
requests fail with X509 certificate errors because the Go TLS
stack does not automatically detect the Termux CA bundle path.
This change adds automatic detection of Termux environments and
sets SSL_CERT_FILE to the correct CA bundle path before any
network operations. The detection checks:
- HOME or PATH contains 'com.termux'
- Common CA bundle locations in Termux prefix
Fixes#2944
Export MakeBackup for external use, add ResetToDefaults function that
backs up current config, creates defaults, and preserves security
credentials. Add `picoclaw config reset` CLI command with --force flag.
* feat(model): add `picoclaw model add` for custom OpenAI-compatible endpoints
Onboards a model from a user-supplied API base + key by hitting
GET <base>/models, prompting the user to pick one, and writing the entry
into model_list[] (with api_keys) plus setting it as the default model.
This was previously only available in the TUI launcher (issue #2208) and
is now accessible from the CLI:
picoclaw model add -b URL -k KEY [-m MODEL] [-n ALIAS]
* chore: remove deprecated picoclaw-launcher-tui
Per RFC #2208, the TUI launcher is deprecated in favor of the CLI; its
"online model picker" feature has been ported to `picoclaw model add` in
the previous commit. This drops the binary and all build/release/docs
references:
- delete cmd/picoclaw-launcher-tui/ and assets/launcher-tui.jpg
- Makefile: remove the `build-launcher-tui` target
- .goreleaser.yaml: drop the build entry plus the `picoclaw-launcher-tui`
ids from the launcher docker image, macOS notarize list, and nfpms
contents
- docker/Dockerfile.goreleaser.launcher: drop the COPY for the TUI binary
- READMEs (root + 8 locales): remove the "TUI Launcher" section and
screenshot link
- docs/guides/docker.*: update the "launcher image includes …" sentence
to reflect the two remaining binaries
`make build` still succeeds; `go build ./web/backend` (the launcher
target) still succeeds. `picoclaw-launcher` (web console) is unaffected.
* Fix Windows build flow
* build(makefile): make windows recipes shell-safe
- avoid backslash line-continuation in Windows build-launcher recipe
- replace cmd-specific if-not-exist with PowerShell check in web build-frontend
* Fix Windows build flow
* build(makefile): make windows recipes shell-safe
- avoid backslash line-continuation in Windows build-launcher recipe
- replace cmd-specific if-not-exist with PowerShell check in web build-frontend
* build(web): avoid shell-expanding powershell vars in windows recipe
- rewrite build-frontend Windows command without PowerShell local vars
- keep install-stamp hash check logic
* refactor: support explicit model list providers
* fix(web): preserve explicit model providers
* fix(web): preserve legacy provider prefixes on model updates
fix(models): normalize explicit provider-prefixed ids
fix(api): preserve legacy model updates across providers
fix(agent): preserve config identity for explicit provider refs
* fix ci
* membench: add LLM-as-Judge evaluation mode
Add --eval-mode=llm to membench for LLM-based answer generation and
semantic scoring via an OpenAI-compatible API endpoint.
New files:
- llm_client.go: generic OpenAI-compatible chat completion client
with support for API key, configurable timeout, and optional
chat_template_kwargs (for llama.cpp thinking models)
- eval_llm.go: LLM answer generation + LLM-as-Judge scoring for
both legacy and seahorse retrieval modes
Changes to main.go:
- --eval-mode flag (token|llm) to select evaluation strategy
- --api-base, --api-key, --model flags with env var fallback
(MEMBENCH_API_BASE, MEMBENCH_API_KEY, MEMBENCH_MODEL)
- --no-thinking flag for llama.cpp + Qwen thinking models
- --limit flag to cap QA questions per sample for quick testing
* style: fix golangci-lint formatting (gofmt + golines)
* fix: address Copilot review feedback
- Validate --model is required for LLM eval mode
- Use rune-based truncation to preserve valid UTF-8
- Precompute totalQA count outside inner loop
- Log SearchMessages errors instead of silently skipping
* fix: address Copilot review round 2
- Validate --eval-mode accepts only 'token' or 'llm'
- Normalize base URL to avoid /v1/v1 duplication
- Separate token/LLM results for correct PrintComparison labeling
- Log ExpandMessages errors instead of silently ignoring
- Short-circuit with 0 scores when no context retrieved (match token eval)
- Add --timeout flag wired to LLMClientOptions.Timeout
* fix: address review P1+P2 — sort alignment, failure sentinel, score parser
- P1: Replace hand-rolled sortByRank with sort.Slice (ascending, best
first) matching eval.go's EvalSeahorse — ensures BudgetTruncate keeps
best-ranked messages when truncation occurs
- P2: Use -1.0 sentinel for LLM API failures and parse errors, distinct
from genuine 0.0 score; aggregateMetrics skips -1.0 entries for F1
averaging while still counting HitRate
- P2: Use regexp \b([1-5])\b for judge score extraction instead of
first-digit scan — avoids misparses on '5/5', 'Score: 3' etc.
* fix: address Copilot review round 2
- Fix F1/HitRate weighted aggregation: track ValidF1Count separately so
computeModeAgg weights F1 by valid scores only, not TotalQuestions
- No-context retrieval failure uses 0.0 (genuine bad score) instead of
-1.0 sentinel (reserved for API/parse failures)
- Validate --timeout > 0 to prevent disabling HTTP timeouts
* fix: remove hardcoded /v1 from API base URL
Users now provide the full versioned path in --api-base (e.g. /v1, /v4).
Code only appends /chat/completions. Default changed to
http://127.0.0.1:8080/v1 for backward compatibility.
* fix: address Copilot review round 3
- ValidF1Count=0 when all scores are sentinel (no forced =1)
- Backward compat: old eval JSON without ValidF1Count falls back to
TotalQuestions in computeModeAgg
- Skip empty section in PrintComparison when tokenResults is empty
- Update --api-base flag help to document /v1 default and version path
- Add sentinel aggregation unit tests (partial, all, weighted)
* feat: add --retries flag with exponential backoff for transient LLM errors
Retry on timeout, 5xx, and 429 (rate limit) with 1s/2s/4s backoff.
Default 3 retries, configurable via --retries. Context cancellation
is respected between retries.
* fix: address Copilot review round 4
- runReport splits results by mode suffix into token/llm for PrintComparison
- backward compat fallback (ValidF1Count=0 -> TotalQuestions) only for
non-LLM modes; LLM modes keep ValidF1Count=0 when all scores sentinel
- MaxRetries==0 means no retry; only negative falls back to default 3
- truncateStr uses []rune to avoid cutting multi-byte UTF-8 characters
- Complete() returns error on empty LLM response (vs silent empty string)
* feat: --no-thinking adapts to llama.cpp, Ollama, and GLM backends
Send all three disable-thinking fields simultaneously:
- chat_template_kwargs.enable_thinking=false (llama.cpp, GLM)
- think=false (Ollama 0.9+)
- thinking.type=disabled (GLM/Zhipu)
Each backend picks the field it recognizes and ignores the rest.
Also bumps max_tokens from 512 to 2048 for thinking models.
* feat: mixed model eval + concurrent QA workers
- Add --judge-model, --judge-api-base, --judge-api-key flags for separate judge model
- Add --concurrency flag (default 1) with semaphore-based goroutine pool
- Add reasoning_content fallback for GLM/DeepSeek style responses
- Prepend /no_think to system prompt for Ollama /v1 compatibility
- Reduce default MaxTokens from 2048 to 512 (answers are 1-3 sentences)
- Extract evalQAWorker and buildSeahorseContext for shared concurrent logic
---------
Co-authored-by: BeaconCat <BeaconCat@users.noreply.github.com>
* feat(updater): add web self-update endpoint and updater package
* feat(selfupgrade): when url empty, using GetTestReleaseAPIURL for test .
* feat(selfupgrade): only GetTestReleaseAPIURL .
* feat(upgrade): cli $0 update work well!
* fix(ci): fix ci err
* fix(test): fix ci test
* fix(ci): fix ci lint fmt err
* test(updater): add test for updater
* fix(ci): fix ci lint var copy err
* fix(ci): retry ci
* updater: require checksum verification, prefer API digest, verify SHA256, fix zip extraction, update tests
* fix(lint): lint fixed
* fix(lint): lint fixed2
* updater: stream download and verify sha256; add http client timeout and progress
Avoid double-download by streaming asset into temp file while computing SHA256 and verifying against checksum; replace http.Get with shared httpClient (2m timeout) to prevent hangs; add simple stderr progress display; remove unused helpers.
* feat(web): display backend version info in sidebar
* fix(web): improve version parsing and timeout behavior
* refactor(web): remove useless --version fallback
* feat(web): implement version info caching and improve retrieval logic
* fix(web): clarify version timeout rationale
* fix(web): harden gateway version probing and tests
* style(web): split regexp to two lines for lint
The agent path now publishes to outbound bus directly (since #2100),
making the deliver=true direct-to-bus shortcut and the directive type
prompt wrapping redundant. All cron jobs now uniformly route through
the agent. This is an intentional behavior change: old jobs with
deliver=true will execute through the agent instead of bypassing it.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>