When running PicoClaw inside Termux or termux-chroot, HTTPS
requests fail with X509 certificate errors because the Go TLS
stack does not automatically detect the Termux CA bundle path.
This change adds automatic detection of Termux environments and
sets SSL_CERT_FILE to the correct CA bundle path before any
network operations. The detection checks:
- HOME or PATH contains 'com.termux'
- Common CA bundle locations in Termux prefix
Fixes#2944
* refactor: support explicit model list providers
* fix(web): preserve explicit model providers
* fix(web): preserve legacy provider prefixes on model updates
fix(models): normalize explicit provider-prefixed ids
fix(api): preserve legacy model updates across providers
fix(agent): preserve config identity for explicit provider refs
* fix ci
* membench: add LLM-as-Judge evaluation mode
Add --eval-mode=llm to membench for LLM-based answer generation and
semantic scoring via an OpenAI-compatible API endpoint.
New files:
- llm_client.go: generic OpenAI-compatible chat completion client
with support for API key, configurable timeout, and optional
chat_template_kwargs (for llama.cpp thinking models)
- eval_llm.go: LLM answer generation + LLM-as-Judge scoring for
both legacy and seahorse retrieval modes
Changes to main.go:
- --eval-mode flag (token|llm) to select evaluation strategy
- --api-base, --api-key, --model flags with env var fallback
(MEMBENCH_API_BASE, MEMBENCH_API_KEY, MEMBENCH_MODEL)
- --no-thinking flag for llama.cpp + Qwen thinking models
- --limit flag to cap QA questions per sample for quick testing
* style: fix golangci-lint formatting (gofmt + golines)
* fix: address Copilot review feedback
- Validate --model is required for LLM eval mode
- Use rune-based truncation to preserve valid UTF-8
- Precompute totalQA count outside inner loop
- Log SearchMessages errors instead of silently skipping
* fix: address Copilot review round 2
- Validate --eval-mode accepts only 'token' or 'llm'
- Normalize base URL to avoid /v1/v1 duplication
- Separate token/LLM results for correct PrintComparison labeling
- Log ExpandMessages errors instead of silently ignoring
- Short-circuit with 0 scores when no context retrieved (match token eval)
- Add --timeout flag wired to LLMClientOptions.Timeout
* fix: address review P1+P2 — sort alignment, failure sentinel, score parser
- P1: Replace hand-rolled sortByRank with sort.Slice (ascending, best
first) matching eval.go's EvalSeahorse — ensures BudgetTruncate keeps
best-ranked messages when truncation occurs
- P2: Use -1.0 sentinel for LLM API failures and parse errors, distinct
from genuine 0.0 score; aggregateMetrics skips -1.0 entries for F1
averaging while still counting HitRate
- P2: Use regexp \b([1-5])\b for judge score extraction instead of
first-digit scan — avoids misparses on '5/5', 'Score: 3' etc.
* fix: address Copilot review round 2
- Fix F1/HitRate weighted aggregation: track ValidF1Count separately so
computeModeAgg weights F1 by valid scores only, not TotalQuestions
- No-context retrieval failure uses 0.0 (genuine bad score) instead of
-1.0 sentinel (reserved for API/parse failures)
- Validate --timeout > 0 to prevent disabling HTTP timeouts
* fix: remove hardcoded /v1 from API base URL
Users now provide the full versioned path in --api-base (e.g. /v1, /v4).
Code only appends /chat/completions. Default changed to
http://127.0.0.1:8080/v1 for backward compatibility.
* fix: address Copilot review round 3
- ValidF1Count=0 when all scores are sentinel (no forced =1)
- Backward compat: old eval JSON without ValidF1Count falls back to
TotalQuestions in computeModeAgg
- Skip empty section in PrintComparison when tokenResults is empty
- Update --api-base flag help to document /v1 default and version path
- Add sentinel aggregation unit tests (partial, all, weighted)
* feat: add --retries flag with exponential backoff for transient LLM errors
Retry on timeout, 5xx, and 429 (rate limit) with 1s/2s/4s backoff.
Default 3 retries, configurable via --retries. Context cancellation
is respected between retries.
* fix: address Copilot review round 4
- runReport splits results by mode suffix into token/llm for PrintComparison
- backward compat fallback (ValidF1Count=0 -> TotalQuestions) only for
non-LLM modes; LLM modes keep ValidF1Count=0 when all scores sentinel
- MaxRetries==0 means no retry; only negative falls back to default 3
- truncateStr uses []rune to avoid cutting multi-byte UTF-8 characters
- Complete() returns error on empty LLM response (vs silent empty string)
* feat: --no-thinking adapts to llama.cpp, Ollama, and GLM backends
Send all three disable-thinking fields simultaneously:
- chat_template_kwargs.enable_thinking=false (llama.cpp, GLM)
- think=false (Ollama 0.9+)
- thinking.type=disabled (GLM/Zhipu)
Each backend picks the field it recognizes and ignores the rest.
Also bumps max_tokens from 512 to 2048 for thinking models.
* feat: mixed model eval + concurrent QA workers
- Add --judge-model, --judge-api-base, --judge-api-key flags for separate judge model
- Add --concurrency flag (default 1) with semaphore-based goroutine pool
- Add reasoning_content fallback for GLM/DeepSeek style responses
- Prepend /no_think to system prompt for Ollama /v1 compatibility
- Reduce default MaxTokens from 2048 to 512 (answers are 1-3 sentences)
- Extract evalQAWorker and buildSeahorseContext for shared concurrent logic
---------
Co-authored-by: BeaconCat <BeaconCat@users.noreply.github.com>
* feat(updater): add web self-update endpoint and updater package
* feat(selfupgrade): when url empty, using GetTestReleaseAPIURL for test .
* feat(selfupgrade): only GetTestReleaseAPIURL .
* feat(upgrade): cli $0 update work well!
* fix(ci): fix ci err
* fix(test): fix ci test
* fix(ci): fix ci lint fmt err
* test(updater): add test for updater
* fix(ci): fix ci lint var copy err
* fix(ci): retry ci
* updater: require checksum verification, prefer API digest, verify SHA256, fix zip extraction, update tests
* fix(lint): lint fixed
* fix(lint): lint fixed2
* updater: stream download and verify sha256; add http client timeout and progress
Avoid double-download by streaming asset into temp file while computing SHA256 and verifying against checksum; replace http.Get with shared httpClient (2m timeout) to prevent hangs; add simple stderr progress display; remove unused helpers.
* feat(web): display backend version info in sidebar
* fix(web): improve version parsing and timeout behavior
* refactor(web): remove useless --version fallback
* feat(web): implement version info caching and improve retrieval logic
* fix(web): clarify version timeout rationale
* fix(web): harden gateway version probing and tests
* style(web): split regexp to two lines for lint
The agent path now publishes to outbound bus directly (since #2100),
making the deliver=true direct-to-bus shortcut and the directive type
prompt wrapping redundant. All cron jobs now uniformly route through
the agent. This is an intentional behavior change: old jobs with
deliver=true will execute through the agent instead of bypassing it.
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
On Android, /etc/resolv.conf does not exist, causing Go's default DNS
resolution to fail. This adds an init() hook that:
1. Detects missing /etc/resolv.conf (Android environment)
2. Configures a custom resolver with PreferGo: true
3. Supports multiple DNS servers via PICOCLAW_DNS_SERVER env var
- Semicolon-separated: "8.8.8.8:53;1.1.1.1:53"
- Single server also works: "8.8.8.8"
- Auto-appends :53 if port omitted
4. Round-robin rotation across configured servers
5. Defaults to Google DNS + Cloudflare DNS
Also patches http.DefaultTransport to use the custom resolver.