- stop exposing the raw Pico token to the frontend
- add /api/pico/info for non-secret Pico connection metadata
- proxy /pico/ws through the launcher with same-origin and dashboard auth checks
- inject the upstream Pico websocket protocol server-side
- update frontend chat connection flow and Vite websocket proxy path
- refresh related docs and tests
* refactor(docs): reorganize docs by type and locale
* chore(docs): add docs layout lint target and contributor guidance
Introduce a lint-docs script and Makefile target for common
documentation naming and placement checks. Expand docs/README.md
with layout and translation conventions, and update CONTRIBUTING.md
to point contributors to the new docs guidance and validation step.
* docs: add section index pages and fix localized doc links
- add reader navigation to docs/README.md
- add index pages for guides, reference, operations, security, architecture, and migration
- update localized project README links to prefer existing translated docs
* docs: fix broken wecom link in Malay README
Introduce a lint-docs script and Makefile target for common
documentation naming and placement checks. Expand docs/README.md
with layout and translation conventions, and update CONTRIBUTING.md
to point contributors to the new docs guidance and validation step.
Add loop-split.md explaining the 12-file split of the original
4384-line loop.go, covering the file map, extraction method,
and future phase 2 plans.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Propagate the configured HTTP client and proxy settings to the
SearXNG search provider.
Allow web_fetch to connect to the configured proxy as the first hop
without bypassing the existing private-host checks for redirect
targets and fetched URLs.
Add tests for loopback proxy fetches and SearXNG proxy propagation.
- split the tools page into focused components and a shared hook
- add separate Tool Library and Web Search tabs
- refresh web search settings layout and localized copy
- make provider expansion keyboard accessible
- restore wrapping for long tool names in library cards
- allow custom styling for KeyInput
Persist channel settings through the current channel_list schema, keeping common
channel fields at the top level and channel-specific fields under settings.
Return common fields and default config shapes from channel config endpoints, and
add coverage for nested patches, missing channel defaults, and secret handling.
* membench: add LLM-as-Judge evaluation mode
Add --eval-mode=llm to membench for LLM-based answer generation and
semantic scoring via an OpenAI-compatible API endpoint.
New files:
- llm_client.go: generic OpenAI-compatible chat completion client
with support for API key, configurable timeout, and optional
chat_template_kwargs (for llama.cpp thinking models)
- eval_llm.go: LLM answer generation + LLM-as-Judge scoring for
both legacy and seahorse retrieval modes
Changes to main.go:
- --eval-mode flag (token|llm) to select evaluation strategy
- --api-base, --api-key, --model flags with env var fallback
(MEMBENCH_API_BASE, MEMBENCH_API_KEY, MEMBENCH_MODEL)
- --no-thinking flag for llama.cpp + Qwen thinking models
- --limit flag to cap QA questions per sample for quick testing
* style: fix golangci-lint formatting (gofmt + golines)
* fix: address Copilot review feedback
- Validate --model is required for LLM eval mode
- Use rune-based truncation to preserve valid UTF-8
- Precompute totalQA count outside inner loop
- Log SearchMessages errors instead of silently skipping
* fix: address Copilot review round 2
- Validate --eval-mode accepts only 'token' or 'llm'
- Normalize base URL to avoid /v1/v1 duplication
- Separate token/LLM results for correct PrintComparison labeling
- Log ExpandMessages errors instead of silently ignoring
- Short-circuit with 0 scores when no context retrieved (match token eval)
- Add --timeout flag wired to LLMClientOptions.Timeout
* fix: address review P1+P2 — sort alignment, failure sentinel, score parser
- P1: Replace hand-rolled sortByRank with sort.Slice (ascending, best
first) matching eval.go's EvalSeahorse — ensures BudgetTruncate keeps
best-ranked messages when truncation occurs
- P2: Use -1.0 sentinel for LLM API failures and parse errors, distinct
from genuine 0.0 score; aggregateMetrics skips -1.0 entries for F1
averaging while still counting HitRate
- P2: Use regexp \b([1-5])\b for judge score extraction instead of
first-digit scan — avoids misparses on '5/5', 'Score: 3' etc.
* fix: address Copilot review round 2
- Fix F1/HitRate weighted aggregation: track ValidF1Count separately so
computeModeAgg weights F1 by valid scores only, not TotalQuestions
- No-context retrieval failure uses 0.0 (genuine bad score) instead of
-1.0 sentinel (reserved for API/parse failures)
- Validate --timeout > 0 to prevent disabling HTTP timeouts
* fix: remove hardcoded /v1 from API base URL
Users now provide the full versioned path in --api-base (e.g. /v1, /v4).
Code only appends /chat/completions. Default changed to
http://127.0.0.1:8080/v1 for backward compatibility.
* fix: address Copilot review round 3
- ValidF1Count=0 when all scores are sentinel (no forced =1)
- Backward compat: old eval JSON without ValidF1Count falls back to
TotalQuestions in computeModeAgg
- Skip empty section in PrintComparison when tokenResults is empty
- Update --api-base flag help to document /v1 default and version path
- Add sentinel aggregation unit tests (partial, all, weighted)
* feat: add --retries flag with exponential backoff for transient LLM errors
Retry on timeout, 5xx, and 429 (rate limit) with 1s/2s/4s backoff.
Default 3 retries, configurable via --retries. Context cancellation
is respected between retries.
* fix: address Copilot review round 4
- runReport splits results by mode suffix into token/llm for PrintComparison
- backward compat fallback (ValidF1Count=0 -> TotalQuestions) only for
non-LLM modes; LLM modes keep ValidF1Count=0 when all scores sentinel
- MaxRetries==0 means no retry; only negative falls back to default 3
- truncateStr uses []rune to avoid cutting multi-byte UTF-8 characters
- Complete() returns error on empty LLM response (vs silent empty string)
* feat: --no-thinking adapts to llama.cpp, Ollama, and GLM backends
Send all three disable-thinking fields simultaneously:
- chat_template_kwargs.enable_thinking=false (llama.cpp, GLM)
- think=false (Ollama 0.9+)
- thinking.type=disabled (GLM/Zhipu)
Each backend picks the field it recognizes and ignores the rest.
Also bumps max_tokens from 512 to 2048 for thinking models.
* feat: mixed model eval + concurrent QA workers
- Add --judge-model, --judge-api-base, --judge-api-key flags for separate judge model
- Add --concurrency flag (default 1) with semaphore-based goroutine pool
- Add reasoning_content fallback for GLM/DeepSeek style responses
- Prepend /no_think to system prompt for Ollama /v1 compatibility
- Reduce default MaxTokens from 2048 to 512 (answers are 1-3 sentences)
- Extract evalQAWorker and buildSeahorseContext for shared concurrent logic
---------
Co-authored-by: BeaconCat <BeaconCat@users.noreply.github.com>
Scope tool result deduplication to each assistant tool-call block so providers
that reuse call IDs across separate turns do not lose valid tool results. Also
drop invalid empty tool call IDs and orphaned tool messages after validation.