Introduce a lint-docs script and Makefile target for common
documentation naming and placement checks. Expand docs/README.md
with layout and translation conventions, and update CONTRIBUTING.md
to point contributors to the new docs guidance and validation step.
Add loop-split.md explaining the 12-file split of the original
4384-line loop.go, covering the file map, extraction method,
and future phase 2 plans.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Propagate the configured HTTP client and proxy settings to the
SearXNG search provider.
Allow web_fetch to connect to the configured proxy as the first hop
without bypassing the existing private-host checks for redirect
targets and fetched URLs.
Add tests for loopback proxy fetches and SearXNG proxy propagation.
- split the tools page into focused components and a shared hook
- add separate Tool Library and Web Search tabs
- refresh web search settings layout and localized copy
- make provider expansion keyboard accessible
- restore wrapping for long tool names in library cards
- allow custom styling for KeyInput
Persist channel settings through the current channel_list schema, keeping common
channel fields at the top level and channel-specific fields under settings.
Return common fields and default config shapes from channel config endpoints, and
add coverage for nested patches, missing channel defaults, and secret handling.
Replace generic mockProvider with modelRecordingProvider that captures
the model parameter passed to Chat(). After delegation from alpha to
beta, assert the recorded model is "model-beta" — proving the child
turn actually ran with the target agent's configuration, not the
caller's.
Also add wiring tests:
- TestDelegateToolNotRegistered_SingleAgent: single-agent has no
delegate in its tool registry
- TestDelegateToolRegistered_MultiAgent: both agents in a two-agent
setup have the delegate tool
Ref: #2148
Remove the IsToolEnabled("delegate") check — there is no "delegate"
entry in ToolsConfig, so the check was always true. The only real
gate is len(agents) > 1, which is the intended behavior: delegate
is auto-registered in multi-agent setups.
Ref: #2148
Add table-driven test with case and whitespace variants (ALPHA,
" Alpha ", " alpha ") that should all be caught by the self-check
after normalization.
Ref: #2148
Apply routing.NormalizeAgentID to the raw agent_id input before any
logic runs. This prevents case/whitespace variants like "ALPHA" or
" alpha " from bypassing the self-delegation guard while still
resolving to the same agent in the registry.
The normalized value is used consistently for self-check, allowlist,
SpawnSubTurn, and result attribution.
Ref: #2148
Register the delegate tool in registerSharedTools when multiple agents
are configured. Gated independently from the subagent tool — delegate
uses SubTurn directly and does not depend on SubagentManager.
Self-delegation is prevented by injecting the current agent ID.
Permission is enforced via CanSpawnSubagent (reuses allow_agents config).
Single-agent setups are unaffected: the tool is not registered when
only one agent exists in the registry.
Ref: #2148
12 test cases covering:
- success path with result attribution
- agent_id validation (missing, empty, whitespace, wrong type)
- task validation (missing, empty, whitespace)
- permission denied / allowed via allowlist checker
- self-delegation blocked
- nil spawner, spawner error, nil result from spawner
- open access when no allowlist checker is set
Ref: #2148
delegate(agent_id, task) hands off a task to a named agent and blocks
until the result is ready. The target agent runs with its own config
via the TargetAgentID mechanism in SubTurnConfig.
Key behaviors:
- Self-delegation explicitly rejected
- Permission gated by subagents.allow_agents (D4)
- Spawner errors preserve the underlying error via WithError
- Nil result from spawner handled gracefully
- Response attributed with target agent ID
Ref: #2148
Add multi-agent test setup (newMultiAgentLoop) with two agents using
distinct models (model-alpha, model-beta).
Three new tests:
- UsesTargetAgent: parent=alpha delegates to beta, event log confirms
child runs as agent_id=beta with model=model-beta
- NotFound: TargetAgentID pointing to nonexistent agent returns error
- EmptyModelAccepted: empty Model field accepted when TargetAgentID
provides the model implicitly
Ref: #2148
When TargetAgentID is set, spawnSubTurn resolves the target AgentInstance
from the registry and uses it as the base for the child turn. This gives
the child turn the target's workspace, model, tools, and system prompt
instead of inheriting from the caller.
Model validation is relaxed: empty Model is accepted when TargetAgentID
provides the model implicitly via the resolved agent instance.
Ref: #2148
* membench: add LLM-as-Judge evaluation mode
Add --eval-mode=llm to membench for LLM-based answer generation and
semantic scoring via an OpenAI-compatible API endpoint.
New files:
- llm_client.go: generic OpenAI-compatible chat completion client
with support for API key, configurable timeout, and optional
chat_template_kwargs (for llama.cpp thinking models)
- eval_llm.go: LLM answer generation + LLM-as-Judge scoring for
both legacy and seahorse retrieval modes
Changes to main.go:
- --eval-mode flag (token|llm) to select evaluation strategy
- --api-base, --api-key, --model flags with env var fallback
(MEMBENCH_API_BASE, MEMBENCH_API_KEY, MEMBENCH_MODEL)
- --no-thinking flag for llama.cpp + Qwen thinking models
- --limit flag to cap QA questions per sample for quick testing
* style: fix golangci-lint formatting (gofmt + golines)
* fix: address Copilot review feedback
- Validate --model is required for LLM eval mode
- Use rune-based truncation to preserve valid UTF-8
- Precompute totalQA count outside inner loop
- Log SearchMessages errors instead of silently skipping
* fix: address Copilot review round 2
- Validate --eval-mode accepts only 'token' or 'llm'
- Normalize base URL to avoid /v1/v1 duplication
- Separate token/LLM results for correct PrintComparison labeling
- Log ExpandMessages errors instead of silently ignoring
- Short-circuit with 0 scores when no context retrieved (match token eval)
- Add --timeout flag wired to LLMClientOptions.Timeout
* fix: address review P1+P2 — sort alignment, failure sentinel, score parser
- P1: Replace hand-rolled sortByRank with sort.Slice (ascending, best
first) matching eval.go's EvalSeahorse — ensures BudgetTruncate keeps
best-ranked messages when truncation occurs
- P2: Use -1.0 sentinel for LLM API failures and parse errors, distinct
from genuine 0.0 score; aggregateMetrics skips -1.0 entries for F1
averaging while still counting HitRate
- P2: Use regexp \b([1-5])\b for judge score extraction instead of
first-digit scan — avoids misparses on '5/5', 'Score: 3' etc.
* fix: address Copilot review round 2
- Fix F1/HitRate weighted aggregation: track ValidF1Count separately so
computeModeAgg weights F1 by valid scores only, not TotalQuestions
- No-context retrieval failure uses 0.0 (genuine bad score) instead of
-1.0 sentinel (reserved for API/parse failures)
- Validate --timeout > 0 to prevent disabling HTTP timeouts
* fix: remove hardcoded /v1 from API base URL
Users now provide the full versioned path in --api-base (e.g. /v1, /v4).
Code only appends /chat/completions. Default changed to
http://127.0.0.1:8080/v1 for backward compatibility.
* fix: address Copilot review round 3
- ValidF1Count=0 when all scores are sentinel (no forced =1)
- Backward compat: old eval JSON without ValidF1Count falls back to
TotalQuestions in computeModeAgg
- Skip empty section in PrintComparison when tokenResults is empty
- Update --api-base flag help to document /v1 default and version path
- Add sentinel aggregation unit tests (partial, all, weighted)
* feat: add --retries flag with exponential backoff for transient LLM errors
Retry on timeout, 5xx, and 429 (rate limit) with 1s/2s/4s backoff.
Default 3 retries, configurable via --retries. Context cancellation
is respected between retries.
* fix: address Copilot review round 4
- runReport splits results by mode suffix into token/llm for PrintComparison
- backward compat fallback (ValidF1Count=0 -> TotalQuestions) only for
non-LLM modes; LLM modes keep ValidF1Count=0 when all scores sentinel
- MaxRetries==0 means no retry; only negative falls back to default 3
- truncateStr uses []rune to avoid cutting multi-byte UTF-8 characters
- Complete() returns error on empty LLM response (vs silent empty string)
* feat: --no-thinking adapts to llama.cpp, Ollama, and GLM backends
Send all three disable-thinking fields simultaneously:
- chat_template_kwargs.enable_thinking=false (llama.cpp, GLM)
- think=false (Ollama 0.9+)
- thinking.type=disabled (GLM/Zhipu)
Each backend picks the field it recognizes and ignores the rest.
Also bumps max_tokens from 512 to 2048 for thinking models.
* feat: mixed model eval + concurrent QA workers
- Add --judge-model, --judge-api-base, --judge-api-key flags for separate judge model
- Add --concurrency flag (default 1) with semaphore-based goroutine pool
- Add reasoning_content fallback for GLM/DeepSeek style responses
- Prepend /no_think to system prompt for Ollama /v1 compatibility
- Reduce default MaxTokens from 2048 to 512 (answers are 1-3 sentences)
- Extract evalQAWorker and buildSeahorseContext for shared concurrent logic
---------
Co-authored-by: BeaconCat <BeaconCat@users.noreply.github.com>
Scope tool result deduplication to each assistant tool-call block so providers
that reuse call IDs across separate turns do not lose valid tool results. Also
drop invalid empty tool call IDs and orphaned tool messages after validation.
Default the sample web search provider to auto, route Sogou vs DuckDuckGo dynamically based on query/UI language, and sync frontend language changes back to the backend so Current Service and runtime selection stay aligned.