Files
BeaconCat f1b659e5ef membench: add LLM-as-Judge evaluation mode (#2484)
* membench: add LLM-as-Judge evaluation mode

Add --eval-mode=llm to membench for LLM-based answer generation and
semantic scoring via an OpenAI-compatible API endpoint.

New files:
- llm_client.go: generic OpenAI-compatible chat completion client
  with support for API key, configurable timeout, and optional
  chat_template_kwargs (for llama.cpp thinking models)
- eval_llm.go: LLM answer generation + LLM-as-Judge scoring for
  both legacy and seahorse retrieval modes

Changes to main.go:
- --eval-mode flag (token|llm) to select evaluation strategy
- --api-base, --api-key, --model flags with env var fallback
  (MEMBENCH_API_BASE, MEMBENCH_API_KEY, MEMBENCH_MODEL)
- --no-thinking flag for llama.cpp + Qwen thinking models
- --limit flag to cap QA questions per sample for quick testing

* style: fix golangci-lint formatting (gofmt + golines)

* fix: address Copilot review feedback

- Validate --model is required for LLM eval mode
- Use rune-based truncation to preserve valid UTF-8
- Precompute totalQA count outside inner loop
- Log SearchMessages errors instead of silently skipping

* fix: address Copilot review round 2

- Validate --eval-mode accepts only 'token' or 'llm'
- Normalize base URL to avoid /v1/v1 duplication
- Separate token/LLM results for correct PrintComparison labeling
- Log ExpandMessages errors instead of silently ignoring
- Short-circuit with 0 scores when no context retrieved (match token eval)
- Add --timeout flag wired to LLMClientOptions.Timeout

* fix: address review P1+P2 — sort alignment, failure sentinel, score parser

- P1: Replace hand-rolled sortByRank with sort.Slice (ascending, best
  first) matching eval.go's EvalSeahorse — ensures BudgetTruncate keeps
  best-ranked messages when truncation occurs
- P2: Use -1.0 sentinel for LLM API failures and parse errors, distinct
  from genuine 0.0 score; aggregateMetrics skips -1.0 entries for F1
  averaging while still counting HitRate
- P2: Use regexp \b([1-5])\b for judge score extraction instead of
  first-digit scan — avoids misparses on '5/5', 'Score: 3' etc.

* fix: address Copilot review round 2

- Fix F1/HitRate weighted aggregation: track ValidF1Count separately so
  computeModeAgg weights F1 by valid scores only, not TotalQuestions
- No-context retrieval failure uses 0.0 (genuine bad score) instead of
  -1.0 sentinel (reserved for API/parse failures)
- Validate --timeout > 0 to prevent disabling HTTP timeouts

* fix: remove hardcoded /v1 from API base URL

Users now provide the full versioned path in --api-base (e.g. /v1, /v4).
Code only appends /chat/completions. Default changed to
http://127.0.0.1:8080/v1 for backward compatibility.

* fix: address Copilot review round 3

- ValidF1Count=0 when all scores are sentinel (no forced =1)
- Backward compat: old eval JSON without ValidF1Count falls back to
  TotalQuestions in computeModeAgg
- Skip empty section in PrintComparison when tokenResults is empty
- Update --api-base flag help to document /v1 default and version path
- Add sentinel aggregation unit tests (partial, all, weighted)

* feat: add --retries flag with exponential backoff for transient LLM errors

Retry on timeout, 5xx, and 429 (rate limit) with 1s/2s/4s backoff.
Default 3 retries, configurable via --retries. Context cancellation
is respected between retries.

* fix: address Copilot review round 4

- runReport splits results by mode suffix into token/llm for PrintComparison
- backward compat fallback (ValidF1Count=0 -> TotalQuestions) only for
  non-LLM modes; LLM modes keep ValidF1Count=0 when all scores sentinel
- MaxRetries==0 means no retry; only negative falls back to default 3
- truncateStr uses []rune to avoid cutting multi-byte UTF-8 characters
- Complete() returns error on empty LLM response (vs silent empty string)

* feat: --no-thinking adapts to llama.cpp, Ollama, and GLM backends

Send all three disable-thinking fields simultaneously:
- chat_template_kwargs.enable_thinking=false (llama.cpp, GLM)
- think=false (Ollama 0.9+)
- thinking.type=disabled (GLM/Zhipu)
Each backend picks the field it recognizes and ignores the rest.
Also bumps max_tokens from 512 to 2048 for thinking models.

* feat: mixed model eval + concurrent QA workers

- Add --judge-model, --judge-api-base, --judge-api-key flags for separate judge model
- Add --concurrency flag (default 1) with semaphore-based goroutine pool
- Add reasoning_content fallback for GLM/DeepSeek style responses
- Prepend /no_think to system prompt for Ollama /v1 compatibility
- Reduce default MaxTokens from 2048 to 512 (answers are 1-3 sentences)
- Extract evalQAWorker and buildSeahorseContext for shared concurrent logic

---------

Co-authored-by: BeaconCat <BeaconCat@users.noreply.github.com>
2026-04-15 21:15:17 +08:00

199 lines
5.6 KiB
Go

package main
import (
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"strings"
"time"
)
// LLMClient wraps an OpenAI-compatible chat completion endpoint.
type LLMClient struct {
BaseURL string
Model string
APIKey string
NoThinking bool // send chat_template_kwargs to disable thinking (llama.cpp specific)
MaxRetries int // max retry attempts for transient errors (0 = no retry)
Client *http.Client
}
// LLMClientOptions configures the LLM client.
type LLMClientOptions struct {
BaseURL string
Model string
APIKey string
Timeout time.Duration
NoThinking bool
MaxRetries int // max retry attempts (default 3)
}
// NewLLMClient creates a client for an OpenAI-compatible chat completion API.
func NewLLMClient(opts LLMClientOptions) *LLMClient {
if opts.Timeout == 0 {
opts.Timeout = 120 * time.Second
}
maxRetries := opts.MaxRetries
if maxRetries < 0 {
maxRetries = 3
}
return &LLMClient{
BaseURL: strings.TrimRight(opts.BaseURL, "/"),
Model: opts.Model,
APIKey: opts.APIKey,
NoThinking: opts.NoThinking,
MaxRetries: maxRetries,
Client: &http.Client{
Timeout: opts.Timeout,
},
}
}
type chatRequest struct {
Model string `json:"model"`
Messages []chatMessage `json:"messages"`
Temperature float64 `json:"temperature"`
MaxTokens int `json:"max_tokens"`
ChatTemplateKwargs map[string]any `json:"chat_template_kwargs,omitempty"` // llama.cpp
Think *bool `json:"think,omitempty"` // Ollama
Thinking map[string]any `json:"thinking,omitempty"` // GLM (智谱)
}
type chatMessage struct {
Role string `json:"role"`
Content string `json:"content"`
}
type chatResponse struct {
Choices []struct {
Message struct {
Content string `json:"content"`
ReasoningContent string `json:"reasoning_content,omitempty"`
} `json:"message"`
} `json:"choices"`
}
// Complete sends a chat completion request and returns the assistant's reply.
func (c *LLMClient) Complete(ctx context.Context, systemPrompt, userPrompt string) (string, error) {
sysContent := systemPrompt
if c.NoThinking && sysContent != "" {
// Prepend /no_think tag — works with Ollama /v1 endpoint and
// Qwen chat templates where the JSON think field is ignored.
sysContent = "/no_think\n" + sysContent
}
messages := []chatMessage{}
if sysContent != "" {
messages = append(messages, chatMessage{Role: "system", Content: sysContent})
}
messages = append(messages, chatMessage{Role: "user", Content: userPrompt})
body := chatRequest{
Model: c.Model,
Messages: messages,
Temperature: 0.1,
MaxTokens: 512,
}
if c.NoThinking {
// llama.cpp: chat_template_kwargs
body.ChatTemplateKwargs = map[string]any{
"enable_thinking": false,
}
// Ollama (0.9+): think field
thinkFalse := false
body.Think = &thinkFalse
// GLM (智谱): thinking field
body.Thinking = map[string]any{
"type": "disabled",
}
}
jsonBody, err := json.Marshal(body)
if err != nil {
return "", fmt.Errorf("marshal request: %w", err)
}
endpoint := strings.TrimRight(c.BaseURL, "/") + "/chat/completions"
req, err := http.NewRequestWithContext(ctx, "POST", endpoint, bytes.NewReader(jsonBody))
if err != nil {
return "", fmt.Errorf("create request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
if c.APIKey != "" {
req.Header.Set("Authorization", "Bearer "+c.APIKey)
}
var respBody []byte
var lastErr error
for attempt := 0; attempt <= c.MaxRetries; attempt++ {
if attempt > 0 {
backoff := time.Duration(1<<(attempt-1)) * time.Second // 1s, 2s, 4s, ...
log.Printf("LLM retry %d/%d after %v: %v", attempt, c.MaxRetries, backoff, lastErr)
select {
case <-ctx.Done():
return "", ctx.Err()
case <-time.After(backoff):
}
// Rebuild request (body reader is consumed)
req, err = http.NewRequestWithContext(ctx, "POST", endpoint, bytes.NewReader(jsonBody))
if err != nil {
return "", fmt.Errorf("create request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
if c.APIKey != "" {
req.Header.Set("Authorization", "Bearer "+c.APIKey)
}
}
var resp *http.Response
resp, lastErr = c.Client.Do(req)
if lastErr != nil {
continue // network/timeout error → retry
}
respBody, lastErr = io.ReadAll(resp.Body)
resp.Body.Close()
if lastErr != nil {
continue
}
if resp.StatusCode == 429 || resp.StatusCode >= 500 {
lastErr = fmt.Errorf("API error %d: %s", resp.StatusCode, string(respBody))
continue // rate limit or server error → retry
}
if resp.StatusCode != 200 {
return "", fmt.Errorf("API error %d: %s", resp.StatusCode, string(respBody))
}
lastErr = nil
break
}
if lastErr != nil {
return "", fmt.Errorf("after %d retries: %w", c.MaxRetries, lastErr)
}
var chatResp chatResponse
if err := json.Unmarshal(respBody, &chatResp); err != nil {
return "", fmt.Errorf("parse response: %w", err)
}
if len(chatResp.Choices) == 0 {
return "", fmt.Errorf("no choices in response")
}
content := strings.TrimSpace(chatResp.Choices[0].Message.Content)
// Strip any residual <think>...</think> blocks
if idx := strings.Index(content, "</think>"); idx >= 0 {
content = strings.TrimSpace(content[idx+len("</think>"):])
}
// Fallback: GLM/DeepSeek put thinking output in reasoning_content when thinking is enabled
if content == "" && chatResp.Choices[0].Message.ReasoningContent != "" {
content = strings.TrimSpace(chatResp.Choices[0].Message.ReasoningContent)
}
if content == "" {
return "", fmt.Errorf("empty LLM response")
}
return content, nil
}