mirror of https://github.com/sipeed/picoclaw.git synced 2026-06-12 18:08:54 +00:00

Files

T

Mauro b114dcaeb1 feat(model): llm rate limiting (#2198 )

* feat(model): rate limiting

* fix(agent): preserve per-model identity in rate limiting and fallback

* fix test

2026-04-02 19:26:26 +08:00

3.9 KiB

Raw Blame History

Dynamic Rate Limiting

PicoClaw prevents 429 errors from LLM provider APIs by enforcing configurable per-model request-rate limits before sending each request. Unlike the reactive cooldown/fallback system (which activates after a 429 is received), rate limiting is proactive: it keeps outbound QPS within the provider's free-tier or plan limits.

How it works

Token-bucket algorithm

Each rate-limited model gets a token bucket:

Capacity = rpm (burst size equals the per-minute limit)
Refill rate = rpm / 60 tokens per second
Tokens are consumed one per LLM call; if the bucket is empty, the call blocks until a token refills or the request context is cancelled

Call chain integration

AgentLoop.callLLM()
  └─ FallbackChain.Execute()         ← iterate candidates
       ├─ CooldownTracker.IsAvailable()   ← skip if post-429 cooldown active
       ├─ RateLimiterRegistry.Wait()      ← NEW: block until token available
       └─ provider.Chat()                 ← actual LLM HTTP call

The rate limiter runs after the cooldown check and before the provider call, so:

Candidates already in cooldown are skipped entirely (no token consumed)
Candidates that are available get throttled to the configured RPM

The same check applies in ExecuteImage.

Thread safety

RateLimiterRegistry is safe for concurrent use. The per-limiter token bucket uses a fine-grained mutex so concurrent goroutines each acquire their own token independently.

Configuration

Set rpm on any model in model_list:

model_list:
  - model_name: gpt-4o-free
    model: openai/gpt-4o
    api_base: https://api.openai.com/v1
    rpm: 3          # max 3 requests per minute
    api_keys:
      - sk-...

  - model_name: claude-haiku
    model: anthropic/claude-haiku-4-5
    rpm: 60         # 60 rpm (Anthropic free tier)
    api_keys:
      - sk-ant-...

  - model_name: local-llm
    model: openai/llama3
    api_base: http://localhost:11434/v1
    # no rpm → unrestricted

Field	Type	Default	Description
`rpm`	`int`	`0`	Requests per minute. `0` means no limit.

Interaction with fallbacks

When a model has fallbacks configured, each candidate is rate-limited independently:

model_list:
  - model_name: gpt4-with-fallback
    model: openai/gpt-4o
    rpm: 5
    fallbacks:
      - gpt-4o-mini   # must also be in model_list; its own rpm applies

If the current candidate's bucket is empty and there are more candidates available, PicoClaw skips the locally saturated candidate and tries the next fallback immediately. Only the last remaining candidate waits for a token to refill. If the context deadline is hit while waiting on that last candidate, the wait error propagates.

For model_list aliases that resolve to the same underlying provider/model, rate limiting is keyed by the stable config identity (for example model_name) rather than the resolved runtime model string. This preserves distinct RPM settings for multi-key and alias-based configurations.

Burst behaviour

The bucket starts full (burst = RPM). For rpm: 3, the first 3 requests fire instantly; subsequent requests are spaced ~20 s apart.

To reduce burstiness for strict APIs, set a lower rpm and rely on the steady-state refill.

Files changed

File	What
`pkg/providers/ratelimiter.go`	`RateLimiter` (token bucket) + `RateLimiterRegistry`
`pkg/providers/ratelimiter_test.go`	Unit tests for limiter and registry
`pkg/providers/fallback.go`	`FallbackCandidate.RPM` field; `FallbackChain.rl`; `Wait()` call in `Execute`/`ExecuteImage`
`pkg/agent/model_resolution.go`	Resolves candidates from `model_list`, preserving stable config identity and propagating `RPM` into `FallbackCandidate`
`pkg/agent/loop.go`	Build `RateLimiterRegistry`, register all agents' candidates, pass to `NewFallbackChain`

3.9 KiB Raw Blame History