mirror of
https://github.com/sipeed/picoclaw.git
synced 2026-06-12 18:08:54 +00:00
0f395ce110
* refactor: update ASR and TTS implementations * fix lint * Integrating asr/tts models w/ new security config * update documents * add arbitrary whisper transcriptor support * update documents * fix lint * add mimo tts
167 lines
4.7 KiB
Markdown
167 lines
4.7 KiB
Markdown
# ASR (Automatic Speech Recognition)
|
|
|
|
This package handles speech-to-text for PicoClaw voice input.
|
|
|
|
If you are new to ASR setup, the simplest mental model is:
|
|
|
|
1. Add one or more ASR-capable entries to `model_list`.
|
|
2. Point `voice.model_name` at the one you want to use.
|
|
3. Put the API key in `.security.yml`.
|
|
|
|
## Quick Recommendation
|
|
|
|
For most new users, start with one of these:
|
|
|
|
| Provider | Example model | Why start here |
|
|
| --- | --- | --- |
|
|
| [Groq](https://console.groq.com/keys) | `groq/whisper-large-v3-turbo` | Fast Whisper-style transcription and a straightforward OpenAI-compatible API. Groq currently advertises a free tier plan for 2000 reqs/day. |
|
|
| [ElevenLabs](https://elevenlabs.io/pricing) | `elevenlabs/scribe_v1` | Easy setup and strong speech-to-text quality. ElevenLabs currently advertises a free plan that includes speech-to-text usage. |
|
|
|
|
Pricing and free-plan limits can change, so check the linked pricing pages before depending on them in production.
|
|
|
|
## How ASR Configuration Works
|
|
|
|
PicoClaw does not keep ASR API keys inside the `voice` section.
|
|
|
|
Instead:
|
|
|
|
- `voice.model_name` chooses a named entry from `model_list`.
|
|
- The matching `model_list` entry describes the actual provider and model.
|
|
- `.security.yml` stores the API key for that named model entry.
|
|
|
|
This is the recommended pattern because it is explicit, reusable, and consistent with the rest of PicoClaw's model configuration.
|
|
|
|
## Recommended Setup
|
|
|
|
### Option A: Groq Whisper
|
|
|
|
`config.json`
|
|
|
|
```json
|
|
{
|
|
"voice": {
|
|
"model_name": "groq-asr",
|
|
"echo_transcription": true
|
|
},
|
|
"model_list": [
|
|
{
|
|
"model_name": "groq-asr",
|
|
"model": "groq/whisper-large-v3-turbo"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
`.security.yml`
|
|
|
|
```yaml
|
|
model_list:
|
|
groq-asr:
|
|
api_keys:
|
|
- "gsk_your_groq_key"
|
|
```
|
|
|
|
Notes:
|
|
|
|
- You can omit `api_base` and PicoClaw will use Groq's default API base automatically.
|
|
- If you set `api_base` manually for Groq Whisper, both of these forms work:
|
|
- `https://api.groq.com/openai/v1`
|
|
- `https://api.groq.com/openai/v1/audio/transcriptions`
|
|
- Any OpenAI-compatible Whisper model name containing `whisper` can use the Whisper transcription path, not only `whisper-large-v3-turbo`.
|
|
|
|
### Option B: ElevenLabs
|
|
|
|
`config.json`
|
|
|
|
```json
|
|
{
|
|
"voice": {
|
|
"model_name": "elevenlabs-asr",
|
|
"echo_transcription": true
|
|
},
|
|
"model_list": [
|
|
{
|
|
"model_name": "elevenlabs-asr",
|
|
"model": "elevenlabs/scribe_v1"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
`.security.yml`
|
|
|
|
```yaml
|
|
model_list:
|
|
elevenlabs-asr:
|
|
api_keys:
|
|
- "sk-elevenlabs-your-key"
|
|
```
|
|
|
|
### Option C: OpenAI Whisper
|
|
|
|
`config.json`
|
|
|
|
```json
|
|
{
|
|
"voice": {
|
|
"model_name": "openai-asr"
|
|
},
|
|
"model_list": [
|
|
{
|
|
"model_name": "openai-asr",
|
|
"model": "openai/whisper-1"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
`.security.yml`
|
|
|
|
```yaml
|
|
model_list:
|
|
openai-asr:
|
|
api_keys:
|
|
- "sk-openai-your-key"
|
|
```
|
|
|
|
## Other ASR-Capable Model Types
|
|
|
|
PicoClaw currently supports three main ASR routes:
|
|
|
|
| Route | Example models | Behavior |
|
|
| --- | --- | --- |
|
|
| ElevenLabs ASR | `elevenlabs/scribe_v1` | Uses the ElevenLabs transcription API. |
|
|
| Whisper endpoint models | `openai/whisper-1`, `groq/whisper-large-v3` | Uses an OpenAI-compatible `/audio/transcriptions` endpoint. |
|
|
| Audio-capable chat models **(Under construction)** | `openai/gpt-4o-audio-preview`, `gemini/gemini-2.5-flash` | Sends audio to a multimodal chat model and asks it to transcribe. |
|
|
|
|
If you are unsure which one to pick, choose Groq Whisper or ElevenLabs first.
|
|
|
|
## How PicoClaw Chooses a Transcriber
|
|
|
|
`DetectTranscriber` resolves ASR in this order:
|
|
|
|
1. **Preferred path**: resolve `voice.model_name` against `model_list`.
|
|
2. If that resolved model is:
|
|
- `elevenlabs/...`, PicoClaw uses the ElevenLabs transcriber.
|
|
- an OpenAI-compatible Whisper model, PicoClaw uses the Whisper transcriber.
|
|
- an audio-capable chat model, PicoClaw uses `AudioModelTranscriber`.
|
|
3. **Fallback path**: if `voice.model_name` is not set, PicoClaw performs a compatibility scan through `model_list` for legacy auto-detected ASR entries.
|
|
|
|
Fallback scanning exists for backward compatibility. New configurations should set `voice.model_name` explicitly.
|
|
|
|
## Common Mistakes
|
|
|
|
- Defining an ASR model in `model_list` but forgetting to set `voice.model_name`.
|
|
- Putting the API key in `voice` instead of `.security.yml`.
|
|
- Using a non-ASR model and expecting Whisper-style transcription behavior.
|
|
- Setting a custom `api_base` that points to the wrong provider endpoint.
|
|
|
|
## Minimal Checklist
|
|
|
|
Before testing voice input, make sure:
|
|
|
|
- `voice.model_name` matches a `model_list[].model_name`.
|
|
- The matching `.security.yml` entry contains a valid API key.
|
|
- The selected model is actually ASR-capable.
|
|
- Voice input is enabled for the channel you are using.
|