mirror of
https://github.com/sipeed/picoclaw.git
synced 2026-06-12 18:08:54 +00:00
0f395ce110
* refactor: update ASR and TTS implementations * fix lint * Integrating asr/tts models w/ new security config * update documents * add arbitrary whisper transcriptor support * update documents * fix lint * add mimo tts
138 lines
4.0 KiB
Markdown
138 lines
4.0 KiB
Markdown
# TTS (Text-to-Speech)
|
|
|
|
This package handles speech synthesis for PicoClaw.
|
|
|
|
If you are new to TTS setup, the simplest workflow is:
|
|
|
|
1. Add a TTS-capable entry to `model_list`.
|
|
2. Point `voice.tts_model_name` at that entry.
|
|
3. Put the API key in `.security.yml`.
|
|
|
|
## Quick Recommendation
|
|
|
|
For most users, these are the best starting points:
|
|
|
|
| Provider | Why start here |
|
|
| --- | --- |
|
|
| [OpenAI](https://platform.openai.com/docs/guides/text-to-speech) | Best-supported path in PicoClaw today. The current TTS implementation is built around the OpenAI-compatible `/audio/speech` API shape, and OpenAI is the safest default. |
|
|
| [Xiaomi MiMo](https://platform.xiaomimimo.com) | A good second option if you want an OpenAI-compatible provider endpoint and are already using MiMo models in the rest of your stack. |
|
|
|
|
## How TTS Configuration Works
|
|
|
|
PicoClaw does not keep TTS API keys inside `voice`.
|
|
|
|
Instead:
|
|
|
|
- `voice.tts_model_name` selects a named entry from `model_list`.
|
|
- That `model_list` entry provides the provider, model ID, API base, and proxy settings.
|
|
- `.security.yml` stores the API key for the same named model entry.
|
|
|
|
This is the recommended and supported configuration pattern.
|
|
|
|
## Recommended Setup
|
|
|
|
### Option A: OpenAI
|
|
|
|
`config.json`
|
|
|
|
```json
|
|
{
|
|
"voice": {
|
|
"tts_model_name": "openai-tts"
|
|
},
|
|
"model_list": [
|
|
{
|
|
"model_name": "openai-tts",
|
|
"model": "openai/tts-1"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
`.security.yml`
|
|
|
|
```yaml
|
|
model_list:
|
|
openai-tts:
|
|
api_keys:
|
|
- "sk-openai-your-key"
|
|
```
|
|
|
|
### Option B: Xiaomi MiMo
|
|
|
|
`config.json`
|
|
|
|
```json
|
|
{
|
|
"voice": {
|
|
"tts_model_name": "mimo-tts"
|
|
},
|
|
"model_list": [
|
|
{
|
|
"model_name": "mimo-tts",
|
|
"model": "mimo/mimo-v2-tts"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
`.security.yml`
|
|
|
|
```yaml
|
|
model_list:
|
|
mimo-tts:
|
|
api_keys:
|
|
- "your-mimo-key"
|
|
```
|
|
|
|
If you use a custom MiMo endpoint, you can also set `api_base` explicitly. Otherwise PicoClaw will use the provider default.
|
|
|
|
## What PicoClaw Sends Today
|
|
|
|
The current TTS runtime uses an OpenAI-compatible speech request with these defaults:
|
|
|
|
- Endpoint: `/audio/speech`
|
|
- Response format: `opus`
|
|
- Voice: `alloy`
|
|
- Model: taken from the selected `model_list` entry
|
|
|
|
That means:
|
|
|
|
- `openai/tts-1` works naturally.
|
|
- Other OpenAI-compatible providers can work if they accept the same request format.
|
|
- PicoClaw currently does not expose a user-facing config field for changing the TTS voice from `alloy`.
|
|
|
|
## How PicoClaw Chooses a TTS Provider
|
|
|
|
`DetectTTS` resolves TTS in this order:
|
|
|
|
1. **Preferred path**: resolve `voice.tts_model_name` against `model_list`.
|
|
2. If a matching model entry exists and has an API key, PicoClaw creates an OpenAI-compatible TTS provider using that model's settings.
|
|
3. **Fallback path**: if `voice.tts_model_name` is not set or cannot be resolved, PicoClaw scans `model_list` for the first entry whose model string contains `tts` and has an API key.
|
|
|
|
Fallback scanning exists for compatibility. New configs should set `voice.tts_model_name` explicitly.
|
|
|
|
## Notes About API Base Handling
|
|
|
|
PicoClaw normalizes the configured base URL for TTS:
|
|
|
|
- For OpenAI, a base like `https://api.openai.com` or `https://api.openai.com/v1` becomes `https://api.openai.com/v1/audio/speech`.
|
|
- For other OpenAI-compatible providers, PicoClaw preserves the configured base path and ensures it ends with `/audio/speech`.
|
|
- If `api_base` is omitted, PicoClaw uses the provider default base when the model prefix is known.
|
|
|
|
## Common Mistakes
|
|
|
|
- Setting `voice.tts_model_name` to a name that does not exist in `model_list`.
|
|
- Adding a TTS model but forgetting to put its API key in `.security.yml`.
|
|
- Assuming PicoClaw will automatically use provider-specific custom voices.
|
|
- Using a provider endpoint that is not compatible with the OpenAI `/audio/speech` request format.
|
|
|
|
## Minimal Checklist
|
|
|
|
Before testing `send_tts`, make sure:
|
|
|
|
- `voice.tts_model_name` matches a `model_list[].model_name`.
|
|
- The matching `.security.yml` entry contains a valid API key.
|
|
- The chosen provider supports an OpenAI-compatible speech synthesis endpoint.
|
|
- Your selected model is actually a TTS-capable model.
|