Klone V2 Pro
Klone V2 Pro is our voice-cloning model. Synthesize speech in the style of a Voice Arena community voice using its share_id, or (on enterprise tiers) from your own reference audio - with style hints, fixed durations, and telephony-ready output codecs.
Overview
Klone V2 Pro is the voice-cloning model used in the TTS playground when you select the Klone V2 Pro model. It synthesizes speech in the style of a Voice Arena community voice identified by an 8-character share_id (visible on each voice in Voice Arena → Mine tab, or via search).
Key characteristics
- Community / cloned voices via
share_id- multipart form request. - 50+ language codes; billed at 0.5 credits per character (multiplier may apply on paid public voices).
- Languages: Use short ISO-style codes (e.g.
hi,en,ta).
Reference voice - provide exactly one
Choose the voice to clone using exactly one of the sources below. Sending more than one returns 400 ambiguous_audio_source; sending none returns 400 missing_audio_source.
| Field | Type | Tier | Description |
|---|---|---|---|
| share_id | string | All tiers | Catalog / community voice share ID. Uses the stored voice embedding; audio / audio_url / audio_text are ignored. |
| audio | file | Enterprise | Reference audio upload (WAV/MP3/M4A/FLAC/OGG/WebM). Max 10 MB, max 30 s. |
| audio_url | string | Enterprise | HTTPS URL of a reference clip. audio_text (transcript) is required when using a URL. |
Using audio / audio_url on a non-enterprise tier returns 403 enterprise_only.
Generate audio
Catalog / community voice
Returns audio inline (WAV by default). Replace YOUR_FONADA_API_KEY with your API key and YOUR_SHARE_ID with the voice's share ID from Voice Arena.
curl -X POST "https://api.fonada.ai/v1/voice-clone/chunks" \
-H "Authorization: Bearer YOUR_FONADA_API_KEY" \
-F "share_id=YOUR_SHARE_ID" \
-F "text=Hello, this is a voice clone test using a catalog voice." \
-F "language=hi" \
-o cloned_output.wavAPI endpoint
Base URL: https://api.fonada.ai
Endpoint: /v1/voice-clone/chunks
Method: POST
Content-Type: multipart/form-data
Authorization: Bearer YOUR_FONADA_API_KEY
Synthesis parameters (multipart form)
| Field | Type | Default | Description |
|---|---|---|---|
| text | string | required | Text to synthesize in the cloned voice. Empty → 400 empty_text. |
| language | string | hi | ISO code (e.g. hi, en, ta) — not the full language name used in Fonada V1. |
| speed | float | 1.0 | Speed factor, range 0.25–4.0. Overridden by duration if both set. |
| duration | float | none | Fixed output duration in seconds, range 0.1–60.0. Overrides speed; single-chunk text only. |
| instruct | string | none | Comma-separated style hints (e.g. "angry, whispering"), forwarded verbatim. |
| audio_text | string | none | Transcript of the reference clip. Optional for uploads (auto-transcribed if omitted); required for audio_url. Ignored when share_id is used. |
| output_audio_codec | string | wav | Output audio format - see Output audio formats below. |
Output audio formats
The model always produces 24 kHz PCM WAV; non-wav values are transcoded server-side. Set the output_audio_codec form field to control the format.
| Codec | Aliases | Encoding | Sample rate | Content-Type | Long text |
|---|---|---|---|---|---|
| wav | — | PCM WAV (default) | 24 kHz | audio/wav | Any length |
| mp3 | — | MP3 (128 kbps) | 24 kHz | audio/mpeg | Max 450 chars |
| opus | — | Opus in Ogg (128 kbps) | 48 kHz1 | audio/ogg | Max 450 chars |
| pcm | linear16 | Raw headerless 16-bit LE PCM | 24 kHz | audio/pcm | Any length |
| mulaw | ulaw | μ-law (telephony) | 8 kHz | audio/basic | Any length |
| alaw | — | A-law (telephony) | 8 kHz | audio/x-alaw-basic | Any length |
1 Opus always operates at 48 kHz by design (the encoder upsamples the 24 kHz source). pcm, mulaw, and alaw are headerless raw streams — see Playing raw output below. mp3 / opus cannot be concatenated across chunks, so the SDK produces them in a single request capped at 450 characters.
Response: raw audio bytes in the requested codec (WAV by default). Use -o filename.ext to save the file.
Private voices: Only the voice owner (or an admin) may synthesize with a private voice's share_id. Public voices can be used by any authenticated customer with a valid API key.
Response headers
Content-Type— media type for the codec (see table above)Content-Disposition—inline; filename="voice_clone.<ext>"X-Output-Codec— normalized codec name (wav,mp3,opus,pcm,mulaw,alaw)X-Processing-Time-Ms— end-to-end server processing timeX-Upstream-Time-Ms— upstream cloner round-trip time
Playing raw output
pcm, mulaw, and alaw are headerless — tell the player the format and sample rate:
# PCM (s16le, 24 kHz, mono)
ffplay -f s16le -ar 24000 -ac 1 cloned.pcm
# mu-law (8 kHz, mono)
ffplay -f mulaw -ar 8000 -ac 1 cloned.ulaw
# A-law (8 kHz, mono)
ffplay -f alaw -ar 8000 -ac 1 cloned.alaw
# Convert raw mu-law to a playable WAV
ffmpeg -f mulaw -ar 8000 -ac 1 -i cloned.ulaw cloned.wavMore REST examples
curl -X POST "https://api.fonada.ai/v1/voice-clone/chunks" \
-H "Authorization: Bearer YOUR_FONADA_API_KEY" \
-F "share_id=YOUR_SHARE_ID" \
-F "text=Hello from voice cloning." \
-F "language=en" \
-F "output_audio_codec=mp3" \
-o cloned.mp3# Telephony mu-law (8 kHz) for IVR / SIP
curl -X POST "https://api.fonada.ai/v1/voice-clone/chunks" \
-H "Authorization: Bearer YOUR_FONADA_API_KEY" \
-F "share_id=YOUR_SHARE_ID" \
-F "text=Welcome to support. How can I help you today?" \
-F "language=en" \
-F "output_audio_codec=mulaw" \
-o cloned.ulaw# Enterprise: clone from your own reference audio + style + fixed duration
curl -X POST "https://api.fonada.ai/v1/voice-clone/chunks" \
-H "Authorization: Bearer YOUR_FONADA_API_KEY" \
-F "text=This is a custom cloned voice." \
-F "language=en" \
-F "audio=@/path/to/reference.wav" \
-F "audio_text=Exact transcript of the reference recording." \
-F "instruct=cheerful, energetic" \
-F "duration=5.0" \
-o cloned.wavimport httpx
with open("cloned.mp3", "wb") as f:
resp = httpx.post(
"https://api.fonada.ai/v1/voice-clone/chunks",
headers={"Authorization": "Bearer YOUR_FONADA_API_KEY"},
data={
"text": "Hello from voice cloning.",
"language": "en",
"share_id": "YOUR_SHARE_ID",
"output_audio_codec": "mp3",
},
timeout=180.0,
)
resp.raise_for_status()
print("codec:", resp.headers.get("X-Output-Codec"))
print("upstream ms:", resp.headers.get("X-Upstream-Time-Ms"))
f.write(resp.content)Python SDK (model v2)
Voice cloning is exposed through the existing TTSClient - set model="v2" and provide a reference voice. A dedicated VoiceCloneClient is also available if you prefer a voice-clone-only entry point. The SDK auto-splits long text, paces requests to the rate limit, and merges the audio for you.
pip install fonadalabsfrom fonadalabs.tts.client import TTSClient
client = TTSClient(api_key="YOUR_FONADA_API_KEY")
audio_bytes = client.generate_audio(
text="Hello! This speech is generated with voice cloning.",
language="English",
model="v2",
share_id="YOUR_SHARE_ID",
output_file="output.wav",
)
print(f"Generated {len(audio_bytes):,} bytes of WAV audio")import asyncio
from fonadalabs.tts.client import TTSClient
async def main():
client = TTSClient()
audio = await client.generate_audio_async(
"Hello from async voice cloning.",
language="English",
model="v2",
share_id="YOUR_SHARE_ID",
)
return audio
asyncio.run(main())Long text & chunking
- Input is split sentence-aware (soft 100 / hard 200 chars per chunk) — you do not chunk text yourself.
- A per-API-key request tunnel paces calls (~10 req/min) so long jobs do not hit rate limits. Estimate:
minutes ≈ chunks ÷ 10. wav/pcm/mulaw/alawmerge cleanly at any length;mp3/opusare single-request, max 450 chars.
Rate limits
| Period | Limit |
|---|---|
| Minute | 10 requests |
| Hour | 100 requests |
| Day | 500 requests |
Plus a global concurrency cap of 40 in-flight requests. Exceeding either returns 429 rate_limit_exceeded with a retry_after_seconds hint.
Errors
All errors return JSON: {"detail": {"error": "<code>", "message": "<text>"}}
| HTTP | error | Cause |
|---|---|---|
| 400 | empty_text | text was empty / whitespace |
| 400 | invalid_output_audio_codec | Codec not in the supported set |
| 400 | ambiguous_audio_source | More than one reference source supplied |
| 400 | missing_audio_source | No reference source supplied |
| 400 | invalid_speed / invalid_duration | speed or duration out of range |
| 400 | audio_too_large / audio_too_long | Reference upload exceeds 10 MB / 30 s |
| 403 | enterprise_only | audio / audio_url on a non-enterprise tier |
| 403 | invalid_api_key / inactive_api_key | Bad or disabled API key |
| 404 | share not found | Unknown share_id |
| 429 | credits_exhausted | Insufficient credits / credit limit reached |
| 429 | rate_limit_exceeded | Per-user rate limit or concurrency cap reached |
| 500 | encoding_failed | Server-side transcode to the requested codec failed |
| 502 | upstream_error | Upstream cloner failed, was unreachable, or returned no audio |
Related endpoints
| Endpoint | Purpose |
|---|---|
| POST /v1/voice-clone/chunks | One-shot synchronous clone (this doc) — returns audio inline |
| POST /v1/voice-clone/generate | Asynchronous clone with storage chunks + job tracking (for very long text) |
| GET /v1/voice-clone/languages | List supported language codes |
Browser / playground integration: The Fonadalabs web app calls the fonada-api Supabase edge function with action: 'voice-clone-tts' so your API key stays server-side. For long-running jobs the platform may use async chunk delivery; the direct /v1/voice-clone/chunks call above is the synchronous inline-audio response shape for integrations.
See also: Voice Arena documentation for creating voices, share IDs, and pricing.
Voice quality
Klone V2 Pro
Voice cloning
Community voices from Voice Arena — clone-style synthesis via share_id and language code.
- API endpoint
- /v1/voice-clone/chunks
- Languages
- 50+ (ISO codes)
- Voice param
- share_id
- Billing
- 0.5 credits / character
- Output
- wav, mp3, opus, pcm, mulaw, alaw
FAQ
Klone V2 Pro is our voice-cloning model. Instead of choosing a fixed system voice, you synthesize speech in the style of a Voice Arena community voice identified by an 8-character share_id, or (on enterprise tiers) from your own reference audio.