Teaching a machine to say sàndali · Alessandro Catorcini

The hook: a $150 problem with a surprising bottom

I write historical novels. They exist in three languages — English, Italian, Romanian — and I’d like them to exist as audiobooks too. The obvious path is ElevenLabs: paste the text, pick a voice, pay roughly $100–150 per book, done. Multiply that across a back catalog and every future title in three languages, and “just pay for it” stops being obviously correct. So I did what any reasonable person with a couple of RTX 3090s and an unhealthy tolerance for yak-shaving would do: I decided to see whether I could build the pipeline myself, at zero marginal cost per book.

The English side turned out to be easy. The Italian side turned out to be a doctoral thesis in disguise.

This is the story of how a straightforward “evaluate some TTS engines” afternoon became a multi-day engineering project spanning speech synthesis, Italian phonology, open-source licensing law, an LLM bake-off, and finally a from-a-clean-base fine-tune of a 0.5B model on 124 hours of public-domain audiobook narration. Along the way almost every initial assumption got mugged by reality. Those muggings are the interesting part, so I’ll dwell on them.

I set up four TTS systems in isolated virtualenvs on the render box (dual 3090s), pointed them all at the same ~700-word excerpt of one of my English chapters, and gave each the same reference recording of my own voice to clone. Then I listened blind and wrote down verdicts before looking at which was which.

System	Voice	Verdict
F5-TTS	cloned	Clearly my voice. A bit quiet, occasional artifacts. Best identity match.
Kokoro (am_adam preset)	preset	Clear, precise, a little artificial. Best raw quality — but not my voice.
Tortoise (fast preset)	cloned	Inaudible, artifact soup. The “fast” preset destroys quality.
XTTS v2	cloned	Stop-and-go cadence from a paragraph-chunking bug. Between the other two.

The takeaway was a classic engineering tension: Kokoro won on raw quality, F5 won on voice identity, and nobody won on both. XTTS’s stutter was clearly a plumbing bug (it was chunking on paragraphs), not a model limitation. So for English, the path was clear enough: XTTS with proper sentence-level chunking, or F5 if identity mattered more than polish.

Then I switched languages, and the floor fell out.

Act II — Italian breaks everything, and it’s not a bug

I recorded 52 seconds of myself speaking Italian as a reference and ran the same excerpt-cloning exercise on my Italian novel, Lo Scudo Più Lontano. I iterated XTTS through four revisions, fixing real bugs each time: paragraph→sentence chunking, an empty-chunk guard, trailing-silence trimming, a 12 ms cosine crossfade between chunks. XTTS v4 was genuinely good — correct phonetics, decent prosody, slightly stiff. Then I tried the models everyone raves about:

Fish-Speech 1.5 — perfect rhythm and intonation. And then it face-plants on diacritics: perché, città come out wrong.
OpenAudio-S1-mini — also perfect rhythm. And it still mangles words like sàndali (sandals).

Here’s the thing that took me an embarrassingly long time to accept: this is not a bug you can fix with better chunking or conditioning. It’s a property of the Italian language colliding with how these models were trained.

The proparoxytone problem

Italian orthography does not mark lexical stress on most words. You write sandali, and a competent reader knows it’s sàn-da-li — stress on the third-from-last syllable (a proparoxytone, or in Italian sdrucciola). But nothing in the spelling tells you that. The default statistical guess — the one every English-primary multilingual model reaches for — is to stress the penultimate syllable: san-dà-li. Wrong. Confidently, ridiculously wrong, in a way that makes a native listener’s skin crawl.

English-trained models have no Italian pronunciation lexicon. So on any word where the stress isn’t where the “usual” rule would put it, they guess, and they guess penultimate. Proparoxytones are the biggest bucket of exceptions, and Italian is full of them: àlbero, màcchina, fàbbrica, telèfono, tàvolo, pèttine, nùvola, mùsica, mèdico…

At this early stage I over-generalized wildly — I heard a long garbled Fish-1.5 passage and concluded “it gets all the early-accent words wrong.” That turned out to be false, and correcting it later mattered a lot. But in the moment, the conclusion was stark enough that I made a pragmatic call: ElevenLabs for Italian production, local pipeline for English. Ship the easy win, pay for the hard one.

That should have been the end of it. Except the problem was too interesting to leave alone — and it smelled like a product.

Act III — The rabbit hole has a floor, and the floor is a taxonomy

The more I poked at Italian stress, the more structure appeared. This wasn’t one problem; it was a family of them, and they had genuinely different shapes. With a native ear (mine) doing the labeling, a taxonomy fell out — six categories that collapse into a handful of engineering strategies:

1. Singleton mis-stresses. Words like sàndali that the model just gets wrong. Context-free. A word is either on the naughty list or it isn’t. → a hand-curated exception lexicon.

2. Heterophonic homographs — omografi non omofoni, the truly nasty ones. Same spelling, two stress positions, two meanings, and you can only tell which from context:

prìncipi (princes) vs princìpi (principles)
àncora (anchor) vs ancóra (still / again)
sùbito (immediately) vs subìto (undergone)
càpitano (they happen) vs capitàno (captain)
séguito (retinue) vs seguìto (followed)
nòcciolo (pit/core) vs nocciòlo (hazel tree)

A pure dictionary lookup cannot solve these, because the key collides. You need to know what the sentence means.

3. Productive enclitic/conjugated pairs. vèstiti (get dressed! — imperative + clitic) vs vestìti (dressed — participle). pèntiti, divèrtiti, concèntrati. This is not a list you can enumerate — it’s generated by the grammar. An imperative verb keeps its stem stress and tacks on unstressed pronouns, retracting the stress into proparoxytone territory; the identically-spelled participle doesn’t. Thousands of pairs, all following a rule. → a morphological generator, not a list.

4. Vowel-quality pairs — pèsca (peach) vs pésca (fishing), lègge (law) vs légge (he reads). Same stress position, different vowel aperture (open è/ò vs closed é/ó). Real, but even native speakers from most regions neutralize this distinction, and enforcing it can sound hypercorrect. → deferred to a premium tier.

5. Consonant voicing — razza /rattsa/ (breed) vs /raddza/ (ray). Not even representable in text. → accept-and-flag, premium at best.

The clarifying move was mapping these categories onto three engine buckets: generate (morphological rules over a stressed-lemma lexicon), detect (runtime part-of-speech/morphology disambiguation), and curate (a hand list for the irreducible residue). The core intellectual property, it became clear, was going to be a stress-marked lemma lexicon plus a morphological engine — because categories 1–4 could largely be generated or detected, leaving only true singletons to list by hand.

I gave the thing a name: accento. A stress-injection front-end that sits in front of any TTS model. Its whole job is to look at Italian text and mark the stresses the model would otherwise get wrong, before handing the text off. Crucially, I confirmed the intervention point was clean: inspecting the S1-mini frontend showed accent marks are not stripped — an accented character becomes its own byte-level token (the accent is token 6362), reaches the model intact, and reliably moves the stress. Feed the model sàndali and it says SAN-da-li. The lever worked.

Act IV — “Are any of these words actually wrong?” The A/B that saved months of wasted work

Before building a lexicon, I did the thing I should always do first and often don’t: I ran a controlled experiment to find out how big the problem actually was, instead of trusting my earlier gut impression.

Six sentences. About twelve proparoxytones. Peak-normalized audio so loudness wouldn’t bias my ear. Native-speaker judged (me, carefully, blind to which was baseline). The result reframed the entire project:

S1-mini already gets ~90%+ of Italian proparoxytone stress right out of the box.

Of the ~12 test words, exactly one — sandali — was wrong in the baseline. àlbero, macchina, fabbrica, capita, telefono, tavolo, pettine, nuvola, musica, medico, rapida — all correct, unprompted. My earlier “it gets everything wrong” was an artifact of one bad long passage. The real problem wasn’t 100% of proparoxytones; it was the ~10% the model happens to miss.

Meta-lesson #1: measure the problem before you build the solution. I was one gut-impression away from hand-building a 120,000-word pronunciation lexicon to solve a problem that needed a few hundred entries.

This flipped the product shape entirely. No acoustic fine-tune needed (yet). No giant lexicon. Just S1-mini + a stress-injection preprocessor driven by a curated exception list of the ~10% the model mis-stresses — plus the homograph disambiguator and the morphological generator for the productive cases. And a hand-curated few-hundred-word list is, by construction, clean IP: it sidesteps every licensing landmine in the existing Italian pronunciation resources. (More on those landmines shortly — they’re a saga.)

Later, running accento’s own acoustic detector against a real chapter of my prose (543 content words, 59 proparoxytones) put a number on the pain: the true model mis-stress rate on proparoxytones is roughly 20–30%. Since proparoxytones are only ~11% of running text, that’s consistent with “~90% of words correct overall” — and it’s more than enough to sound ridiculous several times a page. Both things are true at once. The 90% is why the model is usable; the 20–30% is why it needs accento.

Act V — Building accento: lexicon, morphology, and an acoustic lie-detector

With the shape settled, the build went fast, in green-tests-only increments (ruff + mypy --strict + pytest, permissive dependencies only).

The exception lexicon (Stage 1) — context-free singleton fixes. Minimal-intervention by design: the injector only adds marks, and a property test asserts strip_stress(output) == input, so it can never corrupt text, only annotate it.

Stage 2a — POS disambiguation. Many homograph pairs differ by part of speech (àncora the noun after an article vs ancóra the adverb). A part-of-speech tagger plus rules resolves these deterministically. spaCy’s it_core_news_sm (MIT-licensed) does the tagging.

Stage 2b — the LLM semantic pass. For same-part-of-speech pairs (prìncipi/princìpi, both nouns), you genuinely need to understand the sentence. An LLM annotates the intended reading. (Which LLM became a whole investigation — see Act VII.)

The morphological engine (the real IP). Two pure rule-based generators, zero new dependencies:

an enclitic generator (imperative-stem + clitic → proparoxytone, vs the paroxytone participle), which is high-yield and near-linear — feed it ~1–2k pronominal verbs, get ~1–2k homograph pairs by rule;
a plural-collapse generator (a proparoxytone -e/-o singular whose -io→-i plural lands on the same surface form as another word’s).

On twelve hand-built gold pairs, the engine reproduced 12/12, zero misses, no rule gaps, and auto-routed each to 2a or 2b by the part-of-speech relation.

The acoustic detector — the QC gate. This is the piece I’m proudest of and the one that taught me the most. It’s a forced-alignment-based stress detector (torchaudio’s forced_align, an Apache-2.0 Italian wav2vec2 CTC model, librosa’s pyin for pitch) that listens to the generated audio and reports where the stress actually landed. It’s both the dev-time diagnostic (“which words does this model get wrong?”) and the production QC gate (“did the model say what we injected?”).

Building it surfaced a genuinely surprising empirical fact:

S1-mini realizes injected stress as an F0 (pitch) accent, not as lengthening.

Italian stress is normally duration-dominant — stressed syllables are longer. But this model, when you inject a mark, raises the pitch rather than stretching the syllable. That’s unusual, and it meant the detector needed an F0-weighted mode to reliably tell “baseline sandali (wrong)” from “marked sandali (right).” A perfect example of a truth you only learn by measuring the actual system rather than reasoning from linguistics textbooks.

Act VI — The licensing detective story (or: why a small player can’t just grab the data)

Here is the part that surprised me most, and the part I think generalizes furthest beyond audiobooks.

If you want to sell a model — or a lexicon, or anything trained on data — the license of every input matters in a way that’s easy to hand-wave until you actually read the terms. The framing I settled on, after reading primary sources and the 2025 wave of US copyright rulings (Bartz v. Anthropic, Kadrey v. Meta, Thomson Reuters v. Ross):

Non-commercial / ShareAlike / GPL data is barred from a sold model’s training set regardless of the unsettled “are model weights a derivative work?” question — because the non-commercial term is a separate contractual bar that fair-use / text-and-data-mining doctrine does not excuse.

Translation: even if you win the “weights aren’t a derivative” argument, you still signed (by using the data) a contract that said “not for commercial use.” So the conservative practice is to quarantine non-commercial and research-only data to evaluation only, and train exclusively on genuinely permissive sources.

For Italian speech corpora, that sorted the world into:

Train-safe (clean, commercial): Common Voice Italian (CC0, best accent diversity — but you now have to pull it from the Mozilla Data Collective, not Hugging Face, as of late 2025); VoxPopuli IT (CC0 data — but not their CC-BY-NC models); M-AILABS IT (BSD); MLS IT (CC-BY, keep attribution records). Roughly 460+ clean hours, though read-register-heavy.

Eval-only: CLIPS (the best diatopic/regional diversity in existence — and research-only, plus broadcast-audio rights, plus GDPR concerns; you’d need written commercial permission from the university that holds it). FLEURS.

Avoid entirely: EMILIA (YouTube-scraped provenance — the exact taint to stay away from); VoxForge (GPL on the corpus itself).

And then the real trap, the pronunciation lexicon — the very thing accento needs:

No free, stress-marked, clean-provenance, commercially-usable Italian pronunciation lexicon exists.

Every candidate is poisoned for a sold product: PhonItalia is non-commercial (and its site’s DNS is dead); WikiPron’s data is CC-BY-SA (ShareAlike copyleft, which forces your whole product open); eSpeak-ng is GPLv3 (fine server-side, poison to embed in an SDK); Epitran is MIT but has no stress marks (disqualified); the convenient wrappers (CharsiuG2P, gruut) sit on top of BY-SA/GPL data and inherit the taint.

This is why the exception-lexicon strategy is a moat, not just a convenience:

Meta-lesson #2: clean data provenance isn’t box-ticking for a small player — it’s a competitive necessity. As I put it to myself at one point: we’re not big enough to scoff at it the way OpenAI can. A big company can absorb a lawsuit as a cost of doing business. I can’t. So a hand-built, native-verified stress lexicon — clean by construction, with a provenance column on every row — becomes the defensible asset precisely because the clean version doesn’t exist off-the-shelf. The constraint is the moat.

I kept a licensing ledger for every source. That discipline paid off repeatedly.

Act VII — The Stage-2b bake-off: the smallest model that clears the bar

Stage 2b needs an LLM to resolve same-POS homographs from context. The instinct is to reach for a frontier model. I instead built a labeled evaluation set — 52 items, 15 of them deliberate genre-flips — and ran a ladder of models from smallest to largest, at temperature 0, through the mandatory local GPU scheduler. The rule: pick the smallest model that clears the bar.

Model	License	Overall	Genre-flips
Llama 3.2 3B	Llama Community	73%	67% — out
Qwen2.5 7B	Apache-2.0	86%	80% — just under
Mistral-Small 3.2 24B	Apache-2.0	96%	87% — winner
Aya 32B	CC-BY-NC	98%	100% — ceiling, but non-commercial
Qwen3.5 122B-a10b	Apache-2.0	100%	100% — perfect, but 81 GB resident

The winner was Mistral-Small 3.2 24B: Apache-licensed, local, free, sub-second, already on the Mac. No frontier model needed. And then a run of frontier hosted models all scored a perfect 52/52 — as did Aya, Mistral-Medium, and the 122B Qwen. Six models tied at the top.

Meta-lesson #3: the task saturated. When your hardest eval is a tie at the top across a few-cents hosted-API call and an 81 GB open model, the eval has stopped discriminating at the high end. Its real value is separating the small models (73/86/96%) from the capable tier — not choosing among the capable ones. And a saturated task means the cheapest thing that saturates it wins. For a hosted-API path that’s the cheapest small tier, not a flagship. For the actual product it’s local Mistral-Small, because API means egress + per-book cost + a dependency, and the whole point was zero marginal cost.

Two more corrections fell out of this act, both worth their weight:

Genre priors do not belong in the prompt. I tried conditioning the model with a genre hint (“this is Roman historical fiction, so fòro means forum not hole”). It helped the obvious classical flips and broke the reverse cases — Mistral flipped demoni→devils despite an explicit “of hell” in context; Qwen turned pork coppa into a trophy. Small and mid models treat a soft prior as a hard override.

Meta-lesson #4: priors don’t belong in the LLM prompt. The correct architecture is: the LLM reads context only; priors (genre, frequency) get applied deterministically, downstream, as a tiebreaker on genuine ambiguity — which requires an abstain/confidence signal. So the locked v1 is Mistral-Small with a plain prompt, and the deterministic prior layer is a v2 refinement (and, not incidentally, the lever that could later lift a cheap 7B over the bar).

The QC gate validates execution, not the decision. I’d cheerfully assumed the acoustic QC gate would recover Mistral’s missing 4% — “96% + QC → 100%.” Wrong, and the error is instructive. The gate checks whether the TTS model said what accento injected. If Stage 2b makes a decision error — picks the wrong reading, injects a valid-but-wrong stress — the TTS dutifully says the wrong word, and the gate compares the audio to the wrong target and passes it.

Meta-lesson #5: know what your QC gate actually validates. Mine validates execution (did the model pronounce what we told it to?), not decision (did we tell it the right thing?). There’s no independent oracle for the decision except a human or a better model. Conflating the two would have shipped a subtly-wrong product with a green dashboard.

Act VIII — The gut-punch: the best base is non-commercial

Everything so far was built on OpenAudio-S1-mini, the best-sounding Italian base I’d found. Then I actually read its license, and the project pivoted hard.

S1-mini is non-commercial. The Hugging Face card says CC-BY-NC-SA-4.0; the GitHub repo says a custom Fish Audio Research License that also covers the weights. They conflict on which applies — and both forbid commercial use.

Worse, the ShareAlike clause is a one-way trap: any fine-tune or derivative must stay CC-BY-NC-SA. You cannot launder it to Apache by fine-tuning. The maintainers explicitly declined to productize a commercial self-host license (the GitHub issue is closed, “not planned”); the only sanctioned commercial route is their hosted API — i.e., not self-hosting, i.e., not zero-marginal-cost. And conservatively, the non-commercial term reaches the output: selling an audiobook made with S1-mini is itself commercial use. My novels are for sale. So even the “personal” use isn’t clean if it’s monetized.

This split the product cleanly and painfully:

Personal / non-monetized track: keep S1-mini. It works, it’s good.
Any commercial track (SaaS, OSS redistribution): S1-mini is dead. Full stop.

The consolation: accento itself is base-agnostic. The lexicon, the morphological engine, the homograph inventory, the acoustic QC gate — those are facts about Italian, not about the model. Not wasted. Only the model-specific parts (which words this base gets wrong; whether this base honors injection) would need re-validation on a new base.

Act IX — The clean alternatives fail, and the answer becomes “teach one Italian”

So I went hunting for a commercially-clean base with good Italian and voice cloning. Two candidates, both 0.5B and lighter than S1:

Chatterbox Multilingual (Resemble AI, MIT) — 23 languages including Italian, but it embeds an imperceptible neural watermark in all output.
CosyVoice 2 (Alibaba, Apache-2.0) — Italian + cross-lingual cloning, no watermark.

I ran a loudness-matched blind A/B. Both failed, and the way they failed drew a bright architectural line:

CosyVoice2 was unintelligible — broken phoneme structure. Root cause: the 0.5B model has no Italian training; it fakes Italian cross-lingually using Chinese/English phonemes, so it produces the wrong sounds, not just the wrong stress.
Chatterbox ran away too fast and detonated into noise at 1:37 (an end-of-sequence/repetition glitch).

Both were “light years behind” S1-mini. And here’s the distinction that mattered:

A mispronounced caduceo is accento’s job — a stress error, fixable, expected, don’t count it against a base. But unintelligibility, artifacts, and runaway pace are base-quality failures accento cannot touch. Stress injection can’t fix phonotactics. There is no respelling that forces a model to produce a sound it never learned.

The conclusion was bleak but clarifying: clean + good-Italian + voice-cloning, at S1 quality, does not exist off the shelf. Which left exactly one honest path forward. If no clean base speaks good Italian, then take a clean base with the right architecture and permissive weights, and teach it Italian — fine-tune it.

Not train from scratch. From-scratch is a research program: tens of thousands of hours, $30–150k of compute, 6–18 months, an H100 cluster. A fine-tune gets ~90% of the quality for ~1% of the cost, and it fits on an owned A6000. The base: CosyVoice2-0.5B, weights verified Apache-2.0 (Qwen2.5-0.5B backbone, no NC rider, no watermark). I explicitly rejected OpenF5 — Apache-stamped but trained on Emilia-YODAS/YouTube data, i.e. laundered EMILIA taint. Clean provenance all the way down or not at all.

Act X — 124 hours of clean audio, and a LoRA that surprised me

Data strategy, locked: clean provenance only, because volume was never the bottleneck for a fine-tune (hundreds of hours, not tens of thousands) — provenance was. I explicitly rejected scraping Italian talk radio: copyrighted broadcast plus non-consenting real voices is exactly the taint I’d spent Act VI avoiding, and it’s fatal for a sold model. The anchor instead: LibriVox Italian — public-domain audiobook narration, which is also the ideal domain for an audiobook product — supplemented by Common Voice, MLS, M-AILABS, VoxPopuli, every source’s license logged.

The pilot: prove the thesis cheap

Before building the real corpus, a ~50-hour LoRA pilot to answer one question: can a clean Apache 0.5B model learn Italian well enough to be worth it? Two prongs ran in parallel — a data prong assembling exactly 50.00 hours (29,711 utterances, 24 kHz mono, every record carrying source/license/sha256) and a training prong standing up a CosyVoice2 LoRA rig on the A6000.

A pile of real-world gotchas got ground through here — ModelScope’s git-lfs handing back 4 KB pointer files instead of the 5.3 GB checkpoint (fixed by copying the render box’s working copy); torchrun segfaulting under WSL (fixed with a direct-Python launcher); the transformers pin at 4.51.3 because 5.x produced runaway generation. Then a correction that reset expectations: the “50 hours” was M-AILABS at 16 kHz upsampled — because it turned out M-AILABS and MLS are both only 16 kHz, and raw LibriVox is the only native-24 kHz clean source. Fine for a proof-of-concept (intelligibility and stress don’t need >8 kHz bandwidth), but the pilot output would sound band-limited even if it passed. Fidelity was explicitly not on trial yet.

The LoRA fine-tuned the LLM (text→token) component only — 20.57M of 514.6M params trainable (4%) — with the flow decoder and HiFi-GAN vocoder frozen. Three automated go/no-go gates: intelligibility (ASR word-error-rate), stress (accento’s own detector), and injection responsiveness.

The verdict: 2 of 3 gates passed — a NO-GO overall, but a strong partial GO, and the core thesis validated.

G1 intelligibility: WER 0.262 → 0.119, a 55% drop. PASS. The clearest possible signal: stock CosyVoice2 rendered a sentence as the garbled “Duccio raccorsi ai sandali di coio”; the fine-tune produced “Lucio raccolse i sandali di cuoio.” It learned Italian.
G2 stress: 2/11 → 7/11 proparoxytones correct. PASS.
G3 injection: 27% → 55%. FAIL (bar was 90%). And the cause was diagnosable, not a ceiling: the pilot trained on plain narration, so the model never once saw an injected accent mark — it couldn’t learn a mark→stress override it was never shown. That 27→55% without any marked training data was, if anything, encouraging. The fix is obvious: oversample stress-marked examples in training.

The verdict that actually mattered

Automated gates are necessary but they are not the product. So I generated a full raw chapter — my own novel, my cloned voice, no injection — from both stock and the fine-tune, loudness-matched, and listened. The fine-tune read the chapter in 508 seconds; stock dragged through the same text in 735 seconds at 1.35 s/word, stumbling over Italian it didn’t know. My verdict, written down at the time:

“Unexpectedly good.”

Meta-lesson #6: metrics are not a good read — my ear is the verdict. WER-0.119 is a number. “Unexpectedly good, and caduceo now comes out correct” is a product. The fine-tune had genuinely acquired Italian: it learned caduceo (stock failed it), and its shorter runtime was natural cadence (0.68 s/word) rather than dragging. No metric told me it was shippable; my ear did.

That by-ear read also cleanly separated two remaining error classes, and the separation is now an architectural law:

tacere mis-stressed (TÀ-cere for ta-CÉ-re) → a stress error → accento’s lane, fixable by the lexicon + the injection follow-up.
chiarore with the “ch” read as a palatal /tʃ/ instead of the correct /k/ → a consonant/G2P error → outside accento entirely. Stress injection can’t change phonetics. This one gets fixed by training — a fuller, cleaner fine-tune teaching the systematic ch + e/i = /k/ rule the 16 kHz pilot only half-absorbed.

Architecture law: accento owns stress; the base model owns phonetics. sàndali and tacere are the front-end’s job. chiarore is training’s job. Knowing which bucket a defect falls into tells you which lever to pull.

Act XI — The production corpus, and clearing the A6000

With the thesis proven by ear, I built the real thing: a native-24 kHz corpus from raw public-domain LibriVox Italian. Catalog the solo-reader recordings via the API, download, convert to 24 kHz mono, then VAD-segment into 3–15 s clips and transcribe with faster-whisper large-v3 (which ran at ~51× realtime — the whole corpus in about 3 hours, 86% retention), then dedup against the M-AILABS overlap.

Final corpus: 123.95 clean hours at 24 kHz, 42,480 utterances, 15 readers, entirely public domain, every utterance in a JSONL manifest with provenance and a SHA-256. Deduplication dropped 7,206 M-AILABS-overlap utterances; zero rejects, zero dupes. That 124 hours comfortably clears the “is it enough?” bar — the plan wanted 150–300 hours, but the pilot got “unexpectedly good” on 38 effective hours of 16 kHz, so 124 hours of native 24 kHz is a clear step up. Expandable if the eval demands it (relax the solo-reader filter, or fold in the 128 hours of supplementary 16 kHz M-AILABS).

One honest note on the cost of doing this on owned hardware: the A6000 lives on a shared box that also runs a couple of other GPU services. Freeing it for a multi-day training commitment meant evicting those services (once I’d confirmed the box was idle and free to use). No cloud, but not free either — GPU time is a real, contended resource, and the discipline of flag-don’t-kill on shared infrastructure is its own small operational lesson.

And a final risk that had to be cleared before committing: would CosyVoice2’s text frontend strip the à/è/ì/ò/ù marks before tokenizing and silently break injection? A one-day spike answered it: CosyVoice2’s active frontend is wetext (not the optional multi-gigabyte Chinese ttsfrd — do not install that), frontend-on vs frontend-off produces byte-identical token IDs, accented words tokenize distinctly (accent = in-vocab token 6362, the same token id as S1-mini), and the round-trip preserves accents. Injection transfers. The decision that fell out: run CosyVoice2 frontend-off and let accento own Italian text normalization (numbers→words, symbols, abbreviations, sentence splitting) — because wetext routes Italian through its English branch and would spell “3” as “three.” That makes accento’s normalization stage load-bearing, which is fine; it’s expected scope.

Every gate cleared. Green light.

Act XII — The training run that almost died at epoch 3

The full run went out: the 124-hour native-24 kHz corpus, an oversampled stress-marked subset generated by accento’s own frontend (so the marks the model trains on are byte-identical to the marks it’ll see at inference — the fix for the pilot’s failed G3 gate), a higher-rank LoRA (r=32/α=64, ~5.6% of params trainable, LLM-only, flow decoder and vocoder frozen), five epochs on the exclusive A6000. The final training file was 70,800 lines — the 42,480 plain utterances plus 28,320 marked ones, 40% marked.

It crashed at epoch 3. The WSL VM didn’t error — it wedged, went unresponsive, and took the GPU with it.

The autopsy is a tidy little cautionary tale about the seams between systems. The out-of-memory kill landed at the epoch boundary, where memory peaks: cross-validation eval, plus a 2 GB checkpoint write, plus a 28 GB training parquet being read off a /mnt/d 9p mount — all against a .wslconfig that capped the VM at 56 GB of a 64 GB host and a swap file that had somehow died at 0 bytes. Zero cushion. The fix was unglamorous and total: cap the VM at 48 GB, give it a real 32 GB swap, and move the training data off the slow 9p mount onto WSL-native ext4. And a second, sneakier bug surfaced in the recovery: the resume path loaded the checkpoint before injecting the LoRA adapters, which silently reset training to scratch while looking for all the world like a clean resume. I only caught it by asserting the LoRA tensors were actually present after the load. Restart-from-scratch wearing a resume costume is the kind of bug that wastes a day and a night of GPU time while the loss curve smiles at you.

Then came my favorite small humiliation of the whole project. Mid-run I checked the box: GPU at 38°C, near-zero utilization. I confidently theorized a monitoring artifact — surely the telemetry was stale, the job was still grinding. Then I did the arithmetic: there’s no way there’s load at 38°C. A GPU under a training run runs hot; one sitting at 38°C simply has no work in front of it. The telemetry wasn’t lying. The GPU was genuinely idle — but not because the job had crashed. Because it had finished, cleanly, at epoch 4, and the post-training sequence hadn’t auto-fired (a network blip had killed the poller). I’d invented a sensor ghost to avoid the simpler truth that the thing had worked. Twice now, in two different registers, the lesson was the same: trust the instrument, not the story you’d prefer.

The merged model — call it accento-full — dropped in as a stock-compatible checkpoint, zero missing keys. It had learned Italian, in my voice, at 24 kHz. Now it had to become an audiobook, which turned out to be a different and longer problem than making the model good.

Act XIII — Generate-and-select: stop trying to make the model perfect

Here is the reframe that unlocked the back half of the project, and it’s the one idea I’d most want a reader to steal.

Injection is stochastic. Even on the fine-tuned model, marking sàndali and asking for audio lands the stress correctly only ~60–72% of the time on any given synthesis. The flow-matching decoder resamples noise on every call; identical input, different read. My instinct — the wrong instinct — was to chase that number toward 100%: more marked training data, a stronger adapter, a fixed random seed, temperature tricks. Months of work to make a stochastic process deterministic.

The right move was to stop fighting the stochasticity and exploit it. An audiobook is produced offline. There is no latency budget. If a span comes out wrong, I can simply generate it again — and again — and keep the take that’s right.

Meta-lesson #7: you don’t need a perfect model; you need a good selector. Online TTS has to be right on the first try. An audiobook has unlimited retries. That single fact demotes the entire “get the model to 100%” research program and promotes a much easier one: synthesize each span N times, and use the acoustic stress detector — the QC gate I’d already built — as a chooser, keeping the take where the injected stress actually landed. An imperfect model plus a good selector plus free retries equals a reliably correct read.

This is where the acoustic detector earned its keep three times over: dev diagnostic, production QC gate, and now selector. Feed it five takes of a span and it ranks them by whether the marked syllables carry the stress. Marked-word correctness jumped from the raw ~72% to ~89% across the chapter — and the pathological cases resolved for free. Andècca — an invented proper name from my novel, a word no model has ever seen — came out wrong on the first three synths and correct on the fourth. Without generate-and-select I’d have hand-fixed it forever. With it, the pipeline just… retried until the detector was satisfied.

Act XIV — A thousand paper cuts between “it can say it” and “it’s an audiobook”

Nobody warns you that the gap between a model that pronounces Italian correctly and a listenable audiobook is filled with unglamorous plumbing. A partial inventory of what broke, and what fixed it:

Choppiness. The first full-chapter render was, in my own words, “really hard to listen to — choppy and broken up.” I misdiagnosed it as a model regression before realizing a single sentence in isolation had zero artifacts — so the model was fine and the stitching was the culprit: 78 sentences hard-concatenated with dead air between them. Four iterations of a stitcher followed (v2 fade edges → v3 → v4), each fixing the last one’s overcorrection. v3 over-trimmed and clipped word endings (“missing about 20–30 ms to finish the word”); v4 backed the trailing-trim threshold from 20 ms to 6 ms, added a 35 ms trailing margin, and — the load-bearing rule — only ever fades silence, never speech. Group into ~45-word spans to minimize seams, 15 ms cosine crossfades at the joins, faded paragraph pauses.

“Capitolo I” read as “Capitolo i” — the letter, not the number. Text normalization, previously a stub, became a real stage: Roman and Arabic numerals expanded to Italian words in context (“Capitolo 1” → “Capitolo primo”), and the loanword Torque — which the model mangled every single time as a rare foreign string — respelled to Torc to force the pronunciation. (Sometimes the right fix for an out-of-distribution word is to change the spelling, not the model.)

None of this is intellectually deep. All of it is the difference between a demo and a product, and skipping it is why so many local-TTS “it works!” threads produce audio nobody would actually listen to for eight hours.

Act XV — The Mussolini problem, and two different ways a metric lied to me

Correctness solved, a new complaint arrived, and it was about feeling: “The read sounds like a speech from Mussolini.” Every stressed word landed like a podium thump. Intelligible, correct, and exhausting — declamatory oratory where the book wanted an intimate narrator.

My first fix was elegant, measured, and completely wrong. I reasoned that generate-and-select was picking the take with the sharpest stress, so I added a second selection criterion: among the stress-correct takes, prefer the calmest one (lowest pitch-accent prominence). I measured it and reported a triumph — emphasis down from 6.56 semitones to 1.52, a 4.3× reduction, at zero cost to correctness. I baked it into the pipeline and handed over a “calm” clip.

When I listened, though, the two clips were identical — I even double-checked I hadn’t swapped the filenames.

And the truth was worse than a swapped file. I ran the verification properly this time — measuring the marked-word emphasis on the assembled files with a stable global pitch reference instead of the noisy per-span one — and the dramatic 4× gap evaporated to 0.3 semitones, inaudible. The original “6.56 vs 1.52” had been a measurement artifact: the per-span F0 metric was garbage, with one span reporting 12.8 semitones and another reporting a negative value (a “peak” below the span’s own median). Averaging that noise had manufactured a spectacular improvement that did not exist. The two clips genuinely differed — different takes, different waveforms — but they sounded the same because, acoustically, on any honest measure, they were the same.

Meta-lesson #8: a single metric will lie to you; the artifact is ground truth. This was the second time in the project a number walked me the wrong way — WER first, pitch-accent prominence now — and both times my ear caught it. A noisy metric doesn’t just add error bars; it can fabricate a clean, confident, entirely fictional result and hand it to you with a straight face. Verify against the actual artifact, and be most suspicious precisely when the number tells you exactly what you hoped to hear.

So the declamation was real, and inter-take selection could not touch it — because every take emphasized the marked words. The pounding wasn’t in the selection. It was in the marks themselves: the model faithfully hammers every accent I inject, and over a full chapter that accumulates into oratory. (Which finally reconciled a puzzle — a short marked clip sounded fine, but the full chapter sounded like a rally. Same per-word emphasis; it’s the density, compounding over minutes.)

The real fix was to inject fewer marks — and it was hiding in plain sight, in accento’s own founding principle, which the pipeline had quietly abandoned. Minimal intervention. I’d drifted into marking all 36 stress-candidate words in the chapter. But the fine-tuned model, asked unmarked, already gets most of them right. So: synthesize each span plain first, let the acoustic detector flag which words actually came out mis-stressed, and mark only those. Everything the model already nails stays untouched — read with natural, un-hammered stress.

The numbers made the case on their own. Of 36 candidate words, the model stressed 25 correctly with no mark at all. Only 11 needed intervention — a 69% cut in forced emphases. And because we now fix only what’s broken (and retry only those spans), marked-word correctness went up, not down: 94.4%, better than the 88.9% of the mark-everything version, for less compute.

Meta-lesson #9: do less. Minimal intervention beat maximal on every axis at once. Marking only the words the model gets wrong produced a read that was simultaneously more correct and more natural than marking everything — because every unnecessary mark was both a redundant correction and a little podium thump. The founding principle had been right all along; the regression was mine for forgetting it. When a system fights you, check whether you’re intervening more than the problem requires.

Act XVI — “Day and night”

I re-rendered the chapter with minimal-marking and sent it over. The verdict:

“It sounds great otherwise.”

And then, unprompted:

“Day and night. You find audiobooks with worse quality than this.”

That is the sentence the whole project was chasing. Not a WER, not a gate, not a semitone count — a native-speaking author, listening to his own novel in his own cloned voice, saying it’s better than things people are paid to produce. The metrics were only ever proxies for this line.

What’s left is trim, not construction: one word (tacere, at a paragraph’s end) that the trailing-trim clipped, and a few pauses to restore — including a specific dramatic beat, a pause after the whispered «E il Cervo», that I wanted restored by name. These are stitching-level fixes, the last few paper cuts. The read is done.

What actually built this

The honest through-line, compressed: the constraints did the design work, and the ear made every real decision.

The non-commercial license of the best-sounding model forced the fine-tune. The non-existence of a clean Italian pronunciation lexicon turned a hand-built one into a moat. The saturated LLM eval picked the cheapest model that cleared it. The offline nature of audiobooks demoted “perfect model” to “good selector.” And every automated gate — WER, stress detector, pitch-accent prominence — was, at best, a proxy that twice actively lied; the acceptance test was always a human listening.

Nine meta-lessons, and they rhyme: measure the problem before you build (it was 10%, not 100%); provenance is a moat for the small player; the saturated eval picks the cheapest winner; priors go downstream, not in the prompt; know what your QC gate actually validates; metrics are not a good read; you need a good selector, not a perfect model; a lone metric will fabricate a result — trust the artifact; and do less. Most of them are the same lesson wearing different clothes: stay honest about what you actually know, and let the real target — the ear — arbitrate.

The result is a thing that did not exist when I started: a clean, Apache-licensed, watermark-free, provably-sourced Italian voice model that says sàndali correctly, narrates a novel without sounding like a rally, and that I am fully allowed to sell. Next comes scaling it across the whole book, and deciding the three delivery shapes — a quality-max personal build, a cheapest-to-scale service, and an OSS package that ships accento with a bring-your-own-model slot.

But the hard question — can a person with two consumer GPUs, a clean-data conscience, and a good ear build an Italian audiobook voice that beats what you can buy? — has an answer now.

Day and night.

The code

accento is open source: github.com/acato/accento — the stress front-end, the morphological engine, the acoustic QC gate, and the provenance manifests, all under Apache-2.0.