VOL. I  ·  NO. 
SUB/WAVE
ON AIR

MANUAL · 05

How the DJ works.

There's no human at the desk. An LLM picks every track, writes every line, and a text-to-speech voice reads it out. Here's how that adds up to a station that sounds like a station.

PICKING TRACKS

One song ends, the DJ chooses the next.

Every time a track finishes, the DJ picks what follows. It builds a pool of candidates from your library — songs in a similar mood, similar artists, recently-added and frequently-played albums, matching playlists — and the LLM chooses from that pool, steering by the time of day, the weather, and the current mood. When nothing's been requested, it runs a fallback playlist so the music never stops.

THE VOICES

Personas, picked at random.

The operator gives the DJ one to ten souls — distinct personas, each with its own name and character. Before each spoken moment the station picks one at random, so the voice on air shifts through the day rather than reading from a single script. Each line is generated fresh; the DJ doesn't repeat itself.

The spoken audio is rendered by a text-to-speech engine — a fast local voice by default, or a more natural one if the operator configures it.

THE VOICE ENGINE

Local voices, or the cloud.

The DJ's words are written by the language model, but turning them into speech is a separate job — handled by one of four text-to-speech engines the operator chooses under Admin → TTS voice. Three run on your own hardware, one is hosted.

  • Piper — a local engine, and the default. It's compact, runs on practically any hardware, and renders speech faster than real time. The voice is clear but a little synthetic. Piper is also the station's safety net — see below.
  • Kokoro — a local neural model that sounds markedly more natural, closer to a real broadcaster. It's heavier: it loads a model into memory and takes longer per line, so it's happiest with a bit of CPU and RAM headroom. It offers a range of voices, with a British selection surfaced in the console.
  • Chatterbox — a local model that clones a voice from a short reference clip, so each persona can have its own distinct sound, and voices paralinguistic cues like [laugh] and [sigh] as real sounds. The most capable local engine and the heaviest — comfortable on a GPU, slow on CPU — and opt-in: the operator bundles it into the controller image.
  • Cloud — hosted text-to-speech through OpenAI or ElevenLabs, using an API key. The most lifelike and expressive of the four, but it costs per use and depends on the network being up.

You don't have to commit to one. The operator can assign a different engine per kind of segment — say a rich cloud voice for station IDs, but a fast local voice for routine time checks — with everything else falling through to a default engine.

THE DJ NEVER GOES SILENT

If a voice ever fails — a cloud outage, a model that isn't installed — the station drops to a local engine automatically. Piper is always there as the last resort, so a spoken segment is never lost to a missing voice.

ENABLING CHATTERBOX

Piper and Kokoro ship inside the controller image, and the cloud engine just needs an API key — but Chatterbox is the exception. Its model and PyTorch runtime are large, so it is opt-in at build time rather than bundled by default. Rebuild the controller image with the WITH_CHATTERBOX build argument, then recreate the container:

cd docker
docker compose build --build-arg WITH_CHATTERBOX=1 controller
docker compose up -d controller

Once the image is built, Chatterbox shows up as an available engine under Admin → TTS voice. To give a persona a cloned voice, drop a short reference WAV into state/chatterbox-voices/ and pick it on the Personas page — without one, Chatterbox uses its built-in voice. Until the image is rebuilt, selecting Chatterbox simply falls back to Piper.

WHEN IT TALKS

Links, IDs, the time, the weather.

Between tracks the DJ does what radio DJs do — a short link tying one song to the next, a station ID, the time at the top of the hour, a weather note when the conditions change. Spoken segments ride over the music: the track ducks down while the DJ talks, then comes back up.

How chatty the station is depends on a frequency setting the operator chooses — quiet, moderate, or aggressive. A quiet station checks the time every couple of hours and drops the occasional ID; an aggressive one gives you full idents and weather updates through the hour.

SHOWS & SESSIONS

It keeps a thread going.

The DJ runs in sessions — a continuous block with a memory of what it's already played and said, so its links stay coherent instead of starting cold each time. A session can be a scheduled show the operator paints onto a weekly grid, or an autonomous block keyed to the time of day and the dominant mood. When the show changes or the block ages out, the session rolls over to a fresh one and carries a short handoff forward.