Justy: Okay so Resemble AI just dropped something called DramaBox and the pitch is basically: write a screenplay, get a performance. Like, the PROMPT is the director.
Cody: I saw that. I was pretty skeptical at the name — DramaBox sounds like a subscription for like, reality TV recaps or something.
Justy: It really does. Anyway — how was your week, did you end up making that flight?
Cody: Barely. I got to the gate and they were already boarding the last group. I was the last person on the plane, sat in the middle seat, no overhead space. Classic.
Justy: That is the most Cody travel story. Okay, DramaBox — I actually think this one's interesting for a pretty specific reason.
Cody: So it's Resemble AI's expressive TTS, and it's built as an IC-LoRA fine-tune on top of LTX-2.3 — that's Lightricks' audio-only model, the 3.3 billion parameter version. The LoRA is already merged into the base weights, so you're downloading one 6.6 gigabyte file for the transformer and a separate 1.9 gigabyte file for the audio components. Then there's a Gemma-3 12B text encoder at 4-bit quantization on top, roughly another 8 gigabytes. So you're looking at about 24 gigabytes
Justy: Okay so that's not a laptop project. But the part that caught me — the actual product insight here — is that the prompt IS the performance. Not a style dropdown, not a slider. You write something like: 'A woman speaks warmly, then she laughs, then her voice cracks.' That's the whole instruction set. And you can optionally pass a 10-second voice reference to clone the timbre. So delivery style and speaker identity are two separate knobs, which I think is the real unlock for pr
Cody: Yeah, and the prompt writing rules are actually pretty specific — there's a real inside-quotes versus outside-quotes distinction that matters a lot. Dialogue goes inside the quotes, and the model produces the actual sounds. Laughs have to be written as one word: '' or 'Hehehe' — never separated. If you write 'Ha ha ha' with spaces it apparently just... speaks the letters weirdly. Stage directions go outside the quotes — 'She sighs deeply,' 'He gulps nervously,' 'A long pause.
Justy: I mean, that is such a 'read the docs before you ship' moment. But honestly for audiobook production or game dialogue this is a really different workflow than what exists. You're writing a script the way you'd write it for a voice actor, and the model interprets it like one.
Cody: That part I buy. My hesitation is more about the VRAM floor. Twenty-four gigabytes peak — that's an A10 or better, minimum. There's a ZeroGPU demo space on Hugging Face if you just want to try it. And the inference settings are worth knowing — cfg-scale at 2.5 is the default, lower gets you more natural, higher gets you more text-faithful. stg-scale is skip-token guidance at 1.5. Those two together are basically the 'how much do I trust the prompt versus how much does the mod
Justy: Hm.
Cody: Oh, and there's a watermarking thing baked in by default — every output from inference.py gets tagged with something called Resemble Perth. It's a neural watermark, imperceptible, survives MP3 compression and editing, near-100% detection accuracy. You can disable it with a flag for debugging but it's on by default.
Justy: I actually think that's the quietest but most important detail in the whole repo. Especially right now. Someone's going to use this to make a fake voicemail and Resemble just made sure there's a receipt.
Cody: Yeah. I mean it's only as good as the detection infrastructure behind it, but the fact that it's on by default — not opt-in — that's the right call.
Justy: Exactly. Okay, there's also a fine-tuning path if you want to train your own LoRA on top of DramaBox — not from raw LTX-2.3, on top of DramaBox itself. So you're starting with the expressive prior already baked in.
Cody: The LoRA targets 288 pairs across 48 transformer blocks — rank 128, alpha 128, 10k steps with a 500-step warmup. JSONL manifest is the recommended data format. One thing worth flagging: load the LoRA at inference rather than pre-merging — pre-merged checkpoints degrade output. Probably an artifact of how the IC-LoRA conditioning interacts with the base weights after the first merge.
Justy: That's such a weird gotcha. Like, normally merging is just... fine.
Cody: It's probably an artifact of how the IC-LoRA conditioning interacts with the base weights after the first merge. When you merge again you're compounding something the architecture wasn't designed for. That's my read, anyway — I could be wrong.
Justy: See, Cody finding the edge case is the most reliable thing about this podcast. Alright — if you want to poke at this, it's ResembleAI slash DramaBox on Hugging Face, MIT-adjacent community license from Lightricks. Good one to have open in a tab.