#45 — Orpheus | Field Notes

Why it matters: Canopy Labs has released Orpheus, a groundbreaking family of speech-LLMs that finally brings human-level speech generation to the open-source community, challenging the dominance of closed-source models.

The big picture: Until now, open-source TTS solutions have lagged behind proprietary offerings in quality and emotional intelligence. Orpheus changes that paradigm with state-of-the-art performance even in its smallest configurations.

What's new

Canopy Labs is releasing four model sizes based on the Llama architecture:

Medium (3B parameters)
Small (1B parameters)
Tiny (400M parameters)
Nano (150M parameters)

Between the lines: Even the smallest models deliver "extremely high quality, aesthetically pleasing speech generation," making this technology accessible across various computing environments.

Technical innovation

Orpheus leverages Llama-3b as its backbone, trained on:

100,000+ hours of English speech data
Billions of text tokens

The edge: This dual training approach enhances TTS performance while maintaining sophisticated language understanding.

Standout capabilities

Zero-shot voice cloning: Without specific training for this task, Orpheus demonstrates emergent voice cloning abilities that match or exceed industry leaders like ElevenLabs and PlayHT.

Emotion control: The model can be taught specific emotional expressions with minimal fine-tuning examples, responding to tags like , , ``, and even handling disfluencies naturally.

Production-ready features

Real-time performance: Orpheus supports output streaming with approximately 200ms latency, which can be further reduced to 25-50ms using input streaming into the KV cache.

By the numbers: Streaming inference runs faster than real-time playback even on an A100 40GB GPU with the 3B parameter model.

Technical differentiators

Canopy Labs made two unconventional design choices:

Using a flattened sequence decoding approach (7 tokens per frame)
Implementing a non-streaming CNN-based tokenizer with a sliding window modification

The bottom line: These choices enable real-time generation without the "popping" artifacts common in other SNAC-based speech LLMs.

What's next

Canopy Labs hints at releasing an open-source end-to-end speech model "in the coming weeks," using the same architecture and training methodology.

How to try it: Demos and code are available on GitHub, Hugging Face, and through an interactive Google Colab notebook.

#45 — Orpheus

What's new

Technical innovation

Standout capabilities

Production-ready features

Technical differentiators

What's next

Keep reading

#46 — The elements of effective storytelling

#47 — Hunyuan-T1

#48 — Cold Take: Broad automation trumps R&D acceleration

We’re actually here to help. Your ICPs have the words. We find them.

What's new

Technical innovation

Standout capabilities

Production-ready features

Technical differentiators

What's next

Keep reading

#46 — The elements of effective storytelling

#47 — Hunyuan-T1

#48 — Cold Take: Broad automation trumps R&D acceleration

We’re actuallyactually here to help. Your ICPs have the words. We find them.

We’re actually here to help. Your ICPs have the words. We find them.