servicesaboutinsights
Back to all notes

#45 — Orpheus

March 19, 20252 min read

#45 — Orpheus

Why it matters: Canopy Labs has released Orpheus, a groundbreaking family of speech-LLMs that finally brings human-level speech generation to the open-source community, challenging the dominance of closed-source models.

The big picture: Until now, open-source TTS solutions have lagged behind proprietary offerings in quality and emotional intelligence. Orpheus changes that paradigm with state-of-the-art performance even in its smallest configurations.

What's new

Canopy Labs is releasing four model sizes based on the Llama architecture:

  • Medium (3B parameters)
  • Small (1B parameters)
  • Tiny (400M parameters)
  • Nano (150M parameters)

Between the lines: Even the smallest models deliver "extremely high quality, aesthetically pleasing speech generation," making this technology accessible across various computing environments.

Technical innovation

Orpheus leverages Llama-3b as its backbone, trained on:

  • 100,000+ hours of English speech data
  • Billions of text tokens

The edge: This dual training approach enhances TTS performance while maintaining sophisticated language understanding.

Standout capabilities

Zero-shot voice cloning: Without specific training for this task, Orpheus demonstrates emergent voice cloning abilities that match or exceed industry leaders like ElevenLabs and PlayHT.

Emotion control: The model can be taught specific emotional expressions with minimal fine-tuning examples, responding to tags like , , ``, and even handling disfluencies naturally.

Production-ready features

Real-time performance: Orpheus supports output streaming with approximately 200ms latency, which can be further reduced to 25-50ms using input streaming into the KV cache.

By the numbers: Streaming inference runs faster than real-time playback even on an A100 40GB GPU with the 3B parameter model.

Technical differentiators

Canopy Labs made two unconventional design choices:

  1. Using a flattened sequence decoding approach (7 tokens per frame)
  2. Implementing a non-streaming CNN-based tokenizer with a sliding window modification

The bottom line: These choices enable real-time generation without the "popping" artifacts common in other SNAC-based speech LLMs.

What's next

Canopy Labs hints at releasing an open-source end-to-end speech model "in the coming weeks," using the same architecture and training methodology.

How to try it: Demos and code are available on GitHub, Hugging Face, and through an interactive Google Colab notebook.

More than just words

Don't fumble in the dark. Your ICPs have the words. We find them.

Strategic messaging isn't marketing fluff—it's the difference between burning cash on ads that don't convert and building a growth engine that scales.