AI Spotlight

Hume Introduces OCTAVE, a Speech-Language Model with Voice and Personality Generation

Hume's OCTAVE generates voices and personalities from prompts or recordings, enabling real-time, multi-speaker, and interactive AI experiences.

Hiraku

Dec 25, 2024 • 2 min read

Image composed by Hiraku for illustrative purposes.

Hume, a company known for its work in AI-driven vocal and emotional intelligence, has introduced OCTAVE (Omni-Capable Text and Voice Engine). This new model merges the speech-language capabilities of Hume’s EVI 2 with the functionality of popular systems such as OpenAI’s Voice Engine, ElevenLabs’ TTS Voice Design, and Google Deepmind’s NotebookLM.

Richer Voices and Personalities

One of OCTAVE’s standout features is its ability to generate both a voice and a distinct personality from either a short descriptive prompt or a brief (5-second) audio sample. The model can produce the desired accent, vocal register, emotional undertone, and other speaking styles, while also generating fitting language and mannerisms. According to Hume, this extends to varied prompts—ranging from “gravelly as if gargling hot asphalt” to “gentle therapist” or “wizard mentor.”

Instant Voice and Personality Adoption

OCTAVE can glean key attributes of any speaker—like tone, accent, and disposition—from a sample as short as five seconds. It then “clones” these characteristics, removing background noise and other artifacts, to produce new speech in a similar voice. This process enables near-instant adaptation to a specific individual’s speaking style and personality, making it possible to continue a conversation in a speaker’s voice even after they have stopped talking.

Real-Time Interaction and Multi-Speaker Dialog

Hume reports that OCTAVE can handle live conversation in real time, generating fluent replies in the voice it has learned or created. It can also switch seamlessly between multiple synthetic voices and personalities within the same conversation, a feature the company likens to Google Deepmind’s NotebookLM. All it needs is a single short recording of each speaker to match their vocal and personality traits. Hume says this approach allows for richer, more realistic interactions than separate models can manage when handling transcription, language response, and speech generation individually.

On-Par Language Capabilities

Beyond voice generation, OCTAVE still performs competitively on text-based reasoning tasks. Hume compared the performance of OCTAVE 3B with that of similarly sized large language models on benchmarks like MMLU, Commonsense QA, PIQA, and ARC (easy). The results suggest that OCTAVE’s advanced speech features do not come at the expense of text comprehension and reasoning abilities, positioning it as a robust all-in-one system for voice-enabled AI applications.

Release and Future Prospects

According to Hume, OCTAVE is still under development, and the company is proceeding cautiously with its release. Selected partners have early access to a limited version for safety and effectiveness testing, with a broader rollout planned in the months ahead. Hume emphasizes the potential for more sophisticated AI experiences, from tailoring virtual personalities to individual users to orchestrating multi-speaker AI conversations. The company has invited feedback from developers and end users on what innovations could be built with OCTAVE’s wide-ranging capabilities.

Richer Voices and Personalities

Instant Voice and Personality Adoption

Real-Time Interaction and Multi-Speaker Dialog

On-Par Language Capabilities

Release and Future Prospects

Get more straight to your inbox!