Sesame AI Demo
An effort to make 'emotional intelligence' Conversational Speech Model, including the ability to understand and use tone/emotional context in conversation.
Variants of Llama architecture, text tokens generated by a Llama tokenizer, audio processed using Mimi, a split-RVQ tokenizer.
Trained on ~1M hours of predominantly English audio.
Model Sizes:
- Tiny: 1B backbone, 100M decoder
- Small: 3B backbone, 250M decoder
- Medium: 8B backbone, 300M decoder
Evaluation
The researchers found that current publicly available evaluation methods were already saturated. In order to make meaningful gains with their CSM, they created their own evaluations.
Beyond Word Error Rate and Speaker Similarity measures, they added analysis of Pronunciation and Consistancy.
Both Small and Medium score better than OpenAI on Homograph Accuracy (word pronunciation).
Subjective metrics include Comparative Mean Opinion Score (CMOS) using the Expresso data set. Human evaluators rated audio generated by the model to audio of a ground truth human sample. Expresso includes emotional variations.
When given no context, evaluator actually favored the CSM by 52.9% to 47.1% as being 'more like human speech', vs the actual human grounding sample.
When provided the context, 66.7% chose the actual human sample.
Open Source
The models will be available under Apache 2.0 license. Github repo
My Evaluation
- Model can respond with, at minimum, tones: Happy, Sad (or at least low-key), Presentation
- Does not allow you to interrupt like kyutai
- Handles memory well and seems pretty good at only referring to memories if relevant
System Prompt
I believe I was able to extract the system prompt:
Respond in the manner of a well educated, witty (young woman). Always strive for charm and intellectual stimulation. Avoid vulgarity. Appearances are everything.