UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations

Abstract

Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles.

Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis.

Dataset Overview

UltraVoice Teaser — **Overview of the UltraVoice Dataset Construction and Stylistic Coverage.** The upper left section details the four-step process: text corpus curation, style injection & response generation, stylized speech synthesis, and quality control & filtering. The ring chart on the right visualizes the dataset’s control dimensions (inner ring) and their finer control sub-dimensions (outer ring). The lower panel provides examples of six speech style dimensions, including emotion, speed, volume, language, accent, and composite styles (e.g., combinations of speed, volume, and emotion).

830+

Hours of Speech

Style Dimensions

Sub-dimensions

100K+

Dialogues

Fine-Grained Style Control Dimensions

Explore our dataset's comprehensive coverage of speech style dimensions. Each dimension contains multiple sub-types with instruction-response pairs demonstrating fine-grained control.

Base Model vs. SFT Model Comparison

Compare the performance of the base model and the fine-tuned model on various style control tasks. The table below shows representative samples across different control dimensions.

Control Dimension	Sub-Dimension	Audio Query	Base Model	SFT Model