Soul App's Open-Source Model Brings Human-like Naturalness to AI Podcasts

Soul AI Lab, the AI technology team behind the social platform Soul App, has officially open-sourced its voice podcast generation model, SoulX-Podcast. Designed specifically for multi-speaker, multi-turn dialogue scenarios, the model supports multiple languages and dialects, including Mandarin, English, Sichuanese, and Cantonese, as well as paralinguistic styles. It is capable of stably generating natural, fluent multi-turn voice dialogues exceeding 60 minutes in length, with accurate speaker switching and rich prosodic variation.

Beyond podcast-specific applications, SoulX-Podcast also achieves outstanding performance in general speech synthesis and voice cloning tasks, delivering a more authentic and expressive voice experience.

Demo Page: https://soul-ailab.github.io/soulx-podcast Technical Report: https://arxiv.org/pdf/2510.23541 Source Code: https://github.com/Soul-AILab/SoulX-Podcast Hugging Face: https://huggingface.co/collections/Soul-AILab/soulx-podcast

Key Capabilities: Fluid Multi-Turn Dialogue, Multi-Dialect Support, Ultra-Long Podcast Generation

1.Zero-Shot Cloning for Multi-Turn Dialogue: In zero-shot podcast generation scenarios, SoulX-Podcast demonstrates exceptional speech synthesis capabilities. It not only accurately reproduces the timbre and style of reference audio but also dynamically adapts prosody and rhythm according to the dialogue context, ensuring that every conversation is natural and rhythmically engaging. Whether in extended multi-turn dialogues or emotionally nuanced exchanges, SoulX-Podcast consistently maintains vocal coherence and authentic expression. Additionally, the model supports controllable generation of various paralinguistic elements, such as laughter and throat clearing, enhancing the immediacy and expressiveness of synthesized speech.

2.Multi-Lingual and Cross-Dialect Voice Cloning: In addition to Mandarin and English, SoulX-Podcast supports several major Chinese dialects, including Sichuanese, Henanese, and Cantonese. More notably, the model achieves cross-dialect voice cloning — even when provided only with a Mandarin reference speech, it can flexibly generate natural speech featuring the phonetic characteristics of these target dialects.

3.Ultra-Long Podcast Generation: SoulX-Podcast supports the generation of ultra-long podcasts while consistently maintaining stable timbre and style throughout.

Voice-Focused: AI Reimagining Emotional Connections

Voice has always served as a vital medium for conveying both information and emotion, uniquely capable of imbuing communication with "emotional warmth" and a sense of "companionship." On Soul, users actively engage in real-time voice interactions to express themselves, share experiences, and build new relationships. As a result, voice has become an "emotional bond" that connects users, making "voice-based social interaction" one of the platform’s defining features.

In advancing AI-powered social networking, Soul has prioritized capabilities such as intelligent dialogue, voice generation, and emotionally expressive communication. Previously, the platform underwent a comprehensive upgrade of its end-to-end full-duplex voice call model, equipping AI with the ability to autonomously manage conversation flow. The AI can now proactively break silences, appropriately interject, listen while speaking, perceive temporal semantics, and engage in parallel discussions, delivering interactive dialogues that mirror daily life and offer a human-like experience of emotional companionship.

Concurrently, the team has introduced a suite of in-house developed voice foundation models, including large-scale models for voice generation, speech recognition, and voice dialogue. These capabilities have been deployed across diverse scenarios such as "AI Companions" and "Audio Partyrooms" (multi-user voice interaction environments).

For example, in September this year, two of Soul’s AI Companions — Meng Zhishi and Yu Ni — hosted a dialogue lasting approximately 40 minutes in an audio partyroom. Without any additional promotion and relying solely on the virtual humans’ organic reach, the event quickly went viral within the community. The room’s engagement metrics set a new platform record, receiving an enthusiastic response from a broad user base.

This success provided Soul’s AI technology and virtual IP operation teams with a key insight: "Virtual IP + AI Audio Dialogue" is emerging as a major growth driver within the virtual content ecosystem. It not only demonstrates the charismatic appeal and expressive power of virtual beings but also reveals new potential for AI in content creation and social interaction.

However, at that time, the industry lacked robust open-source podcast generation models capable of reliably supporting multi-turn natural dialogue. Furthermore, when scaling from single-speaker monologues to multi-speaker conversations and long-form podcasts, systems commonly faced several challenges: over-reliance on the immediate text, resulting in unnatural contextual transitions due to a lack of broader dialogue awareness; insufficient controllability over paralinguistic elements and dialects, making generated dialogues feel robotic and lacking authentic interactivity; and difficulty in adapting emotional states according to the conversation content while maintaining consistent speaker timbre across multiple turns, which ultimately compromised immersion.

To address these challenges, the Soul team decided to open-source SoulX-Podcast, aiming to collaborate with the AIGC community and collectively explore the vast possibilities of AI voice in content creation, social expression, and virtual ecosystems.

Collaborative Exploration: Expanding Possibilities for AI and Social Interaction

Although recent open-source research has begun to explore multi-speaker, multi-turn speech synthesis for podcast and dialogue scenarios, existing work remains largely confined to Mandarin and English, offering limited support for widely used Chinese dialects such as Cantonese, Sichuanese, and Henanese. Furthermore, in multi-turn voice dialogues, appropriate paralinguistic expressions, such as sighs, breaths, and laughter，are essential for enhancing vividness and naturalness, yet these nuances remain underexplored in current models.

SoulX-Podcast is designed to address these very gaps. By integrating support for extended multi-speaker dialogues, comprehensive dialect coverage, and controllable paralinguistic generation, the model brings synthesized podcast speech closer to real-world communication, making it more expressive, engaging, and immersive for listeners.

The overall architecture of SoulX-Podcast adopts the widely-used "LLM + Flow Matching" paradigm for speech generation, where the LLM models semantic tokens and the flow matching module further models acoustic features. For semantic token modeling, SoulX-Podcast is built upon the Qwen3-1.7B foundation model, initialized with its original parameters to fully leverage its robust language understanding capabilities.

Although SoulX-Podcast is specifically designed for multi-speaker, multi-turn dialogues, it also demonstrates exceptional performance in conventional single-speaker speech synthesis and zero-shot voice cloning tasks. In podcast generation benchmarks, the model achieves top-tier results in both speech intelligibility and speaker similarity compared to recent related works.

The open-source release of SoulX-Podcast marks a significant milestone in Soul's engagement with the open-source community. The Soul AI technology team has announced plans to continue enhancing core interactive capabilities, including conversational speech synthesis, full-duplex voice calls, human-like expressiveness, and visual interaction, and to accelerate the integration of these technologies across diverse application scenarios. The ultimate goal is to deliver more immersive, intelligent, and emotionally resonant experiences that foster user well-being and a stronger sense of belonging.