Motivational Speech Synthesis

Text-to-motivational-speech with adjustable motivational factor to control motivational prosody

Artistic research deconstructing the performative excess of motivational western subcultures



Within the increasing popularity of fitness and entrepreneurship in western subcultures, video clips of so-called motivational speech received millions of views across social media. Usually, those audiovisual artifacts show excerpts from presentations or interviews of people—in most cases male business leaders, authors, and other influential figures—who narrate about the optimal instructions, principles, and strategies for success. Paired with epic and emotional background music, these videos should act as a vehicle for self-motivation and goal pursuit. Addressing a primary target group of men, success is often tied to wealthiness, professional growth, or appeal to women, while the same is obstructed by characteristics like weakness, fragility, or discontinuity. With motivational speech, a listener's ultimate goal is to obtain and shape a mindset, which ensures them to be on the right path for achievement. Motivational speech emerges as a phenomenon in a society of self-optimization, embedded in the ethos of constant productivity, self-isolation, competition, and meritocracy.

Motivational speech has emerged as a popular audiovisual phenomenon within Western subcultures, conveying strategies and principles for individual success through expressive, high-energy delivery. The presented paper artistically explores methods for synthesizing the distinctive prosodic patterns inherent to motivational speech, while critically examining its sociocultural foundations. Drawing on recent advances in emotion-controllable text-to-speech (TTS) systems and speech emotion recognition (SER), we employ deep learning models and frameworks to replicate and analyze motivational speech. Within our proposed architecture, we introduce a one-dimensional motivational factor , enabling the control of motivational prosody according to intensity. Situated within broader discourses on self-optimization and meritocracy,Motivational Speech Synthesis contributes to the field of emotional speech synthesis, while also prompting reflection on the societal values mediated in such narratives.



Proposed architecture

The following diagram illustrates our EmoKnob based architecture for synthesizing motivational speech. Motivational intensity is controlled via averaged speaker embeddings, which are derived by selecting and averaging speech samples corresponding to different levels of motivational intensity within our dataset. These embeddings are generated in discrete increments along our one-dimensional motivational factor, ranging from 0 (low intensity) to 1 (high intensity). During inference, the closest embedding is selected according to the desired motivational factor, allowing precise emotional adjustment of generated speech.

Architecture

Motivational Factor

The visualization below presents a 3-dimensional representation of emotional speech samples drawn from motivational speeches, projected into Valence-Arousal-Dominance (VAD) emotion space. Here, the scales range from negative to positive emotions (valence), calm to stimulated emotions (arousal), and submissive to dominant emotions (dominance). Each point represents a segment of motivational speech collected from YouTube, embedded using the deep learning-based speech emotion recognition model wav2vec 2.0. To distill these complex emotional patterns into a single interpretable value, we apply the UMAP dimensionality reduction algorithm, resulting in our motivational factor—a continuous scale ranging from 0 (low motivational intensity) to 1 (high motivational intensity). Explore the interactive plot to observe how different datapoints align along this motivational intensity spectrum.


Audio Examples

Below are synthesized audio samples demonstrating how our motivational factor influences speech prosody. For each of seven motivational speech prompts, we generated audio at varying motivational intensity levels, ranging from 0.00 (low motivational intensity) to 1.00 (high motivational intensity). Listen to these examples to perceive how adjusting the motivational factor affects the expressiveness, tone, and emotional delivery of synthesized motivational speech.

Anything you want, you can get.
0.00
0.25
0.50
0.75
1.00
No goal is too far away to be reached.
0.00
0.25
0.50
0.75
1.00
Nobody is coming to save you.
0.00
0.25
0.50
0.75
1.00
Reject weakness, embrace discipline.
0.00
0.25
0.50
0.75
1.00
Seeing, believing, planning and doing it.
0.00
0.25
0.50
0.75
1.00
Work hard in silence, shock them with your success.
0.00
0.25
0.50
0.75
1.00
You are strong, resilient, consistent, focused.
0.00
0.25
0.50
0.75
1.00

Even though Motivational Speech Synthesis strives for artistic reflection, we want to emphasize, that this project does not aim to judge any person actually benefiting from motivational speech or similar phenomena. We don't expose or look at people consuming motivational speech, but rather focus on deconstructing underlying circumstances and attitudes of those narratives, which arrive as symptoms of a society driven by growth and success.