Motivational Speech Synthesis

Text-to-motivational-speech with adjustable motivational factor to control motivational prosody

Artistic research deconstructing the performative excess of motivational western subcultures



Within the increasing popularity of fitness and entrepreneurship in western subcultures, video clips of so-called motivational speech received millions of views across social media. Usually, those audiovisual artifacts show excerpts from presentations or interviews of people—in most cases male business leaders, authors, and other influential figures—who narrate about the optimal instructions, principles, and strategies for success. Paired with epic and emotional background music, these videos should act as a vehicle for self-motivation and goal pursuit. Addressing a primary target group of men, success is often tied to wealthiness, professional growth, or appeal to women, while the same is obstructed by characteristics like weakness, fragility, or discontinuity. With motivational speech, a listener's ultimate goal is to obtain and shape a mindset, which ensures them to be on the right path for achievement. Motivational speech emerges as a phenomenon in a society of self-optimization, embedded in the ethos of constant productivity, self-isolation, competition, and meritocracy.

Motivational Speech Synthesis aims to reflect the inherent patterns and attitudes of motivational speech. Correlating with the generalization process of one universal way to success as well as the presence of an anticipated forward movement into a listener’s future in motivational speech itself, we use deep learning techniques to average web-scraped motivational speech into a text-to-motivational-speech model. With the introduction of a one-dimensional motivational factor, the model is capable of scaling its motivational prosody. Representing the promise of social mobility, this parameter aligns with phrases like “The harder you work, the more you can get.” In response, we question a universal road to success, only defined and obtained by individual strength and a right mindset, while critically examining the sociocultural foundations of motivational speech. Motivational Speech Synthesis addresses aspects of our work ethic and how we approach our goals and challenges in life.



Proposed architecture

The following diagram illustrates our EmoKnob for synthesizing motivational speech. Motivational intensity is controlled via averaged speaker embeddings, which are derived by selecting and averaging speech samples corresponding to different levels of motivational intensity within our dataset. These embeddings are generated in discrete increments along our one-dimensional motivational factor, ranging from 0 (low intensity) to 1 (high intensity). During inference, the closest embedding is selected according to the desired motivational factor, allowing precise emotional adjustment of generated speech.

Architecture

Motivational Factor

The visualization below presents a 3-dimensional representation of emotional speech samples drawn from motivational speeches, projected into Valence-Arousal-Dominance (VAD) emotion space. Here, the scales range from negative to positive emotions (valence), calm to stimulated emotions (arousal), and submissive to dominant emotions (dominance). Each point represents a segment of motivational speech collected from YouTube, embedded using the deep learning-based speech emotion recognition model wav2vec 2.0. To distill these complex emotional patterns into a single interpretable value, we apply the UMAP dimensionality reduction algorithm, resulting in our motivational factor—a continuous scale ranging from 0 (low motivational intensity) to 1 (high motivational intensity). Explore the interactive plot to observe how different datapoints align along this motivational intensity spectrum.


Audio Examples

Below are synthesized audio samples demonstrating how our motivational factor influences speech prosody. For each of seven motivational speech prompts, we generated audio at varying motivational intensity levels, ranging from 0.00 (low motivational intensity) to 1.00 (high motivational intensity). Listen to these examples to perceive how adjusting the motivational factor affects the expressiveness, tone, and emotional delivery of synthesized motivational speech.

Prompt
Motivational
Factor
0.00

0.25
0.50
0.75
1.00
Anything you want, you can get.
No goal is too far away to be reached.
Nobody is coming to save you.
Reject weakness, embrace discipline.
Seeing, believing, planning and doing it.
Work hard in silence, shock them with your success.
You are strong, resilient, consistent, focused.

Even though Motivational Speech Synthesis strives for artistic reflection, we want to emphasize, that this project does not aim to judge any person actually benefiting from motivational speech or similar phenomena. We don't expose or look at people consuming motivational speech, but rather focus on deconstructing underlying circumstances and attitudes of those narratives, which arrive as symptoms of a society driven by growth and success.