LumenVox speech products (TTS)

TTS

Overview

Neural Text-to-Speech (TTS) involves the conversion of supplied text into an audio file/audio stream. The text to be processed is either supplied in plain text or as an SSML (Speech Synthesis Markup Language) formatted document. Audio can be retrieved as a single audio byte array or in segments/chunks. Client applications can then playback the resulting audio as required.

Each language that we support comes with a choice of distinctive male or female voices, this allows the implementation to choose how to respond to users in a manner that speaks to them best.

A screenshot of a computer screen

Description automatically generated with medium confidence

SSML

SSML is used by software to provide rich control over Text to Speech (TTS) synthesis. It’s a way to provide precise instructions to modify the voice generated. LumenVox supports SSML 1.0 to control pronunciation, tone, and stress of synthesized speech. With LumenVox you can choose a language, pick a voice, and customize your selection if needed.

Text Normalization for TTS

Input text documents contain not only words, but also other written elements, such as numbers, dates, acronyms, abbreviations, symbols, punctuation, etc. All such elements must first be converted to actual words or pauses, and then synthesized. This conversion is performed internally within the synthesizer. Overall LumenVox follows W3C standards for normalization.

Extending Text Normalization

For some of our TTS voices, it is possible for customers to extend the text normalization to handle more cases. This is accomplished via PLS lexicons.

The “say-as” element

The SSML element ”say-as” enables users to annotate fragments of text to force specific interpretation / pronunciation. This should be used only when the default normalization rules fail and render different speech than expected by users. Refer to the LumenVox Knowledgebase for more information on TTS customization using SSML.

Was this article helpful?