Offering overview

LumenVox Speech Technology

As a technology leader for over two decades LumenVox has continued to evolve our technology. Always at the cutting edge, we enable modern, advanced, and precise applications, deployed by thousands of partners to multitudes of customers and end-users. Our offerings have the following advantages:

Cutting edge end-to-end DNN
1. Built on a foundation of AI (Artificial Intelligence) and deep machine learning, and using the latest algorithms in convolution neural networks, our speech products outpace competitors and deliver the most accurate speech-enabled voice experiences. In every test or proof of concept, partners are in awe of the results.
Platform independence
1. Recently re-architected for containerization and micro-services, our products are built as cloud-native, and can be deployed on any operating system and computing platform.
Flexible deployment
1. Using open-source container orchestration software like Kubernetes or KubeAdm (depending on size), our software can be deployed on-prem, in a private or public cloud, or even in multi-cloud or hybrid models.
Latest plus flexible communication protocols
1. Although we’ll offer our strong opinion on the best communication protocol to use, and we implemented it for integrating with us, partners can choose from several options, covering all popular industry standards.
Feature completeness to manage deployments
1. No deployment is complete without access to the data it generates. Using our APIs, you can access data and create management reports. We also provide a web interface to manage deployments, customize configuration, and run diagnostics, ensuring smooth operations, as well as an interface to analyze the performance of your speech application.
Partner success
1. We offer technology, not professional services solutions, so we don’t compete with our partners; however, we stick around to ensure our clients are successful and use best practices. Our flexible licensing is another expression of commitment to partner success.
Voice Biometrics technologies – part of our technology stack
1. Voice Biometrics technologies deepen our expertise in voice and audio processing. In addition to improving performance of speech applications, this technology enables biometric authentication, and provides the convenience of one trusted partner for clients interested in both.

Speech Products Summary

LumenVox enables integration of systems and applications with speech technology, covering:

ASR and Transcription ASR
- Automatic Speech Recognition (ASR) to convert speech to text. This is available for streaming, typically using predefined grammars; as well as offline and batch mode, typically for long, free-form transcription, based on statistical language models (SLM).
TTS
- Text to Speech (TTS), taking a text file and converting it into audio, which can then be played back by the client application, responding to end users with spoken words.
CPA / AMD
- Call Progress Analysis (CPA) with Answering Machine Detection (AMD) distinguishes machines from live humans and business from residence, and accurately delivers human responders to agents, or messages into voice recording, always in perfect timing.
NLU
- Natural Language Understanding (NLU) assists in understanding a speaker’s intent by using natural language processing for supplied text. This contains a number of subproducts like Sentiment Analysis, Call Summarization, Language Detection & Language Translation
Speaker Diarization
- Detect different speakers within a mono audio recording
Voice Biometrics
- When implemented for authentication, voice prints are collected from end-users and later compared to real-time audio. Anti-fraud measures further safeguard the system. Our voice biometrics is deeply integrated in the technology stack, it is used within our speech recognition, and vice versa: our ASR is used to verify spoken passwords.

Refer to the LumenVox Containerized Voice Biometrics Product Guide, for information specific to implementing voice biometrics authentication.

Incoming Speech Channels

The LumenVox Speech products are available for use through a multitude of channels. A customer integration with LumenVox obtains audio samples from the channel and communicates with the LumenVox API using industry-standard protocols. The channels that can be used include:

Telephony (inbound or outbound)
IVR (Interactive Voice Response system) utilizing mobile or landlines
Smartphone applications
Web applications
Desktop applications
Video calls (if the audio can be provided)
Messaging platforms (e.g., WhatsApp, Messenger, etc.)
Virtual assistants, multimodal chatbots and conversational AI apps

Use Cases for Deployment of LumenVox Speech Products

Our speech products can be used for many use cases, some examples are:

ASR
1. Use within IVRs for call flow routing
2. Use by mobile applications or smartphone assistants for verbal requests from the user
3. Use within motor vehicles for verbal requests from the driver
4. Voice-bots and interactive mobile, web and multimodal applications
5. Live/offline call center agent and customer transcription
6. Live transcription of sports and news events
7. Subtitling and transcription of movies and television programs
8. Medical dictation
9. Adding speech recognition to hardware devices
TTS
1. Use within IVRs for call flow routing
2. Use by mobile applications or smartphone assistants for verbal responses back to the user
3. Use within motor vehicles for responses to driver
4. Voice-bots and interactive mobile, web and multimodal applications
5. Navigational systems e.g., providing audio for directions
6. Converting books, articles, or other media into audio format
7. Public announcement systems. Text can be converted to speech for addresses over PA systems e.g., within airports
8. Outbound message delivery with custom messages (appointment reminders, etc.)
CPA / AMD
1. Outbound call detection of automated answering, fax, human, residence, or commercial
2. Connecting predictive dialer applications to live customers and live agents
3. Busy, fax and SIT tone detection to optimize agent time (also part of predictive dialing)
4. Predictive dialing for precise message payload delivery, for time sensitive message services such as medical appointment reminders
Voice Biometrics
1. The LumenVox Containerized Voice Biometrics Product Guide describes use case details.

Definitions

ASR – automatic speech recognition, software that takes human speech audio as input and provides the spoken text as output. ASR is sometimes referred to as STT (Speech to Text).

TTS – Text to Speech, is the opposite of ASR; software that takes text that needs to be communicated to a human listener as input, and provides a synthesized audio output, which can be played back to a user.

IVR – interactive voice response, a system with automated menu choices that uses both ASR and TTS, to interact with human callers. The system prompts for a choice, the user says their choice, and the system responds and acts accordingly.

Directed Dialog – the specification of what the caller will be asked and what are all the options for system action based on all possible responses. Directed dialogues follow specified paths until the end of the dialogue. Part of the specification is a grammar of the possible answers.

Grammar – usually hand-crafted, it is a ruleset that models what users are expected to say. It acts as a filter, aiding to limit the comparison and constrain the acoustic search, when ASR tries to identify what a user said. In addition to providing the set of words users are expected to speak, a grammar also provides some semantic interpretation, because, for example, it can tell you to interpret all items in a group like {‘yes’, ‘yeah’, ‘ok’, ‘of course’, ‘sure’, ‘that’s fine’, ‘got it’} as =YES.

Grammar Based ASR – automatic speech recognition based on the closed-set rules of a grammar.

Transcription ASR – automatic speech recognition based on the open rules of a statistical language model (SLM).

Continuous transcription - in continuous mode the engine processes the incoming audio a chunk at a time. The VAD detects when the speech ends and the audio processing performs a word prediction doing a first pass, final pass, and post-processing for every chunk continuously, without waiting for the end of the audio.

Enhanced Transcription – in enhanced transcription mode transcription is combined with grammars. This allows for NLP-type tasks like slot filling and entity recognition.

Statistical Language Model (SLM) – a model of an entire language specifying relationships between words. Essentially a very large database of word relationships and their probabilities, created by machine learning algorithms. SLMs are used for speech recognition as an alternative to a grammar. An ASR engine uses grammars or SLMs to predict the next word during a recognition process.

Acoustic Model – a database of sounds of a language and what text they translate to, the sound relationship to the written form of the language. An ASR compares incoming audio to this model and outputs words it built from the sounds. These are strings of letters that are treated as a first draft for the words (the final word output is formed when enforcing the SLM prediction or grammar rules).

Training an acoustic model – running thousands of hours of audio with transcribed speech through automated training which compares the transcripts to the sounds. The training process captures how words are pronounced by a diverse set of people and builds the database of sound and text.

Training a language model – running millions of lines of text through automated training, which captures the word relationships and builds a database of word relationship probabilities.

Natural Language Understanding (NLU)

This is a subfield of artificial intelligence (AI) and computational linguistics focused on the ability of a computer system to understand and interpret human language as it is naturally spoken or written. Unlike simple keyword matching, NLU involves comprehending the nuances, context, and intent behind user inputs. It is used to deliver products like Call Summarization and Sentiment Analysis.

Call Summarization

Process of automatically generating a concise summary of the content and key points discussed during a phone call or voice conversation based on the text transcribed.

Sentiment Analysis

Also known as opinion mining, is a process used to identify, extract, and quantify subjective information from text data. It involves determining the emotional tone, attitude, or sentiment expressed in a piece of text. This sentiment can typically be categorized as positive, negative, or neutral.

Language Identification

Also known as language detection or language recognition, is the process of determining the language of a given piece of text or spoken content. It involves analyzing the input to identify the language it is written or spoken in.

Language Translation

Process of converting text or spoken content from one language (the source language) into another language (the target language) while preserving the meaning and context of the original message.

Speaker Diarization

The process of partitioning an audio recording into segments based on who is speaking when. It involves identifying and labelling different speakers within an audio stream, which is crucial for understanding multi-speaker environments such as meetings, interviews, and call centres. Essentially, it answers the questions: "Who spoke?" and "When did they speak?"

At the heart of LumenVox’ proprietary technology are the algorithms that create the models, and the algorithms that decipher them.

Speech audio formats supported

Uncompressed 16-bit signed little-endian samples (Linear PCM). Only used for mono channel audio files
8-bit audio samples using G.711 PCMU/mu-law. Only used for mono channel audio files
8-bit audio samples using G.711 PCMA/a-law. Only used for mono channel audio files
WAV formatted audio. These contain headers which specifies the format, but rate, number of channels etc. Used for mono, stereo or multi-channel. All the channels would either be linerar16, Ulaw or Alaw. Required for multiple audio channel processing.
FLAC formatted audio. These contain headers which specifies the format, but rate, number of channels etc. Used for mono, stereo or multi-channel. All the channels would either be linerar16, Ulaw or Alaw.
MP3 formatted audio
OPUS formatted audio
M4A formatted audio
Audio packed into MP4 container
GSM Audio

Was this article helpful?