Product Glossary
Speech
Automatic Speech Recognition (ASR) - Software that converts human speech audio into text. It can be used in real-time streaming, typically using predefined grammars; and offline and batch mode, typically for long, free-form transcription, based on statistical language models (SLM). Lumenvox ASR uses a Deep Neural Network (DNN) engine.
Call Progress Analysis (CPA) - Software that uses Voice Activity Detection (VAD) and Answering Machine Detection (AMD) to distinguish machines from humans, and business from residence. CPA accurately connects human responders to agents or delivers messages to voice recording. The outbound messaging application is then informed what to do next - hand off to a live agent or leave a personalized voicemail.
Text to Speech (TTS) - Software that converts text into audio. This audio can then be played back by the client application, responding to end-users with spoken words. Each language supported by Lumenvox comes with a choice of male or female voices.
Directed Dialog - The specification of what the caller will be asked and what are all the options for system action based on all possible responses. Directed dialogues follow specified paths until the end of the dialogue.
Grammar - A set of rules that models what users are expected to say. It acts as a filter, aiding to limit the comparison and constrain the acoustic search, when ASR tries to identify what a user said. In addition to providing the set of words users are expected to speak, a grammar also provides some semantic interpretation, because, for example, it can tell you to interpret all items in a group like {‘yes’, ‘yeah’, ‘ok’, ‘of course’, ‘sure’, ‘that’s fine’, ‘got it’} as =YES.
Grammar Based ASR - Automatic speech recognition based on the closed-set rules of a grammar.
Transcription ASR – automatic speech recognition based on the open rules of a statistical language model (SLM).
Continuous Transcription - in continuous mode the engine processes the incoming audio a chunk at a time. The VAD detects when the speech ends and the audio processing performs a word prediction doing a first pass, final pass, and post-processing for every chunk continuously, without waiting for the end of the audio.
Enhanced Transcription – in enhanced transcription mode transcription is combined with grammars. This allows for NLP-type tasks like slot filling and entity recognition.
Interactive Voice Response (IVR) - A system with automated menu choices that uses both ASR and TTS to interact with human callers. The system prompts for a choice, the user says their choice, and the system responds and acts accordingly.
Statistical Language Model (SLM) - A model of an entire language specifying relationships between words.
Acoustic Model – a database of sounds of a language and what text they translate to, the sound relationship to the written form of the language. An ASR compares incoming audio to this model and outputs words it built from the sounds. These are strings of letters that are treated as a first draft for the words (the final word output is formed when enforcing the SLM prediction or grammar rules).
Training an Acoustic Model – running thousands of hours of audio with transcribed speech through automated training which compares the transcripts to the sounds. The training process captures how words are pronounced by a diverse set of people and builds the database of sound and text.
Training a Language Model – running millions of lines of text through automated training, which captures the word relationships and builds a database of word relationship probabilities.
Speech Synthesis Markup Language (SSML) - It is a standardized markup language used to control various aspects of speech synthesis, which is the artificial production of human speech. SSML is part of the W3C (World Wide Web Consortium) standards and is commonly used in text-to-speech (TTS) systems to control things like prosody control, pronunciation and word emphasis.
Natural Language Understanding (NLU) - This is a subfield of artificial intelligence (AI) and computational linguistics focused on the ability of a computer system to understand and interpret human language as it is naturally spoken or written. Unlike simple keyword matching, NLU involves comprehending the nuances, context, and intent behind user inputs. It is used to deliver products like Call Summarization and Sentiment Analysis.
Call Summarization - Process of automatically generating a concise summary of the content and key points discussed during a phone call or voice conversation based on the text transcribed.
Sentiment Analysis - Also known as opinion mining, is a process used to identify, extract, and quantify subjective information from text data. It involves determining the emotional tone, attitude, or sentiment expressed in a piece of text. This sentiment can typically be categorized as positive, negative, or neutral.
Speaker Diarization - The process of partitioning an audio recording into segments based on who is speaking when. It involves identifying and labeling different speakers within an audio stream, which is crucial for understanding multi-speaker environments such as meetings, interviews, and call centers. Essentially, it answers the questions: "Who spoke?" and "When did they speak?"
Language Identification - Also known as language detection or language recognition, is the process of determining the language of a given piece of text or spoken content. It involves analyzing the input to identify the language it is written or spoken in.
Language Translation - Process of converting text or spoken content from one language (the source language) into another language (the target language) while preserving the meaning and context of the original message.
MRCP - Media Resource Control Protocol is a communication protocol between applications (voice IVR platforms) and the ASR and TTS resources serving them (taking in the audio or outputting the text). Primarily used by voice application platforms, call centers, call recording providers, telephony trunks, and network switches, using the W3C standards. A subset of the standard is specific for speech applications. While we prefer our customers to use the gRPC-based LumenVox API for our speech and voice services, we recognize that use of MRCP is prevailing in customer service platforms. Hence, we support all major voice platforms and IVRs using MRCP. There are two main versions in use: MRCPv1 and MRCPv2.
gRPC - This is a cross-platform open-source high performance Remote Procedure Call (RPC) framework, initially created by Google. This is the preferred and recommended protocol for communication with LumenVox software. gRPC can run in any environment and can efficiently connect services in and across data centers with support for load balancing, tracing, health checking and authentication. It is optimized for scalability and is cloud and microservices native. All major programing languages can access our API via gRPC.
Voice Biometrics
Voice Biometrics - A technology used for recognizing a person based on the unique characteristics of their voice. This is often used for authentication and verification purposes.
Enrollment - The process of recording a person's voice to create a voiceprint. During enrollment, a user's voice features are captured and stored in a database for future comparison.
Voiceprint - A digital model representing the unique characteristics of an individual's voice, used for comparison during authentication processes.
Authentication/Verification - The process of verifying the identity of a person using their voiceprint. This can involve matching the individual's live voice input against a stored voiceprint.
Speaker Identification - The process of determining which registered speaker's voiceprint matches a given voice input from a group of possible speakers.
Text-Dependent Verification - A form of voice verification where the speaker must say a specific phrase or set of phrases e.g. “My voice is my password. The system compares the spoken input against the pre-recorded version of the same phrase.
Text-Independent Verification - Voice verification that does not require the speaker to say a specific phrase. The system can verify the speaker based on any spoken input.
False Accept Rate (FAR) - The rate at which an unauthorized person is incorrectly accepted by the voice biometric system as an authorized user.
False Reject Rate (FRR) - The rate at which an authorized user is incorrectly rejected by the voice biometric system.
Liveness Detection - A method used to ensure that the voice input comes from a live human and not a recording or synthetic imitation. This is done by getting the user to say a random phrase like “An apple a day keeps the doctor away”.
Playback Detection - Playback detection is used to determine whether recorded or live audio is being played back into the production system.
Footprint Playback Detection – this identifies that audio files have already been received and processed by the voice biometric engine. It operates on the premise that a human doesn’t repeat the passphrase phrase exactly the same – ever. This is used in active verification use cases.
Channel Playback Detection – this identifies that the audio is being played back through some sort of speaker device. This has applications in active, passive and hybrid use cases.
Synthetic Speech Detection - this identifies that the audio being recorded has been generated by some form of synthesizer. This is primarily used in passive use cases but could extend to active use cases.
Biometric Model - A biometric DNN (Deep Neural Network) model is an advanced machine learning architecture needed to to perform tasks such as identification, authentication, and verification. Often a custom model is required for various Text-Dependent phrases.