LumenVox speech products (ASR & Transcription)
ASR
Overview
Our Automatic Speech Recognition (ASR) software takes audio either sent in as a single audio file or via streaming audio segments/chunks and derives text using our Deep Neural Network (DNN) engine. Behind the scenes an acoustic model is used to decipher the sounds, and then either a hand-crafted grammar, or an SLM is used to predict the word. This can be done either offline or in real-time streaming mode.
The Engine
The following engine-block diagram shows the basic structure of our engine. It consists of three main blocks, where blocks 1 and 2 perform an acoustic task, and block 3 performs a linguistic task. The input and output interactions with the engine are managed by the LumenVox API.
Block 1 quite simply pre-processes the audio and prepares it for use by the DNN.
Block 2 contains the neural network. It is diagramed as 2 separate steps for simplification, but it really isn’t. During speech recognition work, the network in this block takes in the audio features, and outputs candidate text in one fell swoop.
Block 3 uses grammars or a language model to make sense of the candidate text and turn it into words that are meaningful in their context and in the order they are presented. Block 3 does not use the audio anymore, the work that is done in this block is entirely linguistic, making words from a stream of letters.
The new engine with its improved scoring algorithm is available from version 4.2 provides better accuracy and performance. The engine utilizes less resources making scaling more manageable.
Acoustic models
At LumenVox our science researchers have created a modern machine learning process that incorporates all the required activity that builds an acoustic model into a single, self-improving, deep layered network. During training, it creates an extensible acoustic model that doesn’t impose limits on the number of dialects and accents that can exist for each word and does not require separate lexicons and other resources to produce an accurate output.
This leads to improved efficiency and cost effectiveness, as the need to create separate acoustic models for accents and close dialects of a language is removed. Our engine can accurately spell words it was not trained with or encountered before, which can benefit industries like medical transcription.
Grammar ASR
Grammars are rulesets that define valid word and phrase choices, for use as part of the linguistic task of the ASR engine. A grammar is used as a constrained subset of the language. This works well for use cases where users are expected to be using only a small set of possible words, like menu choices or digits. Grammars also help predict what combination of words are likely, which helps in cases of ambiguous pronunciations, so they assist in the speech recognition process. They also provide some semantic interpretation.
Many of our customers choose to use grammar ASR because it’s perfect for their use case. Implementing grammars is optimal for IVRs and automated customer service systems, where user responses can be directed to a predictable small-set vocabulary. Such systems are used to automate call routing or repetitive tasks, such as activating a credit card. Grammar ASR works best for short utterances. Since grammars narrow the room for error as compared to unconstrained decoding, our grammar ASR boasts supreme accuracy.
Grammar Files
Our engine supports the SRGS standard for grammars. SRGS (Speech Recognition Grammar Specification) is a W3C standard. SRGS grammars are text files written in either BNF or XML style, and they specify patterns of words that are expected to appear together and the order in which they can appear. LumenVox provides some out-of-the-box grammars, but most clients easily author their own, to suit their business needs.
Grammars Performance Enhancement
When your application starts decoding audio, it sends a grammar with the audio to our speech engine. For the engine to read it, the grammar must be compiled into a binary format. The speech engine now keeps compiled grammars in its cache, allowing the use of large grammars without having to wait for them to be compiled when loaded, if they have not been modified.
Aliases & Lexicons
While using phonemes has been rendered obsolete, the use of aliases has been introduced. By using an alias you can tell the system that if it comes across word A, it should spell it as word B. This is a very powerful tool to help matching predicted words to grammars. The following diagram depicts an example of the syntax that can be used for Aliases/Lexicons and highlights two use cases, where a) words can be replaced (in the first example where the word vulnerable is replaced with Shaky) or b) interpreted (where the decoded ASR text "eye phone" is interpreted as iPhone.
Refer to the LumenVox Knowledgebase for articles on authoring and using SRGS grammars.
Transcription
Humans don’t naturally limit themselves to speak in accordance with a grammar. To decipher what humans say when they express themselves naturally, you need transcription ASR. Unlike grammar ASR that limits users in what they can say as valid responses during the speech recognition process, transcription ASR places no such limits on user vocabulary. When transforming audio to text the engine chooses words based on a statistical language model (SLM) that covers the entire language.
LumenVox ASR software implementations that choose Transcription ASR, often use it for long form transcription, which offers expanded possibilities and use cases. However, we have seen an increase in demand for Transcription ASR even for IVR and short-form utterances. Enterprises want to understand more and do more with what customers say. Call center quality assurance and agent assist, performance and sentiment analysis, chatbots, video and conference transcriptions, to name a few use cases, are all made possible with transcription ASR.
Statistical Language Models (SLM)
When a closed-set grammar does not suffice (usually for long-form transcription), our engine uses SLMs, and applies language model rules, instead of grammar rules, for word prediction. A giant database of word relationships and probabilities assigned to them, each SLM is trained on millions of lines of text in the target language.
Following the example of our acoustic models, our language models can accommodate global use. Each language model encapsulates word relationships for all accents and close dialects, even for those spoken across the world. Thus, for example, our Spanish model can handle Mexican or European Spanish, as well as Spanish dialects spoken all over South America.
As an added feature, we do allow the specification of a dialect, to accommodate dialects with different spelling of words otherwise identical. One example is British and American English, where correct spelling is highly desired for transcription readers of each dialect e.g. colour/color.
Check the LumenVox website for the latest information on languages available for transcription ASR.
SLM Augmentation: Phrase Lists
LumenVox makes it easy to augment the language resources, to enhance accuracy in recognizing specific domain terms, words and phrases. This may be needed in domains such as in the medical field, where pharmaceutical names or disease names are often not included in typical modeling of a language. This can also be used to differentiate between two similar words where the engine consistently chooses one, but in your business domain it is more correct to choose the other. To get the term or phrase right, you can use the Phrase Lists feature that mimics the use of a grammar. This feature also allows for Phrase weighting to be adjusted, to ensure a preference towards it for your business. Phrase list weighting can further be adjusted using the built in probability boost settings.
Refer to the LumenVox Knowledgebase for more information on using Phrase Lists.
Text Normalization (Transcription Post-Processing)
What is Text Normalization
In the conversational AI industry, Text Normalization (TN) is a general name for the post-processing of output text that came out of a speech recognition engine, preparing it to be read.
LumenVox supports configurable text normalization, to accomplish readability of transcription output. There are multiple levels of such normalization: capitalization, punctuation, inverse normalization e.g., turning sequences of digit words into formatted numbers (for counts, currency, dates, etc.), as well as redaction of sensitive information.
Usually, readability is not an issue for short, grammar-based ASR. Raw text can be used to feed a menu choice or a short collection of strings (e.g., a phone number or amount) into a system without worrying about readability. Readability is an issue for long-form transcription. Therefore, when discussing TN, we are referring to working with the output of Transcription ASR, not Grammar ASR.
Why we need Text Normalization
Humans communicate with each other through speech, either spoken or written. The spoken form and the written form are not identical. After having the spoken form, when we invented writing, we developed a written form. When we move from speaking to writing it or from reading to speaking it, our brains translate from one form to the other automatically.
Spoken form: the simplest representation of words. All the words are explicitly represented as they are to be pronounced. That’s the standard, the normal form, without any formatting or shortcuts, no extra symbols, punctuation, or anything to aid in the reading of it, since you are just speaking it. This form is long, e.g., “one hundred fifty two thousand four hundred eighty three dollars”. It’s how we say things.
Written form: the caveat in the spoken form is that it’s not easy to write or read. Humans invented a way to condense the information when writing. In the above example there are 10 words, that can be condensed into one word: $152,483.
Additionally, when speaking, humans make pauses that create sentences, and use intonation for commanding, asking, or stating facts, all of which doesn’t come across when writing the words. Punctuation and capitalization were added to the written form to make up for this missing information.
When written, the text is transformed into numbers, dates, times, acronyms, abbreviations, etc. The written form is non-standard, not normal, and contains formatting and symbols. It is also open to interpretation and can be ambiguous and culture or localization dependent. In some countries correct formatting would have the comma and the period switched. As another example, an American reader will read a date differently than a European reader. The standard spoken form January third is written in the US as 1.3.2022, but Europeans will write it as 3.1.2022. They will read the American format as the first of March.
Humans prefer the written form for text because it’s cleaner and more readable: even though the representation is ambiguous, humans are pretty good at using context to resolve this ambiguity.
However, speech recognition algorithms generate the spoken form as output. They also ‘prefer’ the spoken form as input. Many machine learning systems work best with the spoken form because it’s the simplest and most explicit representation, it’s standard and normal, which leaves less room for error.
Since humans prefer the written form, yet computer systems take in and give out the spoken form, we have to develop algorithms to ‘translate’ the given form into the preferred form, depending on the task and who is going to consume the text.
Why is it called Text Normalization
Normalization is a concept borrowed from mathematics. Normalizing a measurement or an amount means to transform it to some standard form, so that it may be easily compared to other measurements or amounts.
Text normalization is therefore the process of transforming text into a single, standard, explicit, “normalized” form, which allows it to be identified in one way that’s not open to interpretation. This is exactly how we defined the spoken form. Scientifically speaking, text normalization is the transformation:
Written from (complex, condensed, formatted) -> Spoken form (simple, standard, normal)
This is not what we are doing with ASR post-processing. With ASR post processing we are doing the opposite, we are doing ‘inverse text normalization’:
Spoken from (simple, standard, normal) -> Written form (complex, condensed, formatted)
Since ‘inverse text normalization’ is such a mouthful, the shorter ‘text normalization’ has become a standard umbrella name for most of the steps within ASR post-processing.
In fact, there is an algorithm that does true ‘text normalization’: TTS (Text to Speech). TTS systems do the opposite from ASR post-processing, they take a written form, and simplify it, normalize it, so that they can feed it into an algorithm that synthesizes audio speech, that talks back to the user.
In summary:
Scientific jargon: ASR post processing is really inverse text normalization.
Industry jargon
Text normalization in industry jargon is an umbrella term for ASR post-processing that includes inverse text normalization.
LumenVox Text Normalization Offering
LumenVox provides the following capabilities under the TN umbrella:
- Inverse text normalization
- Punctuation and capitalization
- Sensitive information redaction
These capabilities can each be turned on or off using flags that you set when using the LumenVox API. Text Normalization adds computational costs to the processing, so some additional CPU and memory requirements are needed when this is enabled.
1. Inverse Text Normalization Module
The LumenVox ITN module performs the following ‘spoken ->written’ form conversions:
Rule | Example |
Cardinal numbers | “one hundred twenty-three million four hundred fifty six thousand seven hundred eight nine" -> 123,456,789 |
Numbers + "million" or "billion" | "two point three million" -> 2.3 million |
Ordinal numbers | “first” -> 1st |
2+ numbers < 100 in a row | “one ninety-two” -> 192 |
Times | "seven a m eastern standard time" -> 07:00 a.m. EST |
Dates | "may fifth twenty twelve" -> may 5 2012 |
Decimals | "three hundred and three dot five" -> 303.5 |
Alphanumeric sequences | “a b c d one two three” -> ABCD123 |
Measurements (e.g., %, km, °C). | “fifty pounds" -> 50 lbs |
Currencies | “fifty pounds” -> £50 |
2. Punctuation and Capitalization Module
The LumenVox punctuation and capitalization module performs ‘true-casing’: punctuation and capitalization restoration. This includes the following modifications:
- Add punctuation (commas, periods, question marks, and language-specific marks)
- Add capitalization (options are: first letter capital, all uppercase, all lowercase)
3. Sensitive Information Redaction Module
The LumenVox redaction module, currently available in English only, performs the following modifications: PII marking. It marks whether a word contains PII (personal identifiable information) = sensitive information, and of what kind.
Recognizer | What’s covered |
Passwords | Including alphanumeric and special characters |
Social insurance / security # | Australia, Canada, IBAN, USA |
PIN recognizers | 4–6-digit sequence |
Bank account # | Australia, Canada, IBAN, USA |
Names | General |
Driver’s License | Australia, Canada, EU |
Summary of Text Normalization Benefits
Although the primary motivator is readability, benefits of text normalization go beyond the use for human consumption. There are secondary benefits to downstream applications.
- Smoothing the data: normalization can help make the transcript more similar to training data. If the training data includes samples like "I'd like the order fulfilled at 12:15" instead of "I'd like the order fulfilled at quarter passed twelve" for an intent ’set delivery time’, then having TN turned on might improve getting the intent right.
- Understanding the data: having punctuation and capitalization on can be useful for downstream processing by other systems. You can split the utterance into multiple sentences and process them separately. There’s more flexibility for other workflows such as discourse summarization, and automatically generating FAQs.
Downstream tasks that can benefit:
- Intent engine accuracy (classification and slot filling)
- Use in search and search results relevance and ranking
- Content summarization, automatic generation of FAQs
- Analytics accuracy, sentiment analysis, and more
Ask your LumenVox sales rep about languages supported for transcription text normalization.
Continuous Transcription Mode
In continuous mode the engine processes the incoming audio a chunk at a time. The VAD detects when the speech ends and the audio processing performs a word prediction doing a first pass, final pass, and post-processing for every chunk continuously, without waiting for the end of the audio. This is preferable for cases where fast real-time output is needed, such as IVRs, or in agent-assist implementations, as well as for live subtitling.
Comparing Streaming with Partial Results to Continuous Mode
Continuous transcription is very similar to streaming transcription with partial results on (enabled). However, in continuous transcription you have to wait for a pause before looking at a result, and it is final, so this could be fit for subtitling, for example. In contrast in streaming you can get the interim result every few words, and it can change while the speech continues. This is good, as mentioned, for other real time use cases like dictation.
Enhanced Transcription Mode
In enhanced mode grammars are provided for the transcription either through creating a transcription interaction and supplying one or multiple grammar files or through MRCP where an enhanced transcription grammar is processed through ASR. Enhanced transcription provides for NLP-type tasks like slot filling and entity recognition by combining transcription with grammars.