TTS1 Swedish text normalization
The LumenVox TTS1 Text-To-Speech synthesizer works internally by synthesizing words. However, input text documents contain not only words, such as mjölk and socker, but also various other written elements, such as numbers (15), date (3/4/2003), acronyms (USA), abbreviations (t.ex.), symbols ($), etc. All such elements must first be converted to actual words, and only then synthesized. This conversion takes place internally within the synthesizer. Such conversion is called text normalization.
The Swedish TTS1 Text-To-Speech voices correctly normalize and synthesize the majority of Swedish texts. This document describes how LumenVox accomplishes the task of text normalization.
The user may extend LumenVox' text normalization by using PLS lexicons (as defined in the W3C pronunciation-lexicon Recommendation).
Please note that this article does not apply to our TTS2 voices.
Text Structure
This section describes how unannotated input text is split into paragraphs, sentences and words.
Paragraph
Paragraphs are separated by empty lines.
Paragraphs may be explicitly marked with SSML elements <p>.
Sentence
A sentence contains by default less than 1000 characters. Sentences longer than that will be broken into multiple smaller sentences.
Sentences may be explicitly marked with SSML elements <s>.
Word
A word contains by default less than 100 characters. Words longer than that will be broken into multiple smaller words.
Words without any vowels will be spelled out.
LumenVox will properly handle words with colons, such as the standard declension suffixes ( :a, :ar, :arna, :en, etc.) or numeral suffixes.
Supported characters
LumenVox accepts all Unicode characters. LumenVox handles most characters found in texts based on the Latin script.
Punctuation
Punctuation plays a key role in the way texts are interpreted by the TTS system. LumenVox supports majority of punctuation marks found in Swedish texts. However, in the end all punctuation marks which have effect on pauses or intonation are mapped to the following marks.
Punctuation marks | Pause | Intonation |
, | small | slightly rising |
; : | medium | falling |
. ! | long | falling |
? | long | rising or falling |
Default normalization rules
This section describes in general how LumenVox normalizes input text, excluding text fragments marked with the SSML say-as element.
This section is not exhaustive. LumenVox normalizes lots of various text elements, but only the most common have been described over here.
Number
Cardinal number
A cardinal number is either any single digit (0, 1, …, 9) or a sequence of digit not starting with 0.
Longer cardinal numbers may make use of dot as a thousands separator.
Examples
- 10.000 will be pronounced tio-tusen.
- 256 will be pronounced två-hundra-femtiosex.
- 4358 will be pronounced fyra-tusen-tre-hundra-femtioåtta.
- 1.000 will be pronounced ett-tusen.
Signed integer
A signed integer consists of a sign character followed immediately by a cardinal number. Valid sign characters are the plus sign (+), the minus sign (-, U+2212) and the plus-minus sign (±). The popular hyphen-minus character (-), as well as other dash-like characters, are also supported as the sign character, but they are ambiguous and should best be avoided.
Examples
- +5 will be pronounced plus fem.
- -3.000 will be pronounced minus tre-tusen.
Real number
A cardinal or signed integer followed immediately by the dot and a sequence of digits will be recognized as a real number.
Examples
- 4,5 will be pronounced fyra komma fem.
- -3,1 will be pronounced minus tre komma en.
- 1.000,12 will be pronounced ett-tusen komma tolv.
Ordinal number
A cardinal number with suffixed a, :a, e or :e is interpreted as an ordinal number of the given gender. The :dra, :dje and :de ordinal suffixes are also supported in numerals that end with 2, 3 or 4 respectively.
- 21a will be pronounced tjugo-första.
- 42:a will be pronounced fyrtio-andra.
- 6e will be pronounced sjätte.
- 1.000.000:e will be pronounced en miljonte.
- 62:dra will be pronounced sextio-andra.
- 3:dje will be pronounced tredje.
- 44:de will be pronounced fyrtio-fjärde.
Roman numeral
LumenVox supports various Roman numerals.
All uppercase Roman numerals with an appropriate lowercase ordinal suffix separated with colon are pronounced as ordinal numbers.
- LI:e will be pronounced femtio-förste.
- XXVIII:a will be pronounced tjugoåtta.
Uppercase Roman numerals in names of monarchs will be read as ordinal numbers preceded with the word den.
- kejsarinnan Katarina I will be pronounced kejsarinnan katarina den första.
- Gustav I:e will be pronounced gustav den förste.
Small uppercase and lowercase Roman numerals in other contexts will be pronounced as cardinal numbers.
- Punkt IX will be pronounced punkt nio.
- II Världskriget will be pronounced andra världskriget.
- xxii will be pronounced tjugotvå.
Fraction
A fraction consists of the following elements in order:
- An optional sign character.
- An optional whole number (cardinal) followed by the space character.
- The numerator (a cardinal number).
- The slash (/ U+002F) or the solidus character (/ U+2044).
- The denominator (a cardinal number)
Fractions with the slash character are recognized only for the most common denominators. Fractions with the solidus character are always correctly recognized.
Examples
- 3/4 will be pronounced tre fjärdedelar.
- 2 1/2 will be pronounced två och en halv.
- -7 2/3 will be pronounced minus sju och två tredjedelar.
- 15/5678 (solidus only) will be pronounced femton fem-tusen-sex-hundra-sjuttioåttondelar.
Sequence of digits
Sequences of more than one digit starting with 0 are always read as a sequence of digits.
Similarly are handled digits in fixed formats, such as telephone numbers or social security numbers.
Examples
- 0123 will be pronounced noll ett två tre.
Unit and measurement
LumenVox handles a wide variety of commonly as well as rarely used units, including metric and imperial systems. Some unit symbols are always recognized, others need a preceding number.
Examples
- 14'5" will be pronounced fjorton fot fem tum.
- 1h2m30s will be pronounced en timme två minuter trettio sekunder.
- 5 tsp will be pronounced fem teskedar.
- 1 tbsp will be pronounced en matsked.
- 2,6 GHz will be pronounced två komma sex gigahertz.
- 25 km/h will be pronounced tjugofem kilometer i timmen.
- 8 nmi will be pronounced åtta nautiska mil.
- -0,01% will be pronounced minus noll komma noll ett procent.
- 90° will be pronounced nittio grader.
Currency
LumenVox supports a wide list of currencies in multiple formats. Valid currency symbols include commonly used symbols such as £, $, €, ¥, ?, $AU, SG$, as well as many of the ISO 4217 currency codes (uppercase only).
The number may be followed by the words miljon, miljoner, miljard, miljarder, biljon, biljoner, or their abbreviations. In this case the currency will be pronounced at the end.
The value may have a thousands separator which may be either a dot or a space.
- 50€ will be pronounced femtio euro.
- EUR5.27 will be pronounced fem euro och tjugosju cent.
- $10 will be pronounced tio dollar.
- £5,27 will be pronounced fem pund och tjugosju pence.
- GBP 1.000 will be pronounced ett tusen brittiska pund.
- ¥1 miljon will be pronounced en miljon yen.
- ¥5,27 will be pronounced fem yen och tjugosju sen.
- CHF6M will be pronounced sex miljoner schweiziska franc.
- € 20 000 will be pronounced tjugo-tusen euro.
- C$ 2,3 mn will be pronounced två komma tre miljoner kanadensiska dollar.
Time
LumenVox supports time specified in both the 12-hour and the 24-hour clock.
- 1:59 will be pronounced ett och femtionio.
- 2:00 will be pronounced två noll noll.
- 01:59am will be pronounced noll ett femtionio a m.
- 2 AM will be pronounced två a m.
- 13:00 will be pronounced tretton noll noll.
- 10:25:30 will be pronounced tio tjugofem och trettio.
- 07:53:10 A.M. will be pronounced noll sju femtiotre och tio a m.
LumenVox also handles duration specified in multiple formats.
- 5'30" (only for seconds greater than 11) will be pronounced fem minuter trettio sekunder.
- 5m30s will be pronounced fem minuter trettio sekunder.
- 3h10m will be pronounced tre timmar tio minuter.
- 1t30m25s will be pronounced en timme trettio minuter tjugofem sekunder.
Date
One-digit numbers for the day and for the month may have an optional leading zero.
Supported formats for month expressions: numbers (4, 04), name (April), abbreviation (Apr).
The year should be expressed with 4 digits.
European format (D/M/Y, D-M-Y, D.M.Y), default for Swedish voices:
- 12/maj/1995 will be pronounced tolfte maj nitton-hundra-nittiofem.
- 12-Apr-2007 will be pronounced tolfte april två-tusen-sju.
- 20.3.2011 will be pronounced tjugonde mars två-tusen-elva.
Standard US format (M/D/Y, M-D-Y), with month name:
- Dec/31/1999 will be pronounced trettio första december nitton-hundra-nittionio.
- April-25-1999 will be pronounced tjugo femte april nitton-hundra-nittionio.
ISO 8601 standard (Y-M-D, Y/M/D, Y.M.D), only 4-digit year:
- 2007/01/01 will be pronounced två-tusen-sju noll ett noll ett.
- 2007-Jan-01 will be pronounced första januari två-tusen-sju.
- 2007-Januari-01 will be pronounced första januari två-tusen-sju.
Other common formats:
- 01/6 -97 will be pronounced första i sjätte nittiosju.
Range
LumenVox interprets ranges of numbers, measurements, time and date.
- 15-20 April will be pronounced femtonde till tjugonde april.
- 1939-1945 will be pronounced nitton-hundra-trettionio till nitton-hundra-fyrtiofem.
Abbreviations
Most abbreviations will be expanded to full words. There will be no sentence break on the dot sign (full stop) following a supported abbreviation. In order to force a sentence break please use two dot signs: one to mark the abbreviation and one to mark the sentence ending.
Example
- Kiruna centrum ligger 550 m ö.h. will be interpreted as kiruna centrum ligger fem-hundra-femtio meter över havet.
Initialisms
Initialisms with a period (dot) following each letter (e.g. E.U., H.D.M.I.) will be pronounced by spelling out each letter.
Common initialisms without periods (e.g. EU, HDMI) will also be recognized and properly pronounced.
All vowelless words are recognized as initialisms.
Examples
- H.D.M.I. will be pronounced h d m i.
- i EU will be pronounced i e u.
- IT-branschen will be pronounced i t branschen.
- SVT will be pronounced s v t.
- pwq will be pronounced p w q.
Street address
In most cases LumenVox properly recognizes and normalizes street addresses in Sweden.
Example
- Stockholm University, SE-10691 Stockholm, Sweden will be pronounced stockholm university, s e ett hundra sex nittioett stockholm, sweden.
Telephone number
LumenVox recognizes most Swedish telephone number formats and groups digits in pairs or triplets.
Examples
- 08-501 361 01 will be pronounced as noll åtta fem-hundra-ett tre-hundra-sextioett noll ett.
- telefon: 123456 will be pronounced as telefon, tolv trettiofyra femtiosex.
Identifier
Non-words not described elsewhere will be treated as identifiers. This group includes mixes of letters and digits, such as r121, as well as URL’s, e-mail addresses, or fancy proper names unknown to the synthesizer.
Numbers within identifiers such as r121, x01, b987654 will be read as numbers if they consist of up to 4 digits, and will be read as a series of digits otherwise.
Punctuation characters within identifiers will be pronounced.
Examples
- er125lp will be pronounced er ett-hundra-tjugofem l p.
- http://www.lumenvox.com will be pronounced h t t p kolon snedstreck snedstreck w w w punkt lumenvox punkt com.
- B!0 will be pronounced b utropstecken noll.
SSML say-as attribute values
The SSML element say-as gives users the possibility to annotate fragments of text in order to force particular interpretation.
Marking a fragment with say-as disables most default normalization rules, which would have otherwise been applied. Therefore, it is advised to mark text with say-as scarcely, only when the default normalization rules fail and render different speech than expected by the user.
The standards authority W3C Working Group has issued a note describing SSML 1.0 say-as attribute values, which is mostly followed by LumenVox.
Date
LumenVox will interpret a value as a date, when used within say-as with interpret-as="date". This works just as defined in the W3C note. The format attribute may be set to any of the following: mdy, dmy, ymd, md, dm, ym, my, d, m, y.
Examples
- <say-as interpret-as="date" format="ymd">01/02/03</say-as> will be pronounced tredje i andra noll ett.
- <say-as interpret-as="date">1234</say-as> will be pronounced tolv-hundra-trettiofyra.
Duration
A token like 7'10" would by default be recognized as length in feet and inches. However, it may be forced to be recognized as duration in minutes and seconds by surrounding with say-as having interpret-as="time".
Example
- <say-as interpret-as="time">2'10"</say-as> will be pronounced två minuter och tio sekunder.
Telephone number
Telephone numbers may be marked with the say-as element having interpret-as="telephone". In a telephone number LumenVox will read most digits and letters individually, as well as properly read the extension number and the characters * and #.
Examples
- <say-as interpret-as="telephone">1-800-555234</say-as> will be pronounced ett åtta-hundra femtiofem femtiotvå trettiofyra.
Character string
LumenVox will read individual characters for text within the say-as element having interpret-as="characters". The format attribute is ignored. The detail attribute may be used to force pauses, as described in the W3C Note.
Examples
- <say-as interpret-as="characters">speed</say-as> will be pronounced s p e e d.
- <say-as interpret-as="characters" detail="3 1 2">1a3BZ7</say-as> will be pronounced ett a tre b z sju.
Cardinal number
LumenVox will attempt to read values within say-as having interpret-as="cardinal" as cardinal numbers. The format and detail attributes are ignored. Roman numerals are supported.
Examples
- <say-as interpret-as="cardinal">1999</say-as> will be pronounced ett-tusen-nio-hundra-nittionio.
- <say-as interpret-as="cardinal">CLI</say-as> will be pronounced ett-hundra-femtioen.
Ordinal number
LumenVox will attempt to read values within say-as having interpret-as="ordinal" as ordinal numbers. The format and detail attributes are ignored. Roman numerals are supported.
Examples
- <say-as interpret-as="ordinal">21</say-as> will be pronounced tjugo-första.
- <say-as interpret-as="ordinal">VI</say-as> will be pronounced sjätte.
Fraction
LumenVox will interpret values within say-as having interpret-as="fraction" as common fractions. The syntax for fractions is any of the following:
Fraction
["+" | "-" | "±"] cardinal “/” cardinal.
Non-negative mixed number
["+" | "±"] cardinal “+” cardinal “/” cardinal.
Negative mixed number
“-” cardinal “-” cardinal “/” cardinal.
where cardinal is a number as defined in Cardinal numbers above.
Examples
- <say-as interpret-as="fraction">2/9</say-as> will be pronounced två niondelar.
- <say-as interpret-as="fraction">3+1/2</say-as> will be pronounced tre och en halv.
- <say-as interpret-as="fraction">-2-3/8</say-as> will be pronounced minus två och tre åttondelar.
Measurement
Measurements may be marked with say-as having interpret-as="unit" (or interpret-as="measure"). The valid syntax is the following:
Unit
symbol [ "2" | "3" | "4" | "2" | "3" ] [ "/" unit ]
Measurement
number unit
Adjective measurement
number “-” unit
A unit symbol may be almost any of the standard metric, imperial or other unit symbols, e.g. N(newtons), kJ (kilojoules), mi (miles), sqft (square feet), MiB (mebibytes), ly (light years), tbsp (tablespoons), °F (degrees Fahrenheit), psi (pounds per square inch), etc. The unit name does not contain periods (dots). In general the unit symbols are case sensitive, so B is bytes and b is bits, but unambiguous symbols are matched case-insensitively, so that either the proper Hz or improper hz, HZ and hZ will all be treated as the frequency unit hertz.
The SI prefixes as well as binary prefixes may be prepended to unit symbols, if appropriate.
A unit symbol may be suffixed with a power like 2 or 3, so that m2 is square meters and s3 is seconds cubed.
A number may be either a cardinal, a signed integer, a real number, or a fraction, as described above.
Examples
- <say-as interpret-as="unit">2nmi</say-as> will be pronounced två nautiska mil.
- <say-as interpret-as="unit">1+1/2tsp</say-as> will be pronounced en och en halv tesked.
- <say-as interpret-as="unit">5m/s2</say-as> will be pronounced fem meterper kvadratsekund.
- <say-as interpret-as="unit">2,100rpm</say-as> will be pronounced två-tusen-ett-hundra varv per minut.
- <say-as interpret-as="unit">2.7µF</say-as> will be pronounced två komma sju microfarad.
Street address
Street addresses or parts of an address may be marked with say-as having interpret-as="address". This will force special pronunciation of Swedish postal codes (grouping them into three plus two digits).
Examples
- <say-as interpret-as="address">Alphyddevägen 55, 13135 Nacka</say-as> will be pronounced alphyddevägen femtiofem, ett-hundra-trettioett trettiofem nacka.
LumenVox special characters
As mentioned at the very beginning of this text, it is sometimes necessary to modify texts to be synthesized in order to make them compatible with the system constraints and achieve the expected output. LumenVox provides a set of special characters that work only in certain contexts, changing the way texts are being synthesized in terms of pronunciation or intonation. The characters are language-specific and do not apply to other languages unless specified otherwise in the language-specific documentation.
Force rising intonation
A question mark followed by caret also known as circumflex (?^) can be used to force the intonation of a question to be rising. Wh-questions (questions starting with an interrogative pronoun) by default have falling intonation. This can be changed by appending a caret to the question mark.
Example
- Hur mår du?^ will result in a rising intonation.
Force falling intonation
A question mark followed by an underscore (?_) can be used to force the intonation of a question to be falling. Yes/No questions by default have a rising intonation. This can be changed by appending the underscore character to the question mark.
Example
- Är du okej?_ will result in a falling intonation.