Synthesizer SPEAK

The SPEAK method from the client to the server provides the synthesizer resource with the speech text and initiates speech synthesis and streaming.  The SPEAK method can carry voice and prosody header fields that define the behavior of the voice being synthesized, as well as the actual marked-up text to be spoken.  If specific voice and prosody parameters are specified as part of the speech markup text, it will take precedence over the values specified in the header fields and those set using a previous SET-PARAMS request.

When applying voice parameters, there are 3 levels of scope.  The highest precedence are those specified within the speech markup text, followed by those specified in the header fields of the SPEAK request and, hence, apply for that SPEAK request only, followed by the session default values that can be set using the SET-PARAMS request and apply for the whole session moving forward. 

If the resource is idle and the SPEAK request is being actively processed, the resource will respond with a success status code and a request-state of IN-PROGRESS.

If the resource is in the speaking or paused states (i.e., it is in the middle of processing a previous SPEAK request), the status returns success and a request-state of PENDING. This means that this SPEAK request is in queue and will be processed after the currently active SPEAK request is completed.

For the synthesizer resource, this is the only request that can return a request-state of IN-PROGRESS or PENDING.  When the text to be synthesized is complete, the resource will issue a SPEAK-COMPLETE event with the request-id of the SPEAK message and a request-state of COMPLETE..



MRCPV1 SPEAK Example:

C->S: SPEAK 543257 MRCP/1.0
             Voice-gender:female
             Prosody-volume:medium
            Content-Type:application/synthesis+ssml
            Content-Length:104

          <?xml version="1.0"?>
         <speak>
         <paragraph>
               <sentence>You have 4 new messages.</sentence>
              <sentence>The first is from <say-as
              type="name">Stephanie Williams</say-as>
             and arrived at <break/>
             <say-as type="time">3:45pm</say-as>.</sentence>

            <sentence>The subject is <prosody
            rate="0.8">ski trip</prosody></sentence>
       </paragraph>
    </speak>

S->C: MRCP/1.0 543257 200 IN-PROGRESS

S->C: SPEAK-COMPLETE 543257 COMPLETE MRCP/1.0
             Completion-Cause:000 normal                  

MRCPV2 SPEAK Example:

C->S: MRCP/2.0 ... SPEAK 543257
             Channel-Identifier:32AECB23433802@speechsynth
             Voice-gender:female
            Prosody-volume:medium
           Content-Type:application/ssml+xml
          Content-Length:...

          <?xml version="1.0"?>
              <speak version="1.0"
                    xmlns="http://www.w3.org/2001/10/synthesis"
                   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                   xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                       http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                  xml:lang="en-US">
            <p>
               <s>You have 4 new messages.</s>
              <s>The first is from Stephanie Williams and arrived at
                 <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.
               </s>
         <s>The subject is
                  <prosody rate="0.8">ski trip</prosody>
        </s>
    </p>
 </speak>

S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
            Channel-Identifier:32AECB23433802@speechsynth
           Speech-Marker:timestamp=857206027059

S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
             Channel-Identifier:32AECB23433802@speechsynth
            Completion-Cause:000 normal
            Speech-Marker:timestamp=857206027059                    

 

Language Selection when using Plain Text Synthesis

When working with TTS synthesis over MRCP, it it preferable to use the SSML Content-Type as described above because it allows you a greater level of flexibility and control over how words and phrases are pronounced, as well as fine control over voices used.

Sometimes it is necessary to use plain text during synthesis instead of SSML - this may be a preference, or perhaps the platform you are using does not support SSML, in which case you may have little choice. In these instances, it is important to understand how to control the TTS language used during synthesis. Prior to LumenVox version 14.1, this selection was fixed as the SYNTHESIS_LANGUAGE setting configured in your client_property.conf file. Introduced in version 14.1, LumenVox now supports the Speech-Language header when processing plain text syntheses (with Content-Type: text/plain), as shown in the examples below:

MRCPV1 SPEAK Example using plain text with British English language:

C->S: SPEAK 543257 MRCP/1.0
             Voice-gender:female
             Prosody-volume:medium
             Content-Type:text/plain
            Speech-Language:en-GB
            Content-Length:31

           LumenVox TTS is the Bee's Knees

S->C: MRCP/1.0 543257 200 IN-PROGRESS

S->C: SPEAK-COMPLETE 543257 COMPLETE MRCP/1.0
            Completion-Cause:000 normal

MRCPV2 SPEAK Example using plain text with British English language:

C->S: MRCP/2.0 ... SPEAK 543257
              Channel-Identifier:32AECB23433802@speechsynth
              Voice-gender:female
              Prosody-volume:medium
             Content-Type:text/plain
             Speech-Language:en-GB
             Content-Length:31

             LumenVox TTS is the Bee's Knees

S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
             Channel-Identifier:32AECB23433802@speechsynth
             Speech-Marker:timestamp=857206027059

S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
             Channel-Identifier:32AECB23433802@speechsynth
            Completion-Cause:000 normal
             Speech-Marker:timestamp=857206027059            

Please note that specifying the Speech-Language header when using SSML (Content-Type: application/ssml+xml)  will be ignored, since the SSML must contain a language specifier, which will be used instead of any specified Speech-Language header that may be present. The Speech-Language header will therefore only be used in conjunction with Content-Type: text/plain requests.


Was this article helpful?
Copyright (C) 2001-2025, Ai Software, LLC d/b/a LumenVox