Media server connectivity
The Media Server provides connectivity to clients over a variety of network mediums, all adhering to various published standards. This make connecting to LumenVox products easy, well documented, with a variety of libraries that can be used to interface with the LumenVox Media Server.
Because this communication involves a variety of networking protocols and standards, there are a number of protocols and their acronyms that need to be described, however they all serve important functions. You may not need to understand how they all work, or what they all mean in order to use them, since many speech related product these days offer these types of interface.
Control Protocols
Control Protocols are used to establish connections (sessions) when connecting to LumenVox using MRCP to make requests of the LumenVox Automatic Speech Recognition (ASR), Text-To-Speech (TTS) or other services.
LumenVox supports both RTSP and SIP control protocols. These protocols can be used to open or establish connections to the LumenVox Media Server from a client application. Typically you would use one or the other of these protocols, depending on your specific needs, however both can be active at the same time within the Media Server, so for example, you could use RTSP for audio synthesis (TTS) and SIP for speech recognition (ASR) if needed.
It is worth noting that the LumenVox API can be used directly without involving any MRCP connections or using the LumenVox Media Server at all. Our various Application Programming Interfaces are well documented and fully supported as described in our Core API documentation if you prefer. The choice of whether to use our API or MRCP is left up to you. We have many developers around the world using both options, and each have their respective benefits, but for the most part each can be used to perform most ASR and TTS tasks that are needed for the majority of applications.
Because Media Server connectivity is primarily network based, it can be used to connect processes within a single machine, a local network or across the Internet as needed, with similar speed and performance. This can be especially useful in today's cloud and mobile dominated world.
RTSP - Real Time Streaming Protocol
This is an IETF-defined network control protocol that was designed for entertainment and communications systems controlling media streams (such as those used by the LumenVox Media Server for ASR and TTS audio streams).
RTSP is used with MRCP Version 1 (MRCPv1) implementations.
Diagram showing an RTSP session between an MRCP client application and the LumenVox Media Server
Note that from a networking perspective, RTSP sessions use TCP connections only, and do not use UDP sockets for communication. The default communication port for RTSP is 554, however it is important to note that you cannot have two processes running on the same physical machine that are both using the same port, so occasionally it may be necessary to change the port number being used. See Media Server Specific Parameters for information on making such changes.
There are two settings in MRCPv1 that are used to specify the references for recognizer and synthesizer resources. If there is a mismatch in these strings between the client and the Media Server, whenever the resources are requested, the client or server may not understand which is being asked for. These resource definitions (resource URLs) can generally be changed at either the client end or the Media Server end so that they match (it doesn't really matter which end you change). For details on changing these resource URLs, see the media_server.conf article.
LumenVox supports the IETF RFC2326 memo describing Real Time Streaming Protocol
SIP - Session Initiation Protocol
This is an IETF-defined signaling protocol that has been widely adopted for controlling communication sessions, such as VoIP calls and so on. Version 28 of this draft is described here. This protocol was originally designed in 1996 by Henning Schulzrinne and Mark Handley, and has become popular in recent years to the point where many speech application developers have adopted it. Most VXML implementation use SIP connectivity, which can be used to easily connect telephony and speech systems in a controlled Interactive Voice Recognition environment.
SIP is used with MRCP Version 2 (MRCPv2) implementations
Diagram showing a SIP session between an MRCP client application and the LumenVox Media Server
From a networking perspective, SIP differs from RTSP in that it may use either UDP or TCP connections to communicate, depending on the client requirements. As of LumenVox version 11.1, SIP over TCP support was added so that now both UDP and TCP connections are fully supported for SIP. Prior to LumenVox version 11.1, only UDP connections were supported for SIP.
When a SIP session is established, this differs further from RTSP in that there is a second communication channel (TCP-based this time) for the MRCP traffic. The port numbers used for MRCP are negotiated during the session initialization (via SDP). This means that all of the session control information is sent using SIP over UDP or TCP, while all of the MRCP information is sent over its own dedicated TCP connection.
This separation can be useful for network engineers needing to control how traffic is being routed. For example, using SIP, it is possible to configure proxy servers and routers to send the SIP (session control) traffic via one network path, and MRCP (resource control) traffic via a different path. When configuring large systems, connecting to Session Border Controllers (SBCs) and proxy servers, this may be beneficial, however this topic is beyond the scope of this overview.
LumenVox supports the IETF RFC3261 memo describing Session Initiation Protocol
Media Resource Control Protocol - MRCP
This is a protocol which is used when a session has been established using either SIP or RTSP. This differs from the other protocols, which control the overall state of the session or connection. Instead, MRCP is used to control the various speech resources that are used within the session. For example if the client application wants to request some TTS audio, or if it wants to request speech recognition, MRCP would be used to facilitate the communication of these requests.
As was mentioned above, RTSP uses Version 1 of MRCP, which SIP uses Version 2 of MRCP. Both versions perform very similar functions, however there are subtle differences between them which may need to be considered if you intend on writing the MRCP protocol handler yourself (this is not a small task).
As well as communicating requests, responses and events between the client and Media Server, these messages also control audio media which may be streamed between the two. These types of audio/media streams are transported by another protocol, called RTP, which is described below.
LumenVox supports the IETF RFC4463 memo describing MRCP (v1) and also the IETF RFC6787 describing MRCP Version 2
Real-time Transport Protocol - RTP
This protocol is used to transport audio streams over a network. These audio streams may be TTS or ASR audio, but typically TTS traffic flows away from the Media Server and ASR traffic flows towards the Media Server.
There are several types of audio that can be transported using this protocol. LumenVox supports PCMU (ulaw) and PCMA (alaw) encoded at 8 KHz. No other formats are supported.
Typically audio is split into small packets of data representing around 20 ms of time. These small packets are streamed one after the other from one end to the other. The receiver puts these small packets together and uses the audio stream for whatever it needs - playing out audio to a speaker, or sending audio into the speech recognizer.
Each packet has various attributes, describing its format and also which packet number it is within a sequence. This can be important because UDP datagrams are used to transport RTP audio. UDP is very efficient for this task, however packets can become lost or get out of sequence in certain situations. The receiver reviews the sequence information associated with each packet and reassembles the stream as best as it can to maintain audio quality.
In addition to audio, Dual Tone Multi-Frequency (touch-tones) packets are sent over the RTP stream. It is important to understand that the beep itself is not sent over as audio, since this would interfere with speech recognition if these "in-band" beeps were present. Instead, these DTMF tones are sent as RTP Events, which are special packets indicating which key was pressed.
The decision as to which RTP ports are using within a session is negotiated whenever the session is established. This associates the RTP stream with a specific resource (recognizer or synthesizer) which also determines the stream's direction.
LumenVox supports the IETF RFC3550 memo describing Real-time Transport Protocol and also the IETF RFC2833 memo describing the use of DTMF over RTP.
Session Description Protocol - SDP
This protocol is used in conjunction with SIP and RTSP to establish multimedia sessions. This essentially means that using SDP you can describe the various audio streams and MRCP streams that are needed within a session.
SDP is used when negotiating a session and is used to describe the streaming media initialization parameters, including which audio format to use and which ports for RTP and MRCP (in the case of SIP sessions) should be used.
LumenVox supports the IETF RFC4566 memo describing the Session Description Protocol
Putting it all together
Using a combination of the above protocols, you can easily connect the LumenVox Media Server to a wide range of applications from large automated telephony-based IVR systems, to desktop applications, to in-car systems. Any application requiring speech technology can use MRCP to connect seamlessly across networks, or within a single self-contained system.
How client applications can use MRCP to connect to the LumenVox Media Server