Text-to-Speech Powered by Amazon Polly

In order for you to have a more natural-sounding experience for Text-to-Speech and greater control over speech parameters, you can request Text-to-Speech powered by Amazon Polly to be enabled for your account.

You will be able to select between Amazon Polly, which provides a natural experience for Text-to-Speech, or Amazon Polly Neural, which provides higher quality voices (please consult the Supported Languages table below, to see which languages are available in Amazon Polly Neural).

Enabling Text-to-Speech powered by Amazon Polly

To enable Amazon Polly or Amazon Polly Neural in your account, please reach out to your Customer Success Manager (CSM) or Professional Services representative.

Usage costs

The usage of Text-to-Speech powered by Amazon Polly has associated costs. Please reach out to your Customer Success Manager (CSM) for more details.

Supported Languages

Please find below the list of supported languages with Amazon Polly:

1154 1024 1024 1028 1032

Listen to the audio samples of supported languages and voices by visiting this page.

For languages that are available in Studio but not supported in Text-to-Speech powered by Amazon Polly, the voice to be used is the one available in Studio’s standard Text-to-Speech.

Controlling Speech Parameters with Text-to-Speech powered by Amazon Polly

To control speech parameters, it is necessary for the text in the Text-to-Speech section to be between the opening <speak> and closing </speak> tags. 

Example: 
<speak>Hello and thank you for calling our Support line.</speak>

With that in mind, it is now possible to use different control mechanisms to adapt the speech to specific needs.

Adding a pause

Emphasizing words

To emphasize words, the <emphasis> tag can be used. Emphasizing words changes the speaking rate and volume. To do that:
1. Place the opening tag <emphasis level=”value”> and the closing </emphasis> tag on the part of the speech that you want to change the emphasis. The supported values are the following:
 - “Strong”: Increases the volume and slows the speaking rate so that the speech is louder and slower.
 - “Moderate”: Increases the volume and slows the speaking rate, but less than strong. Moderate is the default.
 - “Reduced”: Decreases the volume and speeds up the speaking rate. Speech is softer and faster.
Example:
<speak>Hello and thank you for calling our <emphasis level=”strong”>Support line</emphasis>. We wish you a great day</speak>

📘

Note

Controlling the emphasis for words is not available for Amazon Polly Neural voices.

Controlling Volume, Speaking Rate, and Pitch

To control the volume, speaking rate, and pitch of the voice, the <prosody> tag can be used. To do that:

Controlling Volume:

1. Place the opening tag <prosody volume=”value”> and the closing </prosody> tag on the part of the speech that you wish to control the volume. The supported values are the following:
 - “default”: Resets volume to the default level for the current voice.
 - “silent”, “x-soft”, “soft”, “medium”, “loud”, “x-loud”: Sets the volume to a predefined value for the current voice.
 - “+ndB”, “-ndB”: Changes volume relative to the current level. A value of +0dB means no change, +6dB means approximately twice the current volume, and -6dB means approximately half the current volume. 
Example:
<speak>Hello and thank you for calling our <prosody volume=”loud”>Support line</prosody>. We wish you a great day</speak>

Controlling Speaking Rate:

1. Place the opening tag <prosody rate=”value”> and the closing </prosody> tag on the part of the speech that you wish to control the speaking rate. The supported values are the following:
  - “x-slow”, “slow”, “medium”, “fast”, “x-fast”: Sets the speech rate to a predefined value for the selected voice.
  - “n%”: A positive percentage change in the speaking rate. For example, a value of 100% means no change in speaking rate, a value of 200% means a speaking rate twice the default rate, and a value of 50% means a speaking rate of half the default rate. This value has a range of 20-200%. 

Example 1:
<speak>Hello and thank you for calling our <prosody rate=”fast”>Support line</prosody>. We wish you a great day</speak>

Example 2:
<speak>Hello and thank you for calling our <prosody rate=”60%”>Support line</prosody>. We wish you a great day</speak>

Controlling Pitch:

1. Place the opening tag <prosody pitch=”value”> and the closing </prosody> tag on the part of the speech that you wish to control the pitch of the voice. The supported values are the following:
 - “x-low”, “low”, “medium”, “high”, “x-high”: Sets the pitch to a predefined value for the current voice.
 - “+n%” or “-n%”: Adjusts pitch by a relative percentage. For example, a value of +0% means no baseline pitch change, +5% gives a little higher baseline pitch, and -5% results in a little lower baseline pitch.
Example:
<speak>Hello and thank you for calling our <prosody pitch=”x-low”>Support line</prosody>. We wish you a great day</speak>

📘

Note

Amazon Polly Neural voices only support the Volume and Speaking Rate attributes, and not the Pitch attribute.

Controlling how special types of words are spoken

To control how special types of words are spoken, the <say-as interpret-as> tag can be used. To do that:
1. Place the opening tag <say-as interpret-as=”value”> and the closing </say-as> tag on the part of the speech that you wish to adopt. The supported values are the following:
  - “characters” or “spell-out”: Spells out each letter of the text, as in a-b-c.

📘

Note

The values “characters” or “spell-out” are not supported by Amazon Polly Neural languages.

- “cardinal” or “number”: Interprets the numerical text as a cardinal number, as in 1,234.
  - “ordinal”: Interprets the numerical text as an ordinal number, as in 1,234th.
  - “digits”: Spells out each digit individually, as in 1-2-3-4.
  - “fraction”: Interprets the numerical text as a fraction. This works for both common fractions, such as 3/20, and mixed fractions, such as 2 ½. If you would like to insert 4 ½ to be read as “four and a half”, then you need to define the text as “4+½” (example: <say-as interpret-as="fraction">4+1/2</say-as> will read as “three and a half”).
  - “unit”: Interprets a numerical text as a measurement. The value should be either a number or a fraction followed by a unit with no space in between, as in 1/2inch, or by just a unit, as in 1meter.
  - “date”: Interprets the text as a date. The format of the date must be specified with the format attribute. See below for more information.
   - When “interpret-as” is set to “date”, you also need to indicate the format of the date. This uses the following syntax: <say-as interpret-as="date" format="format">[date]</say-as>
   The following formats can be used: 
     - “mdy”: Month-day-year.
     - “dmy”: Day-month-year.
     - “ymd”: Year-month-day.
     - “md”: Month-day.
     - “dm”: Day-month.
     - “ym”: Year-month.
     - “my”: Month-year.
     - “d”: Day.
     - “m”: Month.
     - “y”: Year.
     - “yyyymmdd”: Year-month-day. If you use this format, you can make Amazon Polly skip parts of the date using question marks.

Example using the tag:
<speak>Hello and thank you for calling our support line. We will be closed on <say-as interpret-as=”date” format=”md”>25/12/2021</say-as>. Thank you for understanding.</speak>
“time”: Interprets the numerical text as duration, in minutes and seconds, as in 1'21".

Example for the <say-as interpret-as=”value”> tag:
<speak>Hello and thank you for calling our Support line. Here is a promo code for you <say-as interpret-as=”characters”>U74HDDKOM</say-as>. We wish you a great day</speak>

Controlling phonetic pronunciation:

To make adjustments to how some words are pronounced, the <phoneme> tag can be used. To use the tag, it will be mandatory to also add two attributes that will indicate how the pronunciation should occur. Those two attributes are: 
- Alphabet (and respective supported values):
   - “ipa”:  Indicates that the International Phonetic Alphabet (IPA) will be used.
   - “x-sampa”:  Indicates that the Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) will be used.
-ph:
   - Specifies the phonetic symbols for pronunciation. For more information, consult this page.

See below two examples (one using ipa, the other one using x-sampa) using the phoneme tag:

Example (using “ipa”):
<speak> You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>. I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.</speak>

Example (using “x-sampa”):
<speak> You say, <phoneme alphabet='x-sampa' ph='pI"kA:n'>pecan</phoneme>. I say, <phoneme alphabet='x-sampa' ph='"pi.k{n'>pecan</phoneme>. 
</speak>

Note: Please bear in mind the difference between using “ipa” and “x-sampa”. When using “ipa”, double quotation marks (“) should be used for the values that the attributes use. When using “x-sampa”, one should use single quotation marks (‘). This is because the double quotation mark (“) in “x-sampa” is used to define the primary stress of a word.

📘

Note

If you use SSML tags in Text to Speech powered by Amazon Polly in order to have control of the speech parameters, please take into consideration the following:
If you decide to stop using Text to Speech powered by Amazon Polly and want to use Studio’s standard solution, you will need to update the messages that contain SSML tags, removing the defined SSML tags, to ensure that the messages are played correctly.