SSML FAQ

Support for SSML elements

https://cloud.google.com/text-to-speech/docs/ssml-beta

The following sections describe the SSML elements and options that can be used on Notevibes

<speak>

The root element of the SSML response.

To learn more about the speak element, see the W3 specification.

Example

<speak>
  my SSML content
</speak>

<break>

An empty element that controls pausing or other prosodic boundaries between words. Using <break> between any pair of tokens is optional. If this element is not present between words, the break is automatically determined based on the linguistic context.

To learn more about the break element, see the W3 specification.

Attributes

Attribute Description
time

Sets the length of the break by seconds or milliseconds (e.g. "3s" or "250ms").

strength

Sets the strength of the output's prosodic break by relative terms. Valid values are: "x-weak", weak", "medium", "strong", and "x-strong". The value "none" indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break that the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between tokens. The stronger boundaries are typically accompanied by pauses.

Example

The following example shows how to use the <break> element to pause between steps:

<speak>
  Step 1, take a deep breath. <break time="200ms"/>
  Step 2, exhale.
  Step 3, take a deep breath again. <break strength="weak"/>
  Step 4, exhale.
</speak>

<say‑as>

This element lets you indicate information about the type of text construct that is contained within the element. It also helps specify the level of detail for rendering the contained text.

The <say‑as> element has the required attribute, interpret-as, which determines how the value is spoken. Optional attributes format and detail may be used depending on the particular interpret-as value.

Examples

The interpret-as attribute supports the following values:

To learn more about the say-as element, see the W3 specification.

<audio>

Supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output.

Attributes

Attribute Required Default Values
src yes n/a A URI referring to the audio media source. Supported protocol is https.
clipBegin no 0 A TimeDesignation that is the offset from the audio source's beginning to start playback from. If this value is greater than or equal to the audio source's actual duration, then no audio is inserted.
clipEnd no infinity A TimeDesignation that is the offset from the audio source's beginning to end playback at. If the audio source's actual duration is less than this value, then playback ends at that time. If clipBegin is greater than or equal to clipEnd, then no audio is inserted.
speed no 100% The ratio output playback rate relative to the normal input rate expressed as a percentage. The format is a positive Real Number followed by %. The currently supported range is [50% (slow - half speed), 200% (fast - double speed)]. Values outside that range may (or may not) be adjusted to be within it.
repeatCount no 1, or 10 if repeatDur is set A Real Number specifying how many times to insert the audio (after clipping, if any, by clipBegin and/or clipEnd). Fractional repetitions aren't supported, so the value will be rounded to the nearest integer. Zero is not a valid value and is therefore treated as being unspecified and has the default value in that case.
repeatDur no infinity A TimeDesignation that is a limit on the duration of the inserted audio after the source is processed for clipBegin, clipEnd, repeatCount, and speed attributes (rather then the normal playback duration). If the duration of the processed audio is less than this value, then playback ends at that time.
soundLevel no +0dB Adjust the sound level of the audio by soundLeveldecibels. Maximum range is +/-40dB but actual range may be effectively less, and output quality may not yield good results over the entire range.

The following are the currently supported settings for audio:

The contents of the <audio> element are optional and are used if the audio file cannot be played or if the output device does not support audio. The contents may include a <desc> element in which case the text contents of that element are used for display. For more information, see the Recorded Audio section in the Responses Checklist.

The src URL must also be an https URL (Google Cloud Storage can host your audio files on an https URL).

To learn more about media responses, see the media response section in the Responses guide.

To learn more about the audio element, see the W3 specification.

Example

<speak>
  <audio src="cat_purr_close.ogg">
    <desc>a cat purring</desc>
    PURR (sound didn't load)
  </audio>
</speak>

<p>,<s>

Sentence and paragraph elements.

To learn more about the p and s elements, see the W3 specification.

Example

<p><s>This is sentence one.</s><s>This is sentence two.</s></p>

Best practices

<sub>

Indicate that the text in the alias attribute value replaces the contained text for pronunciation.

You can also use the sub element to provide a simplified pronunciation of a difficult-to-read word. The last example below demonstrates this use case in Japanese.

To learn more about the sub element, see the W3 specification.

Examples

<sub alias="World Wide Web Consortium">W3C</sub>
<sub alias="にっぽんばし">日本橋</sub>

<prosody>

Used to customize the pitch, speaking rate, and volume of text contained by the element. Currently the rate, pitch, and volume attributes are supported.

The rate and volume attributes can be set according to the W3 specifications. There are three options for setting the value of the pitch attribute:

Option Description
Relative Specify a relative value (e.g. "low", "medium", "high", etc) where "medium" is the default pitch.
Semitones Increase or decrease pitch by "N" semitones using "+Nst" or "-Nst" respectively. Note that "+/-" and "st" are required.
Percentage Increase or decrease pitch by "N" percent by using "+N%" or "-N%" respectively. Note that "%" is required but "+/-" is optional.

To learn more about the prosody element, see the W3 specification.

Example

The following example uses the <prosody> element to speak slowly at 2 semitones lower than normal:

<prosody rate="slow" pitch="-2st">Can you hear me now?</prosody>

<emphasis>

Used to add or remove emphasis from text contained by the element. The <emphasis> element modifies speech similarly to <prosody>, but without the need to set individual speech attributes.

This element supports an optional "level" attribute with the following valid values:

To learn more about the emphasis element, see the W3 specification.

Example

The following example uses the <emphasis> element to make an announcement:

<emphasis level="moderate">This is an important announcement</emphasis>

<par>

A parallel media container that allows you to play multiple media elements at once. The only allowed content is a set of one or more <par>, <seq>, and <media> elements. The order of the <media> elements is not significant.

Unless a child element specifies a different begin time, the implicit begin time for the element is the same as that of the <par> container. If a child element has an offset value set for its begin or end attribute, the element's offset will be relative to the beginning time of the <par> container. For the root <par> element, the begin attribute is ignored and the beginning time is when SSML speech synthesis process starts generating output for the root <par> element (i.e. effectively time "zero").

Example

<speak>
  <par>
    <media xml:id="question" begin="0.5s">
      <speak>Who invented the Internet?</speak>
    </media>
    <media xml:id="answer" begin="question.end+2.0s">
      <speak>The Internet was invented by cats.</speak>
    </media>
    <media begin="answer.end-0.2s" soundLevel="-6db">
      <audio
        src="https://actions.google.com/.../cartoon_boing.ogg"/>
    </media>
    <media repeatCount="3" soundLevel="+2.28dB"
      fadeInDur="2s" fadeOutDur="0.2s">
      <audio
        src="https://actions.google.com/.../cat_purr_close.ogg"/>
    </media>
  </par>
</speak>

<seq>

A sequential media container that allows you to play media elements one after another. The only allowed content is a set of one or more <seq>, <par>, and <media> elements. The order of the media elements is the order in which they are rendered.

The begin and end attributes of child elements can be set to offset values (see Time Specification below). Those child elements' offset values will be relative to the end of the previous element in the sequence or, in the case of the first element in the sequence, relative to the beginning of its <seq> container.

Example

<speak>
  <seq>
    <media begin="0.5s">
      <speak>Who invented the Internet?</speak>
    </media>
    <media begin="2.0s">
      <speak>The Internet was invented by cats.</speak>
    </media>
    <media soundLevel="-6db">
      <audio
        src="https://actions.google.com/.../cartoon_boing.ogg"/>
    </media>
    <media repeatCount="3" soundLevel="+2.28dB"
      fadeInDur="2s" fadeOutDur="0.2s">
      <audio
        src="https://actions.google.com/.../cat_purr_close.ogg"/>
    </media>
  </seq>
</speak>

<media>

Represents a media layer within a <par> or <seq> element. The allowed content of a <media> element is an SSML <speak> or <audio> element. The following table describes the valid attributes for a <media> element.

Attributes

Attribute Required Default Values
xml:id no no value A unique XML identifier for this element. Encoded entities are not supported. The allowed identifier values match the regular expression "([-_#]|\p{L}|\p{D})+". See XML-ID for more information.
begin no 0 The beginning time for this media container. Ignored if this is the root media container element (treated the same as the default of "0"). See the Time specification section below for valid string values.
end no no value A specification for the ending time for this media container. See the Time specification section below for valid string values.
repeatCount no 1 A Real Number specifying how many times to insert the media. Fractional repetitions aren't supported, so the value will be rounded to the nearest integer. Zero is not a valid value and is therefore treated as being unspecified and has the default value in that case.
repeatDur no no value A TimeDesignation that is a limit on the duration of the inserted media. If the duration of the media is less than this value, then playback ends at that time.
soundLevel no +0dB Adjust the sound level of the audio by soundLevel decibels. Maximum range is +/-40dB but actual range may be effectively less, and output quality may not yield good results over the entire range.
fadeInDur no 0s A TimeDesignation over which the media will fade in from silent to the optionally-specified soundLevel. If the duration of the media is less than this value, the media will be mid-fade in at the end of playback.
fadeOutDur no 0s A TimeDesignation over which the media will fade out from the optionally-specified soundLevel until it is silent. If the duration of the media is less than this value, the media will be mid-fade out at the beginning of playback.

Time specification

A time specification, used for the value of `begin` and `end` attributes of <media> elements and media containers (<par> and <seq> elements), is either an offset value (for example, +2.5s) or a syncbase value (for example, foo_id.end-250ms).