Visualising acoustic cues

3D Gaussians with plotly

In one of my research projects, I analysed the outcomes of two accent adaptation experiments conducted in English and Swedish. The two studies employed the same paradigm which produced contradicting results - one had a positive effect of exposure and the other showed a null effect. After ruling out type 1 or 2 errors and low statistical power we constructed ideal observers to determine if both results would have been predicted by models of speech perception that posit distributional learning when listeners adapt to unfamiliar talkers.

To understand such models one presupposes:

  • the listener perceives the segments of speech that make up a word (e.g. bed is made up 3 categories of sounds, b-e-d)
  • the listener is sensitive to the acoustic correlates that signal the presence of the sound and distinguish it from other sounds.
  • more than one acoustic cue measurement can be mapped to the same sound category.
  • the researcher knows which cues matter the most for identifying a given sound category.

In distributional learning the listener tracks the regularities of acoustic cues and their occurence with the sound category of interest. Many acoustic cue measurements can map to the same category thus forming a distribution.

One could loosely think of “bed” and “bet” as being differentiated by only the final sound (d/t). The acoustic cues that map to the categories “d” and “t” at the end of the word has been identified to be 3 temporal measures – the duration of the previous vowel sound (“e”), the duration of closure (to produce d or t at the end of a word the talker has to momentarily stop the air-flow through his vocal tract), and the duration of the burst of air that follows release of the closure.

Production of "d"-words and "t"-ending words by native and non-native talkers of US English. Points show durational measures of vowel, closure, burst for each word. Ellipses represent the parameterised distributions of cues under Gaussian assumptions.

We could in principle, map all the combinations of cue measurements that occur each time “d” is produced and all the combinations that occur each time “t” is produced and visualise the distributions that divide the two categories of sounds. We could compare data from native talkers with that of non-native talkers and see how they differ or don’t differ. With these visual models we could reasonably predict that the greater the similarity the less there is for the listener to learn about the new talker since her cues would fall within an expected range.

In this project we compared the production of stop consonants (“d” and “t) in word-final contexts (e.g. bat and bad) of native and non-native speakers in both English and Swedish.

Same as previous plot but with Swedish word final "d" and "t".