ArticlePDF Available

The role of isochrony in speech perception in noise

November 2020
Scientific Reports 10(1):19580

November 2020
10(1):19580

DOI:10.1038/s41598-020-76594-1

License
CC BY

Authors:

Vincent Aubanel

University of Melbourne

jean-luc Schwartz

GIPSA-lab

The role of isochrony in speech--the hypothetical division of speech units into equal duration intervals--has been the subject of a long-standing debate. Current approaches in neurosciences have brought new perspectives in that debate through the theoretical framework of predictive coding and cortical oscillations. Here we assess the comparative roles of naturalness and isochrony in the intelligibility of speech in noise for French and English, two languages representative of two well-established contrastive rhythm classes. We show that both top-down predictions associated with the natural timing of speech and to a lesser extent bottom-up predictions associated with isochrony at a syllabic timescale improve intelligibility. We found a similar pattern of results for both languages, suggesting that temporal characterisation of speech from different rhythm classes could be unified around a single core speech unit, with neurophysiologically defined duration and linguistically anchored temporal location. Taken together, our results suggest that isochrony does not seem to be a main dimension of speech processing, but may be a consequence of neurobiological processing constraints, manifesting in behavioural performance and ultimately explaining why isochronous stimuli occupy a particular status in speech and human perception in general.

Intelligibility as a function of temporal distortion, as measured by the four metrics (rows) defined in Table 5. Data are grouped according to experimental condition (colors), the type of modification of the experimental condition (column groupings) and language (columns). Three subsets of data are highlighted for subsequent analysis (see text): (A) departure from isochrony of naturally timed sentences; (B) departure from natural rhythm of isochronously retimed sentences; (C) departure from natural rhythm and isochrony of anisochronously retimed sentences. Regression lines show linear modelling of the data points, disregarding subject and sentence random variation.

…

Annotation of an example sentence (translation: The red neon lamp makes his/her hair iridescent), in its original unmodified natural timing (A) and transformed isochronous forms at the accent (B) and syllable (C) levels. For each panel, from top to bottom: spectrogram with target boundaries used for the transformation overlaid in dashed lines, accent group onsets (red), syllable onsets (orange), phonemes and words.

…

Simultaneous generalised hypotheses tests for the effect of condition on intelligibility, formulated on two independent models for French and English respectively. From left to right: comparison tested and, for each language, comparison estimate and associated z and p values, with classical visual significativity indication. Data for English has been previously reported in 19 .

…

(A) Initial (m3) and equivalent simpler (m4) models for the role of departure from natural rhythm in isochronously retimed sentences.The formulae of the fixed effects are given for the two models, and the result of a likelihood-ratio test between the two models is given on the right of the vertical separator. (B) m4 model coefficients, with associated p values. (C) Fixed-effect sizes with lower and upper confidence levels.

…

Top panel: proportion of words correctly recognised in each experimental condition for French and English. Error bars show 95% confidence intervals over 26 and 27 subjects in the two languages respectively. Bottom panel: average sentence temporal distortion (δ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\delta$$\end{document} function computed on speech units matching the temporal condition, see Eq. (1)). By construction, temporal distortion is null for natural sentences (NAT condition) and identical for isochronously and anisochronously retimed sentences at a given rythmic level, that is, in ISO.acc and ANI.acc conditions on one hand, and in ISO.syl and ANI.syl conditions on the other hand. Error bars show 95% confidence intervals over 180 sentences for French and English. Data for English was previously reported in¹⁹.

…

Figures - uploaded by Vincent Aubanel

Content may be subject to copyright.

Access to this full-text is provided by Springer Nature.

Learn more

Content available from Scientific Reports

This content is subject to copyright. Terms and conditions apply.





Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports

The role of isochrony in speech

perception in noise

Vincent Aubanel* & Jean‑Luc Schwartz

The role of isochrony in speech—the hypothetical division of speech units into equal duration

intervals—has been the subject of a long‑standing debate. Current approaches in neurosciences have

brought new perspectives in that debate through the theoretical framework of predictive coding

and cortical oscillations. Here we assess the comparative roles of naturalness and isochrony in the

intelligibility of speech in noise for French and English, two languages representative of two well‑

established contrastive rhythm classes. We show that both top‑down predictions associated with the

natural timing of speech and to a lesser extent bottom‑up predictions associated with isochrony at

a syllabic timescale improve intelligibility. We found a similar pattern of results for both languages,

suggesting that temporal characterisation of speech from dierent rhythm classes could be unied

around a single core speech unit, with neurophysiologically dened duration and linguistically

anchored temporal location. Taken together, our results suggest that isochrony does not seem to

be a main dimension of speech processing, but may be a consequence of neurobiological processing

constraints, manifesting in behavioural performance and ultimately explaining why isochronous

stimuli occupy a particular status in speech and human perception in general.

A fundamental property of mammalian brain activity is its oscillatory nature, resulting in the alternation between

excitable and inhibited states of neuronal assemblies1. e crucial characteristic of heightened excitability is that

it provides, for sensory areas, increased sensitivity and shorter reaction times, ultimately leading to optimised

behaviour. is idea formed the basis of the Active Sensing theoretical framework2,3 and has found widespread

experimental support.

Oscillatory activity in relation to speech, a complex sensory signal, has initially been described as speech

entrainment, or tracking, a view which proposes that cortical activity can be matched more or less directly to

some characteristics of the speech signal such as the amplitude envelope4,5. e need to identify particular events

to be the support of speech tracking has in turn prompted the question of which units would oscillatory activity

entrain to. e syllable has usually been taken as the right candidate6,7, given the close match, under clear speech

conditions, between the timing of syllable boundaries and that of the amplitude envelope’s larger variations. ese

conditions are however far from being representative of how speech is usually experienced: connected speech is

notoriously characterised by the lack of acoustically salient syllable boundaries8.

In parallel, early works on speech rhythm, also inspired by evident similarities with the timing of music9, have

led scholars to focus on periodic aspects of speech. e isochrony hypothesis extended impressionistic descrip-

tions of speech sounding either morse-like or machine gun-like10,11, and led to the rhythmic class hypothesis12,

stating that languages fall into distinct rhythmic categories depending of which unit is used to form the isoch-

ronous stream. Two main classes emerged, stress-timed languages (e.g., English) based on isochronous feet and

syllable-timed languages (e.g., French) which assume equal-duration syllables. Still, the isochrony hypothesis and

the related rhythmic class hypothesis, in spite (or by virtue) of their simple formulation and intuitive account,

have been the source of a continuous debate (see review in13).

e way current theories are formulated as reviewed above, isochrony in speech would present some advan-

tages: speech units delivered at an ideal isochronous pace would be maximally predictable and lead to maximum

entrainment, through alleviating the need for potentially costly phase-reset mechanisms14. However, naturally

produced speech is rarely isochronous, if at all, and this departure from a hypothetical isochronous form, that

is, variation of sub-rhythmic unit durations is in fact used to encode essential information at all linguistic (pre-

lexical, lexical, prosodic, pragmatic, discursive) and para-linguistic levels. Two apparently contradictory hypoth-

eses are therefore at play here, the rst one positing a benecial role for isochrony in speech processing, and the

second one seeing natural speech timing as a gold standard, with any departure from it impairing its recognition.

In this study we attempt to disentangle the role of the two temporal dimensions of isochrony and natural-

ity in speech perception. We report on two experiments conducted separately on spoken sentences in French

and English, each representative of the two rhythmic classes. For this aim, we exploited the Harvard corpus for

OPEN

*

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol:.(1234567890)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

English15 and its recently developed French counterpart, the Fharvard corpus16. Both corpora contain sentences

composed of 5–7 keywords, recorded by one taker for each language. Sentences were annotated at two hierarchi-

cal rhythmic levels: the accent group level and the syllable level, respectively forming the basis of the two main

language rhythmic classes mentioned above. We retimed naturally produced sentences to an isochronous form

(or a matched anisochronous form, see hereaer) by locally compressing or elongating speech portions cor-

responding to rhythmic units at the two corresponding levels (accent group and syllable). Retiming was opera-

tionalised around P-centres, that is, the time at which listeners report the occurrence of the unit17,18, and which

provide crucial pivotal events at the meeting point between bottom-up acoustic saliency cues and top-down

information about the onset of linguistic units. Unmodied time onsets of either accent (acc) or syllable (syl)

rhythmic units served as a reference for the natural rhythm (NAT) condition, from which isochronous (ISO) and

anisochronous (ANI) conditions were dened. Altogether, this provided 5 temporal versions of each sentence in

each corpus: the unmodied natural version (NAT), the isochronous stimuli at the accent (ISO.acc) and syllable

(ISO.syl) levels, and the anisochronous stimuli at the accent (ANI.acc) and syllable (ANI.syl) levels. e ANI

conditions were controls for the ISO conditions through the application of identical net temporal distortions

from the NAT sentences though in a non-isochronous way (see “Methods”).

We then evaluated the consequences of these modications of naturalness towards isochrony in the ability of

listeners to process and understand the corresponding speech items. Sentence stimuli were mixed with stationary

speech-shaped noise to shi comprehension to below-ceiling levels. en, for both languages separately, the set

of the ve types of sentences in noise was presented to native listeners, and the proportion of recognised key-

words was taken as the index of the intelligibility of the corresponding sentence in the corresponding condition.

We show that naturalness is the main ingredient of intelligibility, while isochrony at the syllable level—but not

at the accent group level, whatever the rhythmic class of the considered language—plays an additional though

quantitatively smaller benecial role. is provides for the rst time an integrated coherent framework combin-

ing predictive cues related to bottom-up isochrony and top-down naturalness, describing speech intelligibility

properties independently on the language rhythmic class.

Results

Natural timing leads to greater intelligibility than either isochronously or anisochronously

retimed speech in both languages. We rst report the eect of temporal distortion on intelligibility,

separately by retiming condition, for the two languages. Figure1 shows intelligibility results as the proportion of

keywords correctly recognised by French and English listeners (top panel) and the temporal distortion applied to

sentences in each condition (bottom panel, see “Methods”, Eq.(1) for computation details). Net temporal distor-

tion from natural speech at the condition level appears to be reected in listeners’ performance, with increased

temporal distortion associated with decreased intelligibility in both languages.

Extending the analysis done for the English data and reported in19, we tted a generalised linear mixed-eect

model to the French data. Table1 gathers the results of simultaneous generalised hypotheses on the condition

eects formulated separately for each language.

As is veried in Fig.1 and in the rst 4 rows of Table1, intelligibility of unmodied naturally timed sentences

for French was signicantly higher than sentences in any temporally-modied conditions. is result replicates

what was obtained for English, and conrms that any temporal distortion leads to degraded intelligibility. In

contrast to English however, where accent-isochronously retimed sentences were signicantly more intelligible

than accent-anisochronously retimed ones, no such eect is observed for French (Table1 row 5). Similarly, the

tendency for an isochronous versus anisochronous intelligiblity dierence at the syllable level observed in English

is absent in French (Table1 row 6). Indeed, an overall benet of isochronous over anisochronous transformation

is observed for English but not in French, when combining the two rhythmic levels (Table1 row 7).

As shown by the last row of Table1, syllable level distortion led to greater intelligibility decrease, than accent-

level distortion, in both French and English. is relates to the greater amount of distortion applied to syllable-

level over the accent-level modications applied to sentences, see Fig.1, bottom panel.

In sum, while temporal distortion appears to be the main predictor of intelligibility for both languages, the

independent role of isochrony seems to dier between both languages. We present in the next section evidence

for a common pattern underlying these surface dierences.

Syllable‑level isochrony plays a secondary role in both languages, even in naturally timed sen‑

tences. We dened several rhythm metrics to quantify, for each sentence in any temporal condition, the

departure of either accent group or syllable timing from two canonical rhythm types: natural or isochronous

rhythm. For the two hierarchical rhythmic levels considered here, this amounts to 4 metrics altogether: depar-

ture from naturally timed accent groups or syllables (respectively dnat.acc and dnat.syl), and departure from

isochronous accent group or syllables (respectively diso.acc and diso.syl, see “Methods” and Table5).

Figure2 shows intelligibility scores as a function of the temporal distortion applied to the sentences along

the 4 metrics, for all 5 experimental conditions, for English and French.

We analysed the joint role of isochrony and naturality in the dierent temporal conditions using logistic

regression modelling (see “Methods”). We conducted three separate analyses by grouping natural, isochronous

and anisochronous sentences respectively. is was done to avoid including subsets of data where a given metric

would yield a zero-value by design (see Fig.2). e three analyses are presented in the next subsections, each

corresponding to a highlighted region in Fig.2.

Departure from isochrony in naturally timed sentences (Fig. 2 region A). Naturally-timed sen-

tences have a null departure from naturality by design, but their departure from an isochronous form at both

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol.:(0123456789)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

●

French English

0.40

0.45

0.50

0.55

0.60

prop. word correct

Intelligibility

●

NAT

ISO.acc

ANI.acc

ISO.syl

ANI.syl

NAT

ISO.acc

ANI.acc

ISO.syl

ANI.syl

0.0

0.1

0.2

0.3

0.4

0.5

●

NAT

ISO.acc

ANI.acc

ISO.syl

ANI.syl

Temporal distortion

Figure1. Top panel: proportion of words correctly recognised in each experimental condition for French and

English. Error bars show 95% condence intervals over 26 and 27 subjects in the two languages respectively.

Bottom panel: average sentence temporal distortion (

function computed on speech units matching the

temporal condition, see Eq.(1)). By construction, temporal distortion is null for natural sentences (NAT

condition) and identical for isochronously and anisochronously retimed sentences at a given rythmic level, that

is, in ISO.acc and ANI.acc conditions on one hand, and in ISO.syl and ANI.syl conditions on the other hand.

Error bars show 95% condence intervals over 180 sentences for French and English. Data for English was

previously reported in19.

Table 1. Simultaneous generalised hypotheses tests for the eect of condition on intelligibility, formulated

on two independent models for French and English respectively. From le to right: comparison tested and,

for each language, comparison estimate and associated z and p values, with classical visual signicativity

indication. Data for English has been previously reported in19.

Row Comparison

French English

Est. z p Est. z p

1 ISO.acc, NAT

−0.509

−10.94

<0.001

***

−0.545

−12.16

<0.001

***

2 ANI.acc, NAT

−0.420

−9.09

<0.001

***

−0.722

−16.06

<0.001

***

3 ISO.syl, NAT

−0.862

−18.45

<0.001

***

−1.017

−22.40

<0.001

***

4 ANI.syl, NAT

−0.820

−17.59

<0.001

***

−1.127

−24.68

<0.001

***

5 ISO.acc, ANI.acc

−0.088

−1.93

0.273 0.177 4.00

<0.001

***

6 ISO.syl, ANI.syl

−0.042

−0.91

0.878 0.110 2.43

0.092.

7 ISO, ANI

−0.130

−2.01

0.233 0.287 4.54

<0.001

***

8 syl, acc

−0.753

−11.59

<0.001

***

−0.878

−13.80

<0.001

***

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol:.(1234567890)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

the accent and the syllable level can be evaluated by the diso.acc and diso.syl metrics respectively. We therefore

included in the analysis the diso.acc and diso.syl metrics, and discarded the dnat.acc and dnat.syl metrics (see

Fig.2 regionA). Starting from the initial logistic regression model predicting intellibility with the full interac-

tion of language (French and English), diso.acc and diso.syl, as xed eects, we found that the simplest equivalent

model was a model with only diso.acc and diso.syl factors without interaction (see Table2 and “Methods”).

e resulting model shows that for natural sentences, intelligibility is positively correlated with departure from

accent isochrony (i.e., increased accent group irregularity is associated with better intelligibility) and negatively

correlated with departure from syllable isochrony (i.e., the more isochronous naturally timed syllables are, the

better the sentence is recognised). Importantly, this result does not depend on the language considered, with

both French and English showing the same pattern of results. Fixed eect sizes are markedly small, as the major-

ity of the variance is explained by random eects, as expected with the material used here. But xed eects are

nevertheless real and quantitatively important, as seen on Fig.2.

Departure from natural timing in isochronously retimed sentences (Fig. 2 region B). Next we

assessed to what extent intelligibility in isochronous conditions can be predicted from the departure from natu-

ral rhythm, at the accent and syllable levels (see Fig.2 regionB). From an initial fully interactional model with

language, dnat.acc and dnat.syl predictors, the simplest equivalent model consisted of only the dnat.syl factor

(see Table3).

is indicates that in conditions where sentences are isochronously transformed, intelligibility is signicantly

negatively correlated with departure from natural syllabic rhythmicity. Crucially, departure from accent group

natural rhythm does not play a role, and the results are identical for both languages considered.

Figure2. Intelligibility as a function of temporal distortion, as measured by the four metrics (rows) dened

in Table5. Data are grouped according to experimental condition (colors), the type of modication of the

experimental condition (column groupings) and language (columns). ree subsets of data are highlighted

for subsequent analysis (see text): (A) departure from isochrony of naturally timed sentences; (B) departure

from natural rhythm of isochronously retimed sentences; (C) departure from natural rhythm and isochrony

of anisochronously retimed sentences. Regression lines show linear modelling of the data points, disregarding

subject and sentence random variation.

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol.:(0123456789)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

Departure from isochrony and natural timing in anisochronously retimed sentences (Fig. 2

region C). In this last step we evaluated whether intelligibility of anisochronously retimed sentences could be

predicted by a combination of the four rhythmic distortion metrics (diso.acc, diso.syl, dnat.acc and dnat.syl, see

Fig.2 regionC). Indeed, anisochronously retimed speech departs from both natural and isochronous canonical

forms of timing, and in particular all four metrics have non all-zero values for these sentences. From an initial

fully interactional model crossing language with the four rhythm metrics, we found that the simplest equivalent

model consisted of the additive model of factors dnat.syl and diso.syl (see Table4).

ese results rene and extend what was obtained in the previous two analyses. First, the rhythmic unit of

accent group (whether dened at the stressed syllable level in English or at the accentual phrase level in French)

does not provide any explanatory power in predicting intelligibility in anisochronously timed speech. Second, the

role of natural syllable timing is conrmed and is the strongest predictor of intelligibility in that model as shown

by its z-value (Table4B) and its eect size (Table4C). ird, a role for the departure from isochronously-timed

syllables is detected. is means that in these conditions where the timing of speech is most unpredictable, there

is a tendency for isochronous syllables to be associated with increased intelligibility. We note however that there is

a necessary correlation between dnat and diso metrics for anisochronous sentences, exemplied by the fact that a

close-to-isochronous natural sentence has to be distorted by a low diso value to be rendered isochronous, and that

its anisochronous counterpart, being distorted by the same quantity by design, will be close to both the natural

and the isochronous version. A quantitative analysis conrmed this (Pearson’s product-moment correlation

Table 2. (A) Initial (m1) and equivalent simpler (m2) model for the role of departure from isochrony in

naturally timed sentences. e formulae of the xed eects are given for the two models, and the result of

a likelihood-ratio test between the two models is given on the right of the vertical separator. (B) m2 model

coecients, with associated p values. (C) Fixed-eect sizes with lower and upper condence levels.

Model summary Likelihood-ratio test (m1, m2)

Fixed eects AIC

χ2

(> χ

(A) Model selection

m1 language

∗

diso.acc

∗

diso.syl 6728.4 3.36 5 0.65

m2 diso.acc+ diso.syl 6721.7

Estimate SE zvalue Pr(>|z|)

(B) Equivalent model (m2) coecients

(Intercept) 0.8633 0.3230 2.673 0.00752**

diso.acc 1.0651 0.4633 2.299 0.02150*

diso.syl

−1.5151

0.5757

−2.632

0.00850**

Eect

Lower CL Upper CL

m2 0.028 0.014 0.048

diso.syl 0.018 0.006 0.034

diso.acc 0.013 0.004 0.028

Table 3. (A) Initial (m3) and equivalent simpler (m4) models for the role of departure from natural rhythm in

isochronously retimed sentences.e formulae of the xed eects are given for the two models, and the result

of a likelihood-ratio test between the two models is given on the right of the vertical separator. (B) m4 model

coecients, with associated p values. (C) Fixed-eect sizes with lower and upper condence levels.

Model summary Likelihood-ratio test (m3, m4)

Fixed eects AIC

χ2

(> χ

(A) Model selection

m3 language

∗

dnat.acc

∗

dnat.syl 13,507 6.12 6 0.41

m4 dnat.syl 13,502

Estimate SE zvalue Pr(>|z|)

(B) Equivalent model (m4) coecients

(Intercept) 0.7382 0.1424 5.185

2.16e

−

***

dnat.syl

−2.6231

0.2749

−9.541

<2e−16

***

Eect

Lower CL Upper CL

m4 0.045 0.033 0.058

dnat.syl 0.045 0.033 0.058

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol:.(1234567890)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

between diso.syl and dnat.syl: French:

0.72, p<0.01

; English:

0.56, p<0.01

). e specic contribution

of syllabic isochrony to intelligibility therefore appears to be small at best for anisochronous sentences. Finally,

as for above analyses, this pattern of result applies indistinctively to both French and English.

Discussion

In the current study, we set to characterise the possible role of isochrony in speech perception in noise. Isochrony

is contrasted to naturalness, where the former refers to an ideally perfectly regular timing of speech units, while

the latter to the timing of speech units as they occur in naturally produced speech. We included a third set of

anisochronous conditions, in which the timing of speech events bear the same degree of temporal distortion from

naturally timed speech as isochronous speech, while being irregular. We tested temporally modied sentences

at the accent and the syllable levels, two hierarchically nested linguistic levels in English and French. ese two

languages are traditionally described as being representative of two distinct rhythmic classes, based on a hypo-

thetical underlying isochrony of accent versussyllable units respectively in natural speech production12,13,20.

A rst important result of this study is the replication from English to French that isochronous forms of

speech are always less intelligible than naturally timed speech. In fact, in a paradigm as the current one, where

the internal rhythmic structure of speech is changed but the sentence duration remains the same, any temporal

distortion to the naturally produced timings of speech units appears to be detrimental, with the amount of

temporal distortion being a strong predictor of intelligibility decrease. is result goes against the hypothesis

that isochronous speech, by virtue of a supposedly ideally regular timing of its units, would be easier to track,

by alleviating the need for constant phase-resetting. For traditional linguistic accounts too, these results further

debunk the isochrony hypothesis, assuming that produced speech would be based on an underlying ideally

isochronous form12, see review in13,20. In fact, our results suggest that the natural timing statistics are actively

used by listeners in decoding words, even though they are encountered in a single instance by listeners.

At the condition level, while an advantage of isochronous over anisochronous forms of speech was found for

English19, no such trend was observed for French. While possible prosodic or idiosyncratic eects might account

for this dierence (see Supplementary Materials), this led us to examine, separately by condition, the relationship

between intelligibility and the four timing metrics, namely departure from the natural timing of speech units and

departure from a hypothetical underlying isochronous form of those speech units. is analysis unveiled a rather

consistent portrait across the two talkers of the two languages, according to the type of sentence retiming. For

naturally-timed sentences, intelligibility correlates with the degree of departure from an ideal isochronous form

of the sentence at the syllable level. at is, naturally isochronous sentences at the syllable level are signicantly

better recognised than naturally anisochronous sentences. For isochronously retimed sentences, intelligibility

was strongly correlated with departure from naturality (see also Fig.1, bottom). Importantly, this result is valid

only for the syllable level—departure from natural timing of accent groups did not explain intelligibility vari-

ation. For anisochronous sentences, the two syllabic rhythm metrics, that is, departure from naturally timed

syllabes and departure from syllabic isochrony, were found to be correlated with intelligibility, though the latter

to a much lesser extent. is shows that both temporal dimensions of syllabic rhythm are actively relied upon

in speech comprehension. In addition, the simultaneous variation of the two temporal dimensions for this type

of stimuli enabled to provide an indication of the relative size of these eects, since the role of departure from

naturally timed syllables was about ten times stronger than the role of departure from isochronously timed syl-

lables (Table4C)—a ratio however possibly underestimated owing to the necessary correlation that exists between

Table 4. (A) Initial (m5) and equivalent simpler (m6) model for the role of departure from isochrony and

natural rhythm in anisochronously retimed sentences. e formulae of the xed eects are given for the two

models, and the result of a likelihood-ratio test between the two models is given on the right of the vertical

separator. (B) m6 model coecients, with associated p values. (C) Fixed-eect sizes with lower and upper

condence levels.

Model summary Likelihood-ratio test (m5, m6)

Fixed eects AIC

χ2

(> χ

(A) Model selection

m5 language

∗

dnat.acc

∗

dnat.syl

∗

diso.acc

∗

diso.syl 13,610 39.35 29 0.095

m6 dnat.syl+ diso.syl 13,591

Estimate SE z value Pr(>|z|)

(B) Equivalent model (m6) coecients

(Intercept) 0.9614 0.1669 5.760

8.38e−09

***

dnat.syl

−2.9418

0.3852

−7.637

2.22e−14

***

diso.syl

−0.5311

0.2742

−1.937

0.0527.

Eect

Lower CL Upper CL

m6 0.059 0.046 0.073

dnat.syl 0.030 0.021 0.041

diso.syl 0.003 0.000 0.007

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol.:(0123456789)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

the two metrics, by design for these sentences. Again, the regularity of accent group timing did not provide any

signicant increase in intelligibility in any condition—contrary to natural sentences, as we will discuss later.

Crucially, all three analyses did not nd language as a contributing factor in explaining intelligibility, with the

same pattern of result applying to speech produced by both English and French talkers. Of course, as we tested

only one talker in each language, there is the possibility that an observed dierence (or lack thereof) between

languages might be imputable to confounded talker characteristics (see Supplementary Material for an analysis

of talker idiosyncrasies on a range of acoustic parameters). With that caveat in mind, given that those languages

were selected as being representative of dierent rhythmic classes, the undiscriminative nature of the results

may have two implications. First, it suggests that isochrony eects could apply across the board to any language,

independently of their rhythmic patterning at a global level—though, of course, this will have to be tested on

other languages and multiple speakers in future studies. Second, our results suggest that the syllable is a core

unit in temporal speech processing. While the primary role for the syllable in speech processing has long been

proposed7,21,22, (but see8) the fact that isochrony eects are stronger for syllable than accent groups even for a

stress-timed language (which would supposedly favour accent-group isochrony), indicates that the temporal scale

associated with the syllable is key—rather than its linguistic functional value. Instead, our results further support

the usefulness of the notion of a neurolinguistic unit dened at the crossroads of linguistic and neurophysiologi-

cal constraints, such as the so-called “theta-syllable”23 which would apply universally to languages independently

of their linguistically-dened rhythm class. We suggest that this unit combines high-level perceptually-grounded

temporal anchoring, with the P-centre being a good candidate, with physiologically-based temporal constraints

determined by neural processing in the theta range of 4–8Hz (or up to 10Hz, see24,25).

Taken together, the data presented here are compatible with a model of cortical organisation of speech pro-

cessing with co-occurring and inter-dependent bottom-up and top-down activity25,26. While early studies have

focussed on exogenous eect of rhythm on cortical activity, the role of endogenous activity is being progressively

acknowledged and understood27. In this study, we use intelligibity as a proxy for successful cortical process-

ing, following the long established link between intelligibility and entrainment28–30. Our data allow, across two

languages with markedly dierent rhythmic structures, to suggest unifying principles: the dependence of intel-

ligibility of naturally-timed sentences to their underlying isochrony suggests a bottom-up response component

that benets from regularity at the syllable level. In isochronous sentences, the strong response variation with

departure from the sentences’ natural rhythm indicates that learned patterns are a strong determinant of speech

processing. Unifying both distortion types, anisochronous sentences display a graded dependence to both of

temporal distortions, and additionally provide a quantication of the relative magnitude of the eects, since in

this type of sentences, top-down processing has a stronger role than bottom-up information. ese hypotheses

need to be backed-up with neurophysiological data, and a rst work in that framework is reported in31.

A point of interest of our results is the positive correlation in natural sentences between intelligibility and

departure from accent-level isochrony in both English and French. Local intelligibility variations associated

with regular versus irregular accent-level prominences could explain this eect: irregular accent-level units in a

sentence might trigger fewer but louder prominences in a sentence compared to isochronously distributed accent

units, which in turn may prove more intelligibile when mixed with a xed signal-to noise ratio over the entire

sentence, as opposed to a more uniform distribution of energetic masking over the sentences. e psychoacoustic

consequences of this hypothesis are dicult to control and outside the scope of the current study but could open

productive avenues in speech intelligibility enhancing modications32.

e sentence-long speech material employed here can be considered a good match to ecologically valid speech

perception conditions (over more classical controlled tasks involving lexical decision or segment-level categorical

perception33). In everyday communicative settings however, listeners do make use of a much larger temporal

context for successfully decoding speech, eciently incorporating semantic, prosodic or discursive cues from a

temporal span in the range of minutes rather than seconds. It is however dicult to predict the potential eect

that isochronously retimed speech might produce in a broader temporal context, as it may become an explicit

feature of the stimulus to be perceived. Indeed, none of the participants in the current study reported the isoch-

ronous quality of modied speech, but with sustained stimulation beyond the duration of a sentence, listeners

may be able to project a temporal canvas onto upcoming speech. One could hypothesise both a facilitatory

eect due to the explicit nature of the isochronous timing, as found in rhythmic priming studies34–36, but also a

detrimental eect due to the sustained departure from optimally-timed delivery of information found in natural

speech, as defended here. e relevance of examining long-term oscillatory entrainment in the delta and theta

range also needs to be posed, although phase-locking eects beyond the duration of a phrase seem unlikely37,38.

In conclusion, in light of the data presented here, we propose an account of the role of isochrony in speech

perception which unies previous hypotheses on the role of isochrony in speech production and perception39,40

but, importantly, quanties its contribution, with respect to the “elephant-in-the-room” factor, that is, natural

timing of speech. First, to avoid any ambiguity, we propose that isochrony does not have a primary role in

speech perception or production. Indeed, through the eect of the temporal manipulations of speech reported

here, it is clear that isochronous forms of speech do not provide a benet in speech recognition in noise. In fact,

the natural timing statistics of speech appear to contain an essential ingredient used by listeners in the form of

learned information, to which listeners attend to by recruiting top-down cortical activity, and allows them to

process sentences even though they are presented for the rst time and degraded by noise. We however show

that isochrony does play a role, albeit a secondary one, which is compatible with a by-product output of neuronal

models of oscillatory activity41–43. Importantly, the theta-syllable provides the pivot for isochrony in speech and

oscillations in the brain in this context. Entrainment to oscillatory activity has received a lot of attention in the

recent years and received broad experimental support, in particular in specic experimental settings where the

material is explicitely presented in an isochronous form (in visual, auditory, musical contexts38,44–47). e fact

isochrony eects are observed in a context where isochrony is not an explicit characteristic of the experimental

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol:.(1234567890)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

material suggests that these eects could reect physiological mechanisms which, if sustained, can lead to the

kind of behavioural outcomes reported in controlled setups. In sum, we suggest that isochrony is not a require-

ment for successful speech processing but rather a kind of underlying timing prior actively suppressed by the

necessities of exible and ecient linguistic information encoding. Because of plausible neuroanatomical archi-

tecture of cortical processing however, isochrony seems to represents a canonical form of timing that can be

seen as an attractor, and thereby enjoys a special status in perception and production generally, such as poetry,

music and dance.

Methods

Speech material and annotation. e speech material consisted of sentences, taken from the Harvard

corpus for English15 and the Fharvard corpus for French16. Both corpora are highly comparable in their struc-

ture: they both contain 700 sentences or more, each sentence is composed of 5–7 keywords (exactly 5 keywords

for French), with a mild overall semantic predictability, and sentences are phonetically balanced into lists of 10

sentences. For the current study, we used a subset of 180 sentences for each corpus, recorded by a female talker

for English48 and by a male talker for French49. Each sentence subset was randomly sampled from all sentences

of each corpus.

Sentences were annotated at two hierarchical rhythmic levels: the accent group and the syllable level. e

syllable level was taken as the lowest hierarchical rhythmical unit in both languages, and the accent group was

dened as the next hierarchical step up the rhythmic hierarchy, i.e., the stressed-syllable in English and the

accentual phrase in French50,51. While annotation of stressed syllables in English was relatively straightforward,

accentual phrase boundaries prove more dicult to determine unambiguously, as they combine several factors,

from the syntactic structure of the sentence to the particular intonational pattern used by the talker producing

the sentence. We relied on an independent annotation of accent group boundaries done by three native French

speakers. From an initial set of 280 sentences, 181 sentences received a full inter-annotator agreement, and the

rst 180 sentences of that set were selected for the current study.

e P-centre associated to a rhythmic unit corresponds tothe onset of the vowel associated with that unit

in the case of simple consonant-vowel syllables, and is usually advanced in the case of complex syllable onsets,

or when the vowel is preceded by a semi-vowel. In both corpora, P-centre events were rst automatically posi-

tioned following an automatic forced alignement procedure52, then manually checked and corrected if necessary,

typically when a schwa was inserted or deleted in a given syllable. Given the hierarchical relationship between

accent groups and syllables, the accent group P-centre aligns with the corresponding syllable P-centre. See

Supplementary Materials for auditory examples of annotated sentences at the accent group and syllable levels.

Stimuli. Sentences were temporally modied by locally accelerating or slowing down adjacent successive

speech segments. Unmodied time onsets of either accent (acc) or syllable (syl) rhythmic units served as a refer-

ence for the natural rhythm (NAT) condition, from which isochronous (ISO) and anisochronous (ANI) condi-

tions were dened, as detailed below.

We rst dened t the reference time series identifying the time onsets of the N corresponding rhythmic units of

a sentence anked with the sentence endpoints, i.e.,

t=t0,t1,...,tN,tN+1

, with

and

tN+1

respectively the start

and end of the sentence. We noted d the associated inter-rhythmic unit durations, i.e.,

ti+1

−

ti,i

0, ...,N

We then dened the target time series

t′

and the associated durations

d′

, the resulting values aer temporal

transformation. e initial and nal portions of the sentence were le unchanged, i.e.,

t′

′

and

t′

N+1

, hence

d′

, and

d′

. For the ISO condition, the target durations were set to the average

duration of the corresponding intervals in the reference time series NAT, i.e.,

′

−t

N−1

i=1, ...,N−1

. For

the ANI condition, we imposed that sentences were temporally transformed by the same quantity as isochronous

sentences, but resulted in an unpredictable rhythm. We achieved this by using the following simple heuristic, con-

sisting of applying an isochronous transformation to the time-reversed rhythmic units events. First, the reference

time series was replaced by a pseudo reference time series made of pseudo events such that successive pseudo ref-

erence durations were the time reversal of the original reference durations, i.e.,

rev(di)

dN−i,i

1, ...,N

−

;

then, the target time series ANI were computed from this reversed sequence by equalising the temporal distance

of the pseudo events as in the ISO condition.

Temporal transformation was then operated by linearly compressing or expanding successive speech segments

identied by the reference time series by applying the duration ratio of target to reference speech segments, i.e.,

applying a time-scale step function

d′

,i=1, ...,

to the speech signal of the sentence. is was achieved

using WSOLA53, a high-quality pitch-preserving temporal transformation algorithm.

Altogether, we obtained 5 temporal versions of each sentence in the corpus: the unmodied natural version

(NAT), the isochronous stimuli at the accent (ISO.acc) and syllable (ISO.syl) levels, and the anisochronous stim-

uli at the accent (ANI.acc) and syllable (ANI.syl) levels. Figure3 shows the result of the temporal transformation

of an example sentence for the rst 3 conditions. See Supplementary Materials for examples of retimed stimuli.

Final experimental stimuli were constructed by mixing the temporally transformed sentence with speech-

shaped noise at a signal-to-noise ratio of

−3

dB. is value was determined in previous studies16 to elicit a

keyword recognition score of around 60% in the unmodied speech condition, in turn providing maximum

sensitivity for the dierence between unmodied and modied speech conditions. Speech-shaped noise was

constructed, separately for each talker, by ltering white noise with a 200-pole LPC lter computed on the con-

catenation of all sentences of the corpus recorded by that talker.

Temporal distortion metrics. We quantied the net amount of temporal distortion

applied to a given

sentence by computing the root mean square of the log-transformed time-scale step function

associated to that

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol.:(0123456789)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

sentence. Log-transformation was done so that compression and elongation by inverse values (e.g.,

×1

and

×2

respectively) would contribute equally to the overall distortion measure (i.e.,

log

)

=log(2)

). Binary loga-

rithm was used here. Using our notation referring to discrete events, the temporal distortion from the reference

time series t with N events to the target time series

t′

can therefore be written:

where

i=1, ...,N

is the event index,

the time-scale factors linking t and

t′

and d the duration between succes-

sive reference events, as dened in the preceding section. Note that the

term in the numerator emerges from

the grouping of samples between successive reference events, since for all these samples

τi

values are constant

by design.

By design, individual sentences undergo an identical amount of temporal distortion in isochronous and

anisochronous transformations. By extending the application of the

function to time instants other than the

ones used for stimulus construction, we introduce additional metrics to evaluate the departure of the rhythm

of any sentence – whether temporally modied or not – to two canonical rhythm types associated with the

sentence: its unmodied natural rhythm and its isochronous counterpart. For the two hierarchical rhythmic

levels considered here, this amounts to 4 new metrics: departure from naturally timed accent groups or syllables

(respectively dnat.acc and dnat.syl), and departure from isochronous accent group or syllables (respectively diso.

acc and diso.syl, see Table5).

(1)



i=1(log τi)2di



i=1

Figure3. Annotation of an example sentence (translation: e red neon lamp makes his/her hair iridescent), in

its original unmodied natural timing (A) and transformed isochronous forms at the accent (B) and syllable

overlaid in dashed lines, accent group onsets (red), syllable onsets (orange), phonemes and words.

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol:.(1234567890)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

Participants and procedure. Data for English are based on 26 participants (21 females) with mean age

of 20.9 (SD = 6.3), all speaking Australian English as a native language, with no known audition troubles and

are described in detail in19. We include the data for English previously reported in19 for the purpose of compar-

ing with the French data. New analyses are conducted on this dataset (see “Results”). All experimental proto-

cols were approved by the Human Research Ethics Committee of Western Sydney University under the refer-

ence H9495. For French, 27 participants (15 females), with mean age of 26.7 years (SD = 8.8) were recruited

from the student and sta population of the University of Grenoble-Alpes (UGA) and were given a 15euro gi

card as compensation for their participation. We checked that all participants met the selection criteria which

included speaking French as a native language, and having no known audition troubles. No data was removed

from the initial set. All experimental protocols were approved by UGA’s ethical committee (CERGA, agreement

IRB00010290-2017-12-12-33). For both English and French studies, all methods were carried out in accordance

with relevant guidelines and regulations, and informed consent was obtained from all participants.

For both language groups, participants were given written instructions and the experimenter gave comple-

mentary information when necessary. Participants then sat in front of a computer screen, where short on-screen

instructions were given before each block of the experiment. Participants were presented with speech and noise

mixtures played binaurally through Beyer Dynamic DT 770 Pro 80 ohm closed headphones at a comfortable level

set at the beginning of the experiment and maintained constant throughout the experiment. Participants had

to type what they heard on a keyboard and press “Enter” to trigger the presentation of the next stimuli. Stimuli

were grouped by condition forming 5 blocks of 36 sentences each. An additional 5 stimuli from each condition

were presented as practice in the beginning of the experiment and were not used for further analysis. e order

of conditions was counterbalanced across participants, and the order of sentences was pseudo-randomised for

each participant. e order of the practice sentences was xed across all participants, and the order of conditions

for practice sentences was also xed, to: NAT, ISO.acc, ANI.acc, ISO.syl, ANI.syl.

Scoring. Sentences were automatically scored with custom-made scripts that matched keywords with listen-

ers typed responses. A dictionary of variants was used in both languages to correct for homophones, full letter

spelling of digits and common spelling mistakes (e.g., *‘ciggar’ corrected to ‘cigar’ in English, and ‘*cigne’ cor-

rected to ‘cygne’ (‘swan’) in French. Each sentence received a nal score as the proportion of keywords correctly

recognised.

e accuracy of the automatic scoring was evaluated on a 530-sentence subset of the listeners responses

(around 5.5% of all 9450 responses), randomly and uniformly sampled across subjects, conditions and languages.

e subset was manually scored, and we found that 98% of the sentences were correctly scored by the automatic

procedure, with the 10 incorrectly scored sentences each having no more than one word typed with a spelling

mistake which was absent in the dictionary (and therefore added to it for future studies).

Data modelling. e eect of experimental condition on intelligibility was analysed for French following

previously reported analysis of the English data19. A generalised linear mixed-eect model (function glmer

from the R package lme454) was tted to intelligibility scores, including a random term intercept by subject.

Normal distribution of residuals was visually veried. Generalised simultaneous hypothesis were formulated

and tested with the function glht from the R package multcomp55, which corrects for multiple comparisons.

Analysis of the contribution of the temporal distortion metrics to intelligibility was performed on com-

bined French and English data. ree models were tted for each of the three subsets of the data (see Fig.2 and

“Results”). Fixed eects were language (French and English) and the non all-zero valued metrics for the given

data subset. Random eect structure included a term by sentence and by participant. For each model, we report

an analysis of variance between the initial full model and the minimal equivalent model. e latter is obtained

from the initial model by incrementally removing xed-eect terms until no term can be removed without

Table 5. Analysis metrics to evaluate departure from naturally-timed and isochronous forms at the accent

and syllable level (rows) in each of the 5 experimental conditions (columns). Each argument of the

function

(Eq.1) is a time series of either accent (acc) or syllable (syl) onsets, as they occur in a given experimental

condition. For example,

tsylISO.acc

represents the syllable onsets of a sentence as they occur in the transformed

ISO.acc experimental condition. Note that some of these distortions are equal to 0 by construction: they

are dnat.acc and and dnat.syl for NAT sentences, and diso.acc and diso.syl for ISO.acc and ISO.syl sentences

respectively.

NAT ISO.acc ISO.syl

dnat.acc

δ(taccNAT ,taccNAT )

δ(taccNAT ,taccISO.acc )

δ(t

acc

NAT ,t

acc

ISO.syl )

dnat.syl

δ(tsylNAT ,tsylNAT )

δ(t

syl

NAT ,t

syl

ISO.acc )

δ(t

syl

NAT ,t

syl

ISO.syl )

diso.acc

δ(taccISO.acc ,taccNAT )

δ(taccISO.acc ,taccISO.acc )

δ(t

accISO.acc

acc

ISO.syl )

diso.syl

δ(t

syl

ISO.syl ,t

syl

NAT )

δ(t

syl

ISO.syl ,t

syl

ISO.acc )

δ(t

syl

ISO.syl ,t

syl

ISO.syl )

ANI.acc ANI.syl

dnat.acc

δ(taccNAT ,taccANI.acc )

δ(t

acc

NAT ,t

acc

ANI.syl )

dnat.syl

δ(tsylNAT ,tsylANI.acc )

δ(t

syl

NAT ,t

syl

ANI.syl )

diso.acc

δ(taccISO.acc ,taccANI.acc )

δ(taccISO.acc ,taccANI.syl )

diso.syl

δ(t

syl

ISO.syl ,t

syl

ANI.acc )

δ(t

syl

ISO.syl ,t

syl

ANI.syl )

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol.:(0123456789)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

signicantly changing the explained variance. Visual distribution of residuals was checked. Fixed-eects size are

computed with the function r2beta from the R packager2glmm56).

Data availability

Example stimuli illustrating P-centre annotation and temporal modiction as well as computer code for the tem-

poral distortion metrics are included in Supplementary Materials. Experimental stimuli and listeners’ responses

data are available for download at: https ://doi.org/10.5281/zenod o.39664 75.

Received: 29 April 2020; Accepted: 28 October 2020

References

1. Bishop, G. H. Cyclic changes in excitability of the optic pathway of the rabbit. Am. J. Physiol. Leg. Content 103, 213–224 (1932).

2. Schroeder, C. E. & Lakatos, P. Low-frequency neuronal oscillations as instruments of sensory selection. Trends Neurosci. 32, 9–18

(2009).

3. Schroeder, C. E., Wilson, D. A., Radman, T., Scharfman, H. & Lakatos, P. Dynamics of active sensing and perceptual selection.

Curr. Opin. Neurobiol. 20, 172–176 (2010).

4. Ahissar, E. et al. Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. Proc. Natl.

Acad. Sci. USA 98, 13367–13372 (2001).

5. Luo, H. & Poeppel, D. Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron 54,

1001–1010. https ://doi.org/10.1016/j.neuro n.2007.06.004 (2007).

6. Greenberg, S. Speaking in shorthand: a syllable-centric perspective for understanding pronunciation variation. Speech Commun.

29, 159–176 (1999).

7. Ghitza, O. On the role of theta-driven syllabic parsing in decoding speech: intelligibility of speech with a manipulated modulation

spectrum. Front. Psychol. 3, 238. https ://doi.org/10.3389/fpsyg .2012.00238 (2012).

8. Cummins, F. Oscillators and syllables: a cautionary note. Front. Psychol. 3, 364. https ://doi.org/10.3389/fpsyg .2012.00364 (2012).

9. Steele, R. An Essay Towards Establishing the Melody and Measure of Speech to Be Expressed and Perpetuated by Peculiar Symbols (J.

Nichols, London, 1779).

10. James, A. L. Speech Signals in Telephony (Sir I. Pitman & Sons Ltd, London, 1940).

11. Pike, K. L. e Intonation of American English (University of Michigan Press, Ann Arbor, 1945).

12. Abercrombie, D. Elements of General Phonetics (Aldine, London, 1967).

13. Cummins, F. Rhythm and speech. In e Handbook of Speech Production (ed. Redford, M. A.) 158–177 (Wiley, Chisester, 2015).

14. Peelle, J. E. & Davis, M. H. Neural oscillations carry speech rhythm through to comprehension. Front. Psychol. 3, 320. https ://doi.

org/10.3389/fpsyg .2012.00320 (2012).

15. Rothauser, E. H. et al. IEEE recommended practice for speech quality measurements. IEEE Trans. Audio Acoust. 17, 225–246

(1969).

16. Aubanel, V., Bayard, C., Strauss, A. & Schwartz, J.-L. e Fharvard corpus: a phonemically-balanced French sentence resource for

audiology and intelligibility research. Speech Commun. 124, 68–74. https ://doi.org/10.1016/j.speco m.2020.07.004 (2020).

17. Morton, J., Marcus, S. & Frankish, C. Perceptual centers (P-centers). Psychol. Rev. 83, 405 (1976).

18. S cott, S.K. P-centers in speech: an acoustic analysis. Ph.D. thesis, UCL, London, UK (1993).

19. Aubanel, V., Davis, C. & Kim, J. Exploring the role of brain oscillations in speech perception in noise: intelligibility of isochronously

retimed speech. Front. Hum. Neurosci.https ://doi.org/10.3389/fnhum .2016.00430 (2016).

20. Arvaniti, A. e usefulness of metrics in the quantication of speech rhythm. J. Phon. 40, 351–373. https ://doi.org/10.1016/j.

wocn.2012.02.003 (2012).

21. Mehler, J., Dommergues, J. Y., Frauenfelder, U. & Segui, J. e syllable’s role in speech segmentation. J. Verb. Lear. Verb. Behav. 20,

298–305 (1981).

22. Greenberg, S., Carvey, H., Hitchcock, L. & Chang, S. Temporal properties of spontaneous speech-a syllable-centric perspective. J.

Phon. 31, 465–485. https ://doi.org/10.1016/j.wocn.2003.09.005 (2003).

23. Ghitza, O. e theta-syllable: a unit of speech information dened by cortical function. Front. Psychol.https ://doi.org/10.3389/

fpsyg .2013.00138 (2013).

24. Ghitza, O. Behavioral evidence for the role of cortical

oscillationsin determining auditory channel capacity for speech. Front.

Psychol.https ://doi.org/10.3389/fpsyg .2014.00652 (2014).

25. Peou, M., Arnal, L. H., Fontolan, L. & Giraud, A.-L.

-band and

-band neural activity reects independent syllable tracking

and comprehension of time-compressed speech. J. Neurosci. 37, 7930–7938. https ://doi.org/10.1523/JNEUR OSCI.2882-16.2017

(2017).

26. Fontolan, L., Morillon, B., Liegeois-Chauvel, C. & Giraud, A.-L. e contribution of frequency-specic activity to hierarchical

information processing in the human auditory cortex. Nat. Commun. 5, 1–10 (2014).

27. Rimmele, J. M., Morillon, B., Poeppel, D. & Arnal, L. H. Proactive sensing of periodic and aperiodic auditory patterns. Tren ds

Cogn. Sci. 22, 870–882. https ://doi.org/10.1016/j.tics.2018.08.003 (2018).

28. Ding, N. & Simon, J. Z. Emergence of neural encoding of auditory objects while listening to competing speakers. Proc. Natl. Acad.

Sci. USA 109, 11854–11859 (2012).

29. Ding, N. & Simon, J. Z. Adaptive temporal encoding leads to a background-insensitive cortical representation of speech. J. Neurosci.

33, 5728–5735 (2013).

30. Peelle, J. E., Gross, J. & Davis, M. H. Phase-locked responses to speech in human auditory cortex are enhanced during comprehen-

sion. Cereb. Cortex 23, 1378–1387. https ://doi.org/10.1093/cerco r/bhs11 8 (2013).

31. Strauss, A., Aubanel, V., Giraud, A.-L. & Schwartz, J.-L. Bottom-up and top-down processes cooperate around syllable P-centers

to ensure speech intelligibility. (submitted).

32. Cooke, M., Aubanel, V. & Lecumberri, M. L. G. Combining spectral and temporal modication techniques for speech intelligibility

enhancement. Comput. Speech Lang. 55, 26–39. https ://doi.org/10.1016/j.csl.2018.10.003 (2019).

33. Giraud, A.-L. & Poeppel, D. Speech perception from a neurophysiological perspective. In e Human Auditory Cortex (eds Poep-

pel, D. et al.) 225–260 (Springer, New York, 2012). https ://doi.org/10.1007/978-1-4614-2314-0_9.

34. Cason, N. & Schön, D. Rhythmic priming enhances the phonological processing of speech. Neuropsychologia 50, 2652–2658 (2012).

35. Cason, N., Astesano, C. & Schön, D. Bridging music and speech rhythm: rhythmic priming and audio-motor training aect speech

perception. Acta Psychol. 155, 43–50. https ://doi.org/10.1016/j.actps y.2014.12.002 (2015).

36. Haegens, S. & Zion-Golumbic, E. Rhythmic facilitation of sensory processing: a critical review. Neurosci. Biobehav. Rev. 86, 150–165.

https ://doi.org/10.1016/j.neubi orev.2017.12.002 (2018).

37. ten Oever, S. & Sack, A. T. Oscillatory phase shapes syllable perception. Proc. Natl. Acad. Sci. USAhttps ://doi.org/10.1073/

pnas.15175 19112 (2015).

Content courtesy of Springer Nature, terms of use apply. Rights reserved



Vol:.(1234567890)

Scientic Reports | (2020) 10:19580 | 

www.nature.com/scientificreports/

38. Hickok, G., Farahbod, H. & Saberi, K. The rhythm of perception. Psychol. Sci. 26, 1006–1013. https ://doi.org/10.1016/j.

cub.2013.11.006 (2015).

39. Dauer, R. M. Stress-timing and syllable-timing reanalyzed. J. Phon. 11, 51–62 (1983).

40. Lehiste, I. Isochrony reconsidered. J. Phon. 5, 253–263 (1977).

41. Lakatos, P., Karmos, G., Mehta, A. D., Ulbert, I. & Schroeder, C. E. Entrainment of neuronal oscillations as a mechanism of atten-

tional selection. Science 320, 110–113. https ://doi.org/10.1126/scien ce.11547 35 (2008).

42. Giraud, A.-L. & Poeppel, D. Cortical oscillations and speech processing: emerging computational principles and operations. Nature

Neurosci. 15, 511–517. https ://doi.org/10.1038/nn.3063 (2012).

43. Hyal, A., Fontolan, L., Kabdebon, C., Gutkin, B. & Giraud, A.-L. Speech encoding by coupled cortical theta and gamma oscilla-

tions. eLife 11, e06213. https ://doi.org/10.7554/eLife .06213 .001 (2015).

44. Busch, N. A., Dubois, J. & VanRullen, R. e phase of ongoing EEG oscillations predicts visual perception. J. Neurosci. 29,

7869–7876. https ://doi.org/10.1523/JNEUR OSCI.0113-09.2009 (2009).

45. Nozaradan, S., Peretz, I., Missal, M. & Mouraux, A. Tagging the neuronal entrainment to beat and meter. J. Neurosci. 31, 10234–

10240 (2011).

46. Ding, N., Melloni, L., Zhang, H., Tian, X. & Poeppel, D. Cortical tracking of hierarchical linguistic structures in connected speech.

Nature Neurosci. 19, 158–164. https ://doi.org/10.1038/nn.4186 (2016).

47. van Atteveldt, N. et al. Complementary fMRI and EEG evidence for more ecient neural processing of rhythmic versus unpredict-

ably timed sounds. Front. Psychol. 6, 1663. https ://doi.org/10.3389/fpsyg .2015.01663 (2015).

48. Aubanel, V., Davis, C. & Kim, J. e MAVA corpushttps ://doi.org/10.4227/139/59a4c 21a89 6a3 (2017).

49. Aubanel, V., Bayard, C., Strauss, A. & Schwartz, J.-L. e Fharvard corpushttps ://doi.org/10.5281/zenod o.14628 54 (2018).

50. Jun, S. A. & Fougeron, C. A phonological model of French intonation. In Intonation: Analysis, Modeling and Technology (ed. Botinis,

A.) 209–242 (Kluwer Academic Publishers, Dordrecht, 2000).

51. Nespor, M. & Vogel, I. Prosodic Phonology: With a New Foreword Vol. 28 (Walter de Gruyter, Berlin, 2007).

52. Goldman, J.-P. EasyAlign: an automatic phonetic alignment tool under Praat. In Interspeech 3233–3236 (Florence, Italy, 2011).

53. Demol, M., Verhelst, W., Struyve, K. & Verhoeve, P. Ecient non-uniform time-scaling of speech with WSOLA. In International

Conference on Speech and Computer (SPECOM), 163–166 (2005).

54. Bates, D., Mächler, M., Bolker, B. M. & Walker, S. C. Fitting linear mixed-eects models using lme4. J. Stat. Sow. 67, 1–48. https

://doi.org/10.18637 /jss.v067.i01 (2015).

55. Hothorn, T., Bretz, F. & Westfall, P. Simultaneous inference in general parametric models. Biom. J. 50, 346–363 (2008).

56. Jaeger, B. r2glmm: Computes R Squared for Mixed (Multilevel) Models, R package version 0.1.2 edn. (2017).

Acknowledgements

is work was supported by the European Research Council under the European Community’s Seventh Frame-

work Program (FP7/2007-2013 Grant Agreement No. 339152, “Speech Unit(e)s”). We thank Christine Nies for

her help in collecting data and Silvain Gerber for assistance with statistical analysis.

Author contributions

V.A. and J.-L.S. conceived and designed the study, supervised data collection, analysed the data and wrote the

paper.

Competing interests

e authors declare no competing interests.

Additional information

Supplementary information is available for this paper at https ://doi.org/10.1038/s4159 8-020-76594 -1.

Correspondence and requests for materials should be addressed to V.A.

Reprints and permissions information is available at www.nature.com/reprints.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and

institutional aliations.

Open Access is article is licensed under a Creative Commons Attribution 4.0 International

License, which permits use, sharing, adaptation, distribution and reproduction in any medium or

format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the

Creative Commons licence, and indicate if changes were made. e images or other third party material in this

article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the

material. If material is not included in the article’s Creative Commons licence and your intended use is not

permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from

the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.

Content courtesy of Springer Nature, terms of use apply. Rights reserved

Terms and Conditions

Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).

Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-

scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By

accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these

purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.

These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal

subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription

(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will

apply.

We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within

ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not

otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as

detailed in the Privacy Policy.

While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may

not:

use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access

control;

use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is

otherwise unlawful;

falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in

writing;

use bots or other automated methods to access the content or redirect messages

override any security feature or exclusionary protocol; or

share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal

content.

In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,

royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal

content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any

other, institutional repository.

These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or

content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature

may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.

To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied

with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,

including merchantability or fitness for any particular purpose.

Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed

from third parties.

If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not

expressly permitted by these Terms, please contact Springer Nature at

onlineservice@springernature.com

Content uploaded by Vincent Aubanel

Content may be subject to copyright.

Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech

Conference Paper

Full-text available

Sep 2022

Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this paper, we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that “perceptual centers” associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model’s robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.

Isochronous is beautiful? Syllabic event detection in a neuro-inspired oscillatory model is facilitated by isochrony in speech

Preprint

Full-text available

Jun 2022

Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this paper , we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that "perceptual centers" associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model's robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection ; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.

COSMO-Onset: A Neurally-Inspired Computational Model of Spoken Word Recognition, Combining Top-Down Prediction and Bottom-Up Detection of Syllabic Onsets

Article

Full-text available

Aug 2021

Recent neurocognitive models commonly consider speech perception as a hierarchy of processes, each corresponding to specific temporal scales of collective oscillatory processes in the cortex: 30–80 Hz gamma oscillations in charge of phonetic analysis, 4–9 Hz theta oscillations in charge of syllabic segmentation, 1–2 Hz delta oscillations processing prosodic/syntactic units and the 15–20 Hz beta channel possibly involved in top-down predictions. Several recent neuro-computational models thus feature theta oscillations, driven by the speech acoustic envelope, to achieve syllabic parsing before lexical access. However, it is unlikely that such syllabic parsing, performed in a purely bottom-up manner from envelope variations, would be totally efficient in all situations, especially in adverse sensory conditions. We present a new probabilistic model of spoken word recognition, called COSMO-Onset, in which syllabic parsing relies on fusion between top-down, lexical prediction of onset events and bottom-up onset detection from the acoustic envelope. We report preliminary simulations, analyzing how the model performs syllabic parsing and phone, syllable and word recognition. We show that, while purely bottom-up onset detection is sufficient for word recognition in nominal conditions, top-down prediction of syllabic onset events allows overcoming challenging adverse conditions, such as when the acoustic envelope is degraded, leading either to spurious or missing onset events in the sensory signal. This provides a proposal for a possible computational functional role of top-down, predictive processes during speech recognition, consistent with recent models of neuronal oscillatory processes.

What is a rhythm for the brain? The impact of contextual temporal variability on auditory perception

Preprint

Full-text available

Apr 2023

Temporal predictions can be formed and impact perception when sensory timing is fully predictable: for instance, the detection of a target sound is enhanced if it is presented on the beat of an isochronous rhythm. However, natural sensory stimuli, like speech or music, are not entirely predictable, but still possess statistical temporal regularities. We investigated whether temporal expectations can be formed in non-fully predictable contexts, and how the temporal variability of sensory contexts affects auditory perception. Specifically, we asked how “rhythmic” an auditory stimulation needs to be in order to observe temporal predictions effects on auditory discrimination performances. In this behavioral auditory oddball experiment, participants listened to auditory sound sequences where the temporal interval between each sound was drawn from gaussian distributions with distinct standard deviations. Participants were asked to discriminate sounds with a deviant pitch in the sequences. Auditory discrimination performances, as measured with deviant sound discrimination accuracy and response times, progressively declined as the temporal variability of the sound sequence increased. Temporal predictability effects ceased to be observed only for the more variable contexts. Moreover, both global and local temporal statistics impacted auditory perception, suggesting that temporal statistics are promptly integrated to optimize perception. Altogether, these results suggests that temporal predictions can be set up quickly based on the temporal statistics of past sensory events and are robust to a certain amount of temporal variability. Therefore, temporal predictions can be built on sensory stimulations that are not purely periodic nor temporally deterministic. Significance statement The perception of sensory events is known to be enhanced when their timing is fully predictable. However, it is unclear whether temporal predictions are robust to temporal variability, which is naturally present in many auditory signals such as speech and music. In this behavioral experiment, participants listened to auditory sound sequences where the timing between each sound was drawn from distinct gaussian distributions. Participant’s ability to discriminate deviant sounds in the sequences was function of the temporal statistics of past events: auditory deviant discrimination progressively declined as the temporal variability of the sound sequence increased. Results therefore suggest that auditory perception is sensitive to prediction mechanisms that are involved even if temporal information is not totally predictable.

What is a Rhythm for the Brain? The Impact of Contextual Temporal Variability on Auditory Perception

Article

Full-text available

Jan 2024

Temporal predictions can be formed and impact perception when sensory timing is fully predictable: for instance, the discrimination of a target sound is enhanced if it is presented on the beat of an isochronous rhythm. However, natural sensory stimuli, like speech or music, are not entirely predictable, but still possess statistical temporal regularities. We investigated whether temporal expectations can be formed in non-fully predictable contexts, and how the temporal variability of sensory contexts affects auditory perception. Specifically, we asked how “rhythmic” an auditory stimulation needs to be in order to observe temporal predictions effects on auditory discrimination performances. In this behavioral auditory oddball experiment, participants listened to auditory sound sequences where the temporal interval between each sound was drawn from gaussian distributions with distinct standard deviations. Participants were asked to discriminate sounds with a deviant pitch in the sequences. Auditory discrimination performances, as measured with deviant sound discrimination accuracy and response times, progressively declined as the temporal variability of the sound sequence increased. Moreover, both global and local temporal statistics impacted auditory perception, suggesting that temporal statistics are promptly integrated to optimize perception. Altogether, these results suggests that temporal predictions can be set up quickly based on the temporal statistics of past sensory events and are robust to a certain amount of temporal variability. Therefore, temporal predictions can be built on sensory stimulations that are not purely periodic nor temporally deterministic.

Rhythm across the arts and sciences

Book

Full-text available

May 2023

Despite the omnipresence of rhythm in music, movement, circadian cycles, and learning processes, this research topic is the first volume to offer interdisciplinary approaches the topic. Empirical research on timing and precision from the microbiological level of synapses to the macro level of elite solo and ensemble performances is presented. The volume provides results gained by the use of microscopes, motion capture systems, medical equipment such as imaging scans, as well as artistic and educational experience. The goal of this research topic is to present current, scientific studies that examine the ways in which rhythm effects human biology, behavior, perception, and art. Reciprocally, science can learn from studies conducted with artists about how they experience, express and synchronize rhythms.

What a difference a syllable makes—Rhythmic reading of poetry

Article

Full-text available

Feb 2023

In reading conventional poems aloud, the rhythmic experience is coupled with the projection of meter, enabling the prediction of subsequent input. However, it is unclear how top-down and bottom-up processes interact. If the rhythmicity in reading loud is governed by the top-down prediction of metric patterns of weak and strong stress, these should be projected also onto a randomly included, lexically meaningless syllable. If bottom-up information such as the phonetic quality of consecutive syllables plays a functional role in establishing a structured rhythm, the occurrence of the lexically meaningless syllable should affect reading and the number of these syllables in a metrical line should modulate this effect. To investigate this, we manipulated poems by replacing regular syllables at random positions with the syllable “tack”. Participants were instructed to read the poems aloud and their voice was recorded during the reading. At the syllable level, we calculated the syllable onset interval (SOI) as a measure of articulation duration, as well as the mean syllable intensity. Both measures were supposed to operationalize how strongly a syllable was stressed. Results show that the average articulation duration of metrically strong regular syllables was longer than for weak syllables. This effect disappeared for “tacks”. Syllable intensities, on the other hand, captured metrical stress of “tacks” as well, but only for musically active participants. Additionally, we calculated the normalized pairwise variability index (nPVI) for each line as an indicator for rhythmic contrast, i.e., the alternation between long and short, as well as louder and quieter syllables, to estimate the influence of “tacks” on reading rhythm. For SOI the nPVI revealed a clear negative effect: When “tacks” occurred, lines appeared to be read less altering, and this effect was proportional to the number of tacks per line. For intensity, however, the nPVI did not capture significant effects. Results suggests that top-down prediction does not always suffice to maintain a rhythmic gestalt across a series of syllables that carry little bottom-up prosodic information. Instead, the constant integration of sufficiently varying bottom-up information appears necessary to maintain a stable metrical pattern prediction.

Neural tracking of continuous acoustics: properties, speech‐specificity and open questions

Article

Dec 2023

Human speech is a particularly relevant acoustic stimulus for our species, due to its role of information transmission during communication. Speech is inherently a dynamic signal, and a recent line of research focused on neural activity following the temporal structure of speech. We review findings that characterise neural dynamics in the processing of continuous acoustics and that allow us to compare these dynamics with temporal aspects in human speech. We highlight properties and constraints that both neural and speech dynamics have, suggesting that auditory neural systems are optimised to process human speech. We then discuss the speech‐specificity of neural dynamics and their potential mechanistic origins and summarise open questions in the field.

Phonological Analysis of Isochrony in English Speech

Article

Oct 2023

Suaad Abd ul-Rahman Eltaif

This study aims at illustrating the concept of isochrony ,determination its phonological units , emphasizing the existence and the role of this phenomenon. Isochrony refers to the occurrence of stressed syllables at equal time intervals through an utterance. This term is used to specify the rhythm features. Many experiments have been applied by the specialists to find a vivid clue about the presence of isochrony in production. They found modicum assent about the actuality of its principles. They assure its existence at perceptual level. The term ''perceptual isochrony'' includes a role in two significant psycholinguistic styles ''underlying spoken'' and written word comprehension , which are word segmentation and lexical access .Isochrony cannot be neglected because it forms the standard of rhythmic system of English. Finally, there is a relationship between isochrony and syntax because the reading of ambiguous sentence requires accurate effort and attention to the meaning during the time of reading.

Memory-Paced Tapping to Auditory Rhythms: Effects of Rate, Speech, and Motor Engagement

Article

Feb 2022

Purpose Humans have a near-automatic tendency to entrain their motor actions to rhythms in the environment. Entrainment has been hypothesized to play an important role in processing naturalistic stimuli, such as speech and music, which have intrinsically rhythmic properties. Here, we studied two facets of entraining one's rhythmic motor actions to an external stimulus: (a) synchronized finger tapping to auditory rhythmic stimuli and (b) memory-paced reproduction of a previously heard rhythm. Method Using modifications of the Synchronization–Continuation tapping paradigm, we studied how these two rhythmic behaviors were affected by different stimulus and task features. We tested synchronization and memory-paced tapping for a broad range of rates, from stimulus onset asynchrony of subsecond to suprasecond, both for strictly isochronous tone sequences and for rhythmic speech stimuli (counting from 1 to 10), which are more ecological yet less isochronous. We also asked what role motor engagement plays in forming a stable internal representation for rhythms and guiding memory-paced tapping. Results and Conclusions Our results show that individuals can flexibly synchronize their motor actions to a very broad range of rhythms. However, this flexibility does not extend to memory-paced tapping, which is accurate only in a narrower range of rates, around ~1.5 Hz. This pattern suggests that intrinsic rhythmic defaults in the auditory and/or motor system influence the internal representation of rhythms, in the absence of an external pacemaker. Interestingly, memory-paced tapping for speech rhythms and simple tone sequences shared similar “optimal rates,” although with reduced accuracy, suggesting that internal constraints on rhythmic entrainment generalize to more ecological stimuli. Last, we found that actively synchronizing to tones versus passively listening to them led to more accurate memory-paced tapping performance, which emphasizes the importance of action–perception interactions in forming stable entrainment to external rhythms.

Theta- and beta-band neural activity reflect independent syllable tracking and comprehension of time-compressed speech

Article

Full-text available

Jul 2017

Recent psychophysics data suggest that speech perception is not limited by the capacity of the auditory system to encode fast acoustic variations through neural γ activity, but rather by the time given to the brain to decode them. Whether the decoding process is bounded by the capacity of θ rhythm to follow syllabic rhythms in speech, or constrained by a more endogenous top-down mechanism, e.g., involving β activity, is unknown. We addressed the dynamics of auditory decoding in speech comprehension by challenging syllable tracking and speech decoding using comprehensible and incomprehensible time-compressed auditory sentences. We recorded EEGs in human participants and found that neural activity in both θ and γ ranges was sensitive to syllabic rate. Phase patterns of slow neural activity consistently followed the syllabic rate (4-14 Hz), even when this rate went beyond the classical θ range (4-8 Hz). The power of θ activity increased linearly with syllabic rate but showed no sensitivity to comprehension. Conversely, the power of β (14-21 Hz) activity was insensitive to the syllabic rate, yet reflected comprehension on a single-trial basis.Wefound different long-range dynamics for θ and β activity, with β activity building up in time while more contextual information becomes available. This is consistent with the roles of θ and β activity in stimulus-driven versus endogenous mechanisms. These data show that speech comprehension is constrained by concurrent stimulus-driven θ and low-γ activity, and by endogenous β activity, but not primarily by the capacity of θ activity to track the syllabic rhythm.

Exploring the Role of Brain Oscillations in Speech Perception in Noise: Intelligibility of Isochronously Retimed Speech

Article

Full-text available

Aug 2016
FRONT HUM NEUROSCI

A growing body of evidence shows that brain oscillations track speech. This mechanism is thought to maximize processing efficiency by allocating resources to important speech information, effectively parsing speech into units of appropriate granularity for further decoding. However, some aspects of this mechanism remain unclear. First, while periodicity is an intrinsic property of this physiological mechanism, speech is only quasi-periodic, so it is not clear whether periodicity would present an advantage in processing. Second, it is still a matter of debate which aspect of speech triggers or maintains cortical entrainment, from bottom-up cues such as fluctuations of the amplitude envelope of speech to higher level linguistic cues such as syntactic structure. We present data from a behavioral experiment assessing the effect of isochronous retiming of speech on speech perception in noise. Two types of anchor points were defined for retiming speech, namely syllable onsets and amplitude envelope peaks. For each anchor point type, retiming was implemented at two hierarchical levels, a slow time scale around 2.5 Hz and a fast time scale around 4 Hz. Results show that while any temporal distortion resulted in reduced speech intelligibility, isochronous speech anchored to P-centers (approximated by stressed syllable vowel onsets) was significantly more intelligible than a matched anisochronous retiming, suggesting a facilitative role of periodicity defined on linguistically motivated units in processing speech in noise.

Looking for Rhythm in Speech

Article

Full-text available

Sep 2012

Fred Cummins

A brief review is provided of the study of rhythm in speech. Much of that activity has focused on looking for empirical measures that would support the categorization of languages into discrete rhythm ‘types’. That activity has had little success, and has used the term ‘rhythm’ in increasingly unmusical and unintuitive ways. Recent approaches to conversation that regard speech as a whole-body activity are found to provide considerations of rhythm that are closer to the central, musical, sense of the term.

The Fharvard corpus: A phonemically-balanced French sentence resource for audiology and intelligibility research

Article

Sep 2020
SPEECH COMMUN

The current study describes the collection of a new phonemically-balanced sentence resource for French, known as the Fharvard corpus. The resource consists of 700 sentences inspired by the original English Harvard sentences, along with audio recordings from one female and one male native French talker. Each of the sentences contains five mono-or bisyllabic keywords and are grouped into 70 lists of 10 sentences using an automatic phoneme-balancing procedure. Twenty-three normal-hearing French listeners identified keywords in the Fharvard sentences in speech-shaped noise. Psychometric functions for the Fharvard sentences indicate mean speech reception thresholds of −4.48 and −3.87 dB and slopes of 10.55 and 12.52 percentage points per dB at the 50% keywords correct point for the female and male talkers respectively. The complete list of Fharvard sentences and the associated audio recordings are available online for speech perception testing.

Combining spectral and temporal modification techniques for speech intelligibility enhancement

Article

Nov 2018

Modifying clean speech prior to output in noisy conditions can lead to substantial intelligibility gains. Most algorithms operate by redistributing energy across the signal, leaving the timing of the underlying speech sounds intact. Other techniques do alter the timing of speech relative to the masker. Both classes of approach – spectral and temporal – lead to a reduction in energetic masking. The current study examines how their combination affects intelligibility. Arguments can be made for both synergy and redundancy, and the presence of distortions introduced by both spectral and temporal approaches might even lead to an antagonistic combination. A cohort of native Spanish listeners identified keywords in sentences in unmodified form and following spectral, temporal and spectro-temporal modification, in the presence of a fluctuating masker. Errors in the spectro-temporal condition were substantially lower than following spectral or temporal modification alone, with a three-fold reduction compared to unmodified speech. Spectro-temporal gains were observed for all phonemes. A glimpse-based model of energetic masking incorporating speech rate changes predicts intelligibility (r=.96), and a glimpsing analysis provides further insights into the distinct mechanisms through which spectral and temporal approaches lead to a release from energetic masking.

Proactive Sensing of Periodic and Aperiodic Auditory Patterns

Article

Oct 2018
TRENDS COGN SCI

The ability to predict when something will happen facilitates sensory processing and the ensuing computations. Building on the observation that neural activity entrains to periodic stimulation, leading neurophysiological models imply that temporal predictions rely on oscillatory entrainment. Although they provide a sufficient solution to predict periodic regularities, these models are challenged by a series of findings that question their suitability to account for temporal predictions based on aperiodic regularities. Aiming for a more comprehensive model of how the brain anticipates ‘when’ in auditory contexts, we emphasize the capacity of motor and higher-order top-down systems to prepare sensory processing in a proactive and temporally flexible manner. Focusing on speech processing, we illustrate how this framework leads to new hypotheses.

Rhythmic facilitation of sensory processing: A critical review

Article

Dec 2017
NEUROSCI BIOBEHAV R

Here we review the role of brain oscillations in sensory processing. We examine the idea that neural entrainment of intrinsic oscillations underlies the processing of rhythmic stimuli in the context of simple isochronous rhythms as well as in music and speech. This has been a topic of growing interest over recent years; however, many issues remain highly controversial: how do fluctuations of intrinsic neural oscillations-both spontaneous and entrained to external stimuli-affect perception, and does this occur automatically or can it be actively controlled by top-down factors? Some of the controversy in the literature stems from confounding use of terminology. Moreover, it is not straightforward how theories and findings regarding isochronous rhythms generalize to more complex, naturalistic stimuli, such as speech and music. Here we aim to clarify terminology, and distinguish between different phenomena that are often lumped together as reflecting "neural entrainment" but may actually vary in their mechanistic underpinnings. Furthermore, we discuss specific caveats and confounds related to making inferences about oscillatory mechanisms from human electrophysiological data.

The MAVA corpus

Data

Oct 2017

The MAVA corpus (MARCS Auditory-Visual Australian recordings of IEEE sentences) is a collection of high quality audiovisual recordings of 205 phonetically balanced sentences from the IEEE sentence database, recorded by a native Australian English female talker. The audio channel is annotated at the word and phoneme level. In addition, for the video channel, frame-by-frame lip contour X Y coordinates are provided. The center of the lip region is used as a reference for deriving four video regions: full face, upper face, lower face and lips. All files are freely available for download under the Creative Commons BY-NC-SA licence.

Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation

Article