ArticlePDF Available

The role of isochrony in speech perception in noise

Authors:

Abstract and Figures

The role of isochrony in speech--the hypothetical division of speech units into equal duration intervals--has been the subject of a long-standing debate. Current approaches in neurosciences have brought new perspectives in that debate through the theoretical framework of predictive coding and cortical oscillations. Here we assess the comparative roles of naturalness and isochrony in the intelligibility of speech in noise for French and English, two languages representative of two well-established contrastive rhythm classes. We show that both top-down predictions associated with the natural timing of speech and to a lesser extent bottom-up predictions associated with isochrony at a syllabic timescale improve intelligibility. We found a similar pattern of results for both languages, suggesting that temporal characterisation of speech from different rhythm classes could be unified around a single core speech unit, with neurophysiologically defined duration and linguistically anchored temporal location. Taken together, our results suggest that isochrony does not seem to be a main dimension of speech processing, but may be a consequence of neurobiological processing constraints, manifesting in behavioural performance and ultimately explaining why isochronous stimuli occupy a particular status in speech and human perception in general.
Content may be subject to copyright.

Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports
The role of isochrony in speech
perception in noise
Vincent Aubanel* & Jean‑Luc Schwartz
The role of isochrony in speech—the hypothetical division of speech units into equal duration
intervals—has been the subject of a long‑standing debate. Current approaches in neurosciences have
brought new perspectives in that debate through the theoretical framework of predictive coding
and cortical oscillations. Here we assess the comparative roles of naturalness and isochrony in the
intelligibility of speech in noise for French and English, two languages representative of two well‑
established contrastive rhythm classes. We show that both top‑down predictions associated with the
natural timing of speech and to a lesser extent bottom‑up predictions associated with isochrony at
a syllabic timescale improve intelligibility. We found a similar pattern of results for both languages,
suggesting that temporal characterisation of speech from dierent rhythm classes could be unied
around a single core speech unit, with neurophysiologically dened duration and linguistically
anchored temporal location. Taken together, our results suggest that isochrony does not seem to
be a main dimension of speech processing, but may be a consequence of neurobiological processing
constraints, manifesting in behavioural performance and ultimately explaining why isochronous
stimuli occupy a particular status in speech and human perception in general.
A fundamental property of mammalian brain activity is its oscillatory nature, resulting in the alternation between
excitable and inhibited states of neuronal assemblies1. e crucial characteristic of heightened excitability is that
it provides, for sensory areas, increased sensitivity and shorter reaction times, ultimately leading to optimised
behaviour. is idea formed the basis of the Active Sensing theoretical framework2,3 and has found widespread
experimental support.
Oscillatory activity in relation to speech, a complex sensory signal, has initially been described as speech
entrainment, or tracking, a view which proposes that cortical activity can be matched more or less directly to
some characteristics of the speech signal such as the amplitude envelope4,5. e need to identify particular events
to be the support of speech tracking has in turn prompted the question of which units would oscillatory activity
entrain to. e syllable has usually been taken as the right candidate6,7, given the close match, under clear speech
conditions, between the timing of syllable boundaries and that of the amplitude envelope’s larger variations. ese
conditions are however far from being representative of how speech is usually experienced: connected speech is
notoriously characterised by the lack of acoustically salient syllable boundaries8.
In parallel, early works on speech rhythm, also inspired by evident similarities with the timing of music9, have
led scholars to focus on periodic aspects of speech. e isochrony hypothesis extended impressionistic descrip-
tions of speech sounding either morse-like or machine gun-like10,11, and led to the rhythmic class hypothesis12,
stating that languages fall into distinct rhythmic categories depending of which unit is used to form the isoch-
ronous stream. Two main classes emerged, stress-timed languages (e.g., English) based on isochronous feet and
syllable-timed languages (e.g., French) which assume equal-duration syllables. Still, the isochrony hypothesis and
the related rhythmic class hypothesis, in spite (or by virtue) of their simple formulation and intuitive account,
have been the source of a continuous debate (see review in13).
e way current theories are formulated as reviewed above, isochrony in speech would present some advan-
tages: speech units delivered at an ideal isochronous pace would be maximally predictable and lead to maximum
entrainment, through alleviating the need for potentially costly phase-reset mechanisms14. However, naturally
produced speech is rarely isochronous, if at all, and this departure from a hypothetical isochronous form, that
is, variation of sub-rhythmic unit durations is in fact used to encode essential information at all linguistic (pre-
lexical, lexical, prosodic, pragmatic, discursive) and para-linguistic levels. Two apparently contradictory hypoth-
eses are therefore at play here, the rst one positing a benecial role for isochrony in speech processing, and the
second one seeing natural speech timing as a gold standard, with any departure from it impairing its recognition.
In this study we attempt to disentangle the role of the two temporal dimensions of isochrony and natural-
ity in speech perception. We report on two experiments conducted separately on spoken sentences in French
and English, each representative of the two rhythmic classes. For this aim, we exploited the Harvard corpus for
OPEN
*
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
English15 and its recently developed French counterpart, the Fharvard corpus16. Both corpora contain sentences
composed of 5–7 keywords, recorded by one taker for each language. Sentences were annotated at two hierarchi-
cal rhythmic levels: the accent group level and the syllable level, respectively forming the basis of the two main
language rhythmic classes mentioned above. We retimed naturally produced sentences to an isochronous form
(or a matched anisochronous form, see hereaer) by locally compressing or elongating speech portions cor-
responding to rhythmic units at the two corresponding levels (accent group and syllable). Retiming was opera-
tionalised around P-centres, that is, the time at which listeners report the occurrence of the unit17,18, and which
provide crucial pivotal events at the meeting point between bottom-up acoustic saliency cues and top-down
information about the onset of linguistic units. Unmodied time onsets of either accent (acc) or syllable (syl)
rhythmic units served as a reference for the natural rhythm (NAT) condition, from which isochronous (ISO) and
anisochronous (ANI) conditions were dened. Altogether, this provided 5 temporal versions of each sentence in
each corpus: the unmodied natural version (NAT), the isochronous stimuli at the accent (ISO.acc) and syllable
(ISO.syl) levels, and the anisochronous stimuli at the accent (ANI.acc) and syllable (ANI.syl) levels. e ANI
conditions were controls for the ISO conditions through the application of identical net temporal distortions
from the NAT sentences though in a non-isochronous way (see “Methods”).
We then evaluated the consequences of these modications of naturalness towards isochrony in the ability of
listeners to process and understand the corresponding speech items. Sentence stimuli were mixed with stationary
speech-shaped noise to shi comprehension to below-ceiling levels. en, for both languages separately, the set
of the ve types of sentences in noise was presented to native listeners, and the proportion of recognised key-
words was taken as the index of the intelligibility of the corresponding sentence in the corresponding condition.
We show that naturalness is the main ingredient of intelligibility, while isochrony at the syllable level—but not
at the accent group level, whatever the rhythmic class of the considered language—plays an additional though
quantitatively smaller benecial role. is provides for the rst time an integrated coherent framework combin-
ing predictive cues related to bottom-up isochrony and top-down naturalness, describing speech intelligibility
properties independently on the language rhythmic class.
Results
Natural timing leads to greater intelligibility than either isochronously or anisochronously
retimed speech in both languages. We rst report the eect of temporal distortion on intelligibility,
separately by retiming condition, for the two languages. Figure1 shows intelligibility results as the proportion of
keywords correctly recognised by French and English listeners (top panel) and the temporal distortion applied to
sentences in each condition (bottom panel, see “Methods, Eq.(1) for computation details). Net temporal distor-
tion from natural speech at the condition level appears to be reected in listeners’ performance, with increased
temporal distortion associated with decreased intelligibility in both languages.
Extending the analysis done for the English data and reported in19, we tted a generalised linear mixed-eect
model to the French data. Table1 gathers the results of simultaneous generalised hypotheses on the condition
eects formulated separately for each language.
As is veried in Fig.1 and in the rst 4 rows of Table1, intelligibility of unmodied naturally timed sentences
for French was signicantly higher than sentences in any temporally-modied conditions. is result replicates
what was obtained for English, and conrms that any temporal distortion leads to degraded intelligibility. In
contrast to English however, where accent-isochronously retimed sentences were signicantly more intelligible
than accent-anisochronously retimed ones, no such eect is observed for French (Table1 row 5). Similarly, the
tendency for an isochronous versus anisochronous intelligiblity dierence at the syllable level observed in English
is absent in French (Table1 row 6). Indeed, an overall benet of isochronous over anisochronous transformation
is observed for English but not in French, when combining the two rhythmic levels (Table1 row 7).
As shown by the last row of Table1, syllable level distortion led to greater intelligibility decrease, than accent-
level distortion, in both French and English. is relates to the greater amount of distortion applied to syllable-
level over the accent-level modications applied to sentences, see Fig.1, bottom panel.
In sum, while temporal distortion appears to be the main predictor of intelligibility for both languages, the
independent role of isochrony seems to dier between both languages. We present in the next section evidence
for a common pattern underlying these surface dierences.
Syllable‑level isochrony plays a secondary role in both languages, even in naturally timed sen‑
tences. We dened several rhythm metrics to quantify, for each sentence in any temporal condition, the
departure of either accent group or syllable timing from two canonical rhythm types: natural or isochronous
rhythm. For the two hierarchical rhythmic levels considered here, this amounts to 4 metrics altogether: depar-
ture from naturally timed accent groups or syllables (respectively dnat.acc and dnat.syl), and departure from
isochronous accent group or syllables (respectively diso.acc and diso.syl, see “Methods” and Table5).
Figure2 shows intelligibility scores as a function of the temporal distortion applied to the sentences along
the 4 metrics, for all 5 experimental conditions, for English and French.
We analysed the joint role of isochrony and naturality in the dierent temporal conditions using logistic
regression modelling (see “Methods”). We conducted three separate analyses by grouping natural, isochronous
and anisochronous sentences respectively. is was done to avoid including subsets of data where a given metric
would yield a zero-value by design (see Fig.2). e three analyses are presented in the next subsections, each
corresponding to a highlighted region in Fig.2.
Departure from isochrony in naturally timed sentences (Fig. 2 region A). Naturally-timed sen-
tences have a null departure from naturality by design, but their departure from an isochronous form at both
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
French English
0.40
0.45
0.50
0.55
0.60
prop. word correct
Intelligibility
NAT
ISO.acc
ANI.acc
ISO.syl
ANI.syl
NAT
ISO.acc
ANI.acc
ISO.syl
ANI.syl
0.0
0.1
0.2
0.3
0.4
0.5
δ
NAT
ISO.acc
ANI.acc
ISO.syl
ANI.syl
Temporal distortion
Figure1. Top panel: proportion of words correctly recognised in each experimental condition for French and
English. Error bars show 95% condence intervals over 26 and 27 subjects in the two languages respectively.
Bottom panel: average sentence temporal distortion (
δ
function computed on speech units matching the
temporal condition, see Eq.(1)). By construction, temporal distortion is null for natural sentences (NAT
condition) and identical for isochronously and anisochronously retimed sentences at a given rythmic level, that
is, in ISO.acc and ANI.acc conditions on one hand, and in ISO.syl and ANI.syl conditions on the other hand.
Error bars show 95% condence intervals over 180 sentences for French and English. Data for English was
previously reported in19.
Table 1. Simultaneous generalised hypotheses tests for the eect of condition on intelligibility, formulated
on two independent models for French and English respectively. From le to right: comparison tested and,
for each language, comparison estimate and associated z and p values, with classical visual signicativity
indication. Data for English has been previously reported in19.
Row Comparison
French English
Est. z p Est. z p
1 ISO.acc, NAT
0.509
10.94
<0.001
***
0.545
12.16
***
2 ANI.acc, NAT
0.420
9.09
<0.001
***
0.722
16.06
***
3 ISO.syl, NAT
0.862
18.45
<0.001
***
1.017
22.40
***
4 ANI.syl, NAT
0.820
17.59
<0.001
***
1.127
24.68
***
5 ISO.acc, ANI.acc
0.088
1.93
0.273 0.177 4.00
***
6 ISO.syl, ANI.syl
0.042
0.91
0.878 0.110 2.43
0.092.
7 ISO, ANI
0.130
2.01
0.233 0.287 4.54
***
8 syl, acc
0.753
11.59
<0.001
***
0.878
13.80
***
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
the accent and the syllable level can be evaluated by the diso.acc and diso.syl metrics respectively. We therefore
included in the analysis the diso.acc and diso.syl metrics, and discarded the dnat.acc and dnat.syl metrics (see
Fig.2 regionA). Starting from the initial logistic regression model predicting intellibility with the full interac-
tion of language (French and English), diso.acc and diso.syl, as xed eects, we found that the simplest equivalent
model was a model with only diso.acc and diso.syl factors without interaction (see Table2 and “Methods”).
e resulting model shows that for natural sentences, intelligibility is positively correlated with departure from
accent isochrony (i.e., increased accent group irregularity is associated with better intelligibility) and negatively
correlated with departure from syllable isochrony (i.e., the more isochronous naturally timed syllables are, the
better the sentence is recognised). Importantly, this result does not depend on the language considered, with
both French and English showing the same pattern of results. Fixed eect sizes are markedly small, as the major-
ity of the variance is explained by random eects, as expected with the material used here. But xed eects are
nevertheless real and quantitatively important, as seen on Fig.2.
Departure from natural timing in isochronously retimed sentences (Fig. 2 region B). Next we
assessed to what extent intelligibility in isochronous conditions can be predicted from the departure from natu-
ral rhythm, at the accent and syllable levels (see Fig.2 regionB). From an initial fully interactional model with
language, dnat.acc and dnat.syl predictors, the simplest equivalent model consisted of only the dnat.syl factor
(see Table3).
is indicates that in conditions where sentences are isochronously transformed, intelligibility is signicantly
negatively correlated with departure from natural syllabic rhythmicity. Crucially, departure from accent group
natural rhythm does not play a role, and the results are identical for both languages considered.
Figure2. Intelligibility as a function of temporal distortion, as measured by the four metrics (rows) dened
in Table5. Data are grouped according to experimental condition (colors), the type of modication of the
experimental condition (column groupings) and language (columns). ree subsets of data are highlighted
for subsequent analysis (see text): (A) departure from isochrony of naturally timed sentences; (B) departure
from natural rhythm of isochronously retimed sentences; (C) departure from natural rhythm and isochrony
of anisochronously retimed sentences. Regression lines show linear modelling of the data points, disregarding
subject and sentence random variation.
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
Departure from isochrony and natural timing in anisochronously retimed sentences (Fig. 2
region C). In this last step we evaluated whether intelligibility of anisochronously retimed sentences could be
predicted by a combination of the four rhythmic distortion metrics (diso.acc, diso.syl, dnat.acc and dnat.syl, see
Fig.2 regionC). Indeed, anisochronously retimed speech departs from both natural and isochronous canonical
forms of timing, and in particular all four metrics have non all-zero values for these sentences. From an initial
fully interactional model crossing language with the four rhythm metrics, we found that the simplest equivalent
model consisted of the additive model of factors dnat.syl and diso.syl (see Table4).
ese results rene and extend what was obtained in the previous two analyses. First, the rhythmic unit of
accent group (whether dened at the stressed syllable level in English or at the accentual phrase level in French)
does not provide any explanatory power in predicting intelligibility in anisochronously timed speech. Second, the
role of natural syllable timing is conrmed and is the strongest predictor of intelligibility in that model as shown
by its z-value (Table4B) and its eect size (Table4C). ird, a role for the departure from isochronously-timed
syllables is detected. is means that in these conditions where the timing of speech is most unpredictable, there
is a tendency for isochronous syllables to be associated with increased intelligibility. We note however that there is
a necessary correlation between dnat and diso metrics for anisochronous sentences, exemplied by the fact that a
close-to-isochronous natural sentence has to be distorted by a low diso value to be rendered isochronous, and that
its anisochronous counterpart, being distorted by the same quantity by design, will be close to both the natural
and the isochronous version. A quantitative analysis conrmed this (Pearson’s product-moment correlation
Table 2. (A) Initial (m1) and equivalent simpler (m2) model for the role of departure from isochrony in
naturally timed sentences. e formulae of the xed eects are given for the two models, and the result of
a likelihood-ratio test between the two models is given on the right of the vertical separator. (B) m2 model
coecients, with associated p values. (C) Fixed-eect sizes with lower and upper condence levels.
Model summary Likelihood-ratio test (m1, m2)
Fixed eects AIC
χ2
Df
p
(> χ
2)
(A) Model selection
m1 language
diso.acc
diso.syl 6728.4 3.36 5 0.65
m2 diso.acc+ diso.syl 6721.7
Estimate SE zvalue Pr(>|z|)
(B) Equivalent model (m2) coecients
(Intercept) 0.8633 0.3230 2.673 0.00752**
diso.acc 1.0651 0.4633 2.299 0.02150*
diso.syl
1.5151
0.5757
2.632
0.00850**
Eect
R2
Lower CL Upper CL
(C) Fixed-eects size
m2 0.028 0.014 0.048
diso.syl 0.018 0.006 0.034
diso.acc 0.013 0.004 0.028
Table 3. (A) Initial (m3) and equivalent simpler (m4) models for the role of departure from natural rhythm in
isochronously retimed sentences.e formulae of the xed eects are given for the two models, and the result
of a likelihood-ratio test between the two models is given on the right of the vertical separator. (B) m4 model
coecients, with associated p values. (C) Fixed-eect sizes with lower and upper condence levels.
Model summary Likelihood-ratio test (m3, m4)
Fixed eects AIC
χ2
Df
p
(> χ
2)
(A) Model selection
m3 language
dnat.acc
dnat.syl 13,507 6.12 6 0.41
m4 dnat.syl 13,502
Estimate SE zvalue Pr(>|z|)
(B) Equivalent model (m4) coecients
(Intercept) 0.7382 0.1424 5.185
2.16e
07
***
dnat.syl
2.6231
0.2749
9.541
<2e16
***
Eect
R2
Lower CL Upper CL
(C) Fixed-eects size
m4 0.045 0.033 0.058
dnat.syl 0.045 0.033 0.058
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
between diso.syl and dnat.syl: French:
r
=
0.72, p<0.01
; English:
r
=
0.56, p<0.01
). e specic contribution
of syllabic isochrony to intelligibility therefore appears to be small at best for anisochronous sentences. Finally,
as for above analyses, this pattern of result applies indistinctively to both French and English.
Discussion
In the current study, we set to characterise the possible role of isochrony in speech perception in noise. Isochrony
is contrasted to naturalness, where the former refers to an ideally perfectly regular timing of speech units, while
the latter to the timing of speech units as they occur in naturally produced speech. We included a third set of
anisochronous conditions, in which the timing of speech events bear the same degree of temporal distortion from
naturally timed speech as isochronous speech, while being irregular. We tested temporally modied sentences
at the accent and the syllable levels, two hierarchically nested linguistic levels in English and French. ese two
languages are traditionally described as being representative of two distinct rhythmic classes, based on a hypo-
thetical underlying isochrony of accent versussyllable units respectively in natural speech production12,13,20.
A rst important result of this study is the replication from English to French that isochronous forms of
speech are always less intelligible than naturally timed speech. In fact, in a paradigm as the current one, where
the internal rhythmic structure of speech is changed but the sentence duration remains the same, any temporal
distortion to the naturally produced timings of speech units appears to be detrimental, with the amount of
temporal distortion being a strong predictor of intelligibility decrease. is result goes against the hypothesis
that isochronous speech, by virtue of a supposedly ideally regular timing of its units, would be easier to track,
by alleviating the need for constant phase-resetting. For traditional linguistic accounts too, these results further
debunk the isochrony hypothesis, assuming that produced speech would be based on an underlying ideally
isochronous form12, see review in13,20. In fact, our results suggest that the natural timing statistics are actively
used by listeners in decoding words, even though they are encountered in a single instance by listeners.
At the condition level, while an advantage of isochronous over anisochronous forms of speech was found for
English19, no such trend was observed for French. While possible prosodic or idiosyncratic eects might account
for this dierence (see Supplementary Materials), this led us to examine, separately by condition, the relationship
between intelligibility and the four timing metrics, namely departure from the natural timing of speech units and
departure from a hypothetical underlying isochronous form of those speech units. is analysis unveiled a rather
consistent portrait across the two talkers of the two languages, according to the type of sentence retiming. For
naturally-timed sentences, intelligibility correlates with the degree of departure from an ideal isochronous form
of the sentence at the syllable level. at is, naturally isochronous sentences at the syllable level are signicantly
better recognised than naturally anisochronous sentences. For isochronously retimed sentences, intelligibility
was strongly correlated with departure from naturality (see also Fig.1, bottom). Importantly, this result is valid
only for the syllable level—departure from natural timing of accent groups did not explain intelligibility vari-
ation. For anisochronous sentences, the two syllabic rhythm metrics, that is, departure from naturally timed
syllabes and departure from syllabic isochrony, were found to be correlated with intelligibility, though the latter
to a much lesser extent. is shows that both temporal dimensions of syllabic rhythm are actively relied upon
in speech comprehension. In addition, the simultaneous variation of the two temporal dimensions for this type
of stimuli enabled to provide an indication of the relative size of these eects, since the role of departure from
naturally timed syllables was about ten times stronger than the role of departure from isochronously timed syl-
lables (Table4C)—a ratio however possibly underestimated owing to the necessary correlation that exists between
Table 4. (A) Initial (m5) and equivalent simpler (m6) model for the role of departure from isochrony and
natural rhythm in anisochronously retimed sentences. e formulae of the xed eects are given for the two
models, and the result of a likelihood-ratio test between the two models is given on the right of the vertical
separator. (B) m6 model coecients, with associated p values. (C) Fixed-eect sizes with lower and upper
condence levels.
Model summary Likelihood-ratio test (m5, m6)
Fixed eects AIC
χ2
Df
p
(> χ
2)
(A) Model selection
m5 language
dnat.acc
dnat.syl
diso.acc
diso.syl 13,610 39.35 29 0.095
m6 dnat.syl+ diso.syl 13,591
Estimate SE z value Pr(>|z|)
(B) Equivalent model (m6) coecients
(Intercept) 0.9614 0.1669 5.760
8.38e09
***
dnat.syl
2.9418
0.3852
7.637
2.22e14
***
diso.syl
0.5311
0.2742
1.937
0.0527.
Eect
R2
Lower CL Upper CL
(C) Fixed-eects size
m6 0.059 0.046 0.073
dnat.syl 0.030 0.021 0.041
diso.syl 0.003 0.000 0.007
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol.:(0123456789)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
the two metrics, by design for these sentences. Again, the regularity of accent group timing did not provide any
signicant increase in intelligibility in any condition—contrary to natural sentences, as we will discuss later.
Crucially, all three analyses did not nd language as a contributing factor in explaining intelligibility, with the
same pattern of result applying to speech produced by both English and French talkers. Of course, as we tested
only one talker in each language, there is the possibility that an observed dierence (or lack thereof) between
languages might be imputable to confounded talker characteristics (see Supplementary Material for an analysis
of talker idiosyncrasies on a range of acoustic parameters). With that caveat in mind, given that those languages
were selected as being representative of dierent rhythmic classes, the undiscriminative nature of the results
may have two implications. First, it suggests that isochrony eects could apply across the board to any language,
independently of their rhythmic patterning at a global level—though, of course, this will have to be tested on
other languages and multiple speakers in future studies. Second, our results suggest that the syllable is a core
unit in temporal speech processing. While the primary role for the syllable in speech processing has long been
proposed7,21,22, (but see8) the fact that isochrony eects are stronger for syllable than accent groups even for a
stress-timed language (which would supposedly favour accent-group isochrony), indicates that the temporal scale
associated with the syllable is key—rather than its linguistic functional value. Instead, our results further support
the usefulness of the notion of a neurolinguistic unit dened at the crossroads of linguistic and neurophysiologi-
cal constraints, such as the so-called “theta-syllable23 which would apply universally to languages independently
of their linguistically-dened rhythm class. We suggest that this unit combines high-level perceptually-grounded
temporal anchoring, with the P-centre being a good candidate, with physiologically-based temporal constraints
determined by neural processing in the theta range of 4–8Hz (or up to 10Hz, see24,25).
Taken together, the data presented here are compatible with a model of cortical organisation of speech pro-
cessing with co-occurring and inter-dependent bottom-up and top-down activity25,26. While early studies have
focussed on exogenous eect of rhythm on cortical activity, the role of endogenous activity is being progressively
acknowledged and understood27. In this study, we use intelligibity as a proxy for successful cortical process-
ing, following the long established link between intelligibility and entrainment2830. Our data allow, across two
languages with markedly dierent rhythmic structures, to suggest unifying principles: the dependence of intel-
ligibility of naturally-timed sentences to their underlying isochrony suggests a bottom-up response component
that benets from regularity at the syllable level. In isochronous sentences, the strong response variation with
departure from the sentences’ natural rhythm indicates that learned patterns are a strong determinant of speech
processing. Unifying both distortion types, anisochronous sentences display a graded dependence to both of
temporal distortions, and additionally provide a quantication of the relative magnitude of the eects, since in
this type of sentences, top-down processing has a stronger role than bottom-up information. ese hypotheses
need to be backed-up with neurophysiological data, and a rst work in that framework is reported in31.
A point of interest of our results is the positive correlation in natural sentences between intelligibility and
departure from accent-level isochrony in both English and French. Local intelligibility variations associated
with regular versus irregular accent-level prominences could explain this eect: irregular accent-level units in a
sentence might trigger fewer but louder prominences in a sentence compared to isochronously distributed accent
units, which in turn may prove more intelligibile when mixed with a xed signal-to noise ratio over the entire
sentence, as opposed to a more uniform distribution of energetic masking over the sentences. e psychoacoustic
consequences of this hypothesis are dicult to control and outside the scope of the current study but could open
productive avenues in speech intelligibility enhancing modications32.
e sentence-long speech material employed here can be considered a good match to ecologically valid speech
perception conditions (over more classical controlled tasks involving lexical decision or segment-level categorical
perception33). In everyday communicative settings however, listeners do make use of a much larger temporal
context for successfully decoding speech, eciently incorporating semantic, prosodic or discursive cues from a
temporal span in the range of minutes rather than seconds. It is however dicult to predict the potential eect
that isochronously retimed speech might produce in a broader temporal context, as it may become an explicit
feature of the stimulus to be perceived. Indeed, none of the participants in the current study reported the isoch-
ronous quality of modied speech, but with sustained stimulation beyond the duration of a sentence, listeners
may be able to project a temporal canvas onto upcoming speech. One could hypothesise both a facilitatory
eect due to the explicit nature of the isochronous timing, as found in rhythmic priming studies3436, but also a
detrimental eect due to the sustained departure from optimally-timed delivery of information found in natural
speech, as defended here. e relevance of examining long-term oscillatory entrainment in the delta and theta
range also needs to be posed, although phase-locking eects beyond the duration of a phrase seem unlikely37,38.
In conclusion, in light of the data presented here, we propose an account of the role of isochrony in speech
perception which unies previous hypotheses on the role of isochrony in speech production and perception39,40
but, importantly, quanties its contribution, with respect to the “elephant-in-the-room” factor, that is, natural
timing of speech. First, to avoid any ambiguity, we propose that isochrony does not have a primary role in
speech perception or production. Indeed, through the eect of the temporal manipulations of speech reported
here, it is clear that isochronous forms of speech do not provide a benet in speech recognition in noise. In fact,
the natural timing statistics of speech appear to contain an essential ingredient used by listeners in the form of
learned information, to which listeners attend to by recruiting top-down cortical activity, and allows them to
process sentences even though they are presented for the rst time and degraded by noise. We however show
that isochrony does play a role, albeit a secondary one, which is compatible with a by-product output of neuronal
models of oscillatory activity4143. Importantly, the theta-syllable provides the pivot for isochrony in speech and
oscillations in the brain in this context. Entrainment to oscillatory activity has received a lot of attention in the
recent years and received broad experimental support, in particular in specic experimental settings where the
material is explicitely presented in an isochronous form (in visual, auditory, musical contexts38,4447). e fact
isochrony eects are observed in a context where isochrony is not an explicit characteristic of the experimental
Content courtesy of Springer Nature, terms of use apply. Rights reserved
Vol:.(1234567890)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
material suggests that these eects could reect physiological mechanisms which, if sustained, can lead to the
kind of behavioural outcomes reported in controlled setups. In sum, we suggest that isochrony is not a require-
ment for successful speech processing but rather a kind of underlying timing prior actively suppressed by the
necessities of exible and ecient linguistic information encoding. Because of plausible neuroanatomical archi-
tecture of cortical processing however, isochrony seems to represents a canonical form of timing that can be
seen as an attractor, and thereby enjoys a special status in perception and production generally, such as poetry,
music and dance.
Methods
Speech material and annotation. e speech material consisted of sentences, taken from the Harvard
corpus for English15 and the Fharvard corpus for French16. Both corpora are highly comparable in their struc-
ture: they both contain 700 sentences or more, each sentence is composed of 5–7 keywords (exactly 5 keywords
for French), with a mild overall semantic predictability, and sentences are phonetically balanced into lists of 10
sentences. For the current study, we used a subset of 180 sentences for each corpus, recorded by a female talker
for English48 and by a male talker for French49. Each sentence subset was randomly sampled from all sentences
of each corpus.
Sentences were annotated at two hierarchical rhythmic levels: the accent group and the syllable level. e
syllable level was taken as the lowest hierarchical rhythmical unit in both languages, and the accent group was
dened as the next hierarchical step up the rhythmic hierarchy, i.e., the stressed-syllable in English and the
accentual phrase in French50,51. While annotation of stressed syllables in English was relatively straightforward,
accentual phrase boundaries prove more dicult to determine unambiguously, as they combine several factors,
from the syntactic structure of the sentence to the particular intonational pattern used by the talker producing
the sentence. We relied on an independent annotation of accent group boundaries done by three native French
speakers. From an initial set of 280 sentences, 181 sentences received a full inter-annotator agreement, and the
rst 180 sentences of that set were selected for the current study.
e P-centre associated to a rhythmic unit corresponds tothe onset of the vowel associated with that unit
in the case of simple consonant-vowel syllables, and is usually advanced in the case of complex syllable onsets,
or when the vowel is preceded by a semi-vowel. In both corpora, P-centre events were rst automatically posi-
tioned following an automatic forced alignement procedure52, then manually checked and corrected if necessary,
typically when a schwa was inserted or deleted in a given syllable. Given the hierarchical relationship between
accent groups and syllables, the accent group P-centre aligns with the corresponding syllable P-centre. See
Supplementary Materials for auditory examples of annotated sentences at the accent group and syllable levels.
Stimuli. Sentences were temporally modied by locally accelerating or slowing down adjacent successive
speech segments. Unmodied time onsets of either accent (acc) or syllable (syl) rhythmic units served as a refer-
ence for the natural rhythm (NAT) condition, from which isochronous (ISO) and anisochronous (ANI) condi-
tions were dened, as detailed below.
We rst dened t the reference time series identifying the time onsets of the N corresponding rhythmic units of
a sentence anked with the sentence endpoints, i.e.,
t=t0,t1,...,tN,tN+1
, with
t0
and
tN+1
respectively the start
and end of the sentence. We noted d the associated inter-rhythmic unit durations, i.e.,
di
=
ti+1
ti,i
=
0, ...,N
.
We then dened the target time series
t
and the associated durations
d
, the resulting values aer temporal
transformation. e initial and nal portions of the sentence were le unchanged, i.e.,
t
0
=t
0
,t
1
=t
1
,t
N
=t
N
and
t
N+1
=t
N
+
1
, hence
d
0
=d
0
, and
d
N
=d
N
. For the ISO condition, the target durations were set to the average
duration of the corresponding intervals in the reference time series NAT, i.e.,
d
i
=
t
N
t
1
N1
,
i=1, ...,N1
. For
the ANI condition, we imposed that sentences were temporally transformed by the same quantity as isochronous
sentences, but resulted in an unpredictable rhythm. We achieved this by using the following simple heuristic, con-
sisting of applying an isochronous transformation to the time-reversed rhythmic units events. First, the reference
time series was replaced by a pseudo reference time series made of pseudo events such that successive pseudo ref-
erence durations were the time reversal of the original reference durations, i.e.,
rev(di)
=
dNi,i
=
1, ...,N
1
;
then, the target time series ANI were computed from this reversed sequence by equalising the temporal distance
of the pseudo events as in the ISO condition.
Temporal transformation was then operated by linearly compressing or expanding successive speech segments
identied by the reference time series by applying the duration ratio of target to reference speech segments, i.e.,
applying a time-scale step function
τ
i=
d
i
di
,i=1, ...,
N
to the speech signal of the sentence. is was achieved
using WSOLA53, a high-quality pitch-preserving temporal transformation algorithm.
Altogether, we obtained 5 temporal versions of each sentence in the corpus: the unmodied natural version
(NAT), the isochronous stimuli at the accent (ISO.acc) and syllable (ISO.syl) levels, and the anisochronous stim-
uli at the accent (ANI.acc) and syllable (ANI.syl) levels. Figure3 shows the result of the temporal transformation
of an example sentence for the rst 3 conditions. See Supplementary Materials for examples of retimed stimuli.
Final experimental stimuli were constructed by mixing the temporally transformed sentence with speech-
shaped noise at a signal-to-noise ratio of
3
dB. is value was determined in previous studies16 to elicit a
keyword recognition score of around 60% in the unmodied speech condition, in turn providing maximum
sensitivity for the dierence between unmodied and modied speech conditions. Speech-shaped noise was
constructed, separately for each talker, by ltering white noise with a 200-pole LPC lter computed on the con-
catenation of all sentences of the corpus recorded by that talker.
Temporal distortion metrics. We quantied the net amount of temporal distortion
δ
applied to a given
sentence by computing the root mean square of the log-transformed time-scale step function
τ
associated to that
Content courtesy of Springer Nature, terms of use apply. Rights reserved

Vol.:(0123456789)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
sentence. Log-transformation was done so that compression and elongation by inverse values (e.g.,
×1
2
and
×2
respectively) would contribute equally to the overall distortion measure (i.e.,
log
(1
2
)
2
=log(2)
2
). Binary loga-
rithm was used here. Using our notation referring to discrete events, the temporal distortion from the reference
time series t with N events to the target time series
t
can therefore be written:
where
i=1, ...,N
is the event index,
τ
the time-scale factors linking t and
t
and d the duration between succes-
sive reference events, as dened in the preceding section. Note that the
di
term in the numerator emerges from
the grouping of samples between successive reference events, since for all these samples
τi
values are constant
by design.
By design, individual sentences undergo an identical amount of temporal distortion in isochronous and
anisochronous transformations. By extending the application of the
δ
function to time instants other than the
ones used for stimulus construction, we introduce additional metrics to evaluate the departure of the rhythm
of any sentence – whether temporally modied or not – to two canonical rhythm types associated with the
sentence: its unmodied natural rhythm and its isochronous counterpart. For the two hierarchical rhythmic
levels considered here, this amounts to 4 new metrics: departure from naturally timed accent groups or syllables
(respectively dnat.acc and dnat.syl), and departure from isochronous accent group or syllables (respectively diso.
acc and diso.syl, see Table5).
(1)
δ
=
N
i=1(log τi)2di
N
i=1
d
i
Figure3. Annotation of an example sentence (translation: e red neon lamp makes his/her hair iridescent), in
its original unmodied natural timing (A) and transformed isochronous forms at the accent (B) and syllable
(C) levels. For each panel, from top to bottom: spectrogram with target boundaries used for the transformation
overlaid in dashed lines, accent group onsets (red), syllable onsets (orange), phonemes and words.
Content courtesy of Springer Nature, terms of use apply. Rights reserved

Vol:.(1234567890)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
Participants and procedure. Data for English are based on 26 participants (21 females) with mean age
of 20.9 (SD = 6.3), all speaking Australian English as a native language, with no known audition troubles and
are described in detail in19. We include the data for English previously reported in19 for the purpose of compar-
ing with the French data. New analyses are conducted on this dataset (see “Results”). All experimental proto-
cols were approved by the Human Research Ethics Committee of Western Sydney University under the refer-
ence H9495. For French, 27 participants (15 females), with mean age of 26.7 years (SD = 8.8) were recruited
from the student and sta population of the University of Grenoble-Alpes (UGA) and were given a 15euro gi
card as compensation for their participation. We checked that all participants met the selection criteria which
included speaking French as a native language, and having no known audition troubles. No data was removed
from the initial set. All experimental protocols were approved by UGA’s ethical committee (CERGA, agreement
IRB00010290-2017-12-12-33). For both English and French studies, all methods were carried out in accordance
with relevant guidelines and regulations, and informed consent was obtained from all participants.
For both language groups, participants were given written instructions and the experimenter gave comple-
mentary information when necessary. Participants then sat in front of a computer screen, where short on-screen
instructions were given before each block of the experiment. Participants were presented with speech and noise
mixtures played binaurally through Beyer Dynamic DT 770 Pro 80 ohm closed headphones at a comfortable level
set at the beginning of the experiment and maintained constant throughout the experiment. Participants had
to type what they heard on a keyboard and press “Enter” to trigger the presentation of the next stimuli. Stimuli
were grouped by condition forming 5 blocks of 36 sentences each. An additional 5 stimuli from each condition
were presented as practice in the beginning of the experiment and were not used for further analysis. e order
of conditions was counterbalanced across participants, and the order of sentences was pseudo-randomised for
each participant. e order of the practice sentences was xed across all participants, and the order of conditions
for practice sentences was also xed, to: NAT, ISO.acc, ANI.acc, ISO.syl, ANI.syl.
Scoring. Sentences were automatically scored with custom-made scripts that matched keywords with listen-
ers typed responses. A dictionary of variants was used in both languages to correct for homophones, full letter
spelling of digits and common spelling mistakes (e.g., *‘ciggar’ corrected to ‘cigar’ in English, and ‘*cigne’ cor-
rected to ‘cygne’ (‘swan’) in French. Each sentence received a nal score as the proportion of keywords correctly
recognised.
e accuracy of the automatic scoring was evaluated on a 530-sentence subset of the listeners responses
(around 5.5% of all 9450 responses), randomly and uniformly sampled across subjects, conditions and languages.
e subset was manually scored, and we found that 98% of the sentences were correctly scored by the automatic
procedure, with the 10 incorrectly scored sentences each having no more than one word typed with a spelling
mistake which was absent in the dictionary (and therefore added to it for future studies).
Data modelling. e eect of experimental condition on intelligibility was analysed for French following
previously reported analysis of the English data19. A generalised linear mixed-eect model (function glmer
from the R package lme454) was tted to intelligibility scores, including a random term intercept by subject.
Normal distribution of residuals was visually veried. Generalised simultaneous hypothesis were formulated
and tested with the function glht from the R package multcomp55, which corrects for multiple comparisons.
Analysis of the contribution of the temporal distortion metrics to intelligibility was performed on com-
bined French and English data. ree models were tted for each of the three subsets of the data (see Fig.2 and
Results”). Fixed eects were language (French and English) and the non all-zero valued metrics for the given
data subset. Random eect structure included a term by sentence and by participant. For each model, we report
an analysis of variance between the initial full model and the minimal equivalent model. e latter is obtained
from the initial model by incrementally removing xed-eect terms until no term can be removed without
Table 5. Analysis metrics to evaluate departure from naturally-timed and isochronous forms at the accent
and syllable level (rows) in each of the 5 experimental conditions (columns). Each argument of the
δ
function
(Eq.1) is a time series of either accent (acc) or syllable (syl) onsets, as they occur in a given experimental
condition. For example,
tsylISO.acc
represents the syllable onsets of a sentence as they occur in the transformed
ISO.acc experimental condition. Note that some of these distortions are equal to 0 by construction: they
are dnat.acc and and dnat.syl for NAT sentences, and diso.acc and diso.syl for ISO.acc and ISO.syl sentences
respectively.
NAT ISO.acc ISO.syl
dnat.acc
δ(taccNAT ,taccNAT )
δ(taccNAT ,taccISO.acc )
δ(t
acc
NAT ,t
acc
ISO.syl )
dnat.syl
δ(tsylNAT ,tsylNAT )
δ(t
syl
NAT ,t
syl
ISO.acc )
δ(t
syl
NAT ,t
syl
ISO.syl )
diso.acc
δ(taccISO.acc ,taccNAT )
δ(taccISO.acc ,taccISO.acc )
δ(t
accISO.acc
,t
acc
ISO.syl )
diso.syl
δ(t
syl
ISO.syl ,t
syl
NAT )
δ(t
syl
ISO.syl ,t
syl
ISO.acc )
δ(t
syl
ISO.syl ,t
syl
ISO.syl )
ANI.acc ANI.syl
dnat.acc
δ(taccNAT ,taccANI.acc )
δ(t
acc
NAT ,t
acc
ANI.syl )
dnat.syl
δ(tsylNAT ,tsylANI.acc )
δ(t
syl
NAT ,t
syl
ANI.syl )
diso.acc
δ(taccISO.acc ,taccANI.acc )
δ(taccISO.acc ,taccANI.syl )
diso.syl
δ(t
syl
ISO.syl ,t
syl
ANI.acc )
δ(t
syl
ISO.syl ,t
syl
ANI.syl )
Content courtesy of Springer Nature, terms of use apply. Rights reserved

Vol.:(0123456789)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
signicantly changing the explained variance. Visual distribution of residuals was checked. Fixed-eects size are
computed with the function r2beta from the R packager2glmm56).
Data availability
Example stimuli illustrating P-centre annotation and temporal modiction as well as computer code for the tem-
poral distortion metrics are included in Supplementary Materials. Experimental stimuli and listeners’ responses
data are available for download at: https ://doi.org/10.5281/zenod o.39664 75.
Received: 29 April 2020; Accepted: 28 October 2020
References
1. Bishop, G. H. Cyclic changes in excitability of the optic pathway of the rabbit. Am. J. Physiol. Leg. Content 103, 213–224 (1932).
2. Schroeder, C. E. & Lakatos, P. Low-frequency neuronal oscillations as instruments of sensory selection. Trends Neurosci. 32, 9–18
(2009).
3. Schroeder, C. E., Wilson, D. A., Radman, T., Scharfman, H. & Lakatos, P. Dynamics of active sensing and perceptual selection.
Curr. Opin. Neurobiol. 20, 172–176 (2010).
4. Ahissar, E. et al. Speech comprehension is correlated with temporal response patterns recorded from auditory cortex. Proc. Natl.
Acad. Sci. USA 98, 13367–13372 (2001).
5. Luo, H. & Poeppel, D. Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex. Neuron 54,
1001–1010. https ://doi.org/10.1016/j.neuro n.2007.06.004 (2007).
6. Greenberg, S. Speaking in shorthand: a syllable-centric perspective for understanding pronunciation variation. Speech Commun.
29, 159–176 (1999).
7. Ghitza, O. On the role of theta-driven syllabic parsing in decoding speech: intelligibility of speech with a manipulated modulation
spectrum. Front. Psychol. 3, 238. https ://doi.org/10.3389/fpsyg .2012.00238 (2012).
8. Cummins, F. Oscillators and syllables: a cautionary note. Front. Psychol. 3, 364. https ://doi.org/10.3389/fpsyg .2012.00364 (2012).
9. Steele, R. An Essay Towards Establishing the Melody and Measure of Speech to Be Expressed and Perpetuated by Peculiar Symbols (J.
Nichols, London, 1779).
10. James, A. L. Speech Signals in Telephony (Sir I. Pitman & Sons Ltd, London, 1940).
11. Pike, K. L. e Intonation of American English (University of Michigan Press, Ann Arbor, 1945).
12. Abercrombie, D. Elements of General Phonetics (Aldine, London, 1967).
13. Cummins, F. Rhythm and speech. In e Handbook of Speech Production (ed. Redford, M. A.) 158–177 (Wiley, Chisester, 2015).
14. Peelle, J. E. & Davis, M. H. Neural oscillations carry speech rhythm through to comprehension. Front. Psychol. 3, 320. https ://doi.
org/10.3389/fpsyg .2012.00320 (2012).
15. Rothauser, E. H. et al. IEEE recommended practice for speech quality measurements. IEEE Trans. Audio Acoust. 17, 225–246
(1969).
16. Aubanel, V., Bayard, C., Strauss, A. & Schwartz, J.-L. e Fharvard corpus: a phonemically-balanced French sentence resource for
audiology and intelligibility research. Speech Commun. 124, 68–74. https ://doi.org/10.1016/j.speco m.2020.07.004 (2020).
17. Morton, J., Marcus, S. & Frankish, C. Perceptual centers (P-centers). Psychol. Rev. 83, 405 (1976).
18. S cott, S.K. P-centers in speech: an acoustic analysis. Ph.D. thesis, UCL, London, UK (1993).
19. Aubanel, V., Davis, C. & Kim, J. Exploring the role of brain oscillations in speech perception in noise: intelligibility of isochronously
retimed speech. Front. Hum. Neurosci.https ://doi.org/10.3389/fnhum .2016.00430 (2016).
20. Arvaniti, A. e usefulness of metrics in the quantication of speech rhythm. J. Phon. 40, 351–373. https ://doi.org/10.1016/j.
wocn.2012.02.003 (2012).
21. Mehler, J., Dommergues, J. Y., Frauenfelder, U. & Segui, J. e syllable’s role in speech segmentation. J. Verb. Lear. Verb. Behav. 20,
298–305 (1981).
22. Greenberg, S., Carvey, H., Hitchcock, L. & Chang, S. Temporal properties of spontaneous speech-a syllable-centric perspective. J.
Phon. 31, 465–485. https ://doi.org/10.1016/j.wocn.2003.09.005 (2003).
23. Ghitza, O. e theta-syllable: a unit of speech information dened by cortical function. Front. Psychol.https ://doi.org/10.3389/
fpsyg .2013.00138 (2013).
24. Ghitza, O. Behavioral evidence for the role of cortical
θ
oscillationsin determining auditory channel capacity for speech. Front.
Psychol.https ://doi.org/10.3389/fpsyg .2014.00652 (2014).
25. Peou, M., Arnal, L. H., Fontolan, L. & Giraud, A.-L.
θ
-band and
β
-band neural activity reects independent syllable tracking
and comprehension of time-compressed speech. J. Neurosci. 37, 7930–7938. https ://doi.org/10.1523/JNEUR OSCI.2882-16.2017
(2017).
26. Fontolan, L., Morillon, B., Liegeois-Chauvel, C. & Giraud, A.-L. e contribution of frequency-specic activity to hierarchical
information processing in the human auditory cortex. Nat. Commun. 5, 1–10 (2014).
27. Rimmele, J. M., Morillon, B., Poeppel, D. & Arnal, L. H. Proactive sensing of periodic and aperiodic auditory patterns. Tren ds
Cogn. Sci. 22, 870–882. https ://doi.org/10.1016/j.tics.2018.08.003 (2018).
28. Ding, N. & Simon, J. Z. Emergence of neural encoding of auditory objects while listening to competing speakers. Proc. Natl. Acad.
Sci. USA 109, 11854–11859 (2012).
29. Ding, N. & Simon, J. Z. Adaptive temporal encoding leads to a background-insensitive cortical representation of speech. J. Neurosci.
33, 5728–5735 (2013).
30. Peelle, J. E., Gross, J. & Davis, M. H. Phase-locked responses to speech in human auditory cortex are enhanced during comprehen-
sion. Cereb. Cortex 23, 1378–1387. https ://doi.org/10.1093/cerco r/bhs11 8 (2013).
31. Strauss, A., Aubanel, V., Giraud, A.-L. & Schwartz, J.-L. Bottom-up and top-down processes cooperate around syllable P-centers
to ensure speech intelligibility. (submitted).
32. Cooke, M., Aubanel, V. & Lecumberri, M. L. G. Combining spectral and temporal modication techniques for speech intelligibility
enhancement. Comput. Speech Lang. 55, 26–39. https ://doi.org/10.1016/j.csl.2018.10.003 (2019).
33. Giraud, A.-L. & Poeppel, D. Speech perception from a neurophysiological perspective. In e Human Auditory Cortex (eds Poep-
pel, D. et al.) 225–260 (Springer, New York, 2012). https ://doi.org/10.1007/978-1-4614-2314-0_9.
34. Cason, N. & Schön, D. Rhythmic priming enhances the phonological processing of speech. Neuropsychologia 50, 2652–2658 (2012).
35. Cason, N., Astesano, C. & Schön, D. Bridging music and speech rhythm: rhythmic priming and audio-motor training aect speech
perception. Acta Psychol. 155, 43–50. https ://doi.org/10.1016/j.actps y.2014.12.002 (2015).
36. Haegens, S. & Zion-Golumbic, E. Rhythmic facilitation of sensory processing: a critical review. Neurosci. Biobehav. Rev. 86, 150–165.
https ://doi.org/10.1016/j.neubi orev.2017.12.002 (2018).
37. ten Oever, S. & Sack, A. T. Oscillatory phase shapes syllable perception. Proc. Natl. Acad. Sci. USAhttps ://doi.org/10.1073/
pnas.15175 19112 (2015).
Content courtesy of Springer Nature, terms of use apply. Rights reserved

Vol:.(1234567890)
Scientic Reports | (2020) 10:19580 | 
www.nature.com/scientificreports/
38. Hickok, G., Farahbod, H. & Saberi, K. The rhythm of perception. Psychol. Sci. 26, 1006–1013. https ://doi.org/10.1016/j.
cub.2013.11.006 (2015).
39. Dauer, R. M. Stress-timing and syllable-timing reanalyzed. J. Phon. 11, 51–62 (1983).
40. Lehiste, I. Isochrony reconsidered. J. Phon. 5, 253–263 (1977).
41. Lakatos, P., Karmos, G., Mehta, A. D., Ulbert, I. & Schroeder, C. E. Entrainment of neuronal oscillations as a mechanism of atten-
tional selection. Science 320, 110–113. https ://doi.org/10.1126/scien ce.11547 35 (2008).
42. Giraud, A.-L. & Poeppel, D. Cortical oscillations and speech processing: emerging computational principles and operations. Nature
Neurosci. 15, 511–517. https ://doi.org/10.1038/nn.3063 (2012).
43. Hyal, A., Fontolan, L., Kabdebon, C., Gutkin, B. & Giraud, A.-L. Speech encoding by coupled cortical theta and gamma oscilla-
tions. eLife 11, e06213. https ://doi.org/10.7554/eLife .06213 .001 (2015).
44. Busch, N. A., Dubois, J. & VanRullen, R. e phase of ongoing EEG oscillations predicts visual perception. J. Neurosci. 29,
7869–7876. https ://doi.org/10.1523/JNEUR OSCI.0113-09.2009 (2009).
45. Nozaradan, S., Peretz, I., Missal, M. & Mouraux, A. Tagging the neuronal entrainment to beat and meter. J. Neurosci. 31, 10234–
10240 (2011).
46. Ding, N., Melloni, L., Zhang, H., Tian, X. & Poeppel, D. Cortical tracking of hierarchical linguistic structures in connected speech.
Nature Neurosci. 19, 158–164. https ://doi.org/10.1038/nn.4186 (2016).
47. van Atteveldt, N. et al. Complementary fMRI and EEG evidence for more ecient neural processing of rhythmic versus unpredict-
ably timed sounds. Front. Psychol. 6, 1663. https ://doi.org/10.3389/fpsyg .2015.01663 (2015).
48. Aubanel, V., Davis, C. & Kim, J. e MAVA corpushttps ://doi.org/10.4227/139/59a4c 21a89 6a3 (2017).
49. Aubanel, V., Bayard, C., Strauss, A. & Schwartz, J.-L. e Fharvard corpushttps ://doi.org/10.5281/zenod o.14628 54 (2018).
50. Jun, S. A. & Fougeron, C. A phonological model of French intonation. In Intonation: Analysis, Modeling and Technology (ed. Botinis,
A.) 209–242 (Kluwer Academic Publishers, Dordrecht, 2000).
51. Nespor, M. & Vogel, I. Prosodic Phonology: With a New Foreword Vol. 28 (Walter de Gruyter, Berlin, 2007).
52. Goldman, J.-P. EasyAlign: an automatic phonetic alignment tool under Praat. In Interspeech 3233–3236 (Florence, Italy, 2011).
53. Demol, M., Verhelst, W., Struyve, K. & Verhoeve, P. Ecient non-uniform time-scaling of speech with WSOLA. In International
Conference on Speech and Computer (SPECOM), 163–166 (2005).
54. Bates, D., Mächler, M., Bolker, B. M. & Walker, S. C. Fitting linear mixed-eects models using lme4. J. Stat. Sow. 67, 1–48. https
://doi.org/10.18637 /jss.v067.i01 (2015).
55. Hothorn, T., Bretz, F. & Westfall, P. Simultaneous inference in general parametric models. Biom. J. 50, 346–363 (2008).
56. Jaeger, B. r2glmm: Computes R Squared for Mixed (Multilevel) Models, R package version 0.1.2 edn. (2017).
Acknowledgements
is work was supported by the European Research Council under the European Community’s Seventh Frame-
work Program (FP7/2007-2013 Grant Agreement No. 339152, “Speech Unit(e)s”). We thank Christine Nies for
her help in collecting data and Silvain Gerber for assistance with statistical analysis.
Author contributions
V.A. and J.-L.S. conceived and designed the study, supervised data collection, analysed the data and wrote the
paper.
Competing interests
e authors declare no competing interests.
Additional information
Supplementary information is available for this paper at https ://doi.org/10.1038/s4159 8-020-76594 -1.
Correspondence and requests for materials should be addressed to V.A.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional aliations.
Open Access is article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. e images or other third party material in this
article are included in the articles Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.
© e Author(s) 2020
Content courtesy of Springer Nature, terms of use apply. Rights reserved
1.
2.
3.
4.
5.
6.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
use bots or other automated methods to access the content or redirect messages
override any security feature or exclusionary protocol; or
share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
... Therefore, in par with these oscillatory principles, it could be predicted that the syllabic segmentation of speech should be easier when the distribution of syllabic durations is rather isochronous than when it is not. Indeed, a recent study [11] studied the intelligibility of speech in noise in two languages differing by their rhythmic properties, i.e. French, a syllable-timed language, and English, a stress-timed language. ...
... To characterize the departure from isochrony in speech signals, we use a previously introduced temporal distortion metric noted δ [11]. It is computed for a given reference time series t (the initial temporal event series) which is transformed into a target time series t ′ (here an hypothetical isochronous time series with the same number of events) as the following: ...
... We observe that there is no significant correlation: sentences with low distortion to synchrony in syllabic boundaries may have large distortion for P-centers, and vice-versa (Pearson correlation coefficient R = 0.05, p-value p = 0.51). In the following, we use only distortion to synchrony computed over the distribution of P-centers, in line with the experimental study by Aubanel & Schwartz [11]. Figure 4: Event detection performance (F-scores, y-axis) against P-center temporal distortion (δ, x-axis), for P-center detection (left) and syllable boundary detection (right). ...
Conference Paper
Full-text available
Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this paper, we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that “perceptual centers” associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model’s robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.
... Therefore, in par with these oscillatory principles, it could be predicted that the syllabic segmentation of speech should be easier when the distribution of syllabic durations is rather isochronous than when it is not. Indeed, a recent study [11] studied the intelligibility of speech in noise in two languages differing by their rhythmic properties, i.e. French, a syllable-timed language, and English, a stress-timed language. ...
... To characterize the departure from isochrony in speech signals, we use a previously introduced temporal distortion metric noted δ [11]. It is computed for a given reference time series t (the initial temporal event series) which is transformed into a target time series t ′ (here an hypothetical isochronous time series with the same number of events) as the following: ...
... We observe that there is no significant correlation: sentences with low distortion to synchrony in syllabic boundaries may have large distortion for P-centers, and vice-versa (Pearson correlation coefficient R = 0.05, p-value p = 0.51). In the following, we use only distortion to synchrony computed over the distribution of P-centers, in line with the experimental study by Aubanel & Schwartz [11]. Figure 4: Event detection performance (F-scores, y-axis) against P-center temporal distortion (δ, x-axis), for P-center detection (left) and syllable boundary detection (right). ...
Preprint
Full-text available
Oscillation-based neuro-computational models of speech perception are grounded in the capacity of human brain oscillations to track the speech signal. Consequently, one would expect this tracking to be more efficient for more regular signals. In this paper , we address the question of the contribution of isochrony to event detection by neuro-computational models of speech perception. We consider a simple model of event detection proposed in the literature, based on oscillatory processes driven by the acoustic envelope, that was previously shown to efficiently detect syllabic events in various languages. We first evaluate its performance in the detection of syllabic events for French, and show that "perceptual centers" associated to vowel onsets are more robustly detected than syllable onsets. Then we show that isochrony in natural speech improves the performance of event detection in the oscillatory model. We also evaluate the model's robustness to acoustic noise. Overall, these results show the importance of bottom-up resonance mechanism for event detection ; however, they suggest that bottom-up processing of acoustic envelope is not able to perfectly detect events relevant to speech temporal segmentation, highlighting the potential and complementary role of top-down, predictive knowledge.
... Instead, it seems likely that top-down predictions exploiting the listener's knowledge of the timing of natural speech (e.g., lexical or prosodic information) could improve the efficiency of purely bottom-up segmentation (Davis and Johnsrude, 2007). Indeed, clear evidence for the role of topdown timing predictions has been recently provided by Aubanel and Schwartz (2020). Their study showed that speech sequences embedded in a large level of noise were better processed and understood by listeners when they were presented in their natural, irregular timing than in a timing made isochronous without changing their spectro-temporal content. ...
... A second challenge concerns the already mentioned study by Aubanel and Schwartz (2020) showing that natural speech is more intelligible in noise than speech rendered isochronous, while isochrony also plays a role in helping intelligibility, but to a lesser extent. Naturalness and isochrony play here complementary roles which could fit quite well with the existence of a bottom-up onset detection branch exploiting isochrony, and a top-down prediction branch exploiting naturalness. ...
... Here, an experimental prediction is that a fusion model associating bottom-up detection with top-down predictions would hence be necessary. Since the experimental paradigm tested by Aubanel and Schwartz (2020) is based on perception in noise, it is likely that the AND fusion model would be required here. The experimental test-bed provided by this study could enable to refine the precise coordination in time between the topdown and the bottom-up detection processes. ...
Article
Full-text available
Recent neurocognitive models commonly consider speech perception as a hierarchy of processes, each corresponding to specific temporal scales of collective oscillatory processes in the cortex: 30–80 Hz gamma oscillations in charge of phonetic analysis, 4–9 Hz theta oscillations in charge of syllabic segmentation, 1–2 Hz delta oscillations processing prosodic/syntactic units and the 15–20 Hz beta channel possibly involved in top-down predictions. Several recent neuro-computational models thus feature theta oscillations, driven by the speech acoustic envelope, to achieve syllabic parsing before lexical access. However, it is unlikely that such syllabic parsing, performed in a purely bottom-up manner from envelope variations, would be totally efficient in all situations, especially in adverse sensory conditions. We present a new probabilistic model of spoken word recognition, called COSMO-Onset, in which syllabic parsing relies on fusion between top-down, lexical prediction of onset events and bottom-up onset detection from the acoustic envelope. We report preliminary simulations, analyzing how the model performs syllabic parsing and phone, syllable and word recognition. We show that, while purely bottom-up onset detection is sufficient for word recognition in nominal conditions, top-down prediction of syllabic onset events allows overcoming challenging adverse conditions, such as when the acoustic envelope is degraded, leading either to spurious or missing onset events in the sensory signal. This provides a proposal for a possible computational functional role of top-down, predictive processes during speech recognition, consistent with recent models of neuronal oscillatory processes.
... However, from a naturalistic point of view, temporal contexts are rarely fully isochronous nor deterministic. Speech acoustic signals in particular presents complex statistical temporal regularities (Singh et al., 2003;Cummins, 2012;Varnet et al., 2017) that are supposedly used to form temporal expectations and influence language comprehension (Tillmann, 2012;Jadoul et al., 2016;Kösem & Van Wassenhove, 2017;Kösem et al., 2018;Aubanel & Schwartz, 2020). How temporal predictions occur in non-fully predictable temporal contexts such as speech and music and how they influence auditory perception is still under debate (Jadoul et al., 2016;Herbst & Obleser, 2017). ...
Preprint
Full-text available
Temporal predictions can be formed and impact perception when sensory timing is fully predictable: for instance, the detection of a target sound is enhanced if it is presented on the beat of an isochronous rhythm. However, natural sensory stimuli, like speech or music, are not entirely predictable, but still possess statistical temporal regularities. We investigated whether temporal expectations can be formed in non-fully predictable contexts, and how the temporal variability of sensory contexts affects auditory perception. Specifically, we asked how “rhythmic” an auditory stimulation needs to be in order to observe temporal predictions effects on auditory discrimination performances. In this behavioral auditory oddball experiment, participants listened to auditory sound sequences where the temporal interval between each sound was drawn from gaussian distributions with distinct standard deviations. Participants were asked to discriminate sounds with a deviant pitch in the sequences. Auditory discrimination performances, as measured with deviant sound discrimination accuracy and response times, progressively declined as the temporal variability of the sound sequence increased. Temporal predictability effects ceased to be observed only for the more variable contexts. Moreover, both global and local temporal statistics impacted auditory perception, suggesting that temporal statistics are promptly integrated to optimize perception. Altogether, these results suggests that temporal predictions can be set up quickly based on the temporal statistics of past sensory events and are robust to a certain amount of temporal variability. Therefore, temporal predictions can be built on sensory stimulations that are not purely periodic nor temporally deterministic. Significance statement The perception of sensory events is known to be enhanced when their timing is fully predictable. However, it is unclear whether temporal predictions are robust to temporal variability, which is naturally present in many auditory signals such as speech and music. In this behavioral experiment, participants listened to auditory sound sequences where the timing between each sound was drawn from distinct gaussian distributions. Participant’s ability to discriminate deviant sounds in the sequences was function of the temporal statistics of past events: auditory deviant discrimination progressively declined as the temporal variability of the sound sequence increased. Results therefore suggest that auditory perception is sensitive to prediction mechanisms that are involved even if temporal information is not totally predictable.
... However, from a naturalistic point of view, temporal contexts are rarely fully isochronous nor deterministic. Speech acoustic signals in particular presents complex statistical temporal regularities (Singh et al., 2003;Cummins, 2012;Varnet et al., 2017) that are supposedly used to form temporal expectations and influence language comprehension (Tillmann, 2012;Jadoul et al., 2016;Kösem & Van Wassenhove, 2017;Kösem et al., 2018;Aubanel & Schwartz, 2020). How temporal predictions occur in non-fully predictable temporal contexts such as speech and music and how they influence auditory perception is still under debate (Jadoul et al., 2016;Herbst & Obleser, 2017, Zoefel & Kösem, 2022. ...
Article
Full-text available
Temporal predictions can be formed and impact perception when sensory timing is fully predictable: for instance, the discrimination of a target sound is enhanced if it is presented on the beat of an isochronous rhythm. However, natural sensory stimuli, like speech or music, are not entirely predictable, but still possess statistical temporal regularities. We investigated whether temporal expectations can be formed in non-fully predictable contexts, and how the temporal variability of sensory contexts affects auditory perception. Specifically, we asked how “rhythmic” an auditory stimulation needs to be in order to observe temporal predictions effects on auditory discrimination performances. In this behavioral auditory oddball experiment, participants listened to auditory sound sequences where the temporal interval between each sound was drawn from gaussian distributions with distinct standard deviations. Participants were asked to discriminate sounds with a deviant pitch in the sequences. Auditory discrimination performances, as measured with deviant sound discrimination accuracy and response times, progressively declined as the temporal variability of the sound sequence increased. Moreover, both global and local temporal statistics impacted auditory perception, suggesting that temporal statistics are promptly integrated to optimize perception. Altogether, these results suggests that temporal predictions can be set up quickly based on the temporal statistics of past sensory events and are robust to a certain amount of temporal variability. Therefore, temporal predictions can be built on sensory stimulations that are not purely periodic nor temporally deterministic.
... (2017) and Beck and Konieczny (2021). This in turn, should influence the temporal ratio, i.e., quasi-isochronicity (compare also Kotz et al., 2018;Ravignani et al., 2018;Aubanel and Schwartz, 2020), with which musical active readers encounter tacks during oral reading, differently so compared to inactive readers. Specifically, we use the SOI as well as mean intensity, to investigate rhythmic contrasts during MRRL reading, with a special focus on syllables which where substituted by the non-sensical syllable tack. ...
Book
Full-text available
Despite the omnipresence of rhythm in music, movement, circadian cycles, and learning processes, this research topic is the first volume to offer interdisciplinary approaches the topic. Empirical research on timing and precision from the microbiological level of synapses to the macro level of elite solo and ensemble performances is presented. The volume provides results gained by the use of microscopes, motion capture systems, medical equipment such as imaging scans, as well as artistic and educational experience. The goal of this research topic is to present current, scientific studies that examine the ways in which rhythm effects human biology, behavior, perception, and art. Reciprocally, science can learn from studies conducted with artists about how they experience, express and synchronize rhythms.
... (2017) and Beck and Konieczny (2021). This in turn, should influence the temporal ratio, i.e., quasi-isochronicity (compare also Kotz et al., 2018;Ravignani et al., 2018;Aubanel and Schwartz, 2020), with which musical active readers encounter tacks during oral reading, differently so compared to inactive readers. Specifically, we use the SOI as well as mean intensity, to investigate rhythmic contrasts during MRRL reading, with a special focus on syllables which where substituted by the non-sensical syllable tack. ...
Article
Full-text available
In reading conventional poems aloud, the rhythmic experience is coupled with the projection of meter, enabling the prediction of subsequent input. However, it is unclear how top-down and bottom-up processes interact. If the rhythmicity in reading loud is governed by the top-down prediction of metric patterns of weak and strong stress, these should be projected also onto a randomly included, lexically meaningless syllable. If bottom-up information such as the phonetic quality of consecutive syllables plays a functional role in establishing a structured rhythm, the occurrence of the lexically meaningless syllable should affect reading and the number of these syllables in a metrical line should modulate this effect. To investigate this, we manipulated poems by replacing regular syllables at random positions with the syllable “tack”. Participants were instructed to read the poems aloud and their voice was recorded during the reading. At the syllable level, we calculated the syllable onset interval (SOI) as a measure of articulation duration, as well as the mean syllable intensity. Both measures were supposed to operationalize how strongly a syllable was stressed. Results show that the average articulation duration of metrically strong regular syllables was longer than for weak syllables. This effect disappeared for “tacks”. Syllable intensities, on the other hand, captured metrical stress of “tacks” as well, but only for musically active participants. Additionally, we calculated the normalized pairwise variability index (nPVI) for each line as an indicator for rhythmic contrast, i.e., the alternation between long and short, as well as louder and quieter syllables, to estimate the influence of “tacks” on reading rhythm. For SOI the nPVI revealed a clear negative effect: When “tacks” occurred, lines appeared to be read less altering, and this effect was proportional to the number of tacks per line. For intensity, however, the nPVI did not capture significant effects. Results suggests that top-down prediction does not always suffice to maintain a rhythmic gestalt across a series of syllables that carry little bottom-up prosodic information. Instead, the constant integration of sufficiently varying bottom-up information appears necessary to maintain a stable metrical pattern prediction.
Article
Human speech is a particularly relevant acoustic stimulus for our species, due to its role of information transmission during communication. Speech is inherently a dynamic signal, and a recent line of research focused on neural activity following the temporal structure of speech. We review findings that characterise neural dynamics in the processing of continuous acoustics and that allow us to compare these dynamics with temporal aspects in human speech. We highlight properties and constraints that both neural and speech dynamics have, suggesting that auditory neural systems are optimised to process human speech. We then discuss the speech‐specificity of neural dynamics and their potential mechanistic origins and summarise open questions in the field.
Article
This study aims at illustrating the concept of isochrony ,determination its phonological units , emphasizing the existence and the role of this phenomenon. Isochrony refers to the occurrence of stressed syllables at equal time intervals through an utterance. This term is used to specify the rhythm features. Many experiments have been applied by the specialists to find a vivid clue about the presence of isochrony in production. They found modicum assent about the actuality of its principles. They assure its existence at perceptual level. The term ''perceptual isochrony'' includes a role in two significant psycholinguistic styles ''underlying spoken'' and written word comprehension , which are word segmentation and lexical access .Isochrony cannot be neglected because it forms the standard of rhythmic system of English. Finally, there is a relationship between isochrony and syntax because the reading of ambiguous sentence requires accurate effort and attention to the meaning during the time of reading.
Article
Purpose Humans have a near-automatic tendency to entrain their motor actions to rhythms in the environment. Entrainment has been hypothesized to play an important role in processing naturalistic stimuli, such as speech and music, which have intrinsically rhythmic properties. Here, we studied two facets of entraining one's rhythmic motor actions to an external stimulus: (a) synchronized finger tapping to auditory rhythmic stimuli and (b) memory-paced reproduction of a previously heard rhythm. Method Using modifications of the Synchronization–Continuation tapping paradigm, we studied how these two rhythmic behaviors were affected by different stimulus and task features. We tested synchronization and memory-paced tapping for a broad range of rates, from stimulus onset asynchrony of subsecond to suprasecond, both for strictly isochronous tone sequences and for rhythmic speech stimuli (counting from 1 to 10), which are more ecological yet less isochronous. We also asked what role motor engagement plays in forming a stable internal representation for rhythms and guiding memory-paced tapping. Results and Conclusions Our results show that individuals can flexibly synchronize their motor actions to a very broad range of rhythms. However, this flexibility does not extend to memory-paced tapping, which is accurate only in a narrower range of rates, around ~1.5 Hz. This pattern suggests that intrinsic rhythmic defaults in the auditory and/or motor system influence the internal representation of rhythms, in the absence of an external pacemaker. Interestingly, memory-paced tapping for speech rhythms and simple tone sequences shared similar “optimal rates,” although with reduced accuracy, suggesting that internal constraints on rhythmic entrainment generalize to more ecological stimuli. Last, we found that actively synchronizing to tones versus passively listening to them led to more accurate memory-paced tapping performance, which emphasizes the importance of action–perception interactions in forming stable entrainment to external rhythms.
Article
Full-text available
Recent psychophysics data suggest that speech perception is not limited by the capacity of the auditory system to encode fast acoustic variations through neural γ activity, but rather by the time given to the brain to decode them. Whether the decoding process is bounded by the capacity of θ rhythm to follow syllabic rhythms in speech, or constrained by a more endogenous top-down mechanism, e.g., involving β activity, is unknown. We addressed the dynamics of auditory decoding in speech comprehension by challenging syllable tracking and speech decoding using comprehensible and incomprehensible time-compressed auditory sentences. We recorded EEGs in human participants and found that neural activity in both θ and γ ranges was sensitive to syllabic rate. Phase patterns of slow neural activity consistently followed the syllabic rate (4-14 Hz), even when this rate went beyond the classical θ range (4-8 Hz). The power of θ activity increased linearly with syllabic rate but showed no sensitivity to comprehension. Conversely, the power of β (14-21 Hz) activity was insensitive to the syllabic rate, yet reflected comprehension on a single-trial basis.Wefound different long-range dynamics for θ and β activity, with β activity building up in time while more contextual information becomes available. This is consistent with the roles of θ and β activity in stimulus-driven versus endogenous mechanisms. These data show that speech comprehension is constrained by concurrent stimulus-driven θ and low-γ activity, and by endogenous β activity, but not primarily by the capacity of θ activity to track the syllabic rhythm.
Article
Full-text available
A growing body of evidence shows that brain oscillations track speech. This mechanism is thought to maximize processing efficiency by allocating resources to important speech information, effectively parsing speech into units of appropriate granularity for further decoding. However, some aspects of this mechanism remain unclear. First, while periodicity is an intrinsic property of this physiological mechanism, speech is only quasi-periodic, so it is not clear whether periodicity would present an advantage in processing. Second, it is still a matter of debate which aspect of speech triggers or maintains cortical entrainment, from bottom-up cues such as fluctuations of the amplitude envelope of speech to higher level linguistic cues such as syntactic structure. We present data from a behavioral experiment assessing the effect of isochronous retiming of speech on speech perception in noise. Two types of anchor points were defined for retiming speech, namely syllable onsets and amplitude envelope peaks. For each anchor point type, retiming was implemented at two hierarchical levels, a slow time scale around 2.5 Hz and a fast time scale around 4 Hz. Results show that while any temporal distortion resulted in reduced speech intelligibility, isochronous speech anchored to P-centers (approximated by stressed syllable vowel onsets) was significantly more intelligible than a matched anisochronous retiming, suggesting a facilitative role of periodicity defined on linguistically motivated units in processing speech in noise.
Article
Full-text available
A brief review is provided of the study of rhythm in speech. Much of that activity has focused on looking for empirical measures that would support the categorization of languages into discrete rhythm ‘types’. That activity has had little success, and has used the term ‘rhythm’ in increasingly unmusical and unintuitive ways. Recent approaches to conversation that regard speech as a whole-body activity are found to provide considerations of rhythm that are closer to the central, musical, sense of the term.
Article
The current study describes the collection of a new phonemically-balanced sentence resource for French, known as the Fharvard corpus. The resource consists of 700 sentences inspired by the original English Harvard sentences, along with audio recordings from one female and one male native French talker. Each of the sentences contains five mono-or bisyllabic keywords and are grouped into 70 lists of 10 sentences using an automatic phoneme-balancing procedure. Twenty-three normal-hearing French listeners identified keywords in the Fharvard sentences in speech-shaped noise. Psychometric functions for the Fharvard sentences indicate mean speech reception thresholds of −4.48 and −3.87 dB and slopes of 10.55 and 12.52 percentage points per dB at the 50% keywords correct point for the female and male talkers respectively. The complete list of Fharvard sentences and the associated audio recordings are available online for speech perception testing.
Article
Modifying clean speech prior to output in noisy conditions can lead to substantial intelligibility gains. Most algorithms operate by redistributing energy across the signal, leaving the timing of the underlying speech sounds intact. Other techniques do alter the timing of speech relative to the masker. Both classes of approach – spectral and temporal – lead to a reduction in energetic masking. The current study examines how their combination affects intelligibility. Arguments can be made for both synergy and redundancy, and the presence of distortions introduced by both spectral and temporal approaches might even lead to an antagonistic combination. A cohort of native Spanish listeners identified keywords in sentences in unmodified form and following spectral, temporal and spectro-temporal modification, in the presence of a fluctuating masker. Errors in the spectro-temporal condition were substantially lower than following spectral or temporal modification alone, with a three-fold reduction compared to unmodified speech. Spectro-temporal gains were observed for all phonemes. A glimpse-based model of energetic masking incorporating speech rate changes predicts intelligibility (r=.96), and a glimpsing analysis provides further insights into the distinct mechanisms through which spectral and temporal approaches lead to a release from energetic masking.
Article
The ability to predict when something will happen facilitates sensory processing and the ensuing computations. Building on the observation that neural activity entrains to periodic stimulation, leading neurophysiological models imply that temporal predictions rely on oscillatory entrainment. Although they provide a sufficient solution to predict periodic regularities, these models are challenged by a series of findings that question their suitability to account for temporal predictions based on aperiodic regularities. Aiming for a more comprehensive model of how the brain anticipates ‘when’ in auditory contexts, we emphasize the capacity of motor and higher-order top-down systems to prepare sensory processing in a proactive and temporally flexible manner. Focusing on speech processing, we illustrate how this framework leads to new hypotheses.
Article
Here we review the role of brain oscillations in sensory processing. We examine the idea that neural entrainment of intrinsic oscillations underlies the processing of rhythmic stimuli in the context of simple isochronous rhythms as well as in music and speech. This has been a topic of growing interest over recent years; however, many issues remain highly controversial: how do fluctuations of intrinsic neural oscillations-both spontaneous and entrained to external stimuli-affect perception, and does this occur automatically or can it be actively controlled by top-down factors? Some of the controversy in the literature stems from confounding use of terminology. Moreover, it is not straightforward how theories and findings regarding isochronous rhythms generalize to more complex, naturalistic stimuli, such as speech and music. Here we aim to clarify terminology, and distinguish between different phenomena that are often lumped together as reflecting "neural entrainment" but may actually vary in their mechanistic underpinnings. Furthermore, we discuss specific caveats and confounds related to making inferences about oscillatory mechanisms from human electrophysiological data.
Data
The MAVA corpus (MARCS Auditory-Visual Australian recordings of IEEE sentences) is a collection of high quality audiovisual recordings of 205 phonetically balanced sentences from the IEEE sentence database, recorded by a native Australian English female talker. The audio channel is annotated at the word and phoneme level. In addition, for the video channel, frame-by-frame lip contour X Y coordinates are provided. The center of the lip region is used as a reference for deriving four video regions: full face, upper face, lower face and lips. All files are freely available for download under the Creative Commons BY-NC-SA licence.