Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
|
Keynote Speaker 1 - Tuesday, May 2, 10:00 - 10:45
|
1 of 3 |
Julia Hirschberg, Columbia University
Recognizing and Conveying Speaker State Prosodically
Extended Abstract:
A speaker's mental state is often conveyed by acoustic and prosodic factors, as well
as the words they choose and the gestures they use. Considerable research has
been done in recent years to detect emotional state in IVR systems, so that angry
or frustrated users can be directed to a human agent. Other research has sought to
identify a wider variety of emotions and intentions in recorded meetings, again from
acoustic and prosodic cues. From the perspective of speech generation, the problem
of conveying emotional state has emerged as a critical topic in the continuing
effort to make TTS systems sound more like real human beings. Computer game
designers as well as IVR system developers all cite the limits of prosodic and
emotional 'naturalness' as a barrier to using current systems.
In this talk I will describe ongoing research in the speech group at Columbia,
designed to expand the variety of speaker states which may be identified and
produced by acoustic and prosodic variation. I will describe recent work in the
detection of confidence and uncertainty in a physics tutoring system (joint work
with the University of Pittsburgh), work to identify the acoustic and prosodic
characteristics of 'charismatic' speech across cultures, and research into the acoustic
and prosodic indicators of deceptive speech (joint work with the University
of Colorado and SRI International). I will also describe recent progress in the
automatic detection of prosodic features which should make both recognition and
generation of the prosodic characteristics of speaker state more accurate.
| |
Keynote Speaker 2 - Wednesday, May 3, 09:00 - 09:45
|
2 of 3 |
Hartmut R. Pfitzinger, University of Munich
Five Dimensions of Prosody: Intensity, Intonation, Timing, Voice Quality, and Degree of Reduction
Extended Abstract:
This talk gives an overview of methods for analysis, modification, and synthesis
of the prosodic properties of speech. The term prosodic properties is supposed
to cover all phenomena that are not segmental and that are described on several
tiers parallel to the segmental tier. Firth [1] called them prosodies and according
to him e. g. distant assimilation (such as Turkish vowel harmony) is also covered
by his term. This very broad meaning of the term prosodies is much closer to my
view than the very common habit of saying prosody and meaning only intonation.
One of the main purposes of this talk is to demonstrate that the manipulation
of intonation or timing alone can sometimes produce prosodically contradictory
stimuli which in turn inconsistently degrade perception results.
I entitled the talk "Five Dimensions of Prosody" because the first three dimensions
intensity, intonation, and timing are very well known, and Campbell and Mokhtari
[2] named voice quality the fourth prosodic dimension. Obviously, I have added
another dimension, one which I would like to name the degree of reduction. The
remainder of the talk is concerned with describing the analysis, modification, and
synthesis of each of these five prosodies.
Intensity be measured easily reliably by means of root-mean-square, or rectifying
and averaging, or, more precisely, by smoothing the instantaneous amplitude
achieved via Hilbert transformation. It is often argued that intensity, whether
short-term or long-term, has obviously minor communicative functions as can be
seen from any broadcasting where the short-term amplitude is generally strongly
manipulated to maximize loudness of speech without any noticeable impact on the
meaning of the speech. But intensity should be taken into account when naturalness
is important (e. g. in speech synthesis) or e. g. when perception stimuli with
shifted word accents are necessary. In this case the shift of the intonation peak
should be accompanied by a shift of an intensity peak.
Intonation is more complicated: although there exist countless F0 or glottal epoch
detection algorithms (Hess [3] gives an overview), none of them is absolutely reliable
and many of them work only on high-quality speech recordings, or on a limited
range of F0 values, or use an inferior voiced/unvoiced-detection method, or are
sensitive to amplitude variations, or suffer from other shortcomings. However, it
turned out that error rates of modern algorithms are sufficiently low and many
times cancelled by post-processing. Subsequent smoothing, extrapolation, and
parameterization are necessary to make intonation accessible to meaningful modification,
each of these methods with its own algorithmic problems. It turned out
that the command-response model of Fujisaki allows for a very powerful parameterization
of intonation contours in many languages [4]. Finally, pitch-synchronous
overlap and add (PSOLA) is a very effective way to synthesize the new signal but
it has problems especially with strong F0-changes and high-pitched female voices.
Timing is even more complicated: In 1998 [5] I invented a model to estimate
perceptual local speech rate (PLSR), a prosodic contour similar to F0 contours,
easy to interpret and to modify. It is based on a linear combination of the local
syllable rate and the local phone rate both of which are estimated from manual
segmentations of phones and syllable centers, and it produces a mean deviation
of 10 percent of the perceptual speech rate which is precise enough for phonetic
studies and speech synthesis. Other methods such as Z-score-based duration contours
[6] suffer from probable inconsistencies in their essential knowledge bases,
i. e. prototypical mean durations and standard deviations of all speech segments,
and from nonlinear elasticity of the segments. For simple copy-synthesis, dynamic
time warping (DTW) is more appropriate since it needs no segmentation of both
speech signals. Speech pauses, and especially hesitations and repairs further complicate
the timing structure of speech. Their positions and durations are difficult
to predict [7].
Voice quality is difficult to measure, modify, and synthesize. Convincing approaches
are based on epoch detection methods and inverse filtering techniques,
two significant sources of error. The goal is to obtain and parameterize the glottal-
flow waveform which is supposed to carry all voice quality properties. While the
paper of Campbell and Mokhtari [2] is based only on the normalized amplitude
quotient (NAQ), which represents a continuum from breathy to modal or even
pressed voice quality, a more holistic approach of Mokhtari, Pfitzinger, and Ishi
[8] consisted in applying a principal components analysis (PCA) to a database
of glottal-flow waveforms for the purpose of later reconstructing and interpolating
all underlying glottal-flow waveforms from just a few principal components
(PCs). A starting point to cover a wide range of laryngeal variations was the
typology of phonation by Laver [9] and his recordings. It turned out that the first
PC mainly accounted for F0 variations which raises the question as to whether
the prosodic dimension intonation is better subsumed under the fourth dimension
since variations of F0 also influence voice quality.
The degree of reduction is hardly ever interpreted as a prosody, and for good reason:
in order to estimate the degree of reduction a huge amount of phonological
knowledge is necessary. That is, the canonical form must be known for any utterance
to count the number of elisions and insertions (effects in the time domain), and
the target formant frequencies (or articulatory target positions) must be known
to estimate the segmental undershoot (frequency domain effect). This is closely
related to Lindblom's HH theory of phonetic variation [10]. One problem is that
there is not only purely mechanical coarticulation, constrained by inherent properties
of the speech organs, but also coarticulation rules learned during language
acquisition. Even though this prosody is very difficult to estimate, first approaches
are highly desirable since it is very important when manipulating speech. E. g.
shifting the word accent from one syllable to another is a real problem because
the former unstressed syllable usually is produced in a strongly reduced way (in
English often as a Schwa) while the stressed syllable generally has a non-central
vowel quality and a longer duration. Thus, the target syllable should become
de-reduced and the source syllable reduced.
It should be clear that each of the above-mentioned prosodies has its segmental
and supra-segmental manifestation. Actually, from my point of view the terms
low-frequency components and high-frequency components of prosodies describe
the speech facts in a better way. In this view, even the articulatory movements,
and thus every detail that constitutes speech, could become prosodies.
At the end of the talk two applications of prosodic modifications are demonstrated:
one is speech morphing between two utterances of different speakers, which means
estimating equally-spaced intermediate utterances with all prosodic properties
changing in equal steps from one speaker to another. And the other application
is in the field of computer-aided language learning (CALL). Here, we try to show
that an automatic prosodic correction of the speech signal of a language learner
and its auditory feedback help the learner to aquire a foreign language faster than
by hearing the corrections spoken with the teacher's voice [11].
References
[1] Firth, J. R. (1948). Sounds and prosodies. Transactions of the Philological
Society, pp. 127-152.
[2] Campbell, N.; Mokhtari, P. (2003). Voice quality: the 4th prosodic dimension.
In Proc. of the XVth Int. Congress of Phonetic Sciences, vol. 3, pp. 2417-2420,
Barcelona.
[3] Hess, W. (1983). Pitch determination of speech signals: Algorithms and devices.
Springer-Verlag, Berlin, Heidelberg, New York.
[4] Fujisaki, H. (2004). Information, prosody, and modeling with emphasis on
tonal features of speech. In Proc. of the 2nd Int. Conf. on Speech Prosody, pp.
1-10, Nara; Japan.
[5] Pfitzinger, H. R. (1998). Local speech rate as a combination of syllable and
phone rate. In Proc. of ICSLP '98, vol. 3, pp. 1087-1090, Sydney.
[6] Campbell, W. N. (2000). Timing in speech: A multi-level process. In Horne,
M., ed., Prosody: Theory and experiment. Studies presented to GĻosta Bruce, pp.
281-334. Kluwer Academic Publishers, Dordrecht.
[7] Pfitzinger, H. R.; Reichel, U. D. (2006). Text-based and signal-based prediction
of break indices and pause durations. In Proc. of the 3rd Int. Conf. on Speech
Prosody, Dresden; Germany.
[8] Mokhtari, P.; Pfitzinger, H. R.; Ishi, C. T. (2003). Principal components of
glottal waveforms: Towards parameterisation and manipulation of laryngeal voicequality.
In Proc. of the ISCA Tutorial and Research Workshop on Voice Quality:
Functions, Analysis and Synthesis (Voqual'03), pp. 133-138, Geneva.
[9] Laver, J. (1980). The phonetic description of voice quality. Cambridge University
Press, Cambridge.
[10] Lindblom, B. E. F. (1990). Explaining phonetic variation: A sketch of the
HH theory. In Hardcastle, W. J.; Marchal, A., eds., Speech production and speech
modelling, Nr. 55 in Nato ASI series D: Behavioural and social sciences, pp.
403-439. Kluwer Academic Publishers, Dordrecht, Boston, London.
SPEECH PROSODY 2006 9
[11] Bissiri, M. P.; Pfitzinger, H. R.; Tillmann, H. G. (2006). Lexical stress training
of German compounds through resynthesis and emphasis. Accepted for Proc. of
InSTIL Workshop at CALICO, Hawaii.
|
|
| Keynote Speaker 3 - Thursday, May 4, 09:00 - 09:45
|
3 of 3 |
Chiu-yu Tseng; Institute of Linguistics, Academia Sinica, Taiwan
Fluent Speech Prosody and Discourse Organization: Evidence of Top-down Governing and Implications to Speech Technology
Extended Abstract:
Both linguists and engineers ask questions about language and speech, but their
concerns differ. Although both communities look for what makes up communication,
linguists look for what constitutes the abstract linguistic system in the human
mind and brain, while engineers look for ways to model and simulate speech for
technology implementation. What if the question addressed is fluent speech of
Mandarin Chinese, and the answers are to satisfy both linguists and engineers?
Put in paraphrase, the question then becomes what is there to be studied in addition
to lexical tones and intonation for the linguists, and how could fluent speech
prosody be simulated in addition to adding up tones and intonations for the engineers.
Trying boldly to bring answers to both communities, we decided first to adopt a
corpus approach to phonetic studies, an attempt to remedy the traditional phonetic
approach by looking at more samples. To ensure the corpora contain fluent
prosody information, we collected narratives of read discourses rather than canonical
phrases. A total of 9 set of speech corpora with different prosodic features
were recorded over a decade (http://www.myet.com/COSPRO). We then designed
a perceptually based annotation system that emphasized boundary information
and boundary breaks and manually labeled the corpora. The annotated results
were consistently identified multiple-phrase speech paragraphs and various kind of
prosodic units within. We studied the acoustic phonetic correlates of the annotated
paragraphs, units and boundary breaks in detail, and through quantitative analyses,
found systematic cross-phrase patterns in every acoustic parameter for each
unit identified. That is, F0 contours, syllable duration patterns, intensity distribution
patterns, and on top of it, systematic boundary information and boundary
breaks are found across phrases. These patterns are not only cross-speaker but also
cross-speaking-rate. It became obvious that what constitutes fluency is neither in
the tonal realization of each syllable, nor in the individual phrase intonation, but
rather, in the association between and among intonation phrases (IP). The association
came from higher up governing from the discourse. What these associations
or associative prosodic relationships reflect is mainly governing from top-down.
A framework of the multi-phrase hierarchy is subsequently constructed to account
for fluent speech prosody. The term Prosodic Phrase Grouping (PG) was proposed
for the framework to denote how intonation phrases (IP) were grouped to
form a higher and larger prosodic unit; a unit that roughly corresponds to speech
paragraphs in narratives or spoken discourses. Central to the framework is the
notion that individual phrasal intonations are subjacent sister constituents subject
to higher level constraints that specify layered modifications at each prosodic
level; while ultimate output fluent prosody is achieved by adding up contributions
from each prosodic layer. From our data analyses, we were able to show just how
cumulative modifications account for the overall patterns in fluent speech, in particular,
syllable duration as well as boundary pause patterns (Tseng et al., 2005).
Subsequently, we were able to derive acoustic templates for each prosodic unit in
the framework, namely, templates for global F0 contours, syllable durations and
intensity distribution. These templates facilitated constructing a modular model
of multiple-phrase grouping with 4 corresponding acoustic modules for speech synthesis
applications.
By the same logic, we also view spoken discourse prosody as yet another higher
node that groups PGs into sister constituents. Our more recent works are to
establish discourse prosody organization from the PG upward. Again looking at
the larger picture we studied relative F0 range narrowing vs. widening as well as F0
resets across PGs and boundaries. So far we have found two types of prosodic links
that involved F0 narrowing and subsequent F0 reset. One type of F0 narrowing
is duration triggered and redundant, which we term as Prosodic Fillers (PF);
another is lexically and/or syntactically triggered and obligatory, which we term
as Discourse Markers (DM). The main function of these two links appears to be
a major source of melodic and rhythmic variation in output prosody. They also
turned out to be predictable from text analyses.
In summary, what the prosodic specifications discussed above revealed is essentially
the global overall relative prosodic relationships across phrases in fluent
speech; what they reflected is top-down governing of semantic constraints from
the discourse and cognitive constraints from the speaker. All of them are crucial
to on-line speech planning and processing of discourse information. We argue
that any prosody framework of fluent speech should include top-down information,
specify how intonation phrases are formed, and take into considerations perceptual
effects to on-line processing. Moreover, how discourse prosody is organized deems
further attention. Technology developments could serve as the best testing ground
for these findings. As for a tone language such as Mandarin Chinese, in addition
to syllable tones and phrase intonations, there also exists a cross-phrase melody,
rhythm and loudness pattern necessary to forms its fluent speech prosody. We
believe these non-tonal aspects not only bear cross-linguistic significance, but also
merits more attention in studies of tone languages in general.
|
|
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Special Session 1 (SPS 1)
Prosody and Affective Computing
Organizers: Noam Amir, Nick Campbell and Jianhua Tao
Tuesday, May 2, 11:10 - 13:10 |
Special Session 1: Prosody and Affective Computing |
1 of 6 |
The Prosody of Pet Robot Directed Speech: Evidence from Children
AUTHOR(S):
Batliner, Anton; Chair for Pattern Recognition, University of Erlangen-Nuremberg
Biersack, Sonja; Department of Psychology, University of Stirling
Steidl, Stefan; Chair for Pattern Recognition, University of Erlangen-Nuremberg
Abstract:
In this paper, we present a database with emotional children's speech in a humanrobot
scenario: the children were giving instructions to Sony's pet robot dog AIBO,
with AIBO showing both obedient and disobedient behaviour. In such a scenario,
a specific type of partner-centered interaction can be observed. We aimed at
finding prosodic correlates of children's emotional speech and were interested to
see which speech registers children use when talking to AIBO. For interpretation,
we left the weighting and categorization of prosodic features to a statistic classifier.
The parameters found to be most important were word duration, average energy,
variation in pitch and energy, and harmonics-to-noise ratio. The data moreover
suggests that the children used a register that resembled mostly child-directed and
pet-directed speech and to some extent computer-directed speech.
| |
Special Session 1: Prosody and Affective Computing |
2 of 6 |
Modelling personality features by changing prosody in synthetic speech
AUTHOR(S):
Trouvain, Jürgen; Phonetik-Büro Trouvain, Saarbrücken & Institute of Phonetics,
Saarland University
Schmidt, Sarah; Institute of Computer Science, Saarland University
Schröder, Marc; DFKI GmbH, Saarbrücken
Schmitz, Michael; Institute of Computer Science, Saarland University
Barry, William J.; Institute of Phonetics, Saarland University
Abstract:
This study explores how features of brand personalities can be modelled with
the prosodic parameters pitch level, pitch range, articulation rate and loudness.
Experiments with parametrical diphone synthesis showed that listeners rated the
prosodically changed versions better than a baseline version for the dimensions
"sincerity", "competence", "sophistication", "excitement" and "ruggedness". The
contribution of prosodic features such as lower pitch and an enlarged pitch range
are analyzed and discussed.
|
|
|
Special Session 1: Prosody and Affective Computing |
3 of 6 |
Modeling Emotion Expression and
Perception Behavior in Auditive Emotion Evaluation
AUTHOR(S):
Grimm, Michael; Universität Karlsruhe (TH), Karlsruhe
Kroschel, Kristian; Universität Karlsruhe (TH), Karlsruhe
Narayanan, Shrikanth; University of Southern California (USC), Los Angeles
Abstract:
In this paper, we consider both speaker dependent and listener dependent aspects
in the assessment of emotions in speech. We model the speaker dependencies in
emotional speech production by two parameters, Emotion Expression Bias and
Emotion Expression Amplification. Similarly, we model the listener's emotion
perception behavior by a simple parametric model, the correlation with the mean
value of all evaluators. These models form a basis for improving current automatic
emotion recognition schemes. An emotional speech database of the four emotion
categories angry, happy, neutral, and sad was evaluated on three emotion primitives,
valence, activation, and dominance. The assessment results were used to
analyze the variations of the class centroids in the 3D emotion space as a function
of speaker and listener. We found that the models are simple and efficient for
describing individual emotion expression styles and emotion perception behavior
in speech.
|
|
|
Special Session 1: Prosody and Affective Computing |
4 of 6 |
Perception of Non-Verbal Emotional Listener Feedback
AUTHOR(S):
Schröder, Marc; DFKI GmbH
Heylen, Dirk; University of Twente
Poggi, Isabella; University of Rome
Abstract:
This paper reports on a listening test assessing the perception of short non-verbal
emotional vocalisations emitted by a listener as feedback to the speaker. We
clarify the concepts of backchannel and feedback, and investigate the use of affect
bursts as a means of giving emotional feedback via the backchannel. Experiments
with German and Dutch subjects confirm that the recognition of emotion from
affect bursts in a dialogical context is similar to their perception in isolation. We
also investigate the acceptability of affect bursts when used as listener feedback.
Acceptability appears to be linked to display rules for emotion expression. While
many ratings were similar between Dutch and German listeners, a number of clear
differences was found, suggesting language-specific affect bursts.
|
|
|
Special Session 1: Prosody and Affective Computing |
5 of 6 |
Expressive Speech Synthesis:
Evaluation of a Voice Quality Centered Coder on the Different Acoustic Dimensions
AUTHOR(S):
Audibert, Nicolas; ICP
Vincent, Damien; France Telecom, R&D Division
Aubergé, Véronique; ICP
Rosec, Olivier; France Telecom, R&D Division
Abstract:
Expressive speech is intrinsically multi-dimensional. Each acoustic dimension has
specific weights depending on the nature of the expressed affects. The quantity
of information carried by each dimension separately, as well as the processing implied
to carry it has been perceptively measured for a set of natural mono-syllabic
utterances. It has been shown that no parameter alone is able to carry the whole
emotion information These stimuli (anxiety, disappointment, disgust, disquiet, joy,
resignation, sadness) were resynthesized with an LF-ARX algorithm, and evaluated
in the same perceptive protocol extended to the VQ parameters (source, filter
and residue). The comparison of results between natural, TD-Psola resynthesized
and LF-ARX resynthesized stimuli (1) globally confirms the relative weights of
each dimension (2) diagnoses local minor artifacts of resynthesis (3) validates the
efficiency of the LF-ARX algorithm (4) measures the relative importance of each
of LF-ARX parameters.
|
|
|
Special Session 1: Prosody and Affective Computing |
6 of 6 |
On the Structure of Spoken Language
AUTHOR(S):
Campbell, Nick; Advanced Telecommunications Research Institute, Kyoto
Abstract:
The special structure of spoken language is often described as "ill-formed" but
this paper shows that it is ideally suited to the simultaneous expression of (a)
propositional content (i. e., linguistic information) and (b) speaker-state, discourse
management cues, and speaker-listener-relationships (i. e., affective information).
This paper shows that by the frequent insertion of so-called "fillers" and other
repetitive fragments, the speaker provides the listener with constant reference
points for evaluating affective states as displayed by voice-quality information.
|
|
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Poster Session 1 (PS 1)
Prosody and Speech Perception
Tuesday, May 2, 14:30 - 16:00
Chair: Anne Cutler |
Poster Session 1: Prosody and Speech Perception |
1 of 24 |
Timing in News and Weather Forecasts: Implications for Perception
AUTHOR(S):
Shevchenko, Tatiana; Moscow State Linguistic University
Uglova, Natalia; Moscow State Linguistic University
Abstract:
This paper addresses the problem of prosodic text organization in the situation
of severe time limits for TV information programs. It is a search for techniques
used to make a compromise between temporal constraints and the demand for
distinctiveness of speech targeted at mass audience. Tempo and pitch characteristics
of American TV news and weather forecasts (9 items from 4 stations, total
time 10min) are explored with reference to genre, region and gender of newsreaders.
Combinations of features account for native speakers' perception of speech as
'fast' or 'too fast'. The diagnostic parameters are: length of uninterrupted speech
units, types of pauses, number of accents per unit, accented and unaccented syllable
length, Fo max and Fo intervals in key words and units. The data obtained,
when compared to previous research results on interviews, reading, public speaking
and spontaneous talk, revealed phonation/pause time ratio to be most relevant.
|
|
| Poster Session 1: Prosody and Speech Perception |
2 of 24 |
Identification of language and accent through visual speech
AUTHOR(S):
Irwin, Amy; Institute of Hearing Research
Thomas, Sharon; Institute of Hearing Research
Abstract:
Facial movements can be utilised in the processing of visual speech and form the
basis of speechreading. However, the production of speech by different talkers can
be variable; physiology, accent and speech rate can all change the appearance of the
visual signal. The focus of this report is an investigation into the effects of language
and accent variation on speechreading, an area previously lacking in systematic
research.Results from two experiments indicate, firstly, that the visual differences
between French and English, (both accent and language) can be discriminated
through visual speech. Secondly, in a comparison of speechreading performance,
sentences produced using a French accent were found to be significantly more
difficult to speechread by English observers than those produced in an English
accent. This research indicates the importance of further study into the effects of
accent on speechreading.
|
|
| Poster Session 1: Prosody and Speech Perception |
3 of 24 |
Dialect identification through prosodic information: an experimental
approach
AUTHOR(S):
Dimou, Athanassia; Université Paris 7
Chalamandaris, Aimilios; ILSP
Abstract:
The purpose of this paper is to investigate whether native Greek adults can identify
their mother tongue from synthesized stimuli which contain only prosodic - melodic
and rhythmic - information. More specifically we are trying to investigate whether
Greek native speakers are able to discriminate their mother dialect form another
also from Greece, from prosodic only information. In the first section we present
the main idea behind our work, in the second section we present the procedure we
followed in order to complete this pilot study, while at the two final sections one
can find the results and the conclusions of our experiments.
|
|
| Poster Session 1: Prosody and Speech Perception |
4 of 24 |
Fake geminates in French: a production and perception study
AUTHOR(S):
Meisenburg, Trudel; University of Osnabrück
Abstract:
This paper examines the role of consonantal quantity from Latin to the Romance
languages, concentrating on the situation in contemporary French, where fake or
apparent geminates quite frequently arise in morpheme concatenation, often as a
consequence of schwa deletion. A series of production and perception experiments
shows that the required surface contrasts are neither represented nor identified
consistently, speakers rather show a tendency to delete geminates in favor of a
simplified syllable structure but at the cost of morpheme identity.
|
|
| Poster Session 1: Prosody and Speech Perception |
5 of 24 |
Interpretation - Perception - Analysis
AUTHOR(S):
Dohalská-Zichová, Marie; Institut of Phonetics, Charles University in Prague
Škardová, Radka; Institut of Phonetics, Charles University in Prague
Abstract:
The aim of this experiment was to prove via perception tests, in what way two
phonetic groups (i.e. the French and Czechs with proficient knowledge of French)
and two non-phonetic control-groups of listeners perceive the differences in the
individual prosodic demonstration of two types of artistic interpretations of the
poem "Mon rêve familier". At the same time the aim was to compare and contrast
subjective perceptual levels with objective measurements of F0, intensity and time
values. If we take into account the fractional representation and the importance
of individual values for the accents perception, then we can conclude that both
the French and Czechs consider the T value as the crucial value, the second place
in terms of importance of values differs - for Czechs it is intensity followed by
frequency (T-I-F0); on the contrary, for the French on second place being frequency
followed by intensity (T-F0-I).
|
|
| Poster Session 1: Prosody and Speech Perception |
6 of 24 |
Perception of Anger in French as Foreign Language: Experimental Protocol
and Preliminary Results
AUTHOR(S):
Mathon, Catherine; EA333 "Atelier de Recherches sur la Parole"
de Abreu, Sophie; EA333 "Atelier de Recherches sur la Parole"
Perekopska, Daniela; EA333 "Atelier de Recherches sur la Parole"
Abstract:
Learners of a foreign language need to perceive the emotions of her or his interlocutor.
They also need to be able to reproduce an emotion in a satisfactory
prosodic pattern in the foreign language. Otherwise, the communication will fail.
Our project deals with 3 main questions: How are emotions perceived in a foreign
language? Will a learner be able to reproduce such an emotion and how? How
will these (re)productions be recognized by native speakers? We first concentrated
on the study of the emotion called Anger. This paper aims to show if prosody
provides enough information to allow students of French as a Foreign language
(FFL) to recognize this emotion. The perceptual test presented here is original
because of the use of spontaneous corpus of French containing real emotions. We
focus here on the first stage of our research: the results of the perception of anger
by Czech and Portuguese speakers. We insist on the methodology as well as the
experimental protocol of our work.
|
|
| Poster Session 1: Prosody and Speech Perception |
7 of 24 |
Exploring Expressive Speech Space in an Audio-book
AUTHOR(S):
Wang, Lijuan; Dept. of Electronic Engineering, Tsinghua University, Beijing
Zhao, Yong; Microsoft Research Asia, Beijing
Chu, Min; Microsoft Research Asia, Beijing
Chen, Yining; Microsoft Research Asia, Beijing
Soong, Frank; Microsoft Research Asia, Beijing
Cao, Zhigang; Dept. of Electronic Engineering, Tsinghua University, Beijing
Abstract:
In this paper, an audio-book, in which a professional voice talent performs multiple
characters, is exploited to investigate the expressiveness of speech. The expressive
speech space of the sole speaker is explored by finding the distances between
acoustic models of multiple characters and the perceived proximity between their
speech utterances. Using the speech of ten characters as test data, the character
confusion is evaluated in both acoustic and perceptual spaces. We find that the
average precision to differentiate one character from the others is 81.7 % in the
acoustic space and 72.6 % in the perceptual space. It is interesting that the objective
measure outperforms the subjective measure. Furthermore, the acoustic
distance measured by normalized Kullback-Leibler divergence (NKLD) between
two characters is highly correlated with the perceptual distance with correlation
coefficient 0.814. Therefore, NKLD can objectively measure the perceptual similarity
between groups of utterances.
|
|
| Poster Session 1: Prosody and Speech Perception |
8 of 24 |
A Comparative Study of Sentential Stress Distribution in Mandarin
Multi-Style Speeches
AUTHOR(S):
Bao, Mingzhen; University of Florida
Chu, Min; Microsoft Research Asia
Abstract:
This paper compares the distribution of sentential stresses among three speaking
styles: Lyric, Critical, and Explanatory; and extends our previous study in the
base phrase level to the sentence construction level and the prosodic word level.
The results show that 1) The distributions of both rhythmic and semantic stresses
act the same among styles within prosodic words; 2) In the sentence construction
level, the distribution of rhythmic stress is quite similar across three styles, while
semantic stress presents more diversity among speaking styles. The Explanatory
style shares a similar tendency with the Neutral style. The Lyric and Critical
styles differ from the Neutral style in subject-predicate, predicate-object, adjunctsubject,
and adjunct-object constructions. Generally, speaking styles have fewer
effects on rhythmic stress distribution than on semantic stress. Such effects are
more obvious in the sentence construction and the base phrase levels than the
prosodic word level.
|
|
| Poster Session 1: Prosody and Speech Perception |
9 of 24 |
Reliable Prominence Identification in English Spontaneous Speech
AUTHOR(S):
Tamburini, Fabio; DSLO - University of Bologna
Abstract:
This paper presents a follow up of a study on the automatic detection of prosodic
prominence in spontaneous speech. Prosodic prominence involves two different
prosodic features, pitch accent and stress, that are typically based on four acoustic
parameters: fundamental frequency (F0) movements, overall syllable energy,
syllable nuclei duration and mid-to-high-frequency emphasis. A careful measurement
of these acoustic parameters makes it possible to build an automatic system
capable of identifying prominent syllables in utterances with performance comparable
with the inter-human agreement reported in the literature even when tested
on spontaneous speech.
|
|
| Poster Session 1: Prosody and Speech Perception |
10 of 24 |
Form and Function of Falling Pitch Contours in English
AUTHOR(S):
Kleber, Felicitas; Institute of Phonetics and Digital Speech Processing (IPDS),
Christian-Albrechts-University at Kiel
Abstract:
This paper presents the results of a set of perception experiments concerning the
phonological status of early, medial and late F0 peak synchronization in English
and the nature of the contrast between these categories. By means of one identification
and two discrimination tasks, it has been shown that subjects perceive
a categorical-like change when the F0 maximum of a peak is shifted into the
stressed vowel and a gradual change when the F0 maximum is moved into the following unstressed vowel. Therefore, we conclude that the early peak constitutes
a phonological category as opposed to medial peaks; late peaks form a phonetic
continuum.
|
|
| Poster Session 1: Prosody and Speech Perception |
11 of 24 |
Relevance of F0 peak shape and alignment for the perception of a functional
contrast in Russian
AUTHOR(S):
Rathcke, Tamara; Institute of Phonetics and Digital Speech Processing
Abstract:
This paper reports a perception experiment carried out to investigate the perceptually
relevant properties of yes/no-questions and contrastive emphasis in modern
Russian spoken by young people in Kaliningrad. Only melodic cues were involved
in the test stimuli such as alignment and shape of F0 peaks as well as presence
of a peak plateau. A semantic congruity test was performed to investigate these
form-function relations. Results indicate that peak alignment is the strongest cue
for the perceptual distinction of the investigated categories. Contour shape (including
plateau property) serves as a secondary cue, whereas the effect of a plateau
seems to be very small. Results are discussed in terms of phonological modeling of
Russian intonation based on an experimental approach including the investigation
of intonational forms in relation to linguistic functions.
|
|
| Poster Session 1: Prosody and Speech Perception |
12 of 24 |
Categorical Perception of intonational contrasts in European Portuguese
AUTHOR(S):
Falé, Isabel; Onset - CEL, Lab. Psicoling, DLGR, FLUL; Universidade Aberta
Faria, Isabel Hub; Onset - CEL, Lab. Psicoling, DLGR, FLUL
Abstract:
European Portuguese intonational contrast between statement and question contours
was tested on a Categorical Perception based paradigm. From 2 natural
sentences one produced by a male speaker and another by a female, one multi-step
continuum from each sentence was created, from declarative to question contour,
through acoustic manipulation (PSOLA) and submitted to 20 EP listeners that
performed two tasks: an identification and a discrimination task.For the identification
test, subjects had to categorize each presented stimulus. In addition to
response data, reaction times of the identification task were also collected. Experimental
design and procedures were developed with E-Prime.Identification results
confirmed that the contrast is indeed categorical. However, identification reaction
times measurements point to continuous rather than categorical perception. The
absence of a consistent peak of discrimination in the crossover between categories
supports the continuous perception view.
|
|
| Poster Session 1: Prosody and Speech Perception |
13 of 24 |
Secondary stress in Brazilian Portuguese: the interplay between production
and perception studies
AUTHOR(S):
Arantes, Pablo; State University of Campinas
Barbosa, Plinio; State University of Campinas
Abstract:
This paper reports experiments on speech production showing that secondary
stress in Brazilian Portuguese (BP) can be best described as phrase-initial prominence
cued by greater duration and pitch accent excursion in initial position. It
also reports a perception experiment in which clicks were associated to consecutive
V-to-V positions in stress groups. Mean click detection RTs are gradient, but
show no influence of initial lengthening. RTs near the phrasally stressed position
are shorter and almost 60 % of RT variance can be accounted for by produced
timing patterns.
|
|
| Poster Session 1: Prosody and Speech Perception |
14 of 24 |
Perception of Cantonese level tones influenced by context position
AUTHOR(S):
Zheng, Hongying; City University of Hong Kong
Peng, Gang; The Chinese University of Hong Kong
Tsang, Peter W-M.; City University of Hong Kong
Wang, William S-Y.; The Chinese University of Hong Kong
Abstract:
When humans perceive speech sounds, they categorize the sounds into one or
another phoneme category. Perception of speech sound depends on context. Previous
studies on categorical perception of lexical tones were mainly done in an
absolute manner without context. In these experiments we explore the influence
of context on the categorical perception of lexical tones. In particular, we ask
whether the position of the context with respect to the target syllable influences
the categoricalness of the perception. Two experiments on natural and synthesized
speech both show that categorical boundaries of identification curves are sharper
when the context is to the right of the target syllable than when the context is
to the left of the target syllable. Moreover, steeper peaks are obtained in the
discrimination curve from right context continuum. They agree with and enhance
the identification results. Explanations of the phenomenon are suggested in the
paper.
|
|
| Poster Session 1: Prosody and Speech Perception |
15 of 24 |
Perception of Isolated Tone2 words in Mandarin Chinese
AUTHOR(S):
Xu, Lei; Linguistics, Ohio State University, Columbus
Speer, Shari R.; Linguistics, Ohio State University, Columbus
Abstract:
Many tone3 words in Mandarin undergo "third tone sandhi" - a phonological rule
that changes the first tone3 word in a tone3+tone3 sequence to a tone2 word. Spoken
tone2 words that have tone3 counterparts are thus ambiguous. A cross modal priming experiment examined lexical tone processing during word recognition.
Participants saw Chinese characters of 4 kinds: identical, different-only-in-tone,
irrelevant to the auditory word or nonword. Visual targets were preceded by auditory
primes of 4 types: tone2 word with tone3 counterpart, tone2 word w/out
tone3 counterpart, tone3 word with tone2 counterpart, or tone3 word w/out tone3
counterpart. RTs were longer for tone2 words with tone3 counterparts than for
tone2 words w/out tone3 counterparts, while RTs to tone3 words with or w/out
tone2 counterparts did not differ. Results suggest integration of tonal and segmental
information during word recognition, without recourse to a separable "tonal
level".
|
|
| Poster Session 1: Prosody and Speech Perception |
16 of 24 |
Perception of L2 Tones: L1 Lexical Tone Experience May Not Help
AUTHOR(S):
Wang, Xinchun; California Sate University, Fresno
Abstract:
This study investigates whether adult L2 learners' experience with lexical tones
and pitch accent in their first language facilitates the acquisition of L2 lexical
tones. Three groups of beginning learners of Mandarin with different L1 prosodic
experience: native Hmong (a tone language), native Japanese (a pitch and accent
language), and native English (a non-tone, non-pitch accent language) speakers
participated as listeners in a perception test on the four Mandarin tones. Results
showed that native English listeners performed equally well as native Japanese
listeners but native Hmong speakers performed significantly worse than the native
Japanese and native English speakers in perceptual accuracy of Mandarin tones.
The findings suggest that experience with lexical tones and pitch accent may not
always facilitate learning. The lack of exact mapping of L2 tones onto L1 tones
may interfere with the acquisition of nonnative tones especially at the initial stage
of learning.
|
|
| Poster Session 1: Prosody and Speech Perception |
17 of 24 |
Lexical Accent Status Affects Perceived Prominence of Intonational
Peaks in Japanese
AUTHOR(S):
Shinya, Takahito; University of Massachusetts, Amherst
Abstract:
This study shows that lexical accent status affects perceived prominence of fundamental
frequency (F0) peaks in Japanese. In Japanese, word accent type can be
identified from two different sources: lexical accent status and phonetic F0 contour
shape. This study examines whether listeners compensate for the accentual
boost of an accented word based only on the word's lexical accent status, when
no F0 contour information is available. A perceptual experiment was conducted
in which participants judged the relative prominence between two F0 peaks. The
experiment showed that for a given second F0 peak height, the first F0 peak height
was higher when the first word was lexically accented than when it was lexically
unaccented in order for the two words to be equal in perceived prominence. his
suggests that the accentual boost of an accented word is subtracted in perception.
It is concluded that lexical accent status as phonological knowledge affects
perceived prominence of F0 peaks.
|
|
| Poster Session 1: Prosody and Speech Perception |
18 of 24 |
The recognition of Japanese-accented and unaccented English words by
Japanese listeners
AUTHOR(S):
Yoneyama, Kiyoko; Daito Bunka University
Abstract:
This study investigated whether Japanese listeners learning English employ two
types of lexical information (word frequency and neighborhood density) when they
recognize English words. English words recorded by a native speaker of English
and a native speaker of Japanese were presented to Japanese university students in
a noise condition. The results of word recognition scores showed that Japanese listeners
employed both lexical and pre-lexical levels of information in English word
recognition. They were sensitive to both probabilistic phonotactics (bottom-up
acoustic information) and word frequency (lexical information). A strong correlation
between probabilistic phonotactics and neighborhood density still predict
Japanese listeners are influenced by neighborhood density in English word recognition.
|
|
| Poster Session 1: Prosody and Speech Perception |
19 of 24 |
The contribution of silent pauses to the perception of prosodic boundaries
in Korean read speech.
AUTHOR(S):
Hirst, Daniel; CNRS, UMR 6057
Abstract:
This paper discusses the importance of silent pauses in the perception of prosodic
boundaries in Korean speech. It is suggested that in speech in general, and in
particular in spontaneous speech, silent pauses are neither necessary nor sufficient
for the perception of prosodic boundaries. In read speech, however, there is a high
correlation between the presence of a pause and the perception of a boundary. An
experiment was carried out to determine whether removing the silent boundary
from an extract of speech had a significant effect on the perception of boundaries in
Korean read speech. Results suggest that while the presence of a silent boundary
slightly reinforces the perception of a prosodic boundary, subjects are in general
capable of perceiving the boundary without the silent pause.
|
|
| Poster Session 1: Prosody and Speech Perception |
20 of 24 |
The perception of intended speech rate in English, French, and German
by French listeners
AUTHOR(S):
Dellwo, Volker; Dept. of Phonetics and Linguistics, University College London
Ferragne, Emmanuel; Laboratoir Dynamique du Langage, Univ. Lyon 2
Pellegrino, Francois; Laboratoir Dynamique du Langage, Univ. Lyon 2
Abstract:
Speakers are able to produce speech at different intended rates when prompted
to do so. The question addressed in the present research is to what degree different
intended rate categories are perceptually relevant when objective measures of
speech rate (e.g. syllables/second) are variable and to what degree listeners are able to identify intended speech rates in languages other than their native language.
Initial results from an experiment with French listeners rating speech rates
in French, German, and English show that, despite varying objective speech rates,
listeners are well able to identify intended speech rate across different languages.
|
|
| Poster Session 1: Prosody and Speech Perception |
21 of 24 |
Comparing Perceptual Local Speech Rate of German and Japanese
Speech
AUTHOR(S):
Pfitzinger, Hartmut R.; Institute of Phonetics and Speech Communication, University
of Munich
Tamashima, Miyuki; Institute of Phonetics and Speech Communication, University
of Munich
Abstract:
Possibly everybody who listens to people talking to each other in an unknown language
gains the impression that they are speaking very fast. To test the effect of
language background on perceptual local speech rate (PLSR) we conducted a fully
symmetrical perception experiment in which two groups with different language
backgrounds judge the speech rates of stimuli from both languages. 160 short
German and Japanese speech stimuli are judged by 40 German and Japanese subjects.
Japanese listeners overshoot German speech rate by 7.5 % on a PLSR scale
and German listeners overshoot Japanese speech rate by 9.1 %. An explanation is
that unknown languages appear to be spoken faster because listeners are unable
to identify and attenuate redundant features of the unknown speech and, at the
same time, they unconsciously insert additional phonetic items to reduce the mismatch
between the large number of recognized phonetic items and the phonotactic
structure of their native languages.
|
|
| Poster Session 1: Prosody and Speech Perception |
22 of 24 |
The Role of the Accented-Vowel Onset in the Perception of German
Early and Medial Peaks
AUTHOR(S):
Niebuhr, Oliver; Institut für Phonetik und digitale Sprachverarbeitung, Christian-
Albrechts-Universität Kiel
Abstract:
Starting from a series of speech stimuli representing an F0 peak shift continuum
from German early to medial peak, a series of non-speech stimuli is created. These
non-speech stimuli show the F0 and intensity courses of the original speech stimuli,
but with a constant formant structure. The results of a perception experiment
reveal that the organisation of the peak shift continuum found for the identification
of early and me-dial peaks in the speech stimuli can be replicated by the non-speech
stimuli, indicating that early and medial peaks are signalled by an interplay of the
F0 and intensity courses without reference to the spectral change at the accentedvowel
onset.
|
|
| Poster Session 1: Prosody and Speech Perception |
23 of 24 |
Clause position within a sentence: human vs. machine recognition
AUTHOR(S):
Palková, Zdena; Institute of Phonetics, Charles University in Prague
Volín, Jan; Institute of Phonetics, Charles University in Prague
Abstract:
The paper presents a combined experiment in which recognition of a prosodic
phrase position within a larger syntactic structure by human listeners is confronted
with recognition by artificial neural networks. Apart from the success rate we are
predominantly interested in similarities in the error pattern of the two recognition
modes. The results suggest that the automatic recognition could help to determine
which of the selected parameters are relevant for human listeners, since it provides
linguistically interpretable outcome.
|
|
| Poster Session 1: Prosody and Speech Perception |
24 of 24 |
Lateralized processing in human auditory cortex during the perception
of emotional prosody
AUTHOR(S):
Wendt, Beate; Leibniz-Institute for Neurobiology
Brechmann, André; Leibniz-Institute for Neurobiology
Gaschler-Markefski, Birgit; Leibniz-Institute for Neurobiology
Scheich, Henning; Leibniz-Institute for Neurobiology
Ackermann, Hermann; University of Tübingen
Abstract:
The aim of the present fMRI-study was to investigate the influence of different
word prosodies on the activation of the auditory cortex. Pseudowords and semantically
neutral words were presented with neutral prosody in experiment I and
with emotional prosodies in experiment II. In both studies there was a left lateralized
activation for speech perception on planum temporale. In our experiments
the emotional information was task-irrelevant and even distracted from the lexical
task. The performance in the detection of words and pseudowords was significantly
better in the prosodically neutral condition. Thus, the current results contribute
to the clarification of the controversial issue whether prosodies lateralize brain activation
to the right, i.e. if lexical rather than prosodic information is in the focus
of a task involving prosodic steed material, a right hemisphere dominance cannot
be expected. Future experiments with prosody identification tasks will extend
these findings.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Poster Session 2 (PS 2)
Analysis and Formulation of Prosody
Tuesday, May 2, 14:30 - 16:00
Chair: Daniel Hirst |
| Poster Session 2: Analysis and Formulation of Prosody | 1 of 22 | Phonetics vs. phonology in Tamil wh-questionsAUTHOR(S):
Keane, Elinor; Christ Church, Oxford & Oxford University Phonetics Laboratory
Abstract:
Wh-questions in Tamil are not distinguished from declarative utterances by either
pitch accent type or boundary tone. Acoustic analysis of data from 18 speakers
comparing wh-questions with corresponding declaratives revealed that the lexical
marking of interrogativity is nevertheless accompanied by differences in intonation.
The most consistent result was raising of f0 peaks in question words and in a
majority of speakers, including all the females, sentence offset f0 was significantly
higher in questions. This tended to be accompanied by lowering of f0 peaks following
question words, resulting in some compression of the pitch register. In marking
interrogativity Tamil thus manipulates gradient phonetic parameters, adding further
fuel to the debate about whether such parameters can directly signal linguistic
information or are mediated via some elaborated phonological representation.
| | | Poster Session 2: Analysis and Formulation of Prosody | 2 of 22 | Empirical Validation of Hand-labelled Nuclear Accent PatternsAUTHOR(S):
Grabe, Esther; Phonetics Laboratory, University of Oxford
Kochanski, Greg; Phonetics Laboratory, University of Oxford
Coleman, John; Phonetics Laboratory, University of Oxford
Abstract:
In this paper, we explore the interface between intonational phonology and speech
technology, in search of bridges between the disciplines. In a corpus containing
speech data from seven dialects of English, we hand-labelled over 700 nuclear accents
and identified seven accent types. Then we used four-term mathematical
models to describe the fundamental frequency patterns associated with the accents.
A statistical analysis showed that the models for six of the seven accents
differed significantly from each other. Our hand-labels were associated with consistently
different f0 patterns. Our approach bridges the gap between intonational
phonology and speech technology. It provides quantitative, empirically testable
models of intonation labels that can be implemented in applications.
| | | Poster Session 2: Analysis and Formulation of Prosody | 2 of 22 | Phonologies and Phonetics of French ProsodyAUTHOR(S):
Martin, Philippe; Université Paris 7 Denis Diderot
Abstract:
Studies on French intonation are quite diversified, to the point where, looking at
the descriptive results, one might wonder if all researchers did analyze the same
language. Remarkable prosodic characteristics found in one study are not retrieved
in another, and different theoretical approaches give very different insights on
data, despite very similar experimental material. In this paper we attempt to
highlight some converging aspects of two types of intonation linguistic description
on French, developed one in the Autosegmental-Metrical framework and the other
with a phonosyntactic point of view. In particular, the contrast of melodic slope
may be totally hidden with one approach, and appear as the main characteristic
of French intonation with the other.
| | | Poster Session 2: Analysis and Formulation of Prosody | 4 of 22 | Text-based and Signal-based Prediction of Break Indices and Pause DurationsAUTHOR(S):
Pfitzinger, Hartmut R.; Institute of Phonetics and Speech Communication, University
of Munich
Reichel, Uwe D.; Institute of Phonetics and Speech Communication, University of
Munich
Abstract:
The relation between symbolic and signal features of prosodic boundaries is experimentally
studied using prediction methods. Text-based break index prediction
turns out to be fairly good, but signal-based prediction and pause duration prediction
perform worse. A possible reason is that random signal feature variations,
as usually produced by humans, are hard to predict.
| | | Poster Session 2: Analysis and Formulation of Prosody | 5 of 22 | Analysis of Polish Segmental Duration with CARTAUTHOR(S):
Breuer, Stefan; Institute of Communication Sciences, University of Bonn
Francuzik, Katarzyna; Institute of Linguistics, Adam Mickiewicz University Poznaīn
Demenko, Grażyna; Institute of Linguistics, Adam Mickiewicz University Poznaīn
Abstract:
Segmental duration was investigated in a database of Polish read speech (from one
male speaker). The material was labeled automatically and then manually verified.
The dependence of phone duration on a set of features was verified with the
CART algorithm. The duration phenomena were analyzed in relation to syllable,
foot and phrase structure. The results showed the need of segmental as well as
suprasegmental modeling for the analysis of segmental duration.
| | | Poster Session 2: Analysis and Formulation of Prosody | 6 of 22 | The stylization of intonation contoursAUTHOR(S):
Demenko, Grażyna; Institute of Linguistics, Adam Mickiewicz University Poznaīn
Wagner, Agnieszka; Institute of Linguistics, Adam Mickiewicz University Poznaīn
Abstract:
This paper presents the stylization of intonation contours and clustering of F0
movements on accented and post-accented syllables based on annotated speech
corpora. Special software - PitchLine - has been developed to enable the flexible
quasi-automatic segmentation and parametrization of intonation curves. The
experimental material obtained from a 15 min passage read by a male speaker included
more than 1200 annotated accents and several hundred phrase boundaries.
The accuracy of the stylization method was evaluated by measuring NMSE error
between original and stylized F0 contours and in a perception study. Stylized F0
contours which were perceived as very different from the original ones required
further analysis and re-stylization. Finally, 640 mono-tonal accents formed 6 clusters
and 580 bi-tonal accents formed another 6 clusters. The results of clustering
confirmed the correctness of the stylization rules.
| | | Poster Session 2: Analysis and Formulation of Prosody | 7 of 22 | Automatic Pitch Stylization Enhanced with Top-Down ProcessingAUTHOR(S):
Wypych, Mikolaj; IFTR, Polish Academy of Sciences
Abstract:
In the article an original method of pitch stylization from speech waveform and its
orthographic transcript is presented. In addition to bottom-up data processing,
the top-down step is employed. The top-down step allows for the reduction of contextual
variability of intonational structure constituents. Software implementation
of the stylization method for the Polish language is described. The design takes
advantage of components borrowed from an existing automatic intonation recognizer.
Fundamental frequency extraction in the design is performed using a comb
filter. In a subsequent stage, a syllable-wise pitch stylization is performed, followed
by contextual pitch tracking. Intonational structure is recognized by an intonational
parser based on Hidden Markov Models. The intonation model conveying
an annotation system is taken from the recent intonation grammar for Polish by
Jassem. Components of the design were developed in parallel which allowed for
the coordination of tradeoffs between the modules. Training set and exemplary
results are presented together with a discussion of future improvements.
| | | Poster Session 2: Analysis and Formulation of Prosody | 8 of 22 | Evaluation of Pitch Detection Algorithms in Adverse ConditionsAUTHOR(S):
Kotnik, Bojan; University of Maribor
Hoege, Harald; Siemens AG
Kacic, Zdravko; University of Maribor
Abstract:
Robust fundamental frequency estimation in adverse conditions is important in
various speech processing applications. In this paper a new pitch detection algorithm
(PDA) based on the autocorrelation of the Hilbert envelope of the LP
residual is compared to another well established algorithm from Goncharoff. A
set of evaluation criteria is collected on which the two PDA algorithms are compared.
In order to evaluate the algorithms in adverse conditions a suited reference
database was constructed. This reference database consists of parts of the Spanish
SPEECON speech database where recordings of 60 speakers were selected and
manually pitch marked. The recordings cover several adverse conditions as noise
in the car cabin and reverberations of office rooms. The evaluation highlights the
good performance of the new algorithm in comparison but shows, that low SNR
conditions and strong reverberation are still a demanding challenge for future pitch
detection algorithms.
| | | Poster Session 2: Analysis and Formulation of Prosody | 9 of 22 | A General Approach for Automatic Extraction of Tone Commands in the Command-Response Model for Tone LanguagesAUTHOR(S):
Gu, Wentao; The University of Tokyo
Hirose, Keikichi; The University of Tokyo
Fujisaki, Hiroya; The University of Tokyo
Abstract:
Although the command-response model for the process of F0 contour generation
has been successfully applied to many languages, the inverse problem, viz., automatic
derivation of the model parameters from an observed F0 contour, is more
challenging, especially for tone languages which have both polarities of tone commands.
Since the polarity of tone commands cannot be inferred directly from the
F0 contour itself, the information on tone identity and timing need to be incorporated.
The current study gives a general approach for the first-order estimation
of tone command parameters for tone languages, taking Mandarin and Cantonese
as two examples. After a rule-based recognition of the tone command patterns
within each syllable, the timing and amplitude of tone commands will be deduced.
The experiments show that the method gives good results of analysis for both the
two dialects.
| | | Poster Session 2: Analysis and Formulation of Prosody | 10 of 22 | Comparison of Tonal Co-articulation between Intra- and Inter-word Disyllables in MandarinAUTHOR(S):
Wang, Xiaodong; Department of Electronic Engineering
Gu, Wentao; Department of Information and Communication Engineering
Hirose, Keikichi; Department of Information and Communication Engineering
Sun, Qinghua; Department of Electronic Engineering
Minematsu, Nobuaki; Department of Frontier Informatics
Abstract:
Features of tonal co-articulation in Mandarin speech are studied. Though several
previous works investigated how prosodic features of syllables are affected by surrounding
syllables, most of them selected nonsense syllable sequences as speech
material without specific consideration on word boundary. In the present study,
however, a comparison is given on tonal co-articulation between intra-word and
inter-word cases. The speech material is designed: in each pair of sentences, target
disyllables share exactly the same tonal context but differ in position of word
boundary locating at the initial of the target or at the middle. Mean F0 and F0
range are adopted as prosodic features of each syllable, and mean F0's differences
between the second and the first syllables of target are calculated and compared
for sentence pairs. Analysis on 16 disyllabic tone combinations shows the effect of
word boundary location on the tone co-articulation is different depending on the
tone combinations.
| | | Poster Session 2: Analysis and Formulation of Prosody | 11 of 22 | Alignment of Medial and Late Peaks in German Spontaneous SpeechAUTHOR(S):
Niebuhr, Oliver; Institute of Phonetics and Digital Speech Processing, University
of Kiel
Abstract:
Ambrazaitis, Gilbert; Center for Languages and Literature, Lund University
Starting from a corpus of German spontaneous speech, the phonetic realisations
of the two KIM categories medial and late peak were investigated in prenuclear
position. The results show that, for both categories, the onset of the rising F0
move-ment (L) is comparably aligned around the accented-syllable onset, whereas
the F0 maximum (H) is independently aligned and predominantly located before
the accented-syllable offset or after the onset of the following unaccented syllable,
re-spectively. The data further suggest that also from the AM point of view the
two prenuclear rises are different at the pho-nological level. Finally, the possibility
is pointed out that the alignment patterns found for prenuclear rises in other
studies are to some extent due to a combination of categories like the medial and
late peak.
| | | Poster Session 2: Analysis and Formulation of Prosody | 12 of 22 | Emotional, linguistic or just cute? The function of pitch contours in infant -and foreigner-directed speechAUTHOR(S):
Knoll, Monja; University of Portsmouth
Uther, Maria; University of Portsmouth
MacLeod, Norman; The Natural History Museum
O'Neill, Mark; The Natural History Museum
Walsh, Stig; The Natural History Museum
Abstract:
Infant-directed speech (IDS) is characterised by acoustic modifications to adultdirected
speech (ADS) including increased pitch, emotional affect and pitch contour
exaggeration. Pitch contour function in IDS has not been determined, but
may be important for emotional expression, gaining attention or have a linguistic
role. Here two algorithmic approaches (DAISY and Eigenshape analysis) were
used to analyse pitch contour shape in three speech recipient groups, with human
raters as a qualitative comparison. Speech samples of target words in ten mothers
were recorded while they talked to their infants and to a British- (control) and
foreign adult confederate (linguistic condition). 167 pitch contours were extracted
and converted to a standard format for the three approaches. Results indicate
that IDS mostly contains exaggerated contours; FDS and ADS possess mainly
flat curves. These results suggest an attentional-emotional role for the IDS pitch
contours.
| | | Poster Session 2: Analysis and Formulation of Prosody | 13 of 22 | Tone Ratios Combined with F0 Register in Cantonese as Speaker-dependent CharacteristicAUTHOR(S):
Li, Yujia; The Chinese University of Hong Kong
Abstract:
F0 is considered to provide speaker-specific information in some extent. Based on
the widely agreement that extrinsic F0 is helpful for speaker identity, this paper
investigates the possibility of making use of both extrinsic and intrinsic features of
Cantonese tone system as speaker-dependent characteristic. Considering the special
characteristic of Cantonese tone system, relative tone ratios and F0 register
are proposed to model the tone systems generated by different speakers. The investigation
is carried out over both recognition and analysis. The results primarily
show the potential of implementing such features on speaker characterization.
| | | Poster Session 2: Analysis and Formulation of Prosody | 14 of 22 | Functional-oriented articulatory modeling of tones and intonationsAUTHOR(S):
Prom-on, Santitham; Department of Computer Engineering, King Mongkut's University
of Technology Thonburi, Thailand
Xu, Yi; Department of Phonetics and Linguistics, University College London, UK
Thipakorn, Bundit; Department of Computer Engineering, King Mongkut's University
of Technology Thonburi, Thailand
Abstract:
In this paper we report results of applying the quantitative target approximation
model (qTA) to simulate function-specific F0 contours in Mandarin. The qTA
model is based on a set of assumptions about the biophysical and neural control
mechanisms of pitch production. To simulate F0 contours for tone and focus, we
extracted qTA parameters that are tone-specific and adjustment parameters that
are focus-specific. The accuracy and effectiveness of this approach were tested
through a series of synthesis experiments. In the baseline case, the results were
fair with just tonal specifications. Further experiments showed additional improvements
when the parameters became more functions-specific.
| | | Poster Session 2: Analysis and Formulation of Prosody | 15 of 22 | Analysis and Modelling of Question Intonation in American EnglishAUTHOR(S):
Sityaev, Dmitry; Toshiba Research Europe Ltd
Burrows, Tina; Toshiba Research Europe Ltd
Jackson, Peter; Toshiba Research Europe Ltd
Knill, Katherine; Toshiba Research Europe Ltd
Abstract:
This paper addresses the modelling in text-to-speech of the rising intonation pattern
in American English which is often found in yes-no questions. A small corpus
containing yes-no questions was recorded and analysed. F0 was then modelled using
an automatic procedure. The paper also reports on the stability of alignment
of F0 targets in rising intonation patterns.
| | | Poster Session 2: Analysis and Formulation of Prosody | 16 of 22 | A Method for Decomposing and Modeling Jitter in Expressive Speech in ChineseAUTHOR(S):
Wang, Lei; Dept. Computer Science, Tianjin University
Li, Aijun; Institute of Linguistics, Chinese Academy of Social Sciences
Fang, Qiang; Institute of Linguistics, Chinese Academy of Social Sciences
Abstract:
Jitter is considered as one of the most crucial factors to the aim of synthesizing
natural emotional speech. Unlike the traditional methods of measuring jitter
in emotional speech, this paper propose that the jitter in the speech could be
decomposed into two parts, that to say, deterministic jitter and random jitter.
Deterministic jitter is associated with certain causes that may be the affect caused
by emotion state, while random jitter is the result by random events that have
nothing to do with emotion. What is more, two different methods of modeling
jitter distribution are described: jitter decomposition is based on the fact that the
mixed jitter can be divided into deterministic part and random part, while the
algorithm based on GMM tries to simulate the shape of the histogram of jitter
distribution. The result makes a qualitative analysis of the two methods. There
are still much of works for us to do in the future in order to do more detail analysis
and to make quantitative analysis of them.
| | | Poster Session 2: Analysis and Formulation of Prosody | 17 of 22 | Intensity as a macroprosodic variable in CzechAUTHOR(S):
Dubêda, Tomáš; Institute of Phonetics, Charles University in Prague
Abstract:
The present paper provides an acoustic description of macrointensity patterns of
stress units (prosodic words) in read Czech, as reflected by the intensity of syllable
nuclei. Normalized intensity values show that there is a gradual macrodynamic
decrease over the inter-pause group, followed typically by a significant intensity
reset. Local intensity drops occur between the last two syllables of stress units;
in addition, there is a major intensity drop before the pause. Syllables bearing
perceived accents do not show intensity peaks.
| | | Poster Session 2: Analysis and Formulation of Prosody | 18 of 22 | How far can prosodic cues help in word segmentation?AUTHOR(S):
Bartkova, Katarina; France Telecom
Abstract:
Prosodic cues are of great importance in parsing speech signal into prosodic and
lexical units. Automatic speech recognition systems try to use prosodic parameters
to detect boundaries of prosodic units and help thus the acoustic decoding process.
Although the automatic detection of major prosodic boundaries is most of the time
reliable, minor boundary detections are prone to error. A deeper understanding
of the prosodic parameters in spontaneous speech would improve their modeling
and their use by automatic systems. This study analyses filled and silent pause
occurrences and two prosodic parameters, duration of pauses and vowels and F0
slopes, measured on a spontaneous speech corpus in French. The results of the
analysis revealed that a simple local comparison of the parameter values with the
values measured in the vicinity of the segment under consideration can provide
valuable information on the lexical boundaries as well as on prosodic patterns of
the lexical units.
| | | Poster Session 2: Analysis and Formulation of Prosody | 19 of 22 | Acoustic Features of Japanese Vowel-Vowel Hiatus at Prosodic BoundariesAUTHOR(S):
Kitazawa, Shigeyoshi; Shizuoka University
Abstract:
We investigated V-V hiatus through J-ToBI labeling and listening to whole phrases
to estimate degree of discontinuity and, if possible, to determine the exact boundary
between two phrases. Appropriate boundaries were found in most cases as the
maximum perceptual score. Using electroglottography (EGG) of the open quotients
OQ, pitch mark and spectrogram, the acoustic phonological feature of these
V-V hiatus was found as phrase-initial glottalization and phrase-final nasalization,
as well as phrase-final lengthening and phrase-initial shortening of the morae. A
small F0 dip was observable at the boundary of V-V hiatus was found as universal
indication of glottalization. The test materials are taken from the "Japanese
MULTEXT", consisting of a particle - vowel (36), adjective - vowel (5), and word
- word (4).
| | | Poster Session 2: Analysis and Formulation of Prosody | 20 of 22 | Secondary Association of Tones in Castilian SpanishAUTHOR(S):
Face, Timothy; University of Minnesota
Abstract:
This paper considers the role of secondary association of tones in Castilian Spanish.
Recent studies have shown that Castilian Spanish has three contrasting bitonal
rising pitch accents, posing a problem for standard Autosegmental-Metrical theory,
which allows only a binary distinction between L*+H and L+H*. It is argued that
this three-way contrast can be accounted for in a principled and constrained way
if pitch accent tones can have secondary associations to metrical units much the
same way that edge tones have been proposed to have secondary associations in
several languages. In this way, primary association results in the association of
the strong tone (or head) of the pitch accent with the tone-bearing unit, while
secondary association more directly affects phonetic alignment. It is argued that
secondary association of edge tones also exists in Castilian Spanish and is able
to explain two pitch range effects that have been observed, but not explained, in
previous analyses.
| | | Poster Session 2: Analysis and Formulation of Prosody | 21 of 22 | L-tone affixation: Evidence from German dialectsAUTHOR(S):
Kügler, Frank; Institut für Linguistik
Abstract:
In a comparison of the tonal grammars of two German dialects, Swabian and
Upper Saxon German, we observe a particular type of intonation contour that
is similar in surface form, yet differs phonologically. Phonetically, the contour's
shape is rising-falling; phonologically, the Swabian contour reads as L*H +L 0%,
and the one of Upper Saxon as L+ H*L 0%. Both contours are marked ones, and
arise through a process that we call L-affixation, which is indicated by the '+'
diacritic. Both contours share a similar semantico-pragmatic meaning, i.e. they
express narrow focus. An alternative interpretation of the postnuclear low tone in
Swabian as a phrase accent is rejected.
| | | Poster Session 2: Analysis and Formulation of Prosody | 22 of 22 | Rhythmic factors in weak-syllable insertion: An internet corpus studyAUTHOR(S):
Quené, Hugo; Utrecht University
Abstract:
Dutch language users often insert an inflectional schwa after an adverb, in certain
grammatical constructions. The main hypothesis here is that this insertion, which
is often ungrammatical, is driven by speakers' tendency towards regular speech
rhythm, which overrides the fine grammatical nuances conveyed by absence of
inflection. This rhythmicity hypothesis was investigated in a huge text corpus,
viz. all web pages written in Dutch. The proportion of weak-syllable insertion
was obtained for a sample of test phrases, varying in rhythmic context around the
insertion point. Logistic regression of these proportions shows large and significant
effects of rhythmic context on the odds of weak-syllable insertion. Hence, this
insertion may well be due to rhythmical factors in speech production, in addition
to lexical-grammatical factors.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Special Session 2 (SPS 2)
Audio-Visual Prosody Processing
Organizers: Marc Swerts, Denis Burnham and Sascha Fagel
Tuesday, May 2, 16:00 - 18:00
|
| Special Session 2: Auditory-Visual Prosody Processing | 1 of 6 | Measuring and modeling audiovisual prosody for animated agentsAUTHOR(S):
Granström, Björn; Center for Speech Technology, KTH
House, David; Center for Speech Technology, KTH
Abstract:
Understanding the interactions between visual expressions, dialogue functions and
the acoustics of the corresponding speech presents a substantial challenge. The
context of much of our work in this area is to create an animated talking agent
capable of displaying realistic communicative behavior and suitable for use in conversational
spoken language systems, e.g. a virtual language teacher. In this
presentation we will give some examples of recent work, primarily at KTH, involving
the collection and analysis of a database for audiovisual prosody. We will
report on methods for the acquisition and modeling of visual and acoustic data,
and provide some examples of analysis of head nods and eyebrow settings.
| | | Special Session 2: Auditory-Visual Prosody Processing | 2 of 6 | Hearing and Seeing Beats: The influence of visual beats on the production and perception of prominenceAUTHOR(S):
Krahmer, Emiel; Tilburg University
Swerts, Marc; Tilburg University
Abstract:
Speakers can employ a variety of means to indicate that a word is important,
including pitch accents and visual cues such as manual gestures, head nods and
eyebrow movements (collectively referred to as visual beats). In this paper, we look
at the relation between visual and auditory cues for prominence, based on data
collected with an original experimental paradigm in which speakers were instructed
to realize a particular target sentence with different distributions of auditory and
visual cues. The first experiment revealed that visual beats have a significant
effect on the spoken realization of the target words. When a speaker produces
a visual beat, the word uttered simultaneously is produced with relatively more
spoken emphasis, irrespective of the position of the auditory accent. The second
experiment showed that when participants see a speaker realize one of these beat
gestures on a word, they perceive this word as more prominent than when they
do not see the beat gesture.
|
| Special Session 2: Auditory-Visual Prosody Processing | 3 of 6 | Manipulating Uncertainty: The contribution of different audiovisual prosodic cues to the perception of confidence AUTHOR(S):
Dijkstra, Christel; Tilburg University
Krahmer, Emiel; Tilburg University
Swerts, Marc; Tilburg University
Abstract:
When answering factual questions, speakers can signal whether they are uncertain
about the correctness of their answer using prosodic cues such as fillers ("uh"), a
rising intonation contour or a marked facial expression. It has been shown that on
the basis of such cues, observers can make adequate estimates about the speaker's
level of confidence, but it is unclear which of these cues have the largest impact
on perception. To find the relative strength of the three aforementioned cues,
a novel perception experiment was performed in which answers were artificially
manipulated in such a way that all possible combinations of the cues of interest
could be judged by participants. Results showed that while all three factors had
a significant influence on the perception results, this effect was by far the largest
for facial expressions.
| |
| Special Session 2: Auditory-Visual Prosody Processing | 4 of 6 | Visual Correlates of Prosodic Contrastive Focus in French: Description and Inter-Speaker VariabilityAUTHOR(S):
Dohen, Marion; Institut de la Communication Parlée / Human Information Science
Research Labs - ATR
Lœvenbruck, Hélène; Institut de la Communication Parlée
Hill, Harold; Human Information Science Research Labs - ATR
Abstract:
This study is a follow-up of previous studies we conducted on the visible articulatory
correlates of French prosodic contrastive focus. A two speaker analysis
using an automatic lip-tracking device had shown that these correlates existed
and were used in visual perception. However the articulatory strategies depended
on the speaker. The purpose of this study was thus to extend the analysis to
other speakers, examine the similarities and variabilities and try to identify global
tendencies. We recorded five speakers of French with a 3D optical tracker using
a 13 sentence (subject-verb-object) corpus and four focus conditions (S, V, O or
neutral). An articulatory analysis confirmed that visible articulatory correlates
exist for all the speakers. The strategies used are mainly of two types: absolute
and differential. An analysis of other facial movements showed that an eyebrow
raising and/or a head nod can signal focus. This association is however highly
inter- and intra-speaker dependent.
| | | Special Session 2: Auditory-Visual Prosody Processing | 5 of 6 | Audio and Audio-visual Effects of a Short English Emotional Sentence on Japanese L2's and English L1's Cognition, and Physio-acoustic CorrelateAUTHOR(S):
Isei-Jaakkola, Toshiko; The University of Tokyo
Sun, Qinghua; The University of Tokyo
Hirose, Keikichi; The University of Tokyo
Abstract:
The cognition test results of audio (A) and audio-visual (AV) effects on nine English
emotions in a short sentence were compared to the physio-acoustic features
of sound used for the cognition tests. Two groups of Japanese learners of English
(JL2) and one group of English speakers (EL1) participated in these A and AV
cognition tests. In the physio-acoustic analyses we used F0 and intensity contours
and calculated the area of sentential patterns and three forms of distance: area-,
average, and pattern-distance for each emotion. It was found that the order of
the correct answer ratios using dialogues, a short statement, and a word, was:
dialogues > short statement > word in A, and word > short statement in AV. The
relationships between these cognition tests and physio-acoustic analyses confirmed
that although there was not high correlation between them, intensity seems to be
more correlated to the cognition test results for audio by both JL2 and EL1 than
F0.
| | | Special Session 2: Auditory-Visual Prosody Processing | 6 of 6 | Emotional McGurk EffectAUTHOR(S):
Fagel, Sascha; Technical University Berlin
Abstract:
Speaking is a physiological process that manifests in the acoustic and in the optic
domain and hence is audible and visible. These two modalities influence each
other in perception. Under normal circumstances the speech information in both
channels is coherent and complementary and integrated to a percept. But if the
information is conflicting and nevertheless integrated then the percept in one of
the modalities might be changed by the other modality. The experiment described
here discovers that when the video of an utterance spoken in one emotion is dubbed
with the audio of the utterance spoken in another emotion the perceived emotion
might be a third - neither present in the auditory nor in the visual modality.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Oral Session 1 (OS 1)
Prosodic Variability
Wednesday, May 3, 09:45 - 11:25
Chair: Gösta Bruce
|
| Oral Session 1: Prosodic Variability | 1 of 5 | Pronunciation Variant Selection for Spontaneous Speech Synthesis - A Summary of Experimental Results AUTHOR(S):
Werner, Steffen; Dresden University of Technology
Hoffmann, Rüdiger; Dresden University of Technology
Abstract:
To make synthesized speech more natural and colloquial the regularity of synthesized
speech has to be overcome and spontaneous speech effects have to be
integrated into the synthesis process. In a first step towards spontaneous speech
we introduced different duration control methods in speech synthesis. In this paper
we summarize the results of previous works of changing the speaking rate indirectly
by controlling the grapheme-to-phoneme conversion through different pronunciation
variant selection algorithms. The presented results of listening experiments
show a significant improvement in the category colloquial impression. To evaluate
the quality of the most outstanding variant selection approach compared to the
canonical synthesis, we performed a new listening test on longer speech samples.
The variant synthesis applying a pronunciation variant sequence model achieved a
significant lower listening effort and a higher overall rate (MOS) compared to the
canonical synthesis.
| | | Oral Session 1: Prosodic Variability | 2 of 5 | Explaining cross-linguistic differences in effects of lexical stress on spoken-word recognitionAUTHOR(S):
Cutler, Anne; Max Planck Institute for Psycholinguistics
Pasveer, Dennis; Max Planck Institute for Psycholinguistics
Abstract:
Experiments have revealed cross-language differences in listeners' use of stress
information in recognising spoken words. Previous comparisons of the Spanish
and English vocabularies suggested that the differences might reflect the extent to
which considering stress in spoken-word recognition allows rejection of unwanted
competition from embedded words. This hypothesis was tested on the vocabularies
of Dutch and German, for which word recognition results resemble those from
Spanish more than those from English. The vocabulary statistics likewise revealed
that in each language, the reduction of embeddings resulting from consideration
of stress more closely resembles the reduction achieved in Spanish than in English.
| | | Oral Session 1: Prosodic Variability | 3 of 5 | Dialect Alignment SignaturesAUTHOR(S):
Ní Chasaide, Ailbhe; Phonetics and Speech Laboratory, Trinity College Dublin
Dalton, Martha; Phonetics and Speech Laboratory, Trinity College Dublin
Abstract:
This paper considers the hypothesis that dialects may have characteristic patterns
in the alignment of the melodic contour with the segmental or syllabic tiers. Peak
alignment was measured in initial prenuclear accented syllables for 3 dialects of
Connaught Irish, Cois Fharraige, Inis-Oirr and Mayo. The size of the anacrusis
varied as between two (PN2), one (PN1) and no (PN0) unstressed syllables before
the accented one. Results support the hypothesis and indicate that the finetiming
of peak alignment does differ systematically among the three dialects. In the first,
Cois Fharraige, peaks remain fixed across anacrusis conditions, being aligned to
the right edge of the accented syllable. The two other dialects reveal more variable
peak timing: Inis Oirr is moderately variable showing a tendency for the peak to
fall within the stressed vowel, but shifting rightwards to the syllable boundary
when there is no anacrusis (PN0). The Mayo dialect is extremely variable across
the prenuclear conditions. It is argued that such fine time alignment differences
may be important to the differentiation of even closely related dialects.
| | | Oral Session 1: Prosodic Variability | 4 of 5 | Emotional Prosody -Does Culture Makes A Difference? AUTHOR(S):
Burkhardt, Felix; T-Systems Enterprise Services
Audibert, Nicolas; ICP - University of Stendhal, Grenoble
Malatesta, Lori; IVML - Technical University of Athens
Türk, Oytun; R&D Dept., Sestek Inc., Istanbul
Arslan, Levent; R&D Dept., Sestek Inc., Istanbul
Aubergé, Véronique; ICP - University of Stendhal, Grenoble
Abstract:
We report on a multilingual comparison study on the effects of prosodic changes
on emotional speech. The study was conducted in France, Germany, Greece and
Turkey. Semantically identical sentences expressing emotional relevant content
were translated into the target languages and were manipulated systematically
with respect to pitch range, duration model, and jitter simulation. Perception
experiments in the participating countries showed relevant effects irrespective of
language. Nonetheless, some effects of language are also reported.
| | | Oral Session 1: Prosodic Variability | 5 of 5 | Estonian and English rhythm: a two-dimensional quantification based on syllables and feetAUTHOR(S):
Asu, Eva Liina; Institute of the Estonian Language
Nolan, Francis; University of Cambridge
Abstract:
This paper expands a recent pilot experiment on Estonian rhythm within the
quantificational approach to the study of rhythm, using the Pairwise Variability
Index (PVI). The PVI expresses the average difference between adjacent phonological
units such as vowels, consonantal intervals or syllables. It is argued here
that confining the application of the PVI to the level of the syllable (or its components)
misses the essence of Estonian rhythm and indeed of phonetic rhythm in
general, and the first experiment reported in this paper quantifies Estonian rhythm
in terms of the durational PVI of both the syllable and (innovatively) the foot. In
the second experiment, results are compared with the same measures for another
language with strong stress, English. Both languages have a similar, relatively low
foot PVI, but English has a considerably higher syllable PVI reflecting its radical
reduction of unstressed syllables in polysyllabic feet.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Oral Session 2 (OS 2)
Prosody in Dialogue Speech
Wednesday, May 3, 11:50 - 13:10
Chair: Mark Hasegawa-Johnson
|
| Oral Session 2: Prosody in Dialogue Speech | 1 of 4 | Intonational variation in adolescent conversational speech: rural versus urban patternsAUTHOR(S):
Fletcher, Janet; University of Melbourne
Loakes, Deborah; University of Melbourne
Abstract:
The conversational speech of ten female adolescents was analyzed intonationally
with a view to determining whether there is variation between rural and urban varieties
in Australian English. The data revealed that urban females use marginally
more 'uptalk' than their rural counterparts, as well as more sustained, level tunes.
These differences and other aspects of intonational variation are presented in terms
of a prevailing intonational model of English, and discourse annotation schema.
| | | Oral Session 2: Prosody in Dialogue Speech | 2 of 4 | The Friendliness Perception of Dialogue SpeechAUTHOR(S):
Tao, Jianhua; Institute of Automation, Chinese Academy of Sciences, Beijing
Huang, Lixing; Institute of Automation, Chinese Academy of Sciences, Beijing
Kang, Yongguo; Institute of Automation, Chinese Academy of Sciences, Beijing
Yu, Jian; Institute of Automation, Chinese Academy of Sciences, Beijing
Abstract:
The paper is focused on the friendliness analysis and perception of dialogue speech.
To do that, the paper uses a concept of the "perception vector" which contains
the information of emotions and softness. In creating the "perception vector", and
to simulate the perception ambiguity, the paper allows the listeners to label the
speech with multiple emotions, and align them into "one choice", "first choice" and
"second choice". Then, the paper makes the correlation analysis between friendliness
and "perception vectors", the results disclose that the friendliness is positive
correlation to "softness", "happiness" and "anger". Finally the paper traines a
classification tree model to predict friendliness degree from acoustic features. With
the classification tree model, we get the ranking scores of the acoustic parameters'
importance for perceptually synthesized speech. Results shows that the F0 mean
assumes the most important role in emotion perception, Ee is the most important
parameter related to voice quality for the perception model.
| | | Oral Session 2: Prosody in Dialogue Speech | 3 of 4 | Immediate effects of intonational prominence in a visual search task AUTHOR(S):
Ito, Kiwako; Linguistics, Ohio State University
Speer, Shari R.; Linguistics, Ohio State University
Abstract:
Studies of spontaneous speech show that speakers consistently mark contrastive
words using pitch accent. To investigate how listeners process contrastive accentual
prominence, eye-movements were monitored as participants listened to directions
and searched for ornaments to decorate holiday trees. Eye movements to target
ornament cells were earlier when intonation felicitously marked contrast on a color
adjective (e.g. First, hang the green drum!Next, hang the ORANGE drum) than
when it did not (! orange DRUM). Felicitous emphatic accent placement induced
earlier fixations to the target compared to lack of emphasis (! orange drum). In
addition, infelicitous use of accent on the modifier (e.g. green drum ! ORANGE
ball) led to incorrect initial fixations to the preceding cell (e.g. drum) before
the noun itself was processed. These results demonstrate immediate processing
of accentual information on a modifier leading to a strong expectation about the
upcoming discourse entity.
| | | Oral Session 2: Prosody in Dialogue Speech | 4 of 4 | Spoken Dialogue System Using Recognition of User's Feedback for Rhythmic DialogueAUTHOR(S):
Fujie, Shinya; Waseda University
Miyake, Riho; Waseda University
Kobayashi, Tetsunori; Waseda University
Abstract:
The recognition method of user's feedback during the system's utterance is proposed
and its application to the spoken dialogue system is discussed. In human
conversation, we can know the dialogue partner's internal state by receiving such
feedbacks. Our research topics are (1) developing the prosodic information based
feedback recognizer and (2) appropriately controlling the system's utterance timing
along with the user's feedbacks. The implemented recognizer can distinguish
between back-channel and ask-back word-independently with prosodic information
based features and statistical recognition method. Experiments of the spoken dialogue
system with this function reveals when it should generate the next utterance
after receiving the user's feedback.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Poster Session 3 (PS 3)
Prosody and Speech Production
Wednesday, May 3, 14:30 - 16:00
Chair: Grażyna Demenko |
| Poster Session 3: Prosody and Speech Production | 1 of 24 | Register in Mah Meri: A preliminary phonetic analysis AUTHOR(S):
Stevens, Mary; University of Melbourne
Kruspe, Nicole; University of Melbourne
Hajek, John; University of Melbourne
Abstract:
This paper presents the results of a first phonetic investigation of register in Mah
Meri, a Southern Aslian language spoken in Peninsular Malaysia, and part of the
larger Austroasiatic family spread throughout South and Southeast Asia. Voice
register, a complex of laryngeal and supralaryngeal properties, is a common areal
feature amongst members of the Austroasiatic family (particularly the Mon-Khmer
group) but has never previously been reported to occur in an Aslian language. We
consider general spectral appearance, duration and f0 in order to see how well they
correlate with perceived differences in register.
| | | Poster Session 3: Prosody and Speech Production | 2 of 24 | Prosody As Marker of Discourse Segmentation in SuyáAUTHOR(S):
Oliveira, Jr., Miguel; University of Manchester
Abstract:
The present study investigates whether - as in several well-documented languages
-, prosody plays a role in the signaling of discourse segmentation in Suyá, an
Amazonian language of the Jê group. Inspired by the literature, the following
prosodic variables were selected for analysis: pause, pitch reset and boundary
tones.
| | | Poster Session 3: Prosody and Speech Production | 3 of 24 | Quantitative analysis of intonation patterns in statements and questions in Can-toneseAUTHOR(S):
Ma, Joan K.-Y.; University of Hong Kong
Ciocca, Valter; University of Hong Kong
Whitehill, Tara L.; University of Hong Kong
Abstract:
The aim of this study was to investigate intonation patterns in Cantonese using
a quantitative approach. The command-response model was employed to explore
the differences between intonations, and the effects of lexical tone on fundamental
frequency contours of intonation. Two intonation types, with six tonal contrasts
embedded at the final position, were collected from twelve native Cantonese
speakers. Results showed that F0 in questions was raised for the entire utterance,
which was mainly associated with baseline frequency changes. An additional positive
boundary tone command occurred towards the end of the final syllable of
questions, which denoted the final-rise in F0 in questions. A lengthened duration
of the tone command towards the end of questions was also observed. The amplitude
of the final-rise in the contours of questions was affected by the tone of
the final syllable, with significantly higher amplitude noted for the boundary tone
command of tones 25 and 21.
| | | Poster Session 3: Prosody and Speech Production | 4 of 24 | Interaction between the Scottish English System of Prominence and Vowel LengthAUTHOR(S):
Gordeeva, Olga; Speech Science Research Centre, Queen Margaret University College,
Edinburgh
Abstract:
This study looks into interaction between the quasi-phonemic vowel length contrast
in Scottish English and its word-prosodic system. We show that under the same
phrasal accent the phonetically short vowels of the morphologically conditioned
quasi-phonological contrast are produced with significantly more laryngeal effort
(spectral balance) than the long ones, while the vowels do not differ in quality,
overall intensity or fundamental frequency. This difference is explained by employing
the concept of "functional load". Duration must be kept short to mark the
short vowel length, while both word-stress and phrasal accent require lengthening.
Therefore, the additional laryngeal effort in the short vowels serves a prominenceenhancing
function. This finding a supports the hypothesis proposed by Beckman
that phonological categories of word-prosodic systems featuring "stress-accent" are
not necessarily phonetically uniform language-internally.
| | | Poster Session 3: Prosody and Speech Production | 5 of 24 | Preliminary Results of Prosodic Effects on Domain-initial Segments in Hamkyeong Korean AUTHOR(S):
Kim, Sung-A; Dong-A University, Korea
Abstract:
This paper investigates the domain-initial strengthening in English and Hamkyeong
Korean, a pitch accent dialect spoken in the northern part of North Korea. The
question addressed in the present study is whether the domain-initial strengthening
effect is observed at the domain-initial vowels as well as domain-initial consonants.
In the experiment, durations of initial-syllable vowels in various prosodic domains
were compared to those of second vowels in real-word tokens for both languages.
Hamkyeong Korean, like English, tuned out to strengthen the domain-initial consonants.
With regard to vowel durations, we found no significant prosodic effect
in English. Yet, Hamkyeong Korean showed significant differences between durations
of initial and non-initial vowels in the prosodic domains. The findings in
the study are theoretically important as they show that the potentially-universal
phenomenon of initial strengthening is subject to language specific variations in
its implementation.
| | | Poster Session 3: Prosody and Speech Production | 6 of 24 | Syntax and syllable count as predictors of French tonal groups: Drawing links to memory for prosodyAUTHOR(S):
Gilbert, Annie C.; Université de Montréal
Boucher, Victor J.; Université de Montréal
Abstract:
While the role and origin of prosodic structures remain unclear, there is evidence
that prosody bears an intriguing relationship with serial memory processes and
grouping effects. This link is seen in the fact that the recall of presented prosodic
patterns and their production in speech are both restricted in term of a syllable
count. The present experiment complements previous studies by examining the
effects of syntactic structure as opposed to constituent length on produced tonal
groups. Forty subjects produced, in quasi-spontaneous conditions, given utterances
with differing NP, VP structures or differing lengths. The results show that
constituent length is the major predictor, whereas syntactic structure appears as
a secondary factor.
| | | Poster Session 3: Prosody and Speech Production | 7 of 24 | Articulatory Strengthening and Prosodic HierarchyAUTHOR(S):
Cao, Jianfeng; Institue of Linguistics, Chinese Academy of Social Sciences
Zheng, Yuling; Institue of Ethnology and Anthropology, Chinese Academy of Social
Sciences
Abstract:
This paper reports a set of results based on the spectral and EPG measurements
to the read speech copra in Mandarin Chinese, aim at the observation on the relationship
between articulatory strengthening and prosody hierarchy. The data
obtained both from acoustic and physiological measurements indicate that, articulatory
manifestation of any segment in real speech are closely relevant to their
prosodic position or status in connected speech. Therefore, it makes capable to
predict the hierarchical organization of speech prosody from the strength of such
articulatory strengthening. At the same time, this evidence further reveals the existence
of anticipatory planning in speech production. Consequently, our finding
should be not only of benefit for Chinese speech processing, but also provides a
new angle of view to understand the mechanism of speech production in general.
| | | Poster Session 3: Prosody and Speech Production | 8 of 24 | Articulatory and acoustic correlates of prenuclear and nuclear accentsAUTHOR(S):
Mücke, Doris; IfL Phonetik, University of Cologne
Grice, Martine; IfL Phonetik, University of Cologne
Becker, Johannes; IfL Phonetik, University of Cologne
Hermes, Anne; IfL Phonetik, University of Cologne
Baumann, Stefan; IfL Phonetik, University of Cologne
Abstract:
We investigate acoustic and articulatory anchors for F0 targets corresponding to
prenuclear and nuclear accent peaks in German, both across two different articulation
rates and across two different syllable structures. We found that the alignment
of turning points in the F0 signal with minima and maxima in the kinematic signal
was more stable than with segment boundaries in the acoustic signal. Whereas in
Dutch the H peak of a rising prenuclear (L*+H) accent has been shown to occur
at the edge of the accented syllable, in German the peak occurs during the vowel
in the postaccented syllable. In articulatory terms, the peak aligns with articulatory
gestures corresponding to the vowel. Like in English and Dutch, nuclear
peaks in German are aligned earlier in the acoustic signal than prenuclear ones.
The alignment of F0 peaks with the kinematic signal was highly systematic, and
can be interpreted as a shift from a gesture corresponding to a vowel to a gesture
corresponding to a consonant.
| | | Poster Session 3: Prosody and Speech Production | 9 of 24 | Prosodic Marking of Focus Domains - Categorical or Gradient?AUTHOR(S):
Baumann, Stefan; IfL-Phonetik, University of Cologne
Grice, Martine; IfL-Phonetik, University of Cologne
Steindamm, Susanne; IfL-Phonetik, University of Cologne
Abstract:
This paper reports on a production experiment in German eliciting focus domains
of various sizes, ranging from broad to narrow focus, as well as contrastive focus.
Results show that speakers use categorical as well as gradient prosodic means to
indicate different focus structures, with an increase of prominence-lending cues
as the focus domain narrows. Contrast is shown to enhance certain differences
between narrow and broad focus. There is a clear indication that speakers differ
considerably as to the combination of strategies they employ for marking focus
structure.
| | | Poster Session 3: Prosody and Speech Production | 10 of 24 | L tone downtrends in Korean across utterance types AUTHOR(S):
Kim, Kyung-hee; IfL-Phonetik, University of Cologne
Abstract:
Research on global pitch trends has shown that statements and different types
of questions in Dutch all display distinct patterns, and suggests that these may
be influenced by the presence of accentual prominence on wh-words and whether
syntactic cues to interrogativity are present. This implies that there would be
different pitch trends in a language such as Korean which lacks accentual prominence
and which does not have to have an interrogative syntax in unmarked yes-no
questions. We test this implication by comparing the results in [11] with similar
statements and question types in Korean, concentrating in this paper on the scaling
of L tones. Further, we differentiate between the pitch trends towards the end
of the utterances and those in the rest of the utterance, so as to investigate the
contribution of final lowering to the shape of global trends.
| | | Poster Session 3: Prosody and Speech Production | 11 of 24 | The domain of realization of the L-phrase tone in American English AUTHOR(S):
Barnes, Jonathan; Boston University
Shattuck-Hufnagel, Stefanie; MIT
Brugos, Alejna; Boston University
Veilleux, Nanette; Simmons College
Abstract:
The phonetic realization of intonational targets in the f0 contour is not always
straightforwardly predicted by their affiliations in the segmental string, and the
phrase tones of American English are a type of target for which several hypotheses
about the domain of realization have been advanced. By varying the metrical
structure of target words at the end of a phrase produced with the H* L- H%
'surprised dismay' contour, we determined that a) the right edge of the L-, signaled
by the beginning of the rise for the H%, occurs close to the right edge of the phrase,
b) the left edge of the L-, signaled by the end of the fall from the H*, stretches
leftward to seek a prominent syllable, and c) there is significant variation in the
resolution of the various factors that influence these two inflection point locations.
| | | Poster Session 3: Prosody and Speech Production | 12 of 24 | Prosodic Encoding of Topic and Focus in MandarinAUTHOR(S):
Wang, Bei; University of Potsdam
Xu, Yi; University College London & Haskins Laboratories
Abstract:
In this study, we investigate whether and how focus and topic can be separately
encoded in Mandarin. A total of 60 sentences with three lengths and five tone
combinations were recorded in four topic-focus conditions: initial focus, new topic,
implicit topic and given topic, by six speakers. The results of acoustic analysis
show that new topic is encoded with a raised pitch range on the initial word.
Focus, in contrast, is encoded with an expanded pitch range on the focused word
and a suppressed pitch range on the subsequent words.
| | | Poster Session 3: Prosody and Speech Production | 13 of 24 | Contextual Tonal Variations and Pitch Targets in CantoneseAUTHOR(S):
Wong, Ying Wai; The Chinese University of Hong Kong
Abstract:
With Cantonese as the target language, this study investigates the phonetic details
of contextual tonal variations in disyllabic tonal sequences. It was found that the
main source of F0 (fundamental frequency) contour deviation from the canonical
form comes from carryover effect, which is assimilatory in nature. Furthermore,
based on the Target Approximation (TA) model, an optimization problem was
formulated as an attempt to unveil mathematically pitch targets of the six lexical
tones in Cantonese. Finally, implications of our results on tone production and
perception are discussed.
| | | Poster Session 3: Prosody and Speech Production | 14 of 24 | Realization of Cantonese Rising Tones under Different Speaking RatesAUTHOR(S):
Wong, Ying Wai; The Chinese University of Hong Kong
Abstract:
The two Cantonese rising tones, high-rising and low/mid-low rising tones, were
found to maintain their distinct slopes of F0(fundamental frequency)-rise and offset
F0 under different speaking rates. This suggests the two as possible acoustic cues
for rising tone discrimination. The rising contours, under whichever speaking
rate, reside in area temporally near the syllable offset. Furthermore, through
tests with different alignment methods, the rising contours were found to show
the most significant overlap when aligning with offset of the host syllable. Finally,
discussions on characterization of rising tones within the Target Approximation
(TA) model are presented.
| | | Poster Session 3: Prosody and Speech Production | 15 of 24 | Thai tonal contrast under changes in speech rate and stressAUTHOR(S):
Nitisaroj, Rattima; Georgetown University
Abstract:
This study investigates how the five lexical tones in Thai are realized on primary-,
secondary-, and unstressed syllables produced at fast, normal and slow rate. The
results revealed that 1) speech rate does not have any significant effect on F0
height, excursion size and F0 peak and valley location of Thai tones, 2) tones on
primary-stressed syllables have a larger excursion size than those on secondaryand
unstressed syllables, and 3) the five-way tonal contrast in the language is
maintained regardless of changes in speech rate and stress.
| | | Poster Session 3: Prosody and Speech Production | 16 of 24 | Rate sensitivity of syllable in French: a perceptual illusion?AUTHOR(S):
Pasdeloup, Valérie; Université de Rennes 2 & LPL, UMR 6057 CNRS, Université
d'Aix-en-Provence
Espesser, Robert; LPL, UMR 6057 CNRS, Université d'Aix-en-Provence
Faraj, Malika
Abstract:
This study takes place within the framework of Gestalt theory. The aim of this
work is to determine the way the prosodic scene reorganises itself according to the
variation of speech rate. How do the forms constituted by stressed syllables interact
with the ground of unstressed syllables? We present a study of the temporal
structure of a one thousand word speech corpus. The corpus was produced at
three different rates (normal, fast and slow) by one speaker with two repetitions.
The goal is to constrain the rhythmical structure of speech in order to observe
how rhythmic patterns depend on the variation of speech rate. Results show that
rhythm is not elastic. When speech rate changes, syllabic duration does not vary
in the same way for stressed and for unstressed syllables. Unstressed syllables
have very little elasticity compared with stressed syllables. This result supports
the hypothesis that the unstressed syllable is an anchor point in the rhythmic
structure of French.
| | | Poster Session 3: Prosody and Speech Production | 17 of 24 | Production of word stress in German: Children and adultsAUTHOR(S):
Schneider, Katrin; Institute for Natural Language Processing, University of Stuttgart
Möbius, Bernd; Institute for Natural Language Processing, University of Stuttgart
Abstract:
This study investigates the acoustic correlates of contrastive word stress in bisyllabic
and trisyllabic German words, produced by children and their parents.
Results of the acoustic analysis of speech data are reported that were collected
from three children aged 2;3 to 6;1 and their mothers during a period of two years.
The results suggest that German children between 2 and 6 years of age are able to
produce contrastive word stress but differ in their choice and usage of the parameters
that mark stress. We found that, for German, vowel duration is the most
reliable correlate of word stress in the utterances produced by all three children as
well as their mothers. Adult-like usage of fundamental frequency, intensity, and
several voice quality parameters appears to be acquired later than that of duration;
this observation may be confounded by the finding that these parameters appear
to be used less consistently than duration to mark stress even by the mothers.
| | | Poster Session 3: Prosody and Speech Production | 18 of 24 | Stress and Accent in Catalan and Spanish: Patterns of duration, vowel quality, overall intensity, and spectral balanceAUTHOR(S):
Prieto, Pilar; ICREA-UAB
Ortega-Llebaria, Marta; University of Texas-Austin
Abstract:
This article is concerned with the acoustic correlates that characterize stress and
accent in Catalan and Spanish. We analyzed four acoustic correlates of stress
(syllable duration, vowel quality, overall intensity, and spectral balance) in stressed
and unstressed syllables in both accented and unaccented positions. Given that
Spanish and Catalan differ greatly in their use of vowel reduction to mark stressed
positions, we test whether they will also differ in the way they use the other acoustic
correlates to signal the presence of stress and accent. Along with the findings of
Slujter & collaborators (1996, 1997) and Campbell & Beckman (1997) on Dutch
and English, Catalan and Spanish reveal systematic differences in the acoustic
characterization of stress and accent. Specifically, while syllable duration, vowel
quality, and spectral tilt are reliable acoustic correlates of the stress difference in
both languages, accentual differences are acoustically marked by overall intensity
cues.
| | | Poster Session 3: Prosody and Speech Production | 19 of 24 | Acoustic Cues of Stress and Accent in CatalanAUTHOR(S):
Astruc-Aguilera, Llüisa; University of Cambridge (from Feb 2006, Associate Lecturer,
The Open University)
Prieto, Pilar; Universitat Autónoma de Barcelona
Abstract:
This paper examines the phonetic correlates of stress and accent in Catalan. We
analyzed five acoustic correlates of stress (syllable duration, spectral balance, vowel
quality, vowel pitch, and vowel intensity) in two stress conditions and in two accent
conditions, which is to say, in stressed and unstressed syllables in both accented
and unaccented environments (that is, appositions in sentences such as Vol la vela,
la vella '(S)he wants the sail, the old sail' vs. right-dislocated subjects in Vol la
vela, la vella '(S)he wants the sail, the old lady'. Along with the findings of Slujter
& collaborators and Campbell & Beckman on Dutch and on English, Catalan
reveals systematic differences in the acoustic characterization along the accent
and stress dimensions. Syllable duration, spectral balance, and vowel quality are
reliable acoustic correlates of the stress differences, while accentual differences are
acoustically marked by intensity and pitch cues.
| | | Poster Session 3: Prosody and Speech Production | 20 of 24 | Boundaries and Tonal articulation in Taiwanese MinAUTHOR(S):
Pan, Ho-hsien; National Chiao Tung University
Tai, Yi-hsin; National Chiao Tung University
Abstract:
This study investigated the effect of the boundary on Taiwanese falling tones at
the domain final and domain initial positions across the intonational phrase (IP),
tone group (TW), word (WRD) and syllable (SYL)boundaries. The boundaries
were placed at the same position within sentences produced with broad focus. The
results showed that at domain-final, the f0 of falling tones decreased at a slower
rate before IP and TW boundaries than before WRD and SYL boundaries. On
the contrary, at the domain initial position, the ranking for f0 decreasing rate was
IP, then TW, then SYL, and finally WRD. It is proposed that f0 decreasing rate,
reflecting the vocal fold vibration, varies as a function of approaching and receding
boundaries. At supra-segmental levels, the velocity of f0 decrease slows down as
the approaching boundary weakens, whereas the velocity of f0 descending speeds
up as receding boundary strengthens.
| | | Poster Session 3: Prosody and Speech Production | 21 of 24 | Declination and supra-laryngeal articulation in Cantonese - EPG studyAUTHOR(S):
Yuen, Ivan; Queen Margaret University College
Abstract:
Supra-laryngeal declination was reported in Italian and English. Such findings
suggest that declination is not confined to the laryngeal sub-system and its acoustic
output - F0. This paper intended to examine the supra-laryngeal articulation
and declination in Hong Kong Cantonese (a tone language) and tested whether
declination also affects supra-laryngeal articulation. In light of recent findings in
the effect of prosodic positions on articulation, it is the second goal of this paper to
investigate any interaction of prosodic positions and declination on supra-laryngeal
articulation. Results showed no supra-laryngeal declination; however, declination
interacts with prosodic positions in F0 scaling.
| | | Poster Session 3: Prosody and Speech Production | 22 of 24 | Effects of stress on intonational structure in GreekAUTHOR(S):
Baltazani, Mary; University of Ioannina
Abstract:
This paper presents the results of a production experiment that examines the
effects of stress on the realization of tonal events in the intonation of Greek. Words
in three different stress categories - final, penultimate and antepenultimate stress
- were examined in two different prosodic positions: at the edge of an intermediate
phrase and in phrase medial position. The results show that stress position affect
the alignment and scaling of tones at the edge of an intermediate phrase but not in
phrase medial position. Moreover, a phrase final word showed considerable longer
duration than the same word in phrase medial position.
| | | Poster Session 3: Prosody and Speech Production | 23 of 24 | Time-domain Noise Subtraction Applied in the Analysis of Lombard SpeechAUTHOR(S):
Mixdorff, Hansjörg; TFH Berlin University of Applied Sciences
Grauwinkel, Katja; TFH Berlin University of Applied Sciences
Vainio, Martti; University of Helsinki
Abstract:
This paper presents results of the comparison between speech produced in silence
and speech in noise, also known as Lombard speech. A temporal filtering algorithm
was developed which successfully removes the ambient noise from recordings of
Lombard speech by locating and subtracting a recording of the noise performed
in the same environment. The filtering algorithm yields overall noise attenuation
between 15 and 30 dB without distorting the speech signal like spectral filtering
approaches. In the subsequent acoustic analyses we examined the effect of varying
levels of noise on vowel formants, glottal spectra and intensity. For most vowels
we found significant rises in F1 and F2, but little variation in formant bandwidth.
The overall rise in intensity between silent and 80 dB babble noise conditions was
found to be of 9 dB. With growing effort higher harmonics are boosted by up to
6 dB whereas the average speech rate only drops by 5 %.
| | | Poster Session 3: Prosody and Speech Production | 24 of 24 | Lombard speech: Auditory (A), Visual (V) and AV effectsAUTHOR(S):
Davis, Chris; Department of Psychology, The University of Melbourne
Kim, Jeesun; Department of Psychology, The University of Melbourne & Graduate
School of Education, Sejong University
Grauwinkel, Katja; TFH Berlin University of Applied Sciences
Mixdorff, Hansjörg; TFH Berlin University of Applied Sciences
Abstract:
This study examined Auditory (A) and Visual (V) speech (speech-related head
and face movement) as a function of noise environment. Measures of AV speech
were recorded for 3 males and 1 female for 10 sentences spoken in quiet as well
as four styles of background noise (Lombard speech). Auditory speech was analyzed
in terms of overall intensity, duration, spectral tilt and prosodic parameters
employing Fujisaki model based parameterizations of F0 contours. Visual speech
was analyzed in terms of Principal Components (PC) of head and face movement.
Compared to speech in quiet, Lombard speech was louder, of longer duration,
had more energy at higher frequencies (particularly with babble speech) and had
greater amplitude mean accent and phrase commands.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Poster Session 4 (PS 4): Syntax, Semantics, Pragmatics
and Prosody Wednesday, May 3, 14:30 - 16:00
Chair: Zdena Palková |
| Poster Session 4: Syntax, Semantics, Pragmatics | 1 of 21 | A dynamical model for generating prosodic structureAUTHOR(S):
Barbosa, Plinio; IEL/State University of Campinas
Abstract:
The performance of the Monnin-Grosjean (MG) algorithm for predicting prosodic
structure is compared with that of a system of dependency-grammar-based local
markers (the DG system). Analyses of Brazilian Portuguese paragraphs read by
five speakers reveal that the MG algorithm performs as well as the DG system
when V-to-V normalised durations at word and phrase stress boundaries are used
as indexes of prominence. These two procedures, however, have proved unsuccessful
in dealing with individual variability. To overcome such a limitation, a
dynamical model is proposed. By coupling syntactic and regularity constraints
the main advantage of the model is the plausible simulation of speaker variability.
Seven simulations were caried out by changing three model parameters: coupling
strength, conditional probability of phrase stress placement, and V-to-V duration
mean.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 2 of 21 | An automatic method for revising ill-formed sentences based on N-gramsAUTHOR(S):
Athanaselis, Theologos; Institute for Language and Speech Processing
Bakamides, Stelios; Institute for Language and Speech Processing
Dologlou, Ioannis; Institute for Language and Speech Processing
Abstract:
A good indicator of whether a person really knows the context of language is the
ability to use in correct order the appropriate words in a sentence. The "scrambled"
words cause a meaningless and ill formed sentences. Since the language
model, is extracted from a large text corpus, it encodes the local dependencies
of words. The word order errors usually violated the syntactic rules locally and
therefore the N-grams can be used in order to fix ill-formed sentences. This paper
presents an approach for repairing word order errors in text by reordering words
in a sentence and choosing the version that maximizes the number of trigram hits
according to a language model. The novelty of this method concerns the use of
an efficient confusion matrix technique for reordering the words. The comparative
advantage of this method is that works with a large set of words, and avoids
the laborious and costly process of collecting word order errors for creating error
patterns.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 3 of 21 | Focal Pitch Accents and Subject Positions in Spanish: Comparing Close-to-Standard Varieties and Argentinean PortenoAUTHOR(S):
Gabriel, Christoph; University of Osnabruck, FB 7
Abstract:
In Spanish focus can be signaled by both prosodic and syntactic means. However,
it remains controversial how these two components depend on one another.
Based on the analysis of experimental data I argue that in Spanish focus is primarily
expressed by intonation. Unlike most Spanish dialects, Argentinean Porteno
allows for a tonal distinction between neutral and contrastive focus in IP-final
position; in other positions focus is signaled through increased F0 values and/or
syllable-internal early peak alignment. In addition, reordering of constituents can
apply. Movement as a facultative strategy of focus marking is avoided in sentences
with a full DP object, but strongly preferred with a clitic object. The variation
encountered in the data is accounted for by combining Minimalist phrase structure
building with the insights of the optimality-theoretic model of overlapping
constraints.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 4 of 21 | Prosodic Realization of Information Structure Categories in Standard ChineseAUTHOR(S):
Chen, Yiya; Radboud University Nijmegen
Braun, Bettina; Max Planck Institute for Psycholinguistics
Abstract:
This paper investigates the prosodic realization of information structure categories
in Standard Chinese. A number of proper names with different tonal combinations
were elicited as a grammatical subject in five pragmatic contexts. Results show
that both duration and F0 range of the tonal realizations were adjusted to signal
the information structure categories (i.e. theme vs. rheme and background vs.
focus). Rhemes consistently induced a longer duration and a more expanded F0
range than themes. Focus, compared to background, generally induced lengthening
and F0 range expansion (the presence and magnitude of which, however, are
dependent on the tonal structure of the proper names). Within the rheme focus
condition, corrective rheme focus induced more expanded F0 range than normal
rheme focus.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 5 of 21 | Emphasis, Syllable Duration, and Tonal Realization in Standard ChineseAUTHOR(S):
Chen, Yiya; Radboud University Nijmegen
Abstract:
This study examines the durational and F0 adjustments employed to convey degrees
of emphasis in Standard Chinese (SC). Three speakers produced four lexical
tones with varied preceding and following tones. Corrective focus, with two degrees
of emphasis on the target syllable (i.e. Emphasis and More-Emphasis), was
elicited, in addition to a No-Emphasis condition (as the baseline for comparison).
Results showed a gradual increase of syllable duration: The magnitude of increase
from the No-Emphasis to the Emphasis condition and that from the Emphasis
to the More-Emphasis condition were comparable. F0 range expansion, however,
was non-gradual. While there was a robust increase of F0 range from the No-
Emphasis to the Emphasis condition, the expansion from the Emphasis to the
More-Emphasis condition was reduced. The F0 contours of the individual tones
suggest that when emphasized, tones were realized with distinctive F0 patterns,
adapting to the tonal contexts and the increase of duration.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 6 of 21 | Tonal Constituents and Meanings of Yes-No QuestionsAUTHOR(S):
Hedberg, Nancy; Simon Fraser University
Sosa, Juan; Simon Fraser University
Fadden, Lorna; Simon Fraser University
Abstract:
We analyzed the different meanings associated with the tonal contours of 104
positive yes-no questions from the CallHome Corpus of American English. We
take into consideration such broad constituents as the head, nucleus and tail of
intonational phrases, as well as ToBI sequences of pitch accents, phrase accents
and boundary tones. The meaning of a question as unmarked or marked in a
variety of ways is shown to depend upon the intonational contours associated with
these broad constituents, and even withte contour associated with the question as
a whole.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 7 of 21 | Prosodic Properties of Constituents Associated with Stressed 'auch' in GermanAUTHOR(S):
Sudhoff, Stefan; Universität Leipzig, Institut für Linguistik
Lenertová, Denisa; Universität Leipzig, Institut für Linguistik
Abstract:
We report a production experiment and two perception studies examining the
prosodic characteristics of constituents associated with the stressed variant of the
German particle 'auch' (also) in potentially ambiguous constructions. The results
show that these elements are marked by perceptually relevant rising pitch accents,
56 SPEECH PROSODY 2006
but that there is no 1:1 mapping between the prosodic realization and the status
of being associated with 'auch'.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 8 of 21 | Russian personal pronouns in Syntax and PhonologyAUTHOR(S):
Mleinek, Ina; University of Leipzig
Werkmann, Valja; University of Leipzig
Abstract:
The question we will address is how far the syntactic positions of Russian personal
pronouns affect their phonological properties. To this aim we examined their
phonological behaviour in three structural slots within the sentence (first experiment)
and then in the right-peripheral position associated with sentence stress (second
experiment). Probing Rappaport's 1988 idea of the verb as the prosodic host
for de-stressed Russian personal pronouns we wanted to know whether prosodic
cliticization (indicated by times of silence/pauses and steps up/down of F0 values)
is rather determined (1) by focus; (2) by morphosyntactic categories; or (3)
by direction. The result is a combination of all three possibilities, and according
to our second experiment, Russian personal pronouns are functional words in the
broad focus condition while in conditions with contrastive and minimal foci, Russian
personal pronouns can receive sentence stress and thus, behave like lexical or
content words.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 9 of 21 | Can prosodic cues and function words guide syntactic processing and acquisition?AUTHOR(S):
Millotte, Séverine; Laboratoire de Sciences Cognitives et Psycholinguistique
(EHESS-ENS-CNRS)
Wales, Roger; Faculty of Humanities and Social Sciences, La Trobe University
Dupoux, Emmanuel; Laboratoire de Sciences Cognitives et Psycholinguistique
(EHESS-ENS-CNRS)
Christophe, Anne; Laboratoire de Sciences Cognitives et Psycholinguistique
(EHESS-ENS-CNRS)
Abstract:
We studied the use of phonological phrase boundaries and function words in syntactic
processing. French adults performed an abstract word detection task on
jabberwocky sentences. We created two conditions:
- "with function word" condition: targets were directly preceded by a function
word, as in "[une bamoule] [dri se froliter]" ("bamoule" is a noun), and "[tu
bamoules] [saman ti]" ("bamoule" is a verb)
- "without function word" condition: targets were not directly preceded by a function
word; sentence beginnings differed by their prosodic and syntactic structures,
as in "[une cramona bamoule] [camiche dabou]" (noun target) vs "[une cramona]
[bamoule muche] [le mirtou]" (verb target).
Function words and prosodic cues allow listeners to start building a syntactic structure.
Adults were able to use phonological phrase boundaries to define syntactic
boundaries, and function words to label these constituents.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 10 of 21 | Acoustic prominence and reference accessibility in language productionAUTHOR(S):
Watson, Duane; University of Illinois Urbana-Champaign
Arnold, Jennifer; University of North Carolina, Chapel Hill
Tanenhaus, Michael; University of Rochester
Abstract:
Two experiments explored discourse and communicative factors that contribute
to the perceived prominence of a word in an utterance, and how that prominence
is realized acoustically. In Experiment 1 two hypotheses were tested: (1) acoustic
prominence is a product of the given-new status of a word and (2) acoustic
prominence depends on the degree to which a referent is accessible, where greater
acoustic prominence is used for less accessible entities. In a referential communication
task, speakers used acoustic prominence to indicate referent accessibility
change, independent of given-new status. In Experiment 2 a variant of Tic Tac Toe
was used to investigate whether effects of accessibility are driven by a need to signal
the importance of a word or to indicate the word's predictability. The results
indicate that both importance and predictability contribute to the prominence of
a word, but in different ways.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 11 of 21 | More than pointing with the prosodic focus: The Valence-Intensity-Domain (VID) modelAUTHOR(S):
Aubergé, Véronique; ICP CNRS
Rilliard, Albert; ICP CNRS
Abstract:
This paper summarizes several perception experiments showing that the morphology
of the prosodic focus conveys more that the information of the deixis function:
(1) the binary valence - yes/no focus - which is perceptively quite categorical
(a magnet effect is clear on the basis of an identification and a discrimination
experiment), (2) the intensity information, used by the speaker to give his preference
between two focused elements, (3) the information of the focus domain,
that are some segmentation cues about the focused element (phonological unit
or word unit), which are perceptively identified by listeners. The morphological
cues revealing Valence-Intensity-Domain are observed in particular in morphing
procedure making clear the thresholds of quite-categorical behaviors.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 12 of 21 | Focus-related pitch range manipulation (and peak alignment effects) in Egyptian ArabicAUTHOR(S):
Hellmuth, Sam; Department of Linguistics, SOAS, University of London & Institut
für Linguistik, Universität Potsdam
Abstract:
This paper explores focus-related effects on pitch range and on peak alignment
in Egyptian Arabic (EA), and interaction between them. Qualitative analysis of
elicited focus data shows that even when post-focal and 'given', EA words bear
a pitch accent. Quantitative analysis reveals gradient effects of focus in the form
of pitch range manipulation but which reflects identificational/contrastive focus,
not information focus. Peak alignment shows an indirect effect of post-focal F0
compression. It is argued that pitch range manipulation is used in EA to express
identificational/contrastive focus only (not given new/information focus); the
effect is argued to be phonologically gradient because the effects emerge not only
on focused items, as F0 expansion, but also on post-focal items, in the form of F0
compression and earlier peak alignment.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 13 of 21 | An Experimental Study on the Assignment of Focus Accent in MandarinAUTHOR(S):
Wang, Yunjia; Peking University
Chu, Min; Microsoft Research Asia
Abstract:
This paper investigates the distribution of focus-related accents in the broad focus
domain in Chinese Mandarin through 300 natural sentences. The results show that
focus-related accent tends to be assigned to the predicate in a subject-predicate
structure, to the object in a predicate-object structure, and to the head in an
adjunct-head structure unless the head is highly predictable. From these observations,
we conclude that, in a broad focus structure in Chinese Mandarin, the
focus-related accent is normally assigned to the innermost constituent of the sentence
if this constituent has enough semantic weight; otherwise, the accent is placed
in the constituent that has the closest syntactic relationship to the innermost one.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 14 of 21 | Predicting Prosodic Phrasing Using Linguistic Features AUTHOR(S):
Yoon, Tae-Jin; University of Illinois at Urbana-Champaign
Abstract:
The prosodic structure of speech is based on complex interaction within and between
several different levels of linguistic, and paralinguistic organization. Though
leading theories of prosody maintain that prosody is shaped through the interaction
of grammatical factors from phonology, syntax, semantics, and pragmatics,
there is no consensus on how to model their interaction. I provide a new probabilistic
model of the mapping between prosody and phonology, syntax, and argument
structure. The model encodes phonological features, shallow syntactic constituent
structure, and basic argument structure. A machine learning experiment using
these features to predict prosodic phrase boundaries achieves more than 92 % accuracy
in predicting prosodic boundary location. An experiment for predicting
the strength of prosodic boundaries achieve 88.06 % accuracy. This study sheds
light on the relationship between prosodic phrase structure and other grammatical
structures.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 15 of 21 | Utterance Final Forms in Dialogues by Young Japanese: A Syntactic and Prosodic AnalysisAUTHOR(S):
Nishinuma, Yukihiro; CNRS Laboratoire Parole et Langage
Hayashi, Akiko; Chuo University
Yabe, Hiroko; Tokyo Gakugei University
Abstract:
This work reports findings on the relationship between speaker-sex and linguistic
behavior among young Japanese in explanation-giving dialogues. The relationship
between speaker-sex and (1) the choice of utterance final forms; (2) the prosodic
characteristics on these forms, has thus been examined. Data obtained from 110
students of the Tokyo area revealed no statistically significant effect of the sex factor
in the syntactic forms used. However utterance final syllables had a statistically
significant effect both on rhythm and intonation.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 16 of 21 | Cross-dialectal Turn Exchange Rhythm in English InterviewsAUTHOR(S):
Fon, Janice; Graduate Institute of Linguistics, National Taiwan University
Abstract:
This study looked at the relationship between rhythm and exchange type in a
stress-timed language, British English, and a syllable-timed language, Singaporean
English, using a spontaneous speech corpus. Exchange intervals (EIs) were measured
and different exchange types were labeled. Results showed that in a dialog,
EIs were generally limited to a narrow range. However, within the range, EIs
had four functions. First, EIs indicated linguistic rhythm. Singaporean English
tended to have shorter EIs. Second, EIs reflected the cognitive load and tightness
of coupling in differentiating various exchange types. EIs in pairs requiring
more cognitive resources and tighter coupling were longer than in those not as
cognitively-loaded but were shorter than in pairs not as tightly coupled. Moreover,
EIs reflected discourse organization. Topic initiation EIs were longer than
topic ending ones. Finally, the degree of politeness correlated positively with EI.
Asian females' EIs were lengthened.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 17 of 21 | The Effect of Paralinguistic Emphasis on F0 Contours of Cantonese SpeechAUTHOR(S):
Gu, Wentao; The University of Tokyo
Hirose, Keikichi; The University of Tokyo
Fujisaki, Hiroya; The University of Tokyo
Abstract:
Emphasis has a significant effect on F0 contours in various languages, among which
tone languages require more careful study because their F0 contours show complex
interaction between lexical tones and phrase intonation. Here we employ the
command-response model to investigate the effect of paralinguistic emphasis in
Cantonese, a typical tone language with nine lexical tones. Following our previous
study on target syllables in a fixed carrier frame, the current study continues to
investigate the utterances with natural context, in which the effects of emphasis
with different scopes and on different parts of utterance are compared. It shows
that the major effect of emphasis is not on tone commands but on phrase commands.
The narrowness/broadness of emphasis can be distinguished by the number
of phrase commands being affected in the phonetic realization. By use of the
command-response model, F0 contours for expressive speech conveying emphasis
information can be generated efficiently.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 18 of 21 | Prosodic and informational aspects of polar questions in Neapolitan ItalianAUTHOR(S):
Crocco, Claudia; University of Salerno
Abstract:
In this paper the relation between prosodic form and meaning is investigated in a
sample of polar questions in Neapolitan Italian, taken from 4 Map Task dialogues.
The sample is analyzed from both the informational and the prosodic point of
view. The information analysis found 4 groups of questions, distinguished by their
function or by the degree of accessibility of the referents they contain. The groups
were then put in relation to the conversational Map Task moves, and to the results
of the prosodic analysis. The results of this analysis show that the YNQs questions
in Neapolitan Italian have a common prosodic pattern. Their different functions,
i.e. confirmation-seeking and information-seeking, are expressed with a variety
of means that, together with the information provided by the context, concur to
orient the interpretation.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 19 of 21 | Argument Structure and Focus Projection in Korean AUTHOR(S):
Kim, Hee-Sun; Stanford University
Jun, Sun-Ah; UCLA
Lee, Hyuck-Joon; UCLA
Kim, Jong-Bok; Kyunghee University
Abstract:
It has been claimed that syntactic structures and the argument types can determine
the domain of focus: focus on a particular type of internal argument may
project its focus domain to a larger syntactic constituent than the focused item. It
is also known that focus often has prosodic reflections through the manipulations
of prosodic phrasing, prominence relation of words, and duration. This paper examines
the relationship between the focus projection and the argument structure
in Korean by investigating the prosodic correlates of focus. Results show that
there is no sensitivity of argument type in projecting the domain of focus to Verb
Phrase. Regardless of argument types or word order, VP focus was prosodically
marked at the VP-initial word by initiating a large intonational phrase boundary,
raising its pitch peak, and lengthening of the VP-initial syllable and word. The results
do not support the claim that the argument structure is an important factor
in focus projection.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 20 of 21 | Interface between information structure and intonation in Dutch WH-questionsAUTHOR(S):
Chen, Aoju; Max Planck Institute for Psycholinguistics
Abstract:
This study concerns how accent placement is pragmatically governed in Dutch
WH-questions. Central to this issue are questions such as how intonation of the
WH-word is related to information structure of the non-WH word part, whether
topical constituents can be accented, and what is the nature of the accents in the
non-WH word part. Different treatments of these questions in earlier approaches
result in conflicting predictions on the intonation of WH-questions. We addressed
these questions by analysing a corpus of 90 naturally occurring WH-questions.
Results show that the intonation of the WH-word is related to the information
structure of non-WH word part. Moreover, topical constituents can be accented
and accents in the non-WH word part are not necessarily phonetically reduced.
Moreover, we have observed that the speaker may have communicative motivation
to accent the WH-word or adverbs not part of the presupposition in addition to
pragmatic motivation.
| | | Poster Session 4: Syntax, Semantics, Pragmatics | 21 of 21 | Syntactic and prosodic parenthesisAUTHOR(S):
AUTHOR(S):
Peters, Jörg; Radboud University Nijmegen
Abstract:
This paper examines the view that parentheticals obligatorily form an intonational
phrase and break up the intonational phrase of the matrix sentence into two intonational
phrases. The analysis of spontaneous speech data of Hamburg German
shows that neither do all parentheticals form a distinct intonational phrase nor do
all parentheticals break up the intonational phrase of the matrix sentence. The
most frequent type of prosodic integration is prosodic parenthesis, which is the
insertion of one intonational phrase into another and parallels parenthesis on the
syntactic level. Additional analyses reveal that the size of the parenthetical and
the syntactic integration of the parenthetical into the matrix sentences affect its
prosodic integration. Finally, it is argued that the distinction between syntactic
and prosodic parenthesis can solve common problems in defining parentheticals.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Special Session 3 (SPS 3):
Understanding Emotions in Speech: Neural and Cross-cultural Evidence Organizers: Sonja A. Kotz and Marc D. Pell
Wednesday, May 3, 16:00 - 18:00 |
| Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence | 1 of 5 | Non-verbal expressions of emotion - acoustics, valence and cross cultural factorsAUTHOR(S):
Scott, Sophie; Institute of Cognitive Neuroscience, University College London
Sauter, Disa; Institute of Cognitive Neuroscience, University College London
Abstract:
This presentation will address aspects of the expression of emotion in non-verbal
vocal behaviour, specifically attempting to determine the roles of both positive
and negative emotions, their acoustic bases, and the extent to which these are
recognized in non-Western cultures.
| | | Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence | 2 of 5 | Implicit recognition of vocal emotions in native and non-native speech AUTHOR(S):
Pell, Marc D.; School of Communication Sciences and Disorders, McGill University,
Montréal
Abstract:
There is evidence for both cultural-specificity and 'universality' in how listeners
recognize vocal expressions of emotion from speech. This paper summarizes some
of the early findings using the Facial Affect Decision Task which speak to the
implicit processing of vocal emotions as inferred from "emotion priming" effects on
a conjoined facial expression. We provide evidence that English listeners register
the emotional meanings of prosody when processing sentences spoken by native
(English) as well as non-native (Arabic) speakers who encoded vocal emotions in
a culturallyappropriate manner. As well, we discuss the timecourse for activating
emotion-related knowledge in a native and nonnative language which may differ
due to cultural influences on vocal emotion expression.
| | | Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence | 3 of 5 | Examining the neural mechanisms involved in the affective and pragmatic coding of prosodyAUTHOR(S):
Grandjean, Didier; Swiss Centre of Affective Sciences, University of Geneva
Scherer, Klaus R.; Swiss Centre of Affective Sciences, University of Geneva
Abstract:
The vocal expression of humans includes expressions of emotions, such as anger or
happiness, and pragmatic intonations, such as interrogative or affirmative, embedded
within the language. These two types of prosody are differently affected by
the so-called push and pull effects. Push effects, influenced by psychophysiological
activities, strongly affect emotional prosody, whereas pull effects, influenced by
cultural rules of expression, predominantly affect intonation or pragmatic prosody,
even though both processes influence all prosodic production. Two empirical studies
are described that exemplify the possibilities of dissociating emotional and linguistic
prosody decoding at the neurological level. The first study was conducted to
investigate the impairments in prosody recognition related to left or right temporoparietal
brain-damaged patients. The second study used electroencephalography
in healthy participants to investigate the timing of information processing during
emotional and linguistic prosody recognition tasks. The results highlight the importance
of considering not only the distinction of different types of prosody, but
also the relevance of the task realized by the participants to better understand
information processes related to human vocal expression at the suprasegmental
level.
| | | Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence | 4 of 5 | Development of the Brain Mechanism for Understanding Speakers' Intent from SpeechAUTHOR(S):
Imaizumi, Satoshi; Prefectural University of Hiroshima
Noguchi, Yuki; Prefectural University of Hiroshima
Homma, Midori; Prefectural University of Hiroshima
Yamasaki, Kazuko; Prefectural University of Hiroshima
Maruishi, Masaharu; Hiroshima Prefectural Rehabilitation Center
Muranaka, Hiroyuki; Hiroshima Prefectural Rehabilitation Center
Abstract:
To clarify how the brain understands the speaker's mind for verbal acts, fMRI
images obtained from 24 subjects and behavioral data obtained from 339 subjects
were analyzed when they judged the linguistic meanings or emotional manners
of spoken phrases. The target phrases had linguistically positive or negative
meanings and were uttered warmheartedly or coldheartedly by a woman speaker.
The results of the fMRI analyses suggest that neural resources responsible for the
speakers' mind reading are distributed over the superior temporal sulci, inferior
frontal regions, medial frontal regions and posterior cerebellum. The correct judgment
of the speaker intentions significantly increased with age for the phrases with
inconsistent linguistic and emotional valences. Female children showed faster development
than male children. The neural mechanism to interpret speaker's real
intensions from spoken phrases develops slowly during the school age.
| | | Special Session 3: Understanding Emotions in Speech: Neural and Cross-cultural Evidence | 5 of 5 | efMRI Evidence for Implicit Emotional Prosodic ProcessingAUTHOR(S):
Kotz, Sonja A.; Max Planck Institute of Human Cognitive and Brain Sciences,
Leipzig
Paulmann, Silke; Max Planck Institute of Human Cognitive and Brain Sciences,
Leipzig
Raettig, Tim; Max Planck Institute of Human Cognitive and Brain Sciences,
Leipzig
Abstract:
The current efMRI experiment investigated the potential right hemisphere dominance
of emotional prosodic processing under implicit task demands. Participants
evaluated the relative tonal height (high, medium, low) of intelligible and unintelligible
sentences spoken by a trained female speaker of German with three prosodic
contours: happy, angry, and neutral. The results confirm the activation of a bilateral
fronto-striato-temporal network with no clear right hemispheric preference
for emotional prosodic processing. The data suggest that (1) task demands do not
significantly alter lateralization of function in the current context, and (2) frontostriatal
brain areas engage during implicit processing of emotional prosody, thus
do no seem to be task specific.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Oral Session 3 (OS 3): Prosody and Speech Perception Thursday, May 4, 09:45 - 11:25
Chair: Ailbhe Ní Chasaide |
| Oral Session 3: Prosody and Speech Perception | 1 of 5 | Phrase-Final Pitch Discrimination in EnglishAUTHOR(S):
Cummins, Fred; University College Dublin
Doherty, Colin; Royal College of Surgeons in Ireland
Dilley, Laura; The Ohio State University
Abstract:
We investigate the discrimination of phrase final pitch contours within a continuum
from statement to question in English. Previous work in German and Dutch
has raised questions about the relationship between discrimination sensitivity and
category structure within this continuum. To clarify the relationship between
linguistic category and simple auditory discrimination, we employ both speech
and non-speech stimuli. For all stimuli, we find a discrimination peak at the point
in the continuum where a pitch fall changes to a pitch rise. This peak does not
appear to be related to the category boundary for speech stimuli, as revealed in
a labeling task. Discrimination was somewhat better for non-speech stimuli than
speech.
| | | Oral Session 3: Prosody and Speech Perception | 2 of 5 | The role of articulation rate in distinguishing fast and slow speakersAUTHOR(S):
Koreman, Jacques; Saarland University
Abstract:
This article discusses differences in articulation rate between fast and slow speakers
in a production experiment. It is shown that fast and slow speakers differ in
their articulation rates, both in terms of the number of phones in the canonical
form (intended rate) as well as the number of phones present in the actual realization
(realized rate). The articulatory precision index, which indicates the relative
deletion rate, also differs for these speakers. The same differences are observed
for fast and slow inter-pause stretches in a large German database of spontaneous
speech. Both in the database and for the production experiment, however, there
is considerable overlap between the measurements for fast and slow speakers. This
shows that other factors also play a role in distinguishing fast and slow speakers or
inter-pause stretches. The relationship between these factors and the articulation
rates is discussed.
| | | Oral Session 3: Prosody and Speech Perception | 3 of 5 | Toddlers are sensitive to prosodic correlates of disfluency in spontaneous speechAUTHOR(S):
AUTHOR(S):
Soderstrom, Melanie; Brown University
Morgan, James L.; Brown University
Abstract:
The ability to distinguish fluent from disfluent speech could play an important
role in infants' acquisition of their first language. Across two experiments using a
Headturn Preference Procedure, we show that infants are able to distinguish fluent
from disfluent speech based on its prosodic characteristics, and show a preference
for listening to fluent English. In the first experiment, 22-month-old, but not
10-month-old, infants preferred to listen to fluent adult-directed speech samples
over disfluent matched speech samples. In the second experiment, lexical and
grammatical information were removed. Older infants still discriminated fluent
from disfluent speech, but showed the reverse preference, for disfluent speech.
| | | Oral Session 3: Prosody and Speech Perception | 4 of 5 | Modelling Hesitation for Synthesis of Spontaneous SpeechAUTHOR(S):
Carlson, Rolf; TMH, CSC, KTH, Stockholm, Sweden
Gustafson, Kjell; Acapela Group, Stockholm, Sweden
Strangert, Eva; Phonetics, Ume°a University, Sweden
Abstract:
The current work deals with the modelling of one type of disfluency, hesitations.
A perceptual experiment using speech synthesis was designed to evaluate two
duration features found to be correlates to hesitation, pause duration and final
lengthening. A variation of F0 slope before the hesitation was also included. The
most important finding is that it is the total duration increase that is the valid cue
rather than the contribution by either factor. In addition, our findings lead us to
assume an interaction with syntax. The absence of strong effects of the induced
F0 variation was unexpected and we consider several possible explanations for this
result.
| | | Oral Session 3: Prosody and Speech Perception | 5 of 5 | Neural correlates of rhythm processing in speech perceptionAUTHOR(S):
Geiser, Eveline; University of Zurich
Schmidt, Conny; University of Zurich
Jancke, Lutz; University of Zurich
Meyer, Martin; University of Zurich
Abstract:
The present study investigates the neural correlates of speech perception. Metric
and non-metric German pseudo-sentences were compared in an fMRI investigation.
One group of subjects was to decide which type of sentence they had heard.
A second group performed a prosody task on the same stimuli. Group analysis
revealed activation in the supplementary motor area (SMA), for the explicit processing
group. This activation was not present in the implicit processing group. A
direct contrast between the metric and the non-metric sentences for the implicit
processing group revealed significant activation in the left planum temporale (PT)
for the metric condition. Our results suggest that rhythm processing relies on
neural correlates different from those related to speech melody processing. The
implicit perception of unexpected speech rhythm relies on brain areas which have
earlier been associated with temporal auditory processing in the left hemisphere.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Oral Session 4 (OS 4): Affective Speech Thursday, May 4, 11:50 - 13:10
Chair: Véronique Aubergé |
| Oral Session 4: Affective Speech | 1 of 4 | Expressing anger and joy with the size code AUTHOR(S):
Chuenwattanapranithi, Suthathip; Department of Computer Engineering, King
Mongkut's University of Technology Thonburi
Xu, Yi; University College London and Haskins Laboratories
Thipakorn, Bundit; Department of Computer Engineering, King Mongkut's University
of Technology Thonburi
Maneewongvatana, Songrit; Department of Computer Engineering, King Mongkut's
University of Technology Thonburi
Abstract:
This paper reports our finding of the use of a proposed biological code - the
size code in anger and joy speech. In searching for explanations for an F0 peak
delay phenomenon related to angry speech that cannot be accounted for by known
articulatory constraints, we hypothesized that the delay was due to the lowering
of the larynx to exaggerate body size, a biological code known to be used by
animals. Our analysis of the formant frequencies in existing emotional speech
databases revealed that anger speech had lowered formants and joy speech had
raised formants. The results confirm our hypothesis and suggest that the size
code is being actively used by humans to express emotions.
| | | Oral Session 4: Affective Speech | 2 of 4 | Emotion Elicitation in a Computerized Gambling GameAUTHOR(S):
Aharonson, Vered; Tel Aviv Academic College of Engineering
Amir, Noam; Tel Aviv University
Abstract:
We have designed a novel computer controlled environment that elicits emotions
in subjects while they are uttering short identical phrases. The paradigm is based
on Damasio's experiment for eliciting apprehension and is implemented in a voice
activated computer game. For six subjects we have obtained recordings of dozens of
identical sentences, which are coupled to events in the game - gain or loss of points.
Prosodic features of the recorded utterances were extracted and classified. The
resultant classifier gave 78-85 % recognition of presence/absence of apprehension.
| | | Oral Session 4: Affective Speech | 3 of 4 | Pauses in Deceptive Speech AUTHOR(S):
Benus, Stefan; Columbia University
Enos, Frank; Columbia University
Hirschberg, Julia; Columbia University
Shriberg, Elizabeth; SRI & ICSI
Abstract:
We use a corpus of spontaneous interview speech to investigate the relationship
between the distributional and prosodic characteristics of silent and filled pauses
and the intent of an interviewee to deceive an interviewer. Our data suggest that
the use of pauses correlates more with truthful than with deceptive speech, and
that prosodic features extracted from filled pauses themselves as well as features
describing contextual prosodic information in the vicinity of filled pauses may
facilitate the detection of deceit in speech.
| | | Oral Session 4: Affective Speech | 4 of 4 | Mapping Voice to Affect: Japanese Listeners AUTHOR(S):
Yanushevskaya, Irena; Phonetics and Speech Laboratory, Trinity College Dublin
Gobl, Christer; Phonetics and Speech Laboratory, Trinity College Dublin
Ní Chasaide, Ailbhe; Phonetics and Speech Laboratory, Trinity College Dublin
Abstract:
This paper reports the results of perception tests administered to speakers of
Japanese as part of a cross-language investigation of how voice quality and f0 combine
in the signalling of affect. Three types of synthesised stimuli were resented:
(1) 'VQ only' involving variations in voice quality and a neutral f0; (2) 'f0 only',
with different f0 contours and modal voice; and (3) combined 'VQ + f0' stimuli,
where combinations of (1) and (2) were employed. Overall, stimuli involving voice
quality variation (1 and 3) proved to be most consistently associated with affect.
In series (2) only stimuli with very high f0 yielded high affective ratings. Some
striking differences emerge in the ratings obtained for Japanese subjects compared
to those obtained for speakers of Hiberno-English, suggesting that the generation
of expressive speech synthesis will need to be sensitive to language specific uses of
the voice.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Poster Session 5 (PS 5): Speech Technology - Part I: Speech Synthesis
Thursday, May 4, 14:30 - 16:00
Chairs: Keikichi Hirose / Plinio Barbosa |
| Poster Session 5: Speech Technology - Part I: Speech Synthesis | 1 of 34 | Rule-based Prosody Prediction for German Text-to-Speech SynthesisAUTHOR(S):
Becker, Stephanie; Saarland University, Saarbrücken
Schröder, Marc; DFKI GmbH, Saarbrücken
Barry, William J.; Saarland University, Saarbrücken
Abstract:
This paper presents two empirical studies that examine the influence of different
linguistic aspects on prosody in German. First, we analysed a German corpus with
respect to the effect of syntax and information status on prosody. Second, we conducted
a listening test which investigated the prosodic realisation of constituents
in the German 'Vorfeld' depending on their information status. The results were
used to improve the prosody prediction in the German text-to-speech synthesis
system MARY.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 2 of 34 | Duration Prediction in Mandarin TTS SystemAUTHOR(S):
Guo, Qing; Fujitsu Research and Develop Center China, Beijing
Katae, Nobuyuki; Fujitsu Laboratories Ltd.
Abstract:
This paper reports the methodology and result of decision tree based duration
prediction for Mandarin text-to-speech system developed by the Fujitsu Laboratories.
Syllable initials and finals are the basic units in our duration study. In this
paper, factors influencing the finals, such as phrase boundary and phone context,
are discussed in detail. Experiments indicate that the prosodic factor of whether
the right phrase boundary level is prosodic word level or higher level is the most
important determinant of duration. Furthermore, the degree of phrase boundary
vowel lengthening may vary depending on the types of finals. And this paper also
explains the methods for objective evaluation of the performance of the duration
prediction model. At the last part, prosody evaluation results convincing that the
prosody generated by our prosody generation module is much better than that of
two famous Mandarin TTS systems.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 3 of 34 | Adaptation of Prosodic Phrasing ModelsAUTHOR(S):
Bell, Peter; Centre for Speech Technology Research, University of Edinburgh
Burrows, Tina; Speech Technology Group, Toshiba Research Europe Ltd
Taylor, Paul; Department of Engineering, University of Cambridge
Abstract:
There is considerable variation in the prosodic phrasing of speech betweeen different
speakers and speech styles. Due to the time and cost of obtaining large
quantities of data to train a model for every variation, it is desirable to develop
models that can be adapted to new conditions with a limited amount of training
data. We describe a technique for adapting HMM-based phrase boundary prediction
models which alters a statistic distribution of prosodic phrase lengths. The
adapted models show improved prediction performance across different speakers
and types of spoken material.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 4 of 34 | F0 and Segment Duration in Formant Synthesis of Speaker Age AUTHOR(S):
Schötz, Susanne; Linguistics and Phonetics, Centre for Languages and Literature,
Lund University
Abstract:
This paper describes the work with F0 and segment duration when developing a
prototype system for analysis of speaker age using data-driven formant synthesis.
The system was developed to extract 23 parameters from the test words -
spoken by four differently aged female speakers of the same dialect and family
- and to generate synthetic copies. Audio-visual feedback enabled the user to
compare the natural and synthetic versions and facilitated parameter adjustment.
Next, weighted linear interpolation was used in a first crude attempt to synthesize
speaker age. Evaluation of the system revealed its strengths and weaknesses, and
suggested further improvements. F0 and duration performed better than most
other parameters.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 5 of 34 | High Resolution Speech F0 ModificationAUTHOR(S):
Bardi, Tamas; Faculty of Information Technology, Peter Pazmany Catholic University,
Budapest
Abstract:
The present paper propose a new algorithm for pitch modification which is convenient
for changing the fundamental frequency of speech with so fine resolution
that is at least comparable with human pitch perception. Using the proposed
method, measurements of just noticeable changes on speech prosody becomes possible.
High resolution F0 manipulation is completed without explicit over-sampling
of the signal, our FFT-based fast interpolation technique is used instead. Our algorithm
is based on LP-PSOLA method. Though its frequency resolution was
enhanced especially for research purposes, possibly the need for it comes up from
real applications of expressive speech synthesis in the future.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 6 of 34 | Effects of Prosodic Factors on Spectral Balance: Analysis and SynthesisAUTHOR(S):
Miao, Qi; Oregon Health & Science University
Niu, Xiaochuan; Oregon Health & Science University
Klabbers, Esther; Oregon Health & Science University
van Santen, Jan; Oregon Health & Science University
Abstract:
In natural speech, prosodic factors such as accent, stress, phrasal position and
speaking style play important roles in controlling several acoustic features, including
segmental duration, pitch, and spectral balance. To synthesize speech that
sounds natural, these effects need to be accurately modeled. In this study we describe
and evaluate a synthesis method that mimics the effects of prosodic factors
on spectral balance. We measure spectral balance by using the energy in four
broad frequency bands that correspond to formant frequency ranges. An additive
model is used to capture the effects of prosodic factors on spectral balance. A new
sinusoidal synthesis module is implemented under Festival to predict the target
spectral balance from analysis results and apply it to the amplitude parameters of
the sinusoidal model during synthesis. We evaluate an important strength of this
system, which is its ability to reduce spectral discontinuities in unit concatenation.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 7 of 34 | Decomposition of Pitch Curves in the General Superpositional Intonation ModelAUTHOR(S):
Mishra, Taniya; Oregon Health & Science University
van Santen, Jan; Oregon Health & Science University
Klabbers, Esther; Oregon Health & Science University
Abstract:
This paper describes and applies a new algorithm for decomposing pitch curves
into component curves, in accordance with the General Superpositional Model of
Intonation. According to this model, which is a generalization of the Fujisaki
model, a pitch contour can be described as the sum of component curves that are
each associated with different phonological levels, including the phrase, foot, and
phoneme. The algorithm assumes that the phrase curve is locally linear during
intervals spanned by a foot. The algorithm was evaluated using synthetically
generated curves, and was found to accurately recover the synthetic component
curves. The algorithm was also evaluated in a perceptual experiment, where speech
generated by concatenation of accent curves was shown to produce better speech
quality than speech based on direct concatenation of "raw" pitch curve fragments.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 8 of 34 | An innovative F0 modeling approach for emphatic affirmative speech, applied to the GreeklanguageAUTHOR(S):
Giannopoulos, Gergios; Institute for Language and Speech Processing
Chalamandaris, Aimilios; Institute for Language and Speech Processing
Abstract:
In this paper we present an innovative algorithm for modelling the fundamental
frequency F0 for the Greek language, for sentences containing emphatic segments.
The main idea of our approach is the definition of a specific set of intonation word
models, derived from a spoken corpus, the use of which is sufficient in modeling the
pitch contour of arbitrary long sentences similarly structured. Our method is based
on a prosodic unit selection approach. The system was designed and trained on a
spoken corpus of 120 naturally uttered sentences of weather forecasts, containing
emphasis segments and has proved to be very efficient in coping with similarly
structured sentences. In the first section of the paper we present a brief review
of the existing literature on this field, in addition with analogous approaches for
other languages. In the second section we present our method and the design
procedure. The last two sections contain the preliminary results acquired from
our experiments as well as conclusions and refer to future work that needs to be
carried out.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 9 of 34 | Prosody generation in the Speech-to-Speech Translation FrameworkAUTHOR(S):
Agüero, Pablo Daniel; UPC
Adell, Jordi; UPC
Bonafonte, Antonio; UPC
Abstract:
This paper deals with speech synthesis in the framework of speech-to-speech translation.
Our current focus is to translate speeches or conversations between humans
so that a third person can listen to them in its own language. In this framework
the style is not written but spoken and the original speech includes a lot of nonlinguistic
information (as speaker emotion). In this work we propose the use of
prosodic features in the original speech to produce prosody in the target language.
Relevant features are found using an unsupervised clustering algorithm that finds,
in a bilingual speech corpus, intonation clusters in the source speech which are
relevant in the target speech. Preliminary results already show a significant improvement
in the synthetic quality.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 10 of 34 | Facing data scarcity using variable feature vector dimension AUTHOR(S):
Agüero, Pablo Daniel; UPC
Bonafonte, Antonio; UPC
Abstract:
This paper focuses on three key points of intonation modelling: interpolation of
fundamental frequency contour, sentence by sentence parameter extraction and
data scarcity. In some cases, they introduce noise and inconsistency on training
data reducing the performance of machine learning techniques. We consider that
the F0 contour is segmented into prosodic units (such as accent groups, minor
phrases, etc). Each segment of F0 contour has a corresponding feature vector
with linguistic and non-linguistic components. We propose to face the limitations
mentioned above using a technique based on clustering using different feature
vector dimensions. The clustering of feature vectors produces also a partition in
the F0 contour space. The proposal consists on a procedure to select the dimension
that contributes to predict the best fundamental frequency contour from a RMSE
sense compared to a reference contour. Experimental results show an improvement
compared to other approaches.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 11 of 34 | Disfluent Speech Analysis and Synthesis: a preliminary approachAUTHOR(S):
Adell, Jordi; Universitat Polit`ecnica de Catalunya
Bonafonte, Antonio; Universitat Polit`ecnica de Catalunya
Escudero, David; Universidad de Valladolid
Abstract:
Despite the existence of high quality speech synthesisers based on unit selection,
they are based on a reading style approach. However, new applications such as
Speech-to-Speech Translation or Speech User Interfaces request for a talking style
which is more natural in these contexts. Disfluencies are a major characteristic
of talking style. It is thus, convenient to be able to generate disfluent speech. In
the present paper a preliminary analysis of repetitions and filled pauses pitch and
segmental duration is presented. Simple rules to predict these prosodic features
are derived from the previous analysis and used for synthesis. Evaluation shows
an increase in naturality while overall quality is decreased.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 12 of 34 | Structural Data-Driven Prosody Model for TTS SynthesisAUTHOR(S):
Romportl, Jan; Department of Cybernetics, University of West Bohemia in Pilsen
Abstract:
This paper introduces a new data-driven prosody model for the text-to-speech
system ARTIC. The model is intended to be almost language-independent and
to generate naturally sounding intonation with a link to semantics. It is based
on text parametrisation using a new prosodic grammar and on automatic speech
corpora analysis methods. Its performance is evaluated by results of presented
listening tests.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 13 of 34 | Language- and Speaker-Specific Implementation of Intonation Contours in Mul-tilingual TTS SynthesisAUTHOR(S):
Lobanov, Boris; United Institute of Informatics Problems, Minsk
Tsirulnik, Liliya; United Institute of Informatics Problems, Minsk
Zhadinets, Dmitry; United Institute of Informatics Problems, Minsk
Karnevskaya, Helena; Minsk Linguistic State University
Abstract:
The paper is concerned with the study of complete/incomplete phrase intonation
and its language- and speaker-specific peculiarities. A phrase, according to the
model used, is represented by a sequence of accentual units consisting of prenucleus,
nucleus and post-nucleus. The procedure of speech test material preparation
and techniques for language- and speaker-specific intonation analysis are
described. The results of intonation analysis have been obtained on materials of
Russian and Polish native speakers reading aloud a text. The implementation of
intonation 'portraits' in the unified text-to-speech synthesis system for Slavonic
languages with the ability of personal speaking manner cloning is discussed.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 14 of 34 | Statistical Study of Speaker's Peculiarities of Utterances into Phrases Segmen-tationAUTHOR(S):
Lobanov, Boris; United Institute of Informatics Problems, Minsk
Tsirulnik, Liliya; United Institute of Informatics Problems, Minsk
Abstract:
The report is concerned with the experimental study of the idiosyncrasy of utterances
into phrase segmentation observed in the speech of a popular Russian
TV-anchorman and two TV-news speakers. The audio recordings were initially
transcribed, the primary and secondary stresses, as well as phrase boundaries
and phrase intonation types were identified. Comparative statistical estimation
of relative frequencies of occurrence of pauses of various duration, frequencies of
occurrence of phrases with a different number of accent units (AU) and frequencies
of occurrence of pairs of phrases with various numbers of AUs were computed. The
results of the study have been applied to the system of individual voice cloning
using a text-to-speech synthesis.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 15 of 34 | Rule-based Generation of Phrase Components in Two-step Synthesis of Funda-mental Frequency Contours of MandarinAUTHOR(S):
Sun, Qinghua; Graduate School of Engineering,University of Tokyo
Hirose, Keikichi; Graduate School of Information Science and Technology, University
of Tokyo
Gu, Wentao; Graduate School of Information Science and Technology, University
of Tokyo
Minematsu, Nobuaki; Graduate School of Frontier Sciences, University of Tokyo
Abstract:
A rule-based method was developed for realizing phrase components in our twostep
generation of fundamental frequency (F0) contours of Mandarin. Motivated
by the F0 contour generation process model, the two-step scheme assumes (logarithmic)
F0 contours as superposition of tone components on phrase components
assumed to be responses of phrase commands. Too long phrase components cause
a flat F0 contour close to baseline, which is not the case in human speech. In
the case of tone languages such as Mandarin, tone components can be negative.
Hence, to give a margin for downward F0 movement, phrase components need to
keep above a certain level, causing more frequent phrase commands as compared
to non-tonal languages. Based on these facts, simple rules were constructed for
phrase component generation. Speech synthesis was conducted using F0 contours
generated by the method. The result of listening test showed a good control of F0
contours being realized by the method.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 16 of 34 | Efficient Speech Synthesis System using the Deterministic plus Stochastic ModelAUTHOR(S):
Erro, Daniel; Technical University of Catalonia
Moreno, Asunción; Technical University of Catalonia
Abstract:
In this paper, a high-quality concatenative synthesis system using the deterministic
plus stochastic model of speech is described, in which the prosodic modifications
are performed by means of very simple and efficient operations, as we reported in
a previous work. In particular, pitch-synchrony is not necessary, and linear interpolations
substitute other types of estimation. The method for the concatenation
of units has been improved in order to avoid waveform and spectral mismatches.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 17 of 34 | Towards an Automatic Foreign Accent Reduction Tool AUTHOR(S):
Cho, Kwansun; University of Florida
Harris, John G.; University of Florida
Abstract:
An automatic tool to reduce foreign-accent is described and evaluated. An unaccented
speech utterance was used to improve three prosodic features of a corresponding
foreign-accented utterance. The duration, pitch and intensity of the
foreign-accented speech utterance were modified using DTW (Dynamic TimeWarping),
WSOLA (Waveform Similarity Overlap Add), and other automatic speech
processing algorithms. The modified speech utterance was then evaluated to determine
the perceived foreign accent compared to the original. Fifteen native
speakers of American English took part in the perceptual test to rate the degree
of foreign-accent in Korean-accented American English. The results show that
the modified Korean-accented utterances were perceived to have a lower degree of
foreign-accent than the original Korean-accented utterances.
| | | Poster Session 5: Speech Technology - Part I: Speech Synthesis | 18 of 34 | Efficient Technique for Quantization of Pitch ContoursAUTHOR(S):
Nurminen, Jani; Nokia Research Center
Himanen, Sakari; Nokia Research Center
Rämö, Anssi; Nokia Research Center
Abstract:
This paper introduces an efficient technique for pitch contour quantization designed
mainly for applications that require storage of speech or prosodic information
at a high compression ratio. Instead of quantizing the estimated pitch
values directly, the proposed technique forms and quantizes a simplified model of
the pitch contour. The simplified contour is constructed in such a manner that
the amount of information needed for describing it is minimized. At the same
time, the deviation from the original contour is maintained below a predetermined
limit. In addition to the high compression ratio, the contour representation offers
benefits in pitch-synchronous decoding. The proposed technique is implemented
and evaluated in a practical storage speech coder. According to the evaluation,
the performance of the quantization technique is very promising as it achieves perceptually
satisfactory quality at an average bit rate of about 100 bits per second.
| Poster Session 5 (PS 5): Speech Technology - Part II: Speech Recognition and Understanding
Thursday, May 4, 14:30 - 16:00
Chairs: Keikichi Hirose / Plinio Barbosa |
| | Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding | 19 of 34 | F0 Characteristics of Yes-No Question Intonation in Arabic and English: Dis-ambiguation Techniques for Use in ASR AUTHOR(S):
Barrett, Leslie; EDGAR Online
Hata, Kazue; Santa Barbara
Abstract:
This paper presents preliminary research into the possibility of using F0 information
to enhance the performance of speech-to-speech translation engines and
speech recognition software for Arabic and English. Specifically, we aim to find
factors that differentiate yes-no question in both languages from other sentential
types. Although previous research using cross-linguistic question data has shown
F0 rise to be the main indicator of yes-no questions, the particular F0 characteristics
used by listeners as perceptual cues varied. Using comparative language
data, the aim of this study was to find reliable question indicators that could be
detected by automated means. In an experiment with short sentences read by a
native speaker of each language, we examined aspects of F0 contours in the two
languages to find reliable recognition thresholds. Results indicate that reliable
indicators of yes-no questions do exist for both languages and occur within the
sentence-final 50 centiseconds.
| | | Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding | 20 of 34 | Dependency Analysis of Spontaneous Monologue Speech Using Pause and F0 Information: A Preliminary StudyAUTHOR(S):
Takagi, Kazuyuki; The University of Electro-Communications
Ozeki, Kazuhiko; The University of Electro-Communications
Abstract:
This paper deals with the problem of exploiting prosodic information in syntactic
analysis of spontaneous monologue utterances of non-professional speakers. Duration
of pauses at phrase boundaries and relative F0 contour features, which
improve parsing accuracy of read sentences, were also found to be effective for
parsing spontaneous speech. Dependency analysis was performed by the minimum
penalty parser on academic presentation speech recorded in Corpus of Spontaneous
Japanese, a large-scale database of spontaneous Japanese with rich linguistic annotations.
Preliminary experiments on relatively clean parts of the monologue
data utterances showed that the pause and F0 features are effective to improve
the accuracy of dependency analysis of spontaneous utterances, and that combined
use of both features will give further improvement. Although this is a preliminary
study, the results are promising.
| | | Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding | 21 of 34 | Prosodic effects in parsing early vs. late closure sentences by second language learners and native speakersAUTHOR(S):
Hwang, Hyekyung; University of Hawaii
Schafer, Amy J.; University of Hawaii
Abstract:
The Informative Boundary Hypothesis (IBH: [4]) claims that a prosodic boundary
is interpreted relative to preceding boundaries. This study tests predictions
of the IBH with Korean learners of English and English native speakers in a
prosody experiment on the resolution of an Early vs. Late Closure ambiguity
in spoken English sentences. A control experiment assessed and controlled for
English morpho-syntactic knowledge in the main experiment. The main experiment
presented the syntactically ambiguous portion of sentences in a forced-choice
continuation-selection task. The results showed that 1) Korean L2ers at all levels
used relative boundary size to disambiguate sentences, like L1ers; 2) intonation
phrase boundaries provided stronger evidence for syntactic boundaries than intermediate
phrase boundaries, especially for the L2ers; and 3) the IBH's 3-way
categorization of relative boundary size - larger/same-size/smaller - appears insufficient
for this syntactic structure.
| | | Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding | 22 of 34 | Speech Recognition Only with Supra-segmental Features - Hearing Speech as Music AUTHOR(S):
Minematsu, Nobuaki; Graduate School of Frontier Sciences, The University of
Tokyo
Nishimura, Tazuko; Graduate School of Medicine, The University of Tokyo
Murakami, Takao; Graduate School of Information Science and Technology, The
University of Tokyo
Hirose, Keikichi; Graduate School of Information Science and Technology, The
University of Tokyo
Abstract:
This paper proposes a novel paradigm of speech recognition where only the suprasegmental
features are used. Absolute properties of speech events such as formants
and spectrums are completely discarded and only the relative and differential properties
of the events are extracted as phonic contrasts. They are considered as suprasegmental
features and mathematically shown not to carry non-linguistic features
such as speaker, age, gender, etc. This fact expects that speaker-independent
speech recognition should be possible with the reference models built only with
a single speaker's speech. Experiments of vowel sequence recognition show that
this expectation is correct and that the performance of the new paradigm is better
than that of the conventional paradigm using more than four thousand speakers.
Hearing sounds through capturing only their contrasts is often done when hearing
musical sounds, indicating that the proposed paradigm hears speech as music.
| | | Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding | 23 of 34 | Employing Intonational Events Parameterization for Emotion RecognitionAUTHOR(S):
Zervas, Panagiotis; Patras University
Mporas, Iosif; Patras University
Fakotakis, Nikolaos; Patras University
Abstract:
Fujisaki's modeling of pitch contour for the task of emotion recognition from speech
signals, is considered in this article. For the evaluation of the resulted attributes
we have utilized a decision tree inducer as well as the instance based learning algorithm.
The datasets utilized for training the classification models, were extracted
from two emotional speech databases. Results showed that knowledge extracted
from Fujisaki's parameters benefited all prediction models. Thus, an average raise
of 9.52 % in the total accuracy of all models was attained.
| | | Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding | 24 of 34 | Unsupervised Learning of Tone and Pitch Accent AUTHOR(S):
Levow, Gina-Anne; University of Chicago
Abstract:
Recognition of tone and intonation is essential for speech recognition and language
understanding. However, most approaches to this recognition task have relied
upon extensive collections of manually tagged data obtained at substantial time
and financial cost. In this paper, we explore unsupervised clustering approaches to
recognize pitch accent in English and tones in Mandarin Chinese. In unsupervised
Mandarin tone clustering experiments, we achieve 57-87 % accuracy on materials
ranging from broadcast news to clean lab speech. For English pitch accent in
broadcast news materials, results reach 78 %. These results indicate that the
intrinsic structure of tone and pitch accent acoustics can be exploited to reduce
the need for costly labeled training data for tone learning and recognition.
| | | Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding | 25 of 34 | Classification of Statement and Question Intonations in MandarinAUTHOR(S):
Liu, Fang; The University of Chicago
Surendran, Dinoj; The University of Chicago
Xu, Yi; University College London & Haskins Laboratories
Abstract:
Conflicting reports abound in the literature regarding the critical characteristics
of statement and question intonations in Mandarin. In this paper, decision trees
with three different sets of feature vectors are implemented to determine the most
SPEECH PROSODY 2006 81
significant elements in an utterance that signify its sentence type (statement vs.
question). For 10-syllable utterances, the highest correct classification rate (85 %)
is achieved when normalized (to remove the effects of speaker, tone, and focus)
final F0's of the 7th and the last syllables are included in the tree construction.
This performance is close to previously reported human performance (89 %) for
the same testing set. The results confirm the previous finding that the difference
between statement and question intonations in Mandarin is manifested by an
increasing departure from a common starting point toward the end of the sentence.
| | | Poster Session 5: Speech Technology - Part II: Speech Recognition and Understanding | 26 of 34 | Perceptual Optimization of the Chinese Accent-Index Detector AUTHOR(S):
Zhu, Weibin; Institute of Information Science, Beijing Jiaotong University
Abstract:
For a TTS system, only if a large size of corpus annotated with AI (Accent Index)
is available, could it be practicable to build an AI-supported prosody module
in a data-driven method. An approach had been proposed to label Chinese AI
automatically. Although preliminary experiments showed its effectiveness and
efficiency of the approach, there are still certain issues left unsolved: the evaluation
and the optimization of the AI detector. A small size of sub-corpus has been
labeled with AI manually, which is expected to be as a reference for evaluating
the performance. And a measure CC (Correlative-Coefficient), the CC between
the auto-detected and the manual-annotated AI set, is proposed as the criteria
for optimizing the detector. Thanks to the use of CC, the detector has not only
been refined and optimized, but also the auto-detected AI has been assigned with
prosody meaning subjectively.
| Poster Session 5 (PS 5): Speech Technology - Part III: Annotation and Speech Corpus Creation
Thursday, May 4, 14:30 - 16:00
Chairs: Keikichi Hirose / Plinio Barbosa |
| | Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation | 27 of 34 | The Prosodizer - Automatic Prosodic Annotation of Speech Synthesis DatabasesAUTHOR(S):
Braunschweiler, Norbert; Speech Technology Group, CRL Toshiba Europe Ltd.
Abstract:
Prosodic annotations are used for locating and characterizing prominent parts in
utterances as well as identifying and describing boundaries of coherent stretches
of speech. In speech synthesis prosodic annotations can be used to improve the
unit selection process and subsequently yield more natural sounding synthesis.
A method for automatic prosodic annotations of speech is described in this paper.
This method is implemented in a computer program called Prosodizer that
integrates acoustic features of F0 and RMS as well as syntactic and segmental
information like POS tags and syllable boundaries. Design and preliminary performance
results are described.
| | | Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation | 28 of 34 | Automatic Accent Annotation with Limited Manually Labeled Data AUTHOR(S):
Chen, YiNing; Microsoft Research Asia
Lai, Min; Department of Electronic Engineering & Information Science, University
of Science & Technology of China
Chu, Min; Microsoft Research Asia
Soong, Frank K.; Microsoft Research Asia
Zhao, Yong; Microsoft Research Asia
Hu, Fangyu; Department of Electronic Engineering & Information Science, University
of Science & Technology of China
Abstract:
In this paper we investigate automatic accent labeling procedure by using classifiers
trained from limited manually labeled data. Different methods are proposed and
compared in a framework of multi-classifiers, including: a linguistic classifier, an
acoustic classifier and a combined one. The linguistic classifier is first used to label
POS-determined content words as accented and function words as unaccented.
The corresponding labels are then used to train accented and unaccented vowel
HMMs separately. The combined classifier is then used to combine the decisions
of the linguistic and acoustic classifiers' outputs to minimize labeling errors. The
performance can be further improved when the acoustic classifier is re-trained with
the whole corpus which is re-labeled by the combined classifiers. The final accent
labeling accuracy is improved to 94.0 %. Compared with 97.2 %, the self-agreement
ratio of a well-trained human annotator, this accuracy is fairly satisfactory.
| | | Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation | 29 of 34 | Prosodic boundaries in spontaneous Russian: perceptual annotation and auto-matic classificationAUTHOR(S):
Nesterenko, Irina; Laboratoire Parole et Langage
Abstract:
Perceptual experiments with French and Russian speaking subjects were used locate
intonation phrase boundaries under different experimental conditions. Once
inter-listeners' agreement evaluated, we built an automatic predictor based on human
boundary/no-boundary judgments and then evaluated how well the predictor
behaves. This predictor operates on acoustic features and we looked for an optimal
combination of features to mimic perceptual experiment results.
| | | Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation | 30 of 34 | Semi-Automatic Prosodic Transcription of Spoken Spanish in XMLAUTHOR(S):
Velázquez, Eduardo; Freie Universitát Berlin
Abstract:
XML (Extensible Mark-up Language) is designed to represent hierarchical structures;
in this case, it shows the structure of the prosodic components of spoken
language. The XML-based transcription system proposed here allows the input of
1) the phonetic parameters of F0, intensity and duration of each syllable, their relative
variation and standard values to facilitate discrimination and comparison; 2)
the distribution of feet; 3) the boundaries and characterization of intonation units
and utterances, and 4) other conversational phenomena such as pauses, overlaps,
interruptions, etc. This mark-up language is currently being used as an analysis
tool for a corpus of digitally-recorded conversations in the Mexican and Iberian
vernaculars of spoken Spanish.
| | | Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation | 31 of 34 | MalToBI - Building an Annotated Corpus of Spoken MalteseAUTHOR(S):
Vella, Alexandra; University of Malta
Farrugia, Paulseph-John; University of Malta
Abstract:
Research on the phonetics and phonology, particularly the prosody, of Maltese is
limited. This is partly due to the lack of structured resources such as a corpus of
spoken Maltese, for use in research. Such a corpus, especially one including some
element of prosodic annotation, could be a useful tool for further research on the
prosodic structure, amongst other aspects, of Maltese. It could also be important
for continuing development of Text-to-Speech resources in the local context.
Recognition of the necessity for such a corpus gave rise to MalToBI, a project involving
the collection of a relatively small body of spoken Maltese, together with
the development of a Tone and Break Indices (ToBI) framework adapted for use
with Maltese. This paper outlines some aspects of Maltese prosody, describes the
development and design considerations involved in building this corpus and reports
on the progress made so far as well and intentions for future work.
| | | Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation | 32 of 34 | Shape Display: Task Design and Corpus CollectionAUTHOR(S):
Fon, Janice; Graduate Institute of Linguistics, National Taiwan University
Abstract:
This study introduces a new paradigm for spontaneous dialog elicitation and a
small multilingual corpus collected using this paradigm. Pairs of subjects were
seated in separate booths and were each given a felt-covered board and a bag
of assorted felt pieces of various shapes and colors. The goal was to make the
layout of the felt pieces the same on the two boards with the least moves. In
order to test how accommodating the paradigm is to cross-linguistic/cross-cultural
experimental designs, 32 subjects of three different languages, English, Mandarin
(Guoyu and Putonghua), and Japanese participated in the study. Subjects found
the paradigm entertaining and engaged themselves in the game without paying
much conscious attention to their linguistic performances. The elicited dialogs
were spontaneous enough to allow further phonetic and discourse research.
| | | Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation | 33 of 34 | Optimization of MFNs for Signal-based Phrase Break Prediction AUTHOR(S):
Hofmann, Michael; Dresden University of Technology
Jokisch, Oliver; Dresden University of Technology
Abstract:
The automatic prosodic annotation of large speech corpora gains increasing consideration
since appropriate databases for the training of prosodic models in speech
synthesis and recognition are needed. On linguistic level, correct phrase and accent
marking are essential processing steps. The authors developed a neural network
based method for signal-based phrase break prediction and tested this method
across two different speech databases.
The structure of the multilayer feed-forward neural network (MFN) had been
optimized and adapted to the target database and to the specific annotation task.
The method is rather data sensitive - depending on different human labelers and
small differences across training databases, like frequency of occurrence or strength
of phrase breaks. The MFN method can be easily adapted to the characteristics
of different databases (long or short phrases, special formats like dates or web
addresses, etc.). If applied to different databases which contain phrase markers of
human experts, phrase break recognition rates vary from 79% up to 97%.
| | | Poster Session 5: Speech Technology - Part III: Annotation and Speech Corpus Creation | 34 of 34 | Automatic Construction of a Prosodically Rich Text Corpus for Speech Synthesis Systems AUTHOR(S):
Lambert, Tanya
Abstract:
This paper presents a method for an automatic compilation of a phonologically
rich text database, which is used in a concatenative text-to-speech (TTS) synthesis
system. In the method described here, linguistic features are predicted from text
using Festival's linguistic engine. A set of phonological units for a specific text is
compiled from AVLs. The set of phonological units is used in set cover algorithms
in conjunction with the corresponding rich transcription of text in order to generate
a compact and phonologically rich text corpus. This is an efficient way for
generating database prompts with a specific prosodic content; the prompts can
then be recorded and converted into voice. The method described here can be
used for languages other than English.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Poster Session 6 (PS 6): Prosody and Affect Thursday, May 4, 14:30 - 16:00
Chair: Jürgen Trouvain |
| | Poster Session 6: Prosody and Affect | 1 of 13 | Optical Cues to the Visual Perception of Lexical and Phrasal Stress in EnglishAUTHOR(S):
Scarborough, Rebecca; Stanford University, USA
Keating, Patricia; University of California, Los Angeles, USA
Baroni, Marco; University of Bologna, Italy
Cho, Taehong; Hanyang University, Korea
Mattys, Sven; University of Bristol, England
Alwan, Abeer; University of California, Los Angeles, USA
Auer, Edward; University of Kansas, USA
Bernstein, Lynne; House Ear Institute
Abstract:
In a study of optical cues to the visual perception of stress, three American English
talkers spoke words that differed in lexical stress and sentences that differed
in phrasal stress, while video and movements of the face were recorded. In a
production analysis, stressed vs. unstressed syllables from these utterances were
compared along many measures of facial movement, which were generally larger
and faster under stress. In a visual perception experiment, 16 perceivers identified
the location of stress in forced-choice judgments of video clips of these utterances
(without audio). Phrasal stress (54 % correct vs. 25 % chance) was
better-perceived than lexical stress (62 % correct vs. 50 % chance). The relation
of the visual intelligibility of the prosody of these utterances to the optical characteristics
of their production is discussed, with analysis of which cues are associated
with successful visual perception.
| |
| Poster Session 6: Prosody and Affect | 2 of 13 | Some gender and cultural differences in perception of affective expressions AUTHOR(S):
Erickson, Donna; Gifu City Women's College
Abstract:
This study investigates whether people can understand vocal affective expression
in a language that is not their native language, as well as whether there is a
difference in the way males and females understand vocal affective expressions.
We investigated the affectively-neutral Japanese word /banana/ as uttered with
five different affective expressions: anger, sad, surprised, suspicious, and happy.
The listeners were 20 American listeners, 9 Korean listeners, and 20 Japanese
listeners who were asked to indicate which affect they heard. The results showed
that the perception of affect differed according to the native language as well as
to the gender of the listener.
| | | Poster Session 6: Prosody and Affect | 3 of 13 | Signalling affect in Mandarin Chinese - the role of non-lexical utterance - final edge tonesAUTHOR(S):
Mueller-Liu, Patricia; Institute of Phonetics, Saarland University
Abstract:
Of the five pitch-phenomena contained in Y.R. Chao's framework of Mandarin
Chinese intonation, the phenomenon termed 'successive tonal addition' has proved
highly elusive. Using communicatively-based spontaneous speech samples, the
first instrumental evidence of successive tonal addition is presented here, found
to consist of non-lexical pitch-movements added to the lexical tones of utterancefinal
syllables. Investigation into the functions of these phenomena, referred to as
'edge tones', showed these to be affective in nature, signalling emotio-attitudinal
messages.
| | | Poster Session 6: Prosody and Affect | 4 of 13 | Paralinguistic Effects on Voice Quality: A Study in JapaneseAUTHOR(S):
AUTHOR(S):
Menezes, Caroline; National Institute for Japanese Language
Maekawa, Kikuo; National Institute for Japanese Language
Abstract:
This study analyzes two spectral properties in vowel segments, H1-H2 (related to
glottal opening) and H1-A3 (related to the speed of vocal fold closing gesture) in
an attempt to infer the voice quality variation associated with different types of
paralinguistic information (PI) types. Results suggest that both glottal opening
and closing speed of the glottis differ significantly depending on PI. However, for
some PI types there were also significant syllable effects. The correlation between
pitch (F0) and these two voice parameters was very low leading to the conclusion
that just pitch differences cannot account for the observed voice quality variation.
Significant differences were also noted for the power of speech waveform (RMS)
according to PI. Inter-speaker variation was noted especially for 'suspicion'.
| | | Poster Session 6: Prosody and Affect | 5 of 13 | Neutral Speech Corpora - a test for neutralityAUTHOR(S):
Matte, Ana; UFMG
Abstract:
What is neutral speech? This writing reports the results obtained of research
in phonostylistics of Brazilian Portuguese with the objective of determining the
necessary experimental conditions for recording so-called neutral speech. The experiment
was designed to test these two hypotheses: 1) The phrase, the minimal
prosodic unit, is also the minimal unit of meaning in studies of expressing emotion
in speech, even when our focus is on the production of complete texts that should
be taken as a single unit of meaning. 2) The speaker's reported self-impressions
can indicate certain sentences that have been affected by the reactions of the
speaker, which conflict with the objective of recording neutral speech, and therefore
should be rejected from a corpus of referential speech. The results obtained
validated both of the hypotheses and enabled us to formulate a single unique test
for neutral speech, recommended for the process of purging of referential corpora
in experimental phonology.
| | | Poster Session 6: Prosody and Affect | 6 of 13 | Emotion Recognition Using IG-based Feature Compensation and Continuous Support Vector MachinesAUTHOR(S):
Wu, Chung-Hsien; Department of Computer Science and Information Engineering
Chuang, Ze-Jing; Department of Computer Science and Information Engineering
Abstract:
This paper presents an approach to feature compensation for emotion recognition
from speech signals. In this approach, the intonation groups (IGs) of the input
speech signals are firstly extracted. The speech features in each selected intonation
group are then extracted. With the assumption of linear mapping between feature
spaces in different emotional states, a feature compensation approach is proposed
to characterize the feature space with better discriminability among emotional
states. The compensation vector with respect to each emotional state is estimated
using the Minimum Classification Error (MCE) algorithm. For the final emotional
state decision, the compensated IG-based feature vectors are used to train the Continuous
Support Vector Machine (CSVMs) for each emotional state. The CSVM
kernel function is experimentally decided as Radial basis function and the experimental
result shows the proposed approach can obtain encouraging performance
for emotion recognition.
| | | Poster Session 6: Prosody and Affect | 7 of 13 | Emotion Recognition in the Noise Applying Large Acoustic Feature SetsAUTHOR(S):
Schuller, Bjoern; Technische Universitaet Muenchen
Arsic, Dejan; Technische Universitaet Muenchen
Wallhoff, Frank; Technische Universitaet Muenchen
Rigoll, Gerhard; Technische Universitaet Muenchen
Abstract:
Speech emotion recognition is considered mostly under ideal acoustic conditions:
acted and elicited samples in studio quality are used besides sparse works on
spontaneous field-data. However, specific analysis of noise influence plays an important
factor in speech processing and is practically not considered hereon, yet.
We therefore discuss affect estimation under noise conditions herein. On 3 wellknown
public databases - DES, EMO-DB, and SUSAS - effects of post-recording
noise addition in diverse dB levels, and performance under noise conditions during
signal capturing, are shown. To cope with this new challenge we extend generation
of functionals by extraction of a large 4k hi-level feature set out of more than 60
partially novel base contours. Such comprise among others intonation, intensity,
formants, HNR, MFCC, and VOC19. Fast Information-Gain-Ratio filter-selection
picks attributes according to noise conditions. Results are presented using Support
Vector Machines as classifier.
| | | Poster Session 6: Prosody and Affect | 8 of 13 | Speech Rates in French Expressive SpeechAUTHOR(S):
Beller, Grégory; IRCAM
Hueber, Thomas; IRCAM
Schwarz, Diemo; IRCAM
Rodet, Xavier; IRCAM
Abstract:
Expressive speech is a useful tool in cinema, theater and contemporary music.
In this paper we present a study on the influence of expressivity on the speech
rates of a french actor. It involves a relational database containing expressive
and neutral spoken french. We first describe the analysis partly based on a unitselection
Text-to-Speech system. The range of data permits a statistical approach
to the speech rate. A dynamic description of the french speech rate is offered which
demonstrates its evolution in speech. Finally, several results are given concerning
pauses and breathing that help to distinguish between anger and happiness.
| | | Poster Session 6: Prosody and Affect | 9 of 13 | Temporal Interaction of Emotional Prosody and Emotional Semantics: Evidence from ERPs AUTHOR(S):
Paulmann, Silke; Max Planck Institute for Human Cognitive and Brain Sciences
Kotz, Sonja; Max Planck Institute for Human Cognitive and Brain Sciences
Abstract:
Emotional prosody helps us to understand how other people feel. Also, emotions
are transferred verbally. In order to further substantiate the underlying mechanisms
of emotional prosodic processing we investigated the interaction of both
emotional prosody and semantics with event-related brain potentials (ERPs) utilizing
a prosodic and interactive (prosodic/semantic) violation paradigm. Results
suggest that the time-course of the two channels differ. While a pure emotional
prosodic violation elicited a positivity between 450 ms and 600 ms, a violation of
both emotional prosody and semantics elicited a negativity between 500 ms and
650 ms. This suggests that emotional prosody and emotional semantics follow
a different time-course. This holds true for all six emotional prosodies investigated.
Also, the obtained results suggest that emotional prosody and semantics
contribute differentially during the interaction of both information types.
| | | Poster Session 6: Prosody and Affect | 10 of 13 | Voiced and Unvoiced Content of fear-type emotions in the SAFE Corpus AUTHOR(S):
Clavel, Chloé; THALES Research and Technology
Vasilescu, Ioana; ENST-TSI
Richard, Gaël; ENST-TSI
Devillers, Laurence; LIMSI-CNRS
Abstract:
The present research focuses on the development of a fear detection system for
surveillance applications based on acoustic cues. The emotional speech material
used for this study comes from the previously collected SAFE Database (Situation
Analysis in a Fictional and Emotional Database) which consists of audiovisual
sequences extracted from movie fictions. We address here the question of a specific
detection model based on unvoiced speech. In this purpose a set of features is
considered for voiced and unvoiced speech. The salience of each feature is evaluated
by computing the Fisher Discriminant Ratio for fear versus neutral discrimination.
This study confirms that the voiced content and the prosodic features in particular
are the most relevant. Finally the detection system merges information conveyed
by both voiced and unvoiced acoustic content to enhance its performance. fear is
recognized with 69.5 % of success.
| | | Poster Session 6: Prosody and Affect | 11 of 13 | Attitudinal Patterns in Brazilian Portugese Intonation: Analysis and SynthesisAUTHOR(S):
de Morães, Joao Antônio; Universidade Federal do Rio de Janeiro
Stein, Cirineu Cecote; Universidade Federal do Rio de Janeiro
Abstract:
The main goal of this paper is to investigate the prosodic manifestation of the
following attitudinal states: consideration, despair, disappointment, irony, justification,
obviousness, and uncertainty. The sentence O Carlos Alberto já sabe.
[Carlos Alberto already knows it.] was pronounced by a subject, who tried to
convey each of these attitudes. Afterwards, it was presented to 20 panelists, which
were asked to identify the original intention of each enunciation. The attitudes
were, in general, correctly identified. The acoustic analysis revealed that the attitudinal
patterns make use of distinct prosodic parameters in their manifestation:
some are linked to segmental duration, be it global or localized; in other cases, the
decisive prosodic component is the fundamental frequency. Auditory tests using
speech resynthesis turned it possible to evaluate the relative weight of the prosodic
characteristics identified in the analysis.
| | | Poster Session 6: Prosody and Affect | 12 of 13 | Comparing vocal parameters in spontaneous and posed child-directed speechAUTHOR(S):
Schaeffler, Felix; Department of Philosophy and Linguistics, Umeå
Kempe, Vera; Department of Psychology, Stirling University
Biersack, Sonja; Department of Psychology, Stirling University
Abstract:
Research on the facial expression of emotion distinguishes between correlates of
posed vs. spontaneous emotion expression. Similar research in the vocal domain is
lacking. In this study, we compare changes in a range of vocal parameters between
posed vs. spontaneous adult-directed (AD) and child-directed (CD) speech. CDS
is a highly affectively charged speech register which lends itself well to the study
of posed vs. spontaneous emotion expression. A group of mother addressed an
adult and their child, and a group of non-mothers addressed an imaginary adult
and an imaginary child. The results confirm adjustments in pitch, formants and
speech rate typically reported for CDS in both groups. At the same time, they
show that source parameters not in service of linguistic function, such as shimmer
(perturbations in fundamental period amplitude) and harmonics-to-noise ratio
show clear group effects suggesting that they may constitute veridical indicators
of spontaneous emotion expression.
| | | Poster Session 6: Prosody and Affect | 13 of 13 | How prosodic attitudes can be false friends: Japanese vs. French social affects AUTHOR(S):
Shochi, Takaaki; ICP
Aubergé, Véronique; ICP
Rilliard, Albert; ICP
Abstract:
The attitudes of the speaker during a verbal interaction are affects linked to the
speaker intentions, and are built by the language and the culture. They are a very
large part of the affects expressed during an interaction, voluntary controlled, This
paper describes several experiments which show that some attitudes own both to
Japanese and French and are implemented in perceptively similar prosody, but
that some Japanese attitudes don't exist and/or are wrongly decoded by French
listeners. Results are presented for 12 attitudes and three levels of language (naive,
beginner, intermediary). It must particularly be noted that French listeners, naive
in Japanese, can very well recognize admiration, authority and irritation; that they
don't discriminate Japanese question and declaration before the intermediary level,
and that the extreme Japanese politeness is interpreted as impoliteness by French
listeners, even when they can speak a good level of Japanese.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Oral Session 5 (OS 5): Prosody in Pathology and Ageing Thursday, May 4, 16:00 - 17:40
Chair: Joan Ma
|
| Oral Session 5: Prosody in Pathology and Ageing | 1 of 5 | Ageing and Speech Prosody AUTHOR(S):
Zellner Keller, Brigitte; Institut de Psychologie, UNIL
Abstract:
Ageing is part of the normal evolution of human beings. Demographic projections
to 2030 indicate that more than 60 countries will have at least 2 million people
age 65 or older. Yet knowledge about speech in the elderly is still dispersed and
incomplete, in particular in the area of normal ageing. Prosody within a linguistic
community is triggered by a number of parameters which are investigated (see this
conference). Yet, little is currently known about the longitudinal evolution of this
speech component. This paper is a first state of the art about speech prosody and
ageing, with the hope that more researchers in speech sciences will investigate this
domain.
| | | Oral Session 5: Prosody in Pathology and Ageing | 2 of 5 | Evaluation of Tracheoesophageal Substitute Voices Using Prosodic FeaturesAUTHOR(S):
Haderlein, Tino; Universität Erlangen-Nürnberg
Nöth, Elmar; Universität Erlangen-Nürnberg
Schuster, Maria; Universität Erlangen-Nürnberg
Eysholdt, Ulrich; Universität Erlangen-Nürnberg
Rosanowski, Frank; Universität Erlangen-Nürnberg
Abstract:
Tracheoesophageal (TE) speech is a possibility to restore the ability to speak after
laryngectomy, i.e. after the removal of the larynx. TE speech often shows low
audibility and intelligibility which makes it a challenge for the patients to communicate.
In speech rehabilitation the patient's voice quality has to be evaluated.
As no objective classification means exists until now and an automation of this
procedure is desirable, we performed initial experiments for automatic evaluation
using prosodic features. Our reference were scoring results for several evaluation
criteria for TE speech from five experienced raters. Correlation coefficients of up
to 0.84 between human and automatic rating are promising for future work.
| | | Oral Session 5: Prosody in Pathology and Ageing | 3 of 5 | Functionality and perceived atypicality of expressive prosody in children with autism spectrum disordersAUTHOR(S):
Peppe, Sue; Queen Margaret University College, Edinburgh
Martinez Castilla, Pastora; Universidad Autonoma Madrid
Lickley, Robin; Queen Margaret University College, Edinburgh
Mennen, Ineke; Queen Margaret University College, Edinburgh
McCann, Joanne; Queen Margaret University College, Edinburgh
O'Hare, Anne; University of Edinburgh
Rutherford, Marion; 2Royal Hospital for Sick Children, Edinburgh
Abstract:
People with autism are perceived to have 'odd' prosody, but is it malfunctioning?
A new prosody test assesses the functionality of prosody in four aspects of
speech (phrasing, affect, turn-end and focus) by tasks that elicit utterances in
which prosody alone conveys the meaning. The test was used with 100 typicallydeveloping
children (TD), 39 with Asperger's syndrome (AspS) and 31 with highfunctioning
autism (HFA). In results, HFA < TD on all six tasks, HFA < AspS
on four, and AspS < TD on one. In perception experiments, judges rated the
atypicality of the prosody in samples of conversation from participants in each of
the three groups. Correlation between the judges' ratings was high, and ANOVAs
showed differences between groups similar to those found in the test results. The
ratings correlated significantly (mainly at the 0.01 level) with the test's output
scores. The findings support the ecological validity of the test for use as a clinical
assessment tool.
| | | Oral Session 5: Prosody in Pathology and Ageing | 4 of 5 | Dysprosody in Parkinson's disease: Musical scale production and intonation patterns analysisAUTHOR(S):
Rigaldie, Karine; Laboratoire Jacques Lordat
Nespoulous, Jean-Luc; Laboratoire Jacques Lordat
Vigouroux, Nadine; IRIT
Abstract:
This article aims to acquire a better knowledge of prosody disturbances in Parkinson
disease via an acoustic analysis. Our aim is twofold. Firstly, to identify
phonetic and prosodic parameters that are specific of such a pathology. Secondly,
to study the effect of a pharmacological treatment (based on dopamine) on these
patients' speech production. In order to determine the effect of dopamine, oral
productions of 8 parkinsonian patients have been collected, in the OFF and ON
states, and have then been compared to those of control subjects. The specific
aim of this study is (a) to examine the ability of patients to handle the variations
in fundamental frequency of their voice as well as to master the rise in frequency
required by the task (i.e. production of the musical scale and intonation patterns)
and (b) to measure the palliative effects that can be induced, at least partly, in
the management of frequency by a treatment based on L-Dopa.
| | | Oral Session 5: Prosody in Pathology and Ageing | 5 of 5 | Consonant and Vowel Duration in Parkinsonian Speech AUTHOR(S):
Duez, Danielle; CNRS
Abstract:
The current study compared consonant and vowel duration in speech read by 10
French Parkinsonian speakers and 10 control speakers. The results show a different
impact of Parkinson's disease (PD) on speech segments. Consonants were
shortened in PD speech while vowels were significantly longer. This results of
the concomitance of articulatory movements of reduced amplitude and orofacial
bradykinesia. As a consequence syllabic productions are of the same duration in
PD speech as in normal speech. The durational contrast of consonants was maintained,
for vowels there was less agreement with the normal pattern of intrinsic
duration, especially for high vowels.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Special Session 4 (SPS 4): Prosody in Automatic Speech Recognition Organizer: Sin-Horng Chen
Friday, May 5, 09:00 - 11:00 |
| Special Session 4: Prosody in Automatic Speech Recognition | 1 of 6 | Recognizing Mandarin Chinese Fluent Speech Using Prosody Information - An Initial Investigation AUTHOR(S):
Tseng, Chiu-yu; Institute of Linguistics, Academia Sinica
Abstract:
By applying our hierarchical prosody framework for fluent speech that specifies
boundary breaks and boundary information, we were able to recognize speech
paragraphs and various levels of prosodic units within each such paragraph. These
recognized prosodic units are not unrelated speech units but rather, sister constituents
that entail higher-up syntactic as well semantic relationships that cumulatively
make up fluent continuous speech. Note how this top-down approach differs
from most bottom-up approaches. The former offers information from higher
up linguistic association whereas the latter treats identified Chinese syllables as
discrete unrelated units or lexical words at most, leaving structural information
unaddressed. We believe using top-down prosody information may very well offer
new breaking ground in fluent speech recognition.
| | | Special Session 4: Prosody in Automatic Speech Recognition | 2 of 6 | Detection of Fillers Using Prosodic Features in Spontaneous Speech Recognition of JapaneseAUTHOR(S):
Hirose, Keikichi; The University of Tokyo
Abe, Yu; The University of Tokyo
Minematsu, Nobuaki; The University of Tokyo
Abstract:
A new scheme of detecting fillers in spontaneous speech recognition was developed.
When a filler hypothesis appears during the decoding process, a prosodic module
checks the morpheme, hypothesized as filler, and outputs the filler likelihood score.
When the likelihood score exceeds a threshold, a prosodic score is added to the
language score of the hypothesis as a bonus. The prosodic module is constructed
using five-layered perceptron. A comparative recognition experiment with and
without the prosodic module was conducted for 100 utterances of spontaneous
speech of Japanese. Seven fillers originally miss-recognized as non-fillers are correctly
recognized as fillers when the prosodic module is used. No fillers originally
recognized as fillers are wrongly recognized as non-fillers. Although a few non-filler
morphemes are miss-recognized as other non-filler morphemes by the introduction
of the prosodic module, they can be corrected by properly setting parameters of
the recognizer.
| | | Special Session 4: Prosody in Automatic Speech Recognition | 3 of 6 | A New Approach of Using Temporal Information in Mandarin Speech Recognition AUTHOR(S):
Yang, Jyh-Her; National Chiao Tung University
Liao, Yuan-Fu; National Taipei University of Technology
Wang, Yih-Ru; National Chiao Tung University
Chen, Sin-Horng; National Chiao Tung University
Abstract:
In this paper, a new approach of using temporal information to assist in Mandarin
speech recognition is discussed. It incorporates two types of temporal information
into the recognition search. One is a statistical syllable duration model which
considers the influences of 411 base-syllables, 5 tones, 4 position-in-word factors,
and 3 position-in-sentence factors on syllable duration. Another is the timing information
of modeling three types of inter-syllable boundary including intra-word,
inter-word without punctuation mark (PM), and inter-word with PM. The uses of
these two types of temporal information are expected to be useful for improving
the segmentation accuracies in both acoustic decoding and linguistic decoding. Experimental
results showed that the base-syllable/character/word recognition rates
were slightly improved for both MATBN and Treebank datbase.
| | | Special Session 4: Prosody in Automatic Speech Recognition | 4 of 6 | Exploiting Glottal and Prosodic Information for Robust Speaker VerificationAUTHOR(S):
Liao, Yuan-Fu; National Taipei University of Technology, Taiwan
Zeng, Zhi-Ren; National Taipei University of Technology, Taiwan
Chen, Zi-He; National Central University, Taiwan
Juang, Yau-Tarng; National Central University, Taiwan
Abstract:
In this paper, three different levels of speaker cues including the glottal, prosodic
and spectral information are integrated together to build a robust speaker verification
system. The major purpose is to resist the distortion of channels and handsets.
Especially, the dynamic behavior of normalized amplitude quotient (NAQ) and
prosodic feature contours are modeled using Gaussian of mixture models (GMMs)
and two latent prosody analyses (LPAs)-based approaches, respectively. The proposed
methods are evaluated on the standard one speaker detection task of the
2001 NIST Speaker Recognition Evaluation Corpus where only one 2-minute training
and 30-second trial speech (in average) are available. Experimental results have
shown that the proposed approach could improve the equal error rates (EERs) of
maximum a priori-adapted (MAP)-GMMs and GMMs+T-norm approaches from
12.4 % and 9.5 % to 10.3 % and 8.3 % and finally to 7.8 %, respectively.
| | | Special Session 4: Prosody in Automatic Speech Recognition | 5 of 6 | Affect-Robust Speech Recognition by Dynamic Emotional AdaptationAUTHOR(S):
Schuller, Bjoern; Technische Universitaet Muenchen
Stadermann, Jan; Technische Universitaet Muenchen
Rigoll, Gerhard; Technische Universitaet Muenchen
Abstract:
Automatic Speech Recognition fails to a certain extent when confronted with
highly affective speech. In order to cope with this problem we suggest dynamic
adaptation to the actual user emotion. The ASR framework is built by a hybrid
ANN/HMM mono-phone 5k bi-gram LM recognizer. Based hereon we show
adaptation to the affective speaking style. Speech emotion recognition takes place
prior to the actual recognition task to choose appropriate models. We therefore
focus on fast emotion recognition based on low extra feature extraction effort. As
databases for proof-of-concept we use a single digit task and sentences from the
well-known WSJ-corpus. These have been re-recorded in acted neutral and angrily
speaking style under ideal acoustic conditions to exclude other influences. Effectiveness
of acoustic emotion recognition is also proved on the SUSAS corpus. We
finally evaluate the need of adaptation and demonstrate significant superiority of
our dynamic approach to static adaptation.
| | | Special Session 4: Prosody in Automatic Speech Recognition | 6 of 6 | Improved Large Vocabulary Mandarin Speech Recognition Using Prosodic Fea-turesAUTHOR(S):
Huang, Jui-Ting; National Taiwan University
Lee, Lin-shan; National Taiwan University
Abstract:
This paper presents a new framework for improved large vocabulary Mandarin
speech recognition using prosodic features. The prosodic information is formulated
in a probabilistic model well compatible to the conventional maximum a posteriori
(MAP) framework for large vocabulary speech recognition. A set of prosodic
features considering the special characteristics of Mandarin Chinese is developed,
and both syllable-level and prosodic-word-level prosodic models are trained with
the decision tree algorithm. A two-pass recognition process is used, in which
each word arc in the word graph outputted by the first pass is rescored in the
second pass using the two prosodic models. The experiments show the reasonable
improvements in recognition accuracy. This approach does NOT require a prosodic
labeled training corpus and works for the large-scale speaker-independent task.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Special Session 5 (SPS 5):
Articulatory-Functional Approaches to Speech Prosody Organizer: Yi Xu
Friday, May 5, 11:20 - 13:20 |
| Special Session 5: Articulatory-Functional Approaches to Speech Prosody | 1 of 4 | The Roles of Physiology, Physics and Mathematics in Modeling Prosodic Features of SpeechAUTHOR(S):
Fujisaki, Hiroya; Professor Emeritus, The University of Tokyo
Abstract:
This paper presents the author's view on prosody, information, and models, as
well as on the roles of physiology, physics and mathematics in modeling, and
describes the theoretical and experimental bases of the command-response model
for the mechanisms of F0 contour generation, which has been extensively used
in the analysis and synthesis of F0 contours of utterances of various languages.
Although the model represents only those factors that are inherent to the control
mechanism of F0, it allows one to identify those factors that carry communicative
functions of speech as input commands and as parameters of the mechanism.
| | | Special Session 5: Articulatory-Functional Approaches to Speech Prosody | 2 of 4 | Planning Compensates for the Mechanical Limitations of ArticulationAUTHOR(S):
Kochanski, Greg P.; The University of Oxford
Shih, Chilin; University of Illinois at Urbana-Champaign
Abstract:
We explore a simple model of speech articulation. The model consists of an articulator
combined with the ability to remember and improve the neural drive signal
for the articulator. Over many productions, the system learns a neural drive signal
that provides an accurate match for acoustically-defined targets. In fact, the
match can be better than expected, yielding narrower regions of coarticulation
than the intrinsic muscle Fresponse time. Further, despite the time delay introduced
by the muscle, the articulatory response has no time delay, because the
learned neural drive signal occurs in advance of changes in the acoustic targets.
Finally, we test the model against tonal production data from Mandarin conversation,
and show that it can represent non-trivial surface intonation patterns with
simple and linguistically reasonable targets.
| | | Special Session 5: Articulatory-Functional Approaches to Speech Prosody | 3 of 4 | What is Emphasis and How is it Coded?AUTHOR(S):
Kohler, Klaus J.; Institute of Phonetics and Digital Speech Processing (IPDS),
Christian-Albrechts-University at Kiel
Abstract:
The meaning category emphasis is examined with regard to its semantic, pragmatic,
and affective components and their prosodic coding in German, English,
and Dutch. In particular, a distinction is made between emphasis for focus, which
singles out elements of discourse by making them more salient than others, and
emphasis for intensity, which intensifies the meaning contained in the elements.
To evaluate intensity negatively a force accent comes into play, which is signalled
by non-pitch features. The question of universals is also addressed.
| | | Special Session 5: Articulatory-Functional Approaches to Speech Prosody | 4 of 4 | Speech prosody as articulated communicative functions AUTHOR(S):
Xu, Yi; University College London
Abstract:
Speech prosody, just like the segmental aspect of speech, conveys communicative
meanings by encoding functional contrasts. The contrasts are realized through articulation,
a biomechanical process with specific constraints. Prosodic phonology
or any other theory of prosody therefore cannot be autonomous from either communicative
functions or biophysical mechanisms. Successful modeling of speech
prosody can be achieved only if communicative functions and biophysical mechanisms
are treated as the core rather than the margins of prosody.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Poster Session 7 (PS 7): Cross-linguistic Studies and Prosodic Variability Friday, May 5, 14:40 - 16:10
Chair: Kjell Gustafson |
| Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 1 of 26 | Stress Patterns of Complex German Cardinal NumbersAUTHOR(S):
Wagner, Petra; Universität Bonn
Paulson, Meike; Universität Bonn
Abstract:
German cardinal numbers show variable stress patterns on the phonetic surface.
Former studies showed that these cannot be explained by stress shift. In a combined
production and perception study, the hypothesis is tested that German cardinal
numbers are of a hybrid phonological nature: sentence medially, they behave
like compounds following the CSR, while they behave like phonological phrases
following the NSR when occurring phrase finally. The hypotheses were tested and
for the majority of cases.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 2 of 26 | The Temporal Structure of Penta- -and Hexasyllabic Words in EstonianAUTHOR(S):
Lippus, Pärtel; University of Tartu
Pajusalu, Karl; University of Tartu
Teras, Pire; University of Tartu
Abstract:
This article concentrates on five- and six-syllable Estonian words consisting of two
or more metric feet of the first quantity degree (Q1), comparing the temporal
structures of the feet. After an introductory discussion of the problems related to
secondary stressed feet, the article first of all deals with half-length of unstressed
syllables in Q1 feet. This is followed by an analysis of durations and duration
ratios of primary and secondary stressed Q1 feet of five- and six-syllable words. It
appears that in these long words the temporal structure of Q1 feet is not similar.
It differs from the structure of Q1 feet of shorter (di- to tetrasyllabic) words where
there is a significant lengthening of the unstressed vowel (V2). The results show
that in Estonian the whole structure of prosodic word determines the temporal
structure of feet.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 3 of 26 | Intonational Differences in Lombard Speech: Looking Beyond F0 RangeAUTHOR(S):
Welby, Pauline; Institut de la Communication Parlée
Abstract:
Previous studies on speech in noise have generally reported an increase in fundamental
frequency (F0). I examine three other potential intonational differences:
choice of intonation pattern, tonal scaling, and tonal alignment. Seven French
speakers read short paragraphs in quiet and in 80 dB white noise. Four speakers
increased F0 range across the target accentual phrases in noise. Six speakers upscaled
individual tones; there was great inter-speaker variability in tonal scaling,
in contrast with an earlier study on Dutch. No influence of noise on intonation
pattern type was found; there was no tendency to produce more "early rises" in
noise, even though these rises are cues to word segmentation. Producing an early
rise (thus a LHLH or LHH pattern) may not add to the salience of the commonly
produced LH pattern. In addition, no difference in tonal alignment was found,
in contrast to the findings of an earlier study. This may be due to paradigm
differences between the studies.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 4 of 26 | A Perceptual Study on Variability in Break Allocation within Chinese SentencesAUTHOR(S):
Chu, Min; Microsoft Research Asia
Dong, Honghui; Institute of Automation, CAS
Tao, Jianhua; Institute of Automation, CAS
Abstract:
This paper investigates the variability of break allocations within Chinese sentences
by perceptual experimentation. The results confirm the existence of prosodic
chunks. We have found that (1) prosodic chunks are the basic units in the rhythmic
organization of Chinese utterances (breaks can generally be allocated by chunk
boundaries and breaks placed within a chunk will significantly decrease the naturalness
of synthesized speeches); (2) given prosodic chunks, multiple break solutions
are acceptable. Furthermore, breaks can be allocated by chunk boundaries
using simple rules that impose a length-balance constraint without considering the
syntax or semantic structure of a sentence.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 5 of 26 | Contextual Variability of Third-Tone Sandhi in Taiwan MandarinAUTHOR(S):
Chen, Chun-Mei; University of Texas at Austin
Abstract:
This study investigates the phonetic property of Third-Tone Sandhi in Taiwan
Mandarin and the effects of contextual variability. The goal of this study is to
provide empirical evidence for the description of Tone 2 (T2) and Tone 3 (T3) in
Taiwan Mandarin and further to account for the phonetic features of T2 and T3 in
Third-Tone Sandhi Contexts. The results show that isolated T2 is different from
isolated T3 in Taiwan Mandarin. The phonetic T2 (< /T3/) derived from Third-
Tone Sandhi Rule in Sandhi Context has more raising effect than the underlying
T2 in the same Sandhi Context. The greater raising effect of the T3 in Sandhi
Context was supported by its longer vowel duration. Third-Tone Sandhi Rule
turns T3T3 into T2T3, and anticipatory dissimilation enhances the raising effect
on the Sandhi.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 6 of 26 | Voice quality variations throughout the study of the accent of LiverpoolAUTHOR(S):
Coadou, Marion; Laboratoire Parole et Langage
Abstract:
Voice quality is a term which is frequently used by phoneticians, however defining
it precisely is quite difficult. The voice quality of a speaker is the result of the interaction
between organic and phonetic factors (Abercrombie, D., 1967 and Laver,
J., 1980). The organic factors may refer, for example, to the size or the shape
of the vocal tract. The phonetic factors, which are studied here, can be due to
muscular adjustments learnt by the speakers in their social environment. First of
all, this study proposes a definition of some key-concepts in order to understand
voice quality. Then, the corpus is analysed thanks to the Vocal Profile Analysis
Scheme. This pilot study on four subjects from Liverpool shows that it is possible
to observe variations of voice quality between various accents of the British Isles.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 7 of 26 | Prosodic Structure Affects the Production and Perception of Voice-Assimilated German FricativesAUTHOR(S):
Kuzla, Claudia; Max-Planck-Institut für Psycholinguistik
Ernestus, Mirjam; Max-Planck-Institut für Psycholinguistik
Mitterer, Holger; Max-Planck-Institut für Psycholinguistik
Abstract:
Prosodic structure has long been known to constrain phonological processes. More
recently, it has also been recognized as a source of fine-grained phonetic variation
of speech sounds. In particular, segments in domain-initial position undergo
prosodic strengthening, which also implies more resistance to coarticulation in
higher prosodic domains. The present study investigates the combined effects
of prosodic strengthening and assimilatory devoicing on word-initial fricatives in
German, the functional implication of both processes for cues to the fortis-lenis
contrast, and the influence of prosodic structure on listeners' compensation for
assimilation. Results indicate that 1. Prosodic structure modulates duration and
the degree of assimilatory devoicing, 2. Phonological contrasts are maintained by
speakers, but differ in phonetic detail across prosodic domains, and 3. Compensation
for assimilation in perception is moderated by prosodic structure and lexical
constraints.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 8 of 26 | Is there a distinction between H+!H* and H+L* in standard German? Evidence from an acoustic and auditory analysisAUTHOR(S):
Rathcke, Tamara; Institute of Phonetics and Digital Speech Processing Kiel
Harrington, Jonathan; Institute of Phonetics and Digital Speech Processing Kiel
Abstract:
This paper is concerned with intonation in German and whether there is a phonological
distinction between two types of early peaks H+L* and H+!H*. Speech
perception and production data are presented to shed light on this issue. The results
show little evidence for a phonological distinction between these categories.
The results are interpreted in terms of the relationship between downstep and
early peak placement in German.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 9 of 26 | Acoustic Differentiation of L-and L-L% in Switchboard and Radio News SpeechAUTHOR(S):
Kim, Heejin; University of Illinois at Urbana-Champaign
Yoon, Tae-Jin; University of Illinois at Urbana-Champaign
Cole, Jennifer; University of Illinois at Urbana-Champaign
Hasegawa-Johnson, Mark; University of Illinois at Urbana-Champaign
Abstract:
Acoustic evidence for a distinction between low-toned intermediate (ip) and intonational
phrase (IP) boundaries is presented from two speech corpora representing
spontaneous, conversational speech and scripted broadcast speech. Robust effects
of the two boundary levels are found in the phrase-final syllable rime in both
corpora. Nucleus duration is longer and the F0 value at rime end is lower at IP
boundaries compared to ip boundaries. Glottalization is also more frequent before
an IP boundary. Other effects of boundary level on the F0 and intensity contours
over the phrase-final rime are evident but variable across the two corpora. These
findings support the Beckman-Pierrehumbert theory of intonation (Beckman and
Pierrehumbert 1986) in its recognition of two levels of prosodic phrasing.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 10 of 26 | Additive Effects of Phrase Boundary on English Accented VowelsAUTHOR(S):
Lee, Eun-Kyung; University of Illinois at Urbana-Champaign
Cole, Jennifer; University of Illinois at Urbana-Champaign
Kim, Heejin; University of Illinois at Urbana-Champaign
Abstract:
This paper investigates cumulative effects of strengthening and lengthening on English
vowels across two prominence-bearing prosodic factors, phrasal accent and
prosodic phrase boundary. F1, F2 and duration measures are compared across
vowels in three prosodic contexts: ip-medial unaccented, ip-medial accented, and
ip-final accented. The results show that for most vowels there is only one degree of
vowel strengthening, conditioned by phrasal accent, without any additive strengthening
effect of prosodic phrase boundary. Lengthening is observed in both accent
and added phrase boundary conditions, and the effect is consistently cumulative
for at least some vowels, suggesting a gradient increase of duration as a function of
the strength of prosodic structure. This finding also provides compelling evidence
that strengthening and lengthening effects are two independent mechanisms that
serve to mark prosodically strong positions.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 11 of 26 | Is irregular phonation a reliable cue towards the segmentation of continuous speech in American English?AUTHOR(S):
Surana, Kushan; MIT
Slifka, Janet; MIT
Abstract:
This paper analyzes the potential use of irregular phonation as a cue for the
segmentation of continuous speech. The analysis is conducted on two dialect
regions of the TIMIT database which consists of read, isolated utterances. The
data set encompasses 114 speakers resulting in 1331 hand-labeled irregular tokens.
The study shows that 78 % of the irregular tokens occur at word boundaries and
5 % occur at syllable boundaries. Of the irregular tokens at syllable boundaries,
72 % are either at the junction of a compound-word (e.g "outcast") or at the
junction of a base word and a suffix. Of the irregular tokens which do not occur at
word or syllable boundaries, 70 % occur adjacent to voiceless consonants mostly
in utterance-final location. These observations support irregular phonation as an
acoustic cue for syntactic boundaries in connected speech. Detection of regions of
irregular phonation could improve speech recognition and lexical access models.
[Work supported by NIH # DC02978.]
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 12 of 26 | A preliminary study of prosodic patterns in two varieties of suburban youth speech in FranceAUTHOR(S):
Le Gac, David; University of Rouen
Jamin, Mikaël; Nottingham University
Iryna, Lehka; University of Rouen
Abstract:
This paper presents the first results of a research on the prosodic specificities of
French speakers living in two poor multi-ethnic suburbs located in the north of
Paris and in Rouen. The emphasis is on the acoustic analysis and the comparison
of some particular prosodic patterns which are frequently used in the suburban
youth speech. We show that there is no noteworthy difference between speakers
from both suburbs. In particular, we found that both groups of speakers use
rise-fall patterns associated with short syllables at the end of IP. This pattern
is atypical in standard French, and its presence in both groups suggests that it
constitutes a prosodic marker that is essential to the suburban accent identification.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 13 of 26 | Evidence for 'soft' preplanning in tonal production: Initial scaling in RomanceAUTHOR(S):
Prieto, Pilar; ICREA-Universitat Autónoma de Barcelona
D'Imperio, Mariapaola; CNRS-Université de Provence
Elordieta, Gorka; Euskal Herriko Unibertsitatea
Frota, Sónia; Universidade de Lisboa
Vigário, Marina; Universidade do Minho
Abstract:
In this study, the scaling of utterance-initial f0 values and H initial peaks are
examined in several Romance languages as a function of phrasal length. The motivation
for this study stems from contradictory claims in the literature regarding
whether the height of the initial f0 values and peaks is governed by a look-ahead
or preplanning mechanism. A total of ten speakers of five Romance language varieties
(Catalan, Italian, Standard and Northern European Portuguese, and Spanish)
read a total of 3720 declarative utterances (744 utterances per language) of
varying length in number of pitch accents and syllables. The data reveal that the
majority of speakers tend to begin higher in longer utterances. The failure to find
a correlation between phrase length and initial scaling for all speakers within languages
shows that we are dealing with soft preplanning (in [3]'s terms), that is, an
optional production mechanism that may be overridden by other tonal features.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 14 of 26 | A scaling contrast in Majorcan Catalan interrogativesAUTHOR(S):
Vanrell, Maria del Mar; Universitat Autónoma de Barcelona
Abstract:
This paper reports the application of the Categorical Perception Paradigm (CP)
to a pith height contrast in Majorcan Catalan. The first hypothesis is that pitch
height is the primary perceptual cue in distinguishing yes-no questions from whquestions
inMajorcan Catalan. The second hypothesis predicts that, as in previous
studies, the application of the CP involves the presence of order of presentation
effects in the results of the discrimination task. The results show that the primary
perceptual cue is the presence of upstep in yes-no questions and confirm the
existence of an order of presentation effect that deserves further investigation.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 15 of 26 | Morphotonology for TTS in Niger-Congo languagesAUTHOR(S):
Gibbon, Dafydd; Universität Bielefeld
Urua, Eno-Abasi; University of Uyo
Abstract:
Many East Asian languages have lexical (i.e. phonemic) prosody; African languages
are also frequently mentioned as tone languages. However, tone functionality
in African tone languages is fundamentally morphosyntactic rather than
phonemic: (a) tonal pattern types are restricted to particular parts of speech, (b)
tones may be inflectional and play a role in (c) derivational and (d) compounding
word formation patterns, and (e) in syntactic phrasal templates. The aim of
this paper is to document the morphosyntactic functionality of tones in African
languages within a typological context as compared to East Asian tone languages
such as Mandarin, and to develop finite state architectures for tone handling in
practical Text-To-Speech synthesis in health and agriculture information projects
in Ivory Coast and Nigeria. Morphosyntactic tone is illustrated for Ibibio (Lower
Cross, South-Eastern Nigeria).
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 16 of 26 | Non- and Quasi-lexical Realizations of 'Positive Response' in Korean, Polish and Thai AUTHOR(S):
Karpiński, Maciej; Adam Mickiewicz University
Kleśta, Janusz; Adam Mickiewicz University
Szalkowska, Emilia; Adam Mickiewicz University
Abstract:
This paper presents a basic comparative study of Korean, Polish and Thai short
words, quasi-words and vocalizations used to perform the dialogue moves collectively
referred to as "positive responses" in map task dialogues. Some of these
units are produced as non-linguistic vocalizations, while others are "fully legitimate"
linguistic entities. The frequencies of occurrence for the analyzed units
were quite high and similar for the three languages. The numbers of expression
categories were almost identical. However, the tendencies found in the Korean and
Thai intonational contours were more distinct than for Polish. The inventories of
units for all the three languages included borrowings. The nasal vocalization mhm
not only ranked among the most popular expression categories for each of the languages,
but was also consistently produced with a rising contour. The normalized
pitch change was remarkably higher in the Polish expressions than in the Korean
and Thai units.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 17 of 26 | Replicating in Naxi (Tibeto-Burman) an Experiment Designed for Yorùbá: An Approach To 'Prominence-Sensitive Prosody' vs. 'Calculated Prosody'AUTHOR(S):
Michaud, Alexis; Laboratoire de Phonétique et Phonologie (UMR 7018) CNRS/
Paris 3 Sorbonne Nouvelle
Abstract:
An experiment originally designed to investigate the tones of Yorùbá (H, M and
L) is here replicated for Naxi, a Tibeto-Burman language which likewise has H,
M and L tones. The data consist in sentences in which all syllables bear the
same tone. For Naxi, the stylisation of the F0 curves raises difficulties that were
apparently not present in Yorùbá: in Naxi, intonational junctures are manifested
by lengthening and a downward tilt in F0 which may not be adequately captured
by the two-point stylisation used for Yorùbá. The typological discussion suggests
that there may be a continuum between (i) the 'calculated prosody' of languages
such as Ngamambo, whose prosodic structure hinges on the calculation of a tone
sequence, and (ii) the 'prominence-sensitive prosody' of languages such as English,
Chinese or Vietnamese (and to a lesser extent Naxi), in which intonation appears to
reflect phrasing and informational structure in a flexible, typically noncategorical
way.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 18 of 26 | Pitch and Voice Quality Characteristics of the Lexical Word-Tones of Tamang, as Compared with Level Tones (Naxi data) and Pitch-plus-Voice-Quality Tones (Vietnamese data)AUTHOR(S):
Michaud, Alexis; Laboratoire de Phonétique et Phonologie (UMR 7018) CNRS/
Paris 3 Sorbonne Nouvelle
Mazaudon, Martine; LACITO, UMR 7107 CNRS/ Paris 3 & 4
Abstract:
The tones of Tamang (Sino-Tibetan family) involve both F0 and voice quality
characteristics: two of the four tones (tones 3 and 4) were reported to be breathy
in studies from the 1970s. For the present research, audio and electroglottographic
data were collected from 5 speakers in their 30s or 40s. Voice quality is estimated
by computing the glottal open quotient. The present results (bearing on 788
syllables) show that in the speech of three speakers, tones 3 and 4 have a higher
open quotient (providing an indirect cue to breathiness) than tones 1 and 2. The
difference in open quotient between the four tones for the other two speakers
is negligible or inconsistent. The Tamang data are compared with similar data
from Naxi, which possesses level tones, and from Vietnamese, which possesses
pitch-plus-voice-quality tones. The results appear to confirm that Tamang tones
possess several correlates; they offer an insight on ongoing change in the prosodic
system of Tamang.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 19 of 26 | The intonation of Banyumas Javanese AUTHOR(S):
Stoel, Ruben; Universiteit Leiden
Abstract:
I will present an analysis of the intonation of the Banyumas dialect of Javanese (an
Austronesian language spoken in Indonesia), based on the autosegmental-metrical
framework. As Javanese is a language without word stress, I assume that there
are no pitch accents. Accentual Phrases (AP) are marked by boundary tones.
A H% tone marks the end of a pre-nuclear AP, while the nuclear AP ends in a
HL%, LH%, or HL0% tone. This tone marks the end of the focus. Any postfocal
material appears in an encliticized AP. This material must correspond to a
syntactic XP. Contrastive focus at the word level is possible in only a few special
constructions.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 20 of 26 | Syllable cut and energy contour: a contrastive study of German and Hungarian AUTHOR(S):
Mády, Katalin; Institute of German Studies, Pázmány Péter Catholic University
Tronka, Krisztián Z.; Institute of German Studies, Pázmány Péter Catholic University
Reichel, Uwe D.; Department of Phonetics and Speech Communication, University
of Munich
Abstract:
Syllable cut is said to be a phonologically distinctive feature in some languages
where the difference in vowel quantity is accompanied by a difference in vowel
quality like in German. There have been several attempts to find the corresponding
phonetic correlates for syllable cut, from which the energy measurements of vowels
by Spiekermann proved appropriate for explaining the difference between long
and short vowels. On this basis, we intended to compare German as a syllable
cut language and Hungarian where the feature was not expected to be relevant.
However, the phonetic correlates of syllable cut found in this study do not entirely
confirm Spiekermann's results. It seems that the energy features of vowels are
more strongly connected to their duration than to their quality.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 21 of 26 | Lexical Stress Realisation: Native vs. ESL SpeechAUTHOR(S):
Jian, Hua-Li; National Cheng Kung University
Abstract:
English stress placement in phrase-medial and phrase-final is investigated. Current
results indicate that Taiwanese ESL learners realise polysyllabic words that
carry various degrees of stress in two prosodic positions with considerable differences
relative to the native American English speakers, and the differences are
demonstrated from acoustical and phonetic perspectives.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 22 of 26 | Acoustic and perceptual cues for compound-phrasal contrasts in VietnameseAUTHOR(S):
Nguyen, Thu; University of Queensland
Ingram, John; University of Queensland
Abstract:
This paper reports two experiments that examined the acoustic and perceptual
cues that Vietnamese use to distinguish between compounds and noun phrases. 15
minimal sets of the two patterns classified into three different word/phrase types
(noun-adjective (hoa [flower] hôong [pink]: pink flower), noun-verb (bò [ox] cày
[plough]: ox ploughing), and noun-noun (bàn [table] giây [paper]: paper table)
were recorded in two experimental conditions: one with a picture-naming task
and one with a minimal pair sentence task by 45 Vietnamese native speakers
of 3 dialects (Hanoi, Hue, and Saigon). In a perception task, the meaning of the
patterns is identified in a forced choice test by 15 listeners. The results showed that
while there is evidence that Vietnamese use juncture and pre-pausal lengthening to
distinguish between compounds and phrases, no significant acoustic and perceptual
evidence was found to support a claim for contrastive stress patterns between
compounds and noun phrases in Vietnamese.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 23 of 26 | Pitch Range is not Pitch RangeAUTHOR(S):
Ulbrich, Christiane; University of Ulster
Abstract:
This paper presents a phonetic analysis of pitch range as perceived and measured
on utterance and syllable level. A previous analysis of read speech showed that
German speakers produced a larger pitch-range on utterance level, whereas Swiss
German speakers produced a larger pitch-range on syllable level. This analysis
was based on the production of broadcasters reading news messages and a fairytale,
both stylistically very restricted and largely standardized. Therefore, in the
present study semi- and spontaneous utterances are analyzed to provide evidence
that these findings are cross-linguistic rather than discourse-specific. The evidence
was provided by auditory annotation and acoustic measurements.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 24 of 26 | Pitch range variation in child affective speechAUTHOR(S):
Grichkovtsova, Ioulia; QMUC
Mennen, Ineke; QMUC
Abstract:
This study investigates pitch range variation in the affective speech of bilingual
and monolingual children. Cross-linguistic differences in affective speech may lead
bilingual children to express emotions differently in their two different languages.
A cross-linguistically comparable corpus of 6 bilingual Scottish-French children
and 12 monolingual peers was recorded according to the developed methodology.
The results show that the majority of children use pitch range measurements
(overall level and span) to realize differences between some emotions. Monolingual
children use analyzed acoustic parameters in a much more homogeneous way than
bilinguals. Some results of bilingual children do not strictly correspond to those
of monolinguals, and show bidirectional interference.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 25 of 26 | The Effect of Glottalization on Voice PreferenceAUTHOR(S):
Ding, Hongwei; Dresden University of Technology
Jokisch, Oliver; Dresden University of Technology
Hoffmann, Rüdiger; Dresden University of Technology
Abstract:
The impact of phrasal prosody on glottalization is documented in many publications.
Besides prosodic boundary and stress, other influencing factors such
as the speaking style have been studied. The work reported here examines the
relationship between the objective preference of listeners and the occurrence of
speaker's glottalization. The speech data in six languages were used for the multilingual
speaker selection in speech synthesis and have been compiled to listening
test phrases. Additional experiments, concerning the influence of reading style
on glottalization, were conducted with prosodically constant words or phrases in
two languages. Evaluating the statistics from this investigation, we can come to
following conclusions: (a) The occurrence and degree of glottalization can be different
across speakers. (b) As an prosodic effect, glottalization is NOT undesired
for speakers. (c) A well-defined reading style can increase the occurrence.
| | | Poster Session 7: Cross-linguistic Studies and Prosodic Variability | 26 of 26 | Transcribing intonational variation at diffferent levels of analysisAUTHOR(S):
Post, Brechtje; University of Cambridge
Delais-Roussarie, Elisabeth; CNRS / Université de Paris 7
Abstract:
In the transcription system for Intonational Variation (IVTS, derived from IViE),
prosodic features are transcribed on (1) the rhythmic tier, (2) the local phonetic
tier, (3) the global phonetic tier, and (4) the phonological tier. Each tier offers a
range of labels which share a general architecture, but language-specific parameters
determine which subset of labels a transcriber can choose from for the transcription
of a particular language variety, and how the different tiers are associated with one
another. In this paper, we will argue that the multi-linear architecture of IV-based
systems offers transparency, flexibility and standardization, three key advantages
in qualitative and quantitative studies of intonational variation across languages
and language varieties.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
Poster Session 8 (PS 8): Language Acquisition and Learning, Conversational Speech, and Neural Processing Friday, May 5, 14:40 - 16:10
Chair: Nobuaki Minematsu |
Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 1 of 17 | Acquisition of Prosody in a Spanish-English Bilingual ChildAUTHOR(S):
Kim, Sahyang; Wayne State University
Andruski, Jean; Wayne State University
Casielles, Eugenia; Wayne State University
Nathan, Geoff; Wayne State University
Work, Richard; Wayne State University
Abstract:
This study examined the pattern of prosodic phrasing and the distribution of
post-lexical pitch accent types in a Spanish-English bilingual child. We collected
utterances from natural interactions between parents and the child, and analyzed
them using MAE ToBI and SP ToBI. We compared prosodic development across
ages, and compared the child's speech production with parents' productions. Results
showed that both the child and parents divide their short utterances into
smaller prosodic phrases and that most content words bear post-lexical pitch accent,
which can make the word segmentation task easy for children. The majority
of the child's English words was produced with H*. This was similar to his father's
pitch accent pattern, but he produced a higher number of H* than his father. He
could produce the L+H* Spanish nuclear pitch accent with a similar frequency to
that found in his input, but could not produce as many L*+H as his mother in
the prenuclear pitch accent context.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 2 of 17 | Intonation Phrasing in Chinese EFL Learners' Read SpeechAUTHOR(S):
Chen, Hua; Nantong University
Abstract:
Intonation phrasing refers to the system of intonational choices that a speaker
has when associating complete intonation patterns with a text. The number of
patterns and the boundaries may vary and convey different meanings. This study
investigates the intonation phrasing patterns in Chinese EFL learners' read speech.
Recordings of 45 Chinese students were compared with those of 8 British native
speakers. The recorded speech was annotated and analyzed on the computer with
PRAAT, and the learners' prosodic features were compared with those of native
speakers in order to find the non-native like aspects in learners' oral performance.
Findings show that learners differ from native speakers in 1) the frequency of
boundary markers, and 2) the realization of some tonality constraints. The study
has important implications for China's EFL pedagogy as well as for the improvement
of rating rubrics for China's oral English tests.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 3 of 17 | Prosodic characteristics in the Speech of Chinese EFL learnersAUTHOR(S):
Makarova, Veronika; University of Saskatchewan
Zhou, Xia; University of Saskatchewan
Abstract:
This study reports some prosodic characteristics in the quasi-spontaneous classroom
speech of Chinese EFL learners. Recordings of ten dialogues produced by
twenty second-year non-English majors were analyzed to extract the following
features: durations of inter- and intra-turn pauses, duration of filled-in pauses,
numbers of words per tone unit, tone unit durations, speech rates and pitch accent
type (tone) statistics. The deviations from standard native speech in the
areas of tonality and tonicity are also considered. The paper offers some practical
suggestions aimed at improving the prosodic characteristics of the English speech
of Chinese EFL learners.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 4 of 17 | A Rhythmic Analysis on Chinese EFL SpeechAUTHOR(S):
Li, Aijun; Institute of Linguistics, Chinese Academy of Social Sciences
Yin, Zhigang; Institute of Linguistics, Chinese Academy of Social Sciences
Zu, Yiqing; MOTOROLA Research Center China
Abstract:
This paper, based on a phonetic experiment, depicts a contrastive study on the
rhythmic pattern of Chinese learners of English as a foreign language (CL2) as
compared with that of the native speakers of both standard British and American
English (EL1) in their respective pitch accent distribution patterns, prosodic
structures and duration patterns.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 5 of 17 | Unstressed vowels in non-native GermanAUTHOR(S):
Gut, Ulrike; English Department, University of Freiburg
Abstract:
Vowel reduction and deletion are prominent correlates of stress in German and
some preliminary investigations have suggested that this constitutes an area of
difficulty for non-native speakers. This paper explores the production of vowels
in unstressed syllables by learners of German, focusing especially on the acoustic
properties duration and formant structure. It is shown that the realization
of unstressed vowels in non-native German is influenced by the speakers' native
language (L1), but not by speaking style.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 6 of 17 | Native Intuitions of Speakers of a Lexical Accent System in L2 Acquisition of Stress. The Case of Russian Learners of PolishAUTHOR(S):
Kijak, Anna; Utrecht Institute of Linguistics OTS
Abstract:
Native speakers of a lexical accent system (Russians) were tested on their L2
acquisition of a phonological stress system (Polish). In Russian, a sizeable part
of the lexicon is marked underlyingly for accents and claims on the position of
default stress vary. This makes it interesting to investigate which L1 characteristics
(distribution of lexical accents vs. phonological default) are transferred to L2 (if
any). 35 Russian subjects were tested on their L2 production of Polish stress.
The data shows a very consistent and almost uniform source of mistakes: the
stem-final position. These results mirror one of the claims on the default stress
in Russian suggesting that L2 errors originated from L1 transfer of that default.
L1 transfer generally did not reflect the distribution of lexical accents (though the
latter were not completely excluded, they were restricted in their type). Results
on the individual level show various subjects possibly followed two different L2
learning paths.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 7 of 17 | Using Prosodic and Voice Quality Features for Paralinguistic Information ExtractionAUTHOR(S):
Ishi, Carlos Toshinori; ATR/IRC
Ishiguro, Hiroshi; ATR/IRC
Hagita, Norihiro; ATR/IRC
Abstract:
The use of voice quality features in addition to prosodic features is proposed for
automatic extraction of paralinguistic information (like speech acts, attitudes and
emotions) in dialog speech. Perceptual experiments and acoustic analysis are conducted
for monosyllabic utterances spoken in several speaking styles, carrying a
variety of paralinguistic information. Acoustic parameters related with prosodic
and voice quality features potentially representing the variations in speaking styles
are evaluated. Experimental results indicate that prosodic features are effective
for identifying some groups of speech acts with specific functions, while voice quality
features are useful for identifying utterances with an emotional or attitudinal
expressivity.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 8 of 17 | A trial of communicative prosody generation based on control characteristic of one word utterance observed in real conversational speechAUTHOR(S):
Greenberg, Yoko; GITS, Waseda University
Shibuya, Nagisa; GITS, Waseda University
Tsuzaki, Minoru; Kyoko City University of Arts
Kato, Hiroaki; ATR Human Information Science Labs
Sagisaka, Yoshinori; GITS, Waseda University
Abstract:
Aiming at prosody control for conversational speech synthesis, communicative
prosodies were generated based on the prosodic characteristics derived from one
word utterance "n". Firstly huge amount of "n" recorded in an actual environment
were analyzed using F0 generation model to see what kind of prosodic variations
could exist and how they were generated. Based on the results of the analysis
of "n", simple conversion rules to other speaking styles expressing three dimensions
in perceptual impressions, confident-doubtful, allowable-unacceptable and
positive-negative were established. Finally, naturalness evaluation test was conducted
to see how effectively the prosody conversion rules derived from "n" could
be applied to authentic phrases. The results showed validity of the application of
the conversion rules to actual phrases. This indicates the possibility of systematic
prosody control for conversational speech synthesis using corpus-based approach.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 9 of 17 | Intonational cues to discourse structure in Bari and Pisa Italian: perceptual evidenceAUTHOR(S):
Savino, Michelina; University of Bari
Grice, Martine; University of Cologne
Gili Fivela, Barbara; University of Lecce
Marotta, Giovanna; University of Pisa
Abstract:
Perception experiments for Bari and Pisa Italian showed that listeners can reliably
distinguish final and non-final utterances in discourse by means of intonation. Bari
listeners were also able to distinguish a third category, signalling that the end of
the discourse unit is approaching (penultimate position). This was not the case
for Pisa listeners.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 10 of 17 | The intonation of polar questions in two central varieties of ItalianAUTHOR(S):
Giordano, Rosa; University of Naples Federico II - University of Salerno
Abstract:
A growing attention is given nowadays to the contrastive analysis of prosodic
structures and melodies (see, among others, considerations by Ladd 1996 or studies
by Grabe and other scholars and Peters et al. 2004). This paper presents a
contrastive analysis of question tunes: it is dealt with two regional varieties of
Italian (Lazio and Umbria) represented by a sample of map-task dialogues collected
in Rome and Perugia. A consistent similarity emerges, as these varieties not only
share the same intonative forms but also the same positional constraints as well
as the same distribution of the marked prosodic devices. Furthermore, different
accent types seems to be related to different kinds of questions. Differences between
the two varieties are found in the presence and the use of some accentual and edge
tones.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 11 of 17 | Interaction of verb accentuation and utterance finality in BanglaAUTHOR(S):
Dutta, Indranil; University of Illinois at Urbana-Champaign
Hock, Hans Henrich; University of Illinois at Urbana-Champaign
Abstract:
In this study we present data from three experiments that present robust, unambiguous
evidence that Bangla conforms to the cross-linguistic avoidance of prominence
on utterance-final verbs in SOV languages.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 12 of 17 | Two contours, two meanings: the intonation of jaja in German phone conversationsAUTHOR(S):
Golato, Andrea; University of Illinois at Urbana-Champaign
Fagyal, Zsuzsanna; University of Illinois at Urbana-Champaign
Abstract:
This paper shows that jaja 'yes yes' sequences in German conversations carry two
distinct interactional meanings cued by their intonation and sequential placement.
Combined Conversation Analytic (CA) and Intonation Phonological analyses indicate
that jaja tokens uttered with H* L-% intonation (following GToBI) convey
that the previous speaker has persisted too long in a specific course of (verbal) action
which should therefore be stopped. By contrast, jaja tokens with L+H* L-%
intonation are used in situations of fractured intersubjectivity, i.e., immediately
after speakers misalign: with the jaja turn, its speaker treats the action/content of
the previous speaker's utterance as either unwarranted or self-evident. Speaking
rate and regional dialectal differences notwithstanding, the two types of contour
show significantly different peak alignment, and correspond to two distinct 'peak
accent' nuclear contours.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 13 of 17 | The Prosody of Suspects' Responses during Police InterviewsAUTHOR(S):
Fadden, Lorna; Simon Fraser University
Abstract:
This paper reports on the results of a pilot study on the prosody of Western Canadian
suspects' speech as it occurs during the course of investigative interviews with
police. Suspects' responses are categorized according to the type of information
they contain, and the prosodic characteristics of each response type are described.
It will be shown in this exploratory study that the various response types pattern
consistently across a group of suspects and that it is possible to construct a
set of prosodic profiles consisting of pitch range, average pitch, speech rate and
hesitation values associated with each response type.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 14 of 17 | Prosodic signalling of (un)expected information in South Swedish - An interactive manipulation experiment AUTHOR(S):
Ambrazaitis, Gilbert; Center for Languages and Literature, Lund University
Abstract:
Starting from the German pitch peak timing categories and their communicative
functions, it is asked how these functions would be expressed in South Swedish.
The aim is to get a first impression as regards potentially relevant prosodic parameters
associated with the expression of expected vs. unexpected information
in South Swedish. For that, an interactive manipulation experiment is conducted,
where subjects manipulate the pitch contour and duration of monosyllabic test
utterances until the sound output adequately represents a given communicative
function. Swedish has a tonal word accent distinction, and all test words have
accent 1, normally produced with an early pitch fall. It is thus hypothesized that
in South Swedish, expected vs. unexpected information will not be expressed
through a different pitch peak timing, as in German. The results indeed clearly
hint at unexpected information being signalled by means of a higher, rather than
a later pitch peak.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 15 of 17 | An fMRI study of multimodal deixis: preliminary results on prosodic, syntactic, manual and ocular pointingAUTHOR(S):
Carota, Francesca; Institut de la Communication Parlée, UMR CNRS 5009, INPG,
Univ. Stendhal, Grenoble
Lœvenbruck, Hélène; Institut de la Communication Parlée, UMR CNRS 5009,
INPG, Univ. Stendhal, Grenoble
Vilain, Coriandre; Institut de la Communication Parlée, UMR CNRS 5009, INPG,
Univ. Stendhal, Grenoble
Baciu, Monica; Laboratoire de Psychologie et NeuroCognition, UMR CNRS 5105,
UPMF, Grenoble
Abry, Christian; Institut de la Communication Parlée, UMR CNRS 5009, INPG,
Univ. Stendhal, Grenoble
Lamalle, Laurent; INSERM IFR nº 1, RMN biomédicale, Unité IRM 3T, CHU
de Grenoble
Pichat, Cédric; Laboratoire de Psychologie et NeuroCognition, UMR CNRS 5105,
UPMF, Grenoble
Segebarth, Christoph; Unité Mixte INSERM / Univ. J. Fourier, U594, Grenoble
Abstract:
Deixis or pointing plays a crucial role in language acquisition and speech communication.
In this paper we present an innovative fMRI approach in order to
examine deixis, conceived as a unitary communicative strategy which employs
different verbal and non-verbal speech devices to achieve the pragmatic goal of
bringing relevant information to the interlocutors' attention. We designed a unified
fMRI paradigm for multimodal deixis, integrating four conditions of verbal
and non-verbal pointing: 1) prosodic focus, 2) syntactic extraction, 3) index finger
pointing, 4) eye pointing. Sixteen subjects were examined while they gave oral,
manual and ocular responses inside the 3T magnet imager. Preliminary results
based on a random effect analysis with a group of 8 subjects show that all pointing
conditions recruit a left parieto-frontal network, with respect to the control condition.
The findings suggest that different modalities of deixis depend on a common
cerebral network.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 16 of 17 | The Use of Multi-pitch Patterns for Evaluating the Positive and Negative Valence of Emotional SpeechAUTHOR(S):
Cook, Norman D.; Kansai University
Fujisawa, Takashi X.; Kansai University
Abstract:
We report the application of a psychophysical model of harmony perception to the
analysis of speech intonation. The model was designed to reproduce the empirical
findings on the perception of musical chords, but it does not depend on specific
musical scales or tuning systems. Application to speech intonation produces values
corresponding to the total dissonance, tension and affective valence among the
dominant pitches used in the speech utterance.
| | | Poster Session 8: Language Acquisition and Learning, Conversational Speech, and Neural Processing | 17 of 17 | The neural mechanisms for understanding self and speaker's mind from emotional speech: an event-related fMRI studyAUTHOR(S):
Homma, Midori; Graduate School of Comprehensive Scientific Research
Imaizumi, Satoshi; Graduate School of Comprehensive Scientific Research
Maruishi, Masaharu; Hiroshima Prefectural Rehabilitation Center, Hiroshima
Muranaka, Hiroyuki; Hiroshima Prefectural Rehabilitation Center, Hiroshima
Abstract:
Using linguistically positive and negative words uttered either pleasantly or unpleasantly
by four speakers, we examined the brain regions that mediate speech
communication through event-related functional magnetic resonance imaging (fMRI)
analyses. Subjects were adult listeners who evaluated either speakers' mind, their
own mind, or (as a control condition) the number of letters for spoken stimuli which
were randomly presented through ear phones. In both the self and speaker-mind
judgment tasks, the dorsal medial prefrontal cortex (dMPFC), that has been implicated
in theory of mind or self-referential processing, is significantly activated, in
addition to the classical cortical regions involved in processing linguistic semantics
and emotional prosody of speech. These results suggest that the mental state attribution
accomplished by the dorsal medial prefrontal cortex plays an important
role to understand our own and speaker's mind in speech communication.
| |
Abstracts |
Plenary Talks | SPS1 |
SPS2 | SPS3 |
SPS4 | SPS5 |
OS1 | OS2 |
OS3 | OS4 |
OS5 | PS1 |
PS2 | PS3 |
PS4 | PS5 |
PS6 | PS7 |
PS8 | Vitrine
|
| Exhibition
From the Historic Acoustic-phonetic Collection of the
TU Dresden
Tuesday to Friday
| 1 of 1 | Measuring Pitch with Historic Phonetic DevicesAUTHOR(S):
Mehnert, Dieter; Technische Universität Dresden
Hoffmann, Rüdiger; Technische Universität Dresden
Abstract:
Measuring pitch is one of the most important but also most difficult tasks in experimental
phonetics. It is interesting to study how the difficulties have been solved in
the times before the computer was introduced in the phonetic laboratories. In this
paper, this is discussed using a number of exhibits of the acoustic-phonetic collection
of the Dresden University. There will be a small exhibition of historic devices
at the conference Speech Prosody 2006. This paper is intended to accompany the
exhibition.
|
|