line decor
line decor

Home > Public documents



Acoustic-to-Articulatory inversion: Methods and Acquisition of articulatory data

By ASPI consortium members

Deliverable D2.2

Final Report on Speech Inversion Methods

By ASPI consortium members

Deliverable D3.2

Final report on design, acquisition and processing of articulatory data

By ASPI consortium members

Deliverable D6.2

Final report on evaluation results

By ASPI consortium members

Multimodal acquisition technology

Coupling electromagnetic sensors and ultrasound images for tongue tracking: acquisition set up and preliminary results

By Michael Aron, Erwan Kerrien, Marie-Odile Berger, and Yves Laprie

Abstract: This paper describes a new method for coupling ultrasound images with tree-dimensional electromagnetic data, to recover larger parts of the tongue during speech production. The electromagnetic data is superimposed on ultrasound images after spatial and temporal calibration. Successful fusion results are presented on various speech sequences. A complete setup for evaluation of the electromagnetic system is further presented.

To learn more, click here

Design, acquisition and processing of articulatory data

Adaptive Multimodal Fusion by Uncertainty Compensation

V. Pitsikalis, A. Katsamanis, G. Papandreou, and P. Maragos

Abstract: While the accuracy of feature measurements heavily depends on changing environmental conditions, studying the consequences of this fact in pattern recognition tasks has received relatively little attention to date. In this work we explicitly take into account feature measurement uncertainty and we show how classification rules should be adjusted to compensate for its effects. Our approach is particularly fruitful in multimodal fusion scenarios, such as audiovisual speech recognition, where multiple streams of complementary time-evolving features are integrated. For such applications, provided that the measurement noise uncertainty for each feature stream can be estimated, the proposed framework leads to highly adaptive multimodal fusion rules which are widely applicable and easy to implement. We further show that previous multimodal fusion methods relying on stream weights fall under our scheme under certain assumptions; this provides novel insights into their applicability for various tasks and suggests new practical ways for estimating the stream weights adaptively. The potential of our approach is demonstrated in audio-visual speech recognition using either synchronous or asynchronous models.

Multigrid Geometric Active Contour Models

G. Papandreou and P. Maragos

Abstract: Geometric active contour models are very popular PDE tools in image analysis and computer vision. We present a new multigrid algorithm for the fast evolution of level setbased geometric active contours and compare it with other established numerical schemes. We overcome the main bottleneck associated with most numerical implementations of geometric active contours, namely the need for very small time-steps to avoid instability, by employing a very stable fully-2D implicitexplicit time integration numerical scheme. The proposed scheme is more accurate and has improved rotational invariance properties compared with alternative split schemes, particularly when big time-steps are utilized. We then apply properly designed multigrid methods to efciently solve the occurring sparse linear system. The combined algorithm allows for the rapid evolution of the contour and convergence to its final configuration after very few iterations. Image segmentation experiments demonstrate the efficiency and accuracy of the method.

Acoustic-to-articulatory Inversion methods

Evaluation of speech inversion using an articulatory classifier

By Olov Engwall

Abstract: This paper presents an evaluation method for statistically based speech inversion, in which the estimated vocal tract shapes are classified into phoneme categories based on the articulatory correspondence with prototype vocal tract shapes. The prototypes are created using the original articulatory data and the classifier hence permits to interpret the results of the inversion in terms of, e.g., confusions between different articulations and the success in estimating different places of articulation. The articulatory classifier was used to evaluate acoustic and audiovisual speech inversion of VCV words and Swedish sentences performed with a linear estimation and an artificial neural network.

Evaluation of phonetic constraints used in acoustic-to-articulatory inversion

By Blaise Potard, Yves Laprie, and Anne Bonneau

Abstract: One of the main challenges in acoustic-to-articulatory inversion is the incorporation of constraints in order to reduce the under-determination of the problem. This paper is dedicated to the evaluation of phonetic constraints we proposed in a previous work. The vocal tract shapes recovered for three vowels uttered by a female speaker, the speech signal together with X-ray images are available for, were analyzed. It turns out that the phonetic constraints derived from standard phonetic knowledge are quite efficient to keep relevant vocal tract shapes. In addition, acoustic-to-articulatory inversion appears to be an efficient evaluation tool to explore the acoustical properties of an articulatory model.

Adapting visual data to a linear articulatory model

By Blaise Potard and Yves Laprie

Abstract: The goal of this work is to investigate audiovisual-to-articulatory inversion. It is well established that acoustic-to-articulatory inversion is an underdetermined problem. On the other hand, there is strong evidence that human speakers/listeners exploit the multimodality of speech, and more particularly the articulatory cues: the view of visible articulators, i.e. jaw and lips, improves speech intelligibility. It is thus interesting to add constraints provided by the direct visual observation of the speaker’s face. Visible data was obtained by stereo-vision and enable the 3D recovery of jaw and lips movements. These data were processed to fit the nature of parameters of Maeda’s articulatory model. Inversion experiments were conducted.

Articulatory studies

Quantal aspects of non anterior sibilant fricatives:
a simulation study

By Martine Toda and Shinji Maeda

Abstract: The quantal theory (Stevens, 1972; 1989) argues the existence of stable and unstable regions in the acoustics when the articulatory parameters are varied in a continuous way. Such a relation has been proved to be involved in the /s/-/ʃ/ contrast in English (Perkell et al., 1979). However, the quantal aspects within non anterior sibilant phonemes such as /ɕ/-/ʂ/ in Chinese are not well established. This paper reports an acoustic modeling experiment of sibilant-like configurations where the front cavity and tongue constriction lengths are systematically varied. The results suggest the articulatory space to be parceled into several ‘quantal’ regions delimited by unstable regions due to (1) the jump of the lowest prominent spectral peak towards higher frequencies when the first front cavity resonance coincides with a palatal channel free zero and is weakened (when a pressure source is assumed to be located at the exit of the constriction); and (2) the interaction of the prominent peak resonance with back cavity resonances at their crossing points, with changes in formant affiliation. The relatively stable regions delimited by those discontinuities are assumed to provide robust prototypes for non-anterior sibilants, although subject-dependant factors can cause some modifications in quantal patterns.

Deux stratégies articulatoires pour la réalisation du contraste acoustique des sibilantes /s/ et /ʃ/ en français

By Martine Toda

Abstract: This paper reports two articulatory strategies used in the realization of /s/ - /ʃ/ contrast in French from the observation of the MRI data of seven native speakers. These strategies are: ‘tongue position adjustment’ and ‘tongue shape adjustment’. By examining the articulation of the subjects whose frication noise was ‘deviant’, it appeared that they already used all the possibilities for compensation within their articulatory strategy. A better normalization of their frication noise would have required complex gestures (e.g. tongue backing and doming), which are presumably avoided by virtue of articulatory economy.