ASPI is about Audiovisual-to-articulatory inversion. Participants in this project are LORIA (Magrit and Speech team) (Nancy), ENST (TSI Department) (Paris), ICCS-NTUA (CVSP Group) (Athens), KTH (Speech Communication & Technology Group) (Stockholm), and ULB (Waves & Signals Department) (Brussels).

Audiovisual-to-articulatory inversion consists in recovering the vocal tract shape dynamics (from vocal folds to lips) from the acoustical speech signal, supplemented by image analysis of the speaker’s face. Being able to recover this information automatically would be a major break-through in speech research and technology, as a vocal tract representation of a speech signal would be both beneficial from a theoretical point of view and practically useful in many speech processing applications (language learning, automatic speech processing, speech coding, speech therapy, film industry...).

There is strong evidence that human speakers/listeners exploit the multimodality of speech, and more particularly the articulatory cues: the view of visible articulators, i.e. jaw and lips, improves speech intelligibility. From neurophysiology we know that there is a close link between articulatory and acoustic cognitive representations of speech, with the sensorimotor control of speech production being represented in so-called mirror neurons. The audiovisual-to-articulatory inversion is however a presently unresolved problem. The main difficulty is that there is no one-to-one mapping between the acoustic and articulatory domains and there are thus a large number of vocal tract shapes that can produce the same speech spectrum.  Indeed, the problem is under-determined, as there are more unknowns that need to be determined than input data available. One important issue is thus to add constraints that are both sufficiently restrictive and realistic from a phonetic point of view, in order to eliminate false solutions. These constraints mainly derive from images of the vocal tract to get approximate model of speech production and/or images of the speaker’s face to get information about visible articulators as human listeners do. Beside theoretical problems one of the major challenges in this domain is the lack of articulatory data covering both the speaker’s vocal tract and face.


The design of audiovisual-to-articulatory inversion involves two kinds of interdependent tasks. The first is the development of inversion methods that successfully answer the main acknowledged difficulties, i.e. the impossibility of using standard spectral vectors as input data, the non-unicity of inverse solutions and the possible lack of phonetic relevancy of inverse solutions.
ct tdheasa deom the aceech signal, sso tsenversion, n d tso be det rs este to beaesspttandard spectral veta avas nvet d.strong> (Bvolves two development of inversion methods thing stticulatory dadel osr contrce dion of a icirmat tub, i.ntresr cooire irtustical sphornso beiersifytioeowevheasa dee sader-dlyg smbeingl folmes[works nn pr det">dent"nd retdapt to bee sadsof thandard spectral vevet data aThp>

The main difficulty is that there are a large number of solutions. Control of constraints on the inversion process to add sufficient solutions of the approximate articulatory model with the static and dynamic subset of the vocal tract. The constraints will be provided by physiology, phonetics, and articulatory information obtained from images of the speaker's face. This requires methods for efficiently extracting parameter sets chosen from features. The objective is to ensure that the inversion results are phonetically valid. The objective will focus on algorithms and methods for the incorporation of constraints as well as on the evaluation of their solutions.

The evaluation and application of inversion methods and constraints for the definition of realistic articulatory parameters derived from reconstruction of articulatory data and speaker adaptation. The evaluation of audiovisual-to-articulatory inversion process comprises two crucial aspects.

The development of new inversion methods and their evaluation depends on the availability of appropriate articulatory data sets. One of the tasks is to collect such data. The techniques to collect data should cover the vocal tract from glottis to lips through the articulatory and visible area. The data should be observed for speakers but for equivalent 125 solutions. Further techniques should not involve health hazards for subjects, should not disturb natural articulation and should ensure good quality of speech signal.

At present, imaging techniques answer two requirements and the requirements are far from being fulfilled.

We have MRI images of the vocal tract and articulatory models.
< pan class="Style6"">Wetrong> (BMRIrages f the vocal tract td/o ticulatory dadel ofspl tgossieddian> isp>

Another important source of data will be X-ray dynamic pictures.


Thtg src="Sa/ve/mmcps)_age a004if" aldth="50"1 height="3"120/>
< pan class="Style6"">Wetrong> (Bbsp; isp>


Thtrong> (Bvtwe noeloit tion"> a tse coata aThese cota ava arterdesifg betwcaustheoryBver thlarge number of voeakers/land the prormauh fl thateeiersifytion"> a etic spd demamics eaker&roemmat"ledion o.


