ASPI is about Audiovisual-to-articulatory inversion. Participants in this project are LORIA (Magrit and Speech team) (Nancy), ENST (TSI Department) (Paris), ICCS-NTUA (CVSP Group) (Athens), KTH (Speech Communication & Technology Group) (Stockholm), and ULB (Waves & Signals Department) (Brussels).
Audiovisual-to-articulatory inversion consists in recovering the vocal tract shape dynamics (from vocal folds to lips) from the acoustical speech signal, supplemented by image analysis of the speaker’s face. Being able to recover this information automatically would be a major break-through in speech research and technology, as a vocal tract representation of a speech signal would be both beneficial from a theoretical point of view and practically useful in many speech processing applications (language learning, automatic speech processing, speech coding, speech therapy, film industry...).
There is strong evidence that human speakers/listeners exploit the multimodality of speech, and more particularly the articulatory cues: the view of visible articulators, i.e. jaw and lips, improves speech intelligibility. From neurophysiology we know that there is a close link between articulatory and acoustic cognitive representations of speech, with the sensorimotor control of speech production being represented in so-called mirror neurons. The audiovisual-to-articulatory inversion is however a presently unresolved problem. The main difficulty is that there is no one-to-one mapping between the acoustic and articulatory domains and there are thus a large number of vocal tract shapes that can produce the same speech spectrum. Indeed, the problem is under-determined, as there are more unknowns that need to be determined than input data available. One important issue is thus to add constraints that are both sufficiently restrictive and realistic from a phonetic point of view, in order to eliminate false solutions. These constraints mainly derive from images of the vocal tract to get approximate model of speech production and/or images of the speaker’s face to get information about visible articulators as human listeners do. Beside theoretical problems one of the major challenges in this domain is the lack of articulatory data covering both the speaker’s vocal tract and face.
The design of audiovisual-to-articulatory inversion involves two kinds of interdependent tasks. The first is the development of inversion methods that successfully answer the main acknowledged difficulties, i.e. the impossibility of using standard spectral vectors as input data, the non-unicity of inverse solutions and the possible lack of phonetic relevancy of inverse solutions.
The second task is the construction of an articulatory database that comprises dynamic images of the vocal tract together with the speech signal uttered, and that for several male and female speakers. The database is needed firstly to generate faithful articulatory models from medical images (X-ray or MRI images) for one subject and to study the adaptation of these models to other speakers.
For the inversion itself the main objectives are:
- 1 - Development of audiovisual-to-articulatory inversion methods including both static (inversion for one speech frame) and dynamic (inversion for one sentence) conditions,
- 2 - Investigation of additional constraints and optimization techniques to reduce the under-determination of the inversion,
- 3 - Evaluation of the inversion methods on articulatory data.
For the construction of the articulatory database:
- 1- Design and acquisition of multimodal articulatory and audiovisual speech data that enable both the development of articulatory models and the assessment of inversion methods,
- 2 - Design of a low cost acquisition technology based on ultrasound and facial motion capture,
- 3 - Exploitation of existing databases (mainly X-ray images previously acquired).
Basically, the audiovisual-to-articulatory inversion is an acoustical problem where the input data are the resonance frequencies of the vocal tract. However, these frequencies cannot be extracted easily from the speech signal, so the inversion needs to be generalised to accept standard spectral data as input. Objective 1 involves the development of inversion methods using articulatory models or concatenation of uniform tubes, cones or other acoustical horns to investigate how easily the underlying numerical frameworks can be extended and adapted to the use of standard spectral input data.
The inversion method will build on the analysis-by-synthesis paradigm, consisting in using an articulatory synthesiser to compute speech spectra from articulatory or geometrical parameters and store the relations in a codebook. Explicit or implicit table lookup methods are then used to recover at each time frame the set of articulatory parameters. Once inverse solutions have been recovered at each time frame of the speech signal, articulatory trajectories are built from these local solutions by some optimal path search algorithm: dynamic programming, regularisation techniques or physical constraints to obtain smooth trajectories. Visual information from video images can be incorporated both in the static and dynamic conditions, in order to determine the position and articulatory trajectories of e.g. the lips and jaw and to supplement the acoustic speech signal with clues on the vocal tract length, lip rounding, speaker anatomy etc.
As the main difficulty is the very large number of solutions Objective 2 is dedicated to the introduction of constraints on the inversion procedure to adjust the faithfulness of the approximation of the articulatory model with respect to the static and dynamic characteristics of the human vocal tract. The constraints will be provided by physiology, phonetics, and articulatory information extracted from images of the speaker’s face. This requires suitable and numerically efficient methodologies for geometric tracking of deformable features. Objective 2 is crucial in the project since it is intended to guarantee that the inversion results are phonetically, geometrically, acoustically and dynamically relevant. Objective 2 will focus on the algorithmic and numerical frameworks that enable the incorporation of constraints, as well as on the evaluation of their respective interest.
Objective 3 is the evaluation and comparison of inversion methods and constraints in terms of the robustness and reliability of the place of articulation and other phonetic parameters derived from reconstructed articulatory data and normalized for speaker adaptation. The evaluation of an audiovisual-to-articulatory inversion procedure comprises two aspects. The first is the acoustical faithfulness and ensures that inverted data are able to reproduce a speech signal as closely as possible to the original. The closeness is generally evaluated by measuring the distance between original and synthetic resonance frequencies. The second aspect is that of the articulatory faithfulness, which requires that the synthetic output of the articulatory model is compared to vocal tract images of the speaker uttering the same speech segment.
The development of new inversion methods and their evaluation is hence tightly connected to the availability of appropriate articulatory databases. Objective 4 aims at answering this requirement. Ideally, imaging techniques used to collect this kind of data should (i) cover the whole vocal tract (from glottis to lips) with all the articulators and the face visible, (ii) give a time resolution sufficient for the tracking of the dynamics of the vocal tract. The ideal frame rate should enable the observation of a burst release (the fastest articulatory event), i.e. approximately 8 ms for /p/ for instance, or equivalently 125 frames per second. Furthermore, imaging techniques must (iii) not involve any known health hazard for subjects, (iv) not perturb the natural articulation and (v) not degrade the quality of the speech signal recorded together with images.
At present, no single imaging technique answers the above requirements alone, and the requirements are far from being fulfilled even partially by the available methods. Therefore, it is necessary to combine several imaging techniques to reach the objective. MRI will be used separately for sustained phonemes (see Figure1) and continuous speech, whereas 3D motion image acquisition of the face area could be used together with ultrasounds (US), electromagnetic articulography or cineradiography. In the latter case the data would be acquired for a limited duration after ethical agreement.
Figure 1 MRI image of the vocal tract and articulatory model superimposed
Objective 5 is dedicated to the development of an inexpensive acquisition system that enables the simultaneous tracking of tongue and face by combining ultrasound for the tongue, magnetic sensors for the tongue apex and 3D motion capture for the face (see Figure 2). These three modalities do not involve any known health hazard and are relatively low cost and compact compared to other imaging techniques. However, this will require an important implementation effort in the domain of computer vision to perform the geometrical calibration, image registration and synchronisation of ultrasound with respect to 3D motion capture and position sensors. Another potential important source of data is the existence of X-ray dynamic pictures.
Figure 2 stereovision and 3D tracking of markers painted onto the speaker's face
Objective 6 is the exploitation of these data. These data are interesting because they cover a large number of speakers and therefore allow the investigation of static and dynamic speaker normalization.