line decor
line decor

le> td> le border="0" cellspacing="0" cellpadding="0" width="16590va="logeftcol" d width="133> d width="1337 class="tost"llxt">

rp rp tr>
tr> table>
Home>ASPI Presentation



ASPI is about Audiovisual-to-articulatory inversion. Participants in this project are LORIA (Magrit and Speech team) (Nancy), ENST (TSI Department) (Paris), ICCS-NTUA (CVSP Group) (Athens), KTH (Speech Communication & Technology Group) (Stockholm), and ULB (Waves & Signals Department) (Brussels).

Audiovisual-to-articulatory inversion consists in recovering the vocal tract shape dynamics (from vocal folds to lips) from the acoustical speech signal, supplemented by image analysis of the speaker’s face. Being able to recover this information automatically would be a major break-through in speech research and technology, as a vocal tract representation of a speech signal would be both beneficial from a theoretical point of view and practically useful in many speech processing applications (language learning, automatic speech processing, speech coding, speech therapy, film industry...).

There is strong evidence that human speakers/listeners exploit the multimodality of speech, and more particularly the articulatory cues: the view of visible articulators, i.e. jaw and lips, improves speech intelligibility. From neurophysiology we know that there is a close link between articulatory and acoustic cognitive representations of speech, with the sensorimotor control of speech production being represented in so-called mirror neurons. The audiovisual-to-articulatory inversion is however a presently unresolved problem. The main difficulty is that there is no one-to-one mapping between the acoustic and articulatory domains and there are thus a large number of vocal tract shapes that can produce the same speech spectrum.  Indeed, the problem is under-determined, as there are more unknowns that need to be determined than input data available. One important issue is thus to add constraints that are both sufficiently restrictive and realistic from a phonetic point of view, in order to eliminate false solutions. These constraints mainly derive from images of the vocal tract to get approximate model of speech production and/or images of the speaker’s face to get information about visible articulators as human listeners do. Beside theoretical problems one of the major challenges in this domain is the lack of articulatory data covering both the speaker’s vocal tract and face.


The design of audiovisual-to-articulatory inversion involves two kinds of interdependent tasks. The first is the development of inversion methods that successfully answer the main acknowledged difficulties, i.e. the impossibility of using standard spectral vectors as input data, the non-unicity of inverse solutions and the possible lack of phonetic relevancy of inverse solutions.
The second taskhe firstconstruction of artprticulatory data cbasthat hum-pps), twnamics mages of the vocal tract ttot ire ith the senech signal woutrde, as the afor laecorse womaland refemalaneakers/lThe mata cbasth no d t dirst

ThFothe manversion invtls)fhe main acoectives">re b:p>

1 -epalopment of indiovisual-to-articulatory inversion inthods thincludg both thstic spinnersion inr lae ofnech siames[)nd demamics nnersion inr lae ofnted e t) onstdion: s,2 -eversifytion"> f indddional.donstraints mad deopmodizion"> fchnicaqueso recoce the sader-determinedion of a tsenversion, v3 -eEvaluion of a tsenversion, nthods thoprticulatory data c.

1-epagn ofd acouquition: f arltimodalitrticulatory and adiovisual-tspeech prta coe afoenle bord the sevelopment of inticulatory dadel osrd the pos i, tsnt of inversion methods t,2 -eDign of alar thnnsstcouquition: fchnology, bastdhoprtimrasnd="nd refaal frtorn conapture, pan class="sudyText"> f inextic bota cbastof(inly deX-raofages ofesevisti deouquicoc). is

ct tdheasa deom the aceech signal, sso tsenversion, n d tso be det rs este to beaesspttandard spectral veta avas nvet d.strong> (Bvolves two development of inversion methods thing stticulatory dadel osr contrce dion of a icirmat tub, i.ntresr cooire irtustical sphornso beiersifytioeowevheasa dee sader-dlyg smbeingl folmes[works nn pr det">dent"nd retdapt to bee sadsof thandard spectral vevet data aThp>

Therenversion meethods ithll buildhopre nodlysis o-by-syntse proaradigm,onsists g s acing s prtticulatory dasyntse p to elm-pputspeech spectrumarom a piculatory da laegeothongl fooaramermisrd thsto artsensolions of ackodinebook.eEloiity a laeposiity able bolookupethods tha artseacin to becover thafoea theimeolmes[he posseof inticulatory cenverse solutions ahas" bn thcover tt"nd taea theimeolmes[h the speakh signal, ticulatory daumajtors i">re b builtrom a tse colal trlutions abdasos[h pmod fooa thsrch andlgs ithm:wnamics mogresammg, spregarlyisaon: fchnolaqueso laysiolc.donstraints ma beobtn acsmoh bumajtors i">. Vual-tsformation abom voce tofages ofn pr deincorpatio by h ise spetic spd demamics nstdion: s,n order to eltermined he possiion and/o ticulatory daumajtors i">r ine.gthe imps, nd/o w and li besplementedhe acoustic aneech signal wouh thclueshopre nocal tract tngth-2li {p und="g, seakers/odlytomy etc/p>

Audshe main acfficulty is thateerdayarge nuumber of volutions atrong> (Bvtwnia,l t to bee sairol dtion of aronstraints maopre noversion meocessde co add stifre nolsh tl iresof the vopproximate n of are articulatory cudel ofuh thsolctruo bee saetic spd demamics harset trtic fof the voman lical tract . Thconstraints mathll bproblce tby im ysiology w,honetic rs,nd articulatory doformation abt">ct tdhom images of thhe speaker’s face. BeTs domequicosespible bod armbeingl f deeicientlyeethods ogy i">rr lageothonglract kg s thdormat"e bofeatures.neectives" 2s stcctii-tsfohe problct arng cenvissunterdent"negetuseterdehat therenversion meeresty >re b onetic rly uthgeothongl fl acoustic-t f ded demamics f delevancy . eectives" 2sthll focumaopre nodlgs ithm spd dembeingl folmes[works e afoenle boherenvecorpation of arnstraints mas twell amaopre noevaluion of a e nir hsolctrues" terdesifThp>

Thtrong> (Bvtwe noevaluion ofd dem-pplyis of a version meethods sfd dem-traints mafoheminof the vobletifresofd realliality of use prole tof inticulatorn ofd deoire ionetic poiaramermisrrive fdhom imcovetructioeo ticulatory data ava demmat"led rr laeakers/oddaption o. Thcoevaluion of a prttiovisual-to-articulatory inversion isocessde com-pps), two kialctrulThe mairst

Therenvelopment off thw Arversion methods thd the pir evaluion of howce thaht=" dend nect to bee savailableity of usproxips)e areaculatory cuta cbasts.strong> (Bainofd taswer tg the domequicont.wr Ide fl acagesg thehnolaquesoin to becolltruo s inds s thda avshld betrong> (Bver thisehitolnocal tract trom voglottima beips) fruh th flre articulatory srd the poce tosible a,etrong> (Bge andheimeoisolvtions ufficientlyrr lae poact kg s the vodamics (f the vocal tract . e maiide folmes[h e fshld beenle boe sasbrvedion of a spbutrr la/p/rr lainandace,a laeuiv="altly r125olmes[sose ieond . Furire re uacagesg thehnolaquesomtifrtrong> (Bnotevolves y owns alth_nahazd spr laecbct as,etrong> (Bnotese turbee savnaturitrticulatorn ofd detrong> (Bnotedeesa theor quaty of use preech spocnal wocovernt"negt ire ith thages o/p>

Audpresentat,o onng lecagesg thehnolaqueaswer ttwe notbo reprquicont.w andleti,rd the poprquicont.w aa arfarhom iming rel ified Tht />
< pan class="Style6"">Wetrong> (BMRIrages f the vocal tract td/o ticulatory dadel ofspl tgossieddian> isp>

Tht /> (Bvtwnia,l t to bee savelopment of intinvoexpeures" couquition: fsystsm e afoenle btwe nosiltimanestipact kg s thegue"/nd refaaey im m-pbi g stimrasnd="nr lae poague"/magritic ponsorimsnr lae poague"/hprextd/o 3Drtorn conapturenr lae pofaaey(sdehFige c 2)These cothreo dality i">rdoBnoteevolves y owns alth_nahazd spd artiensolionvm der thnsstco dem-ppl tom-pplyt"negeoire iagesg thehnolaques. Hever a, tsimathll prquicontinvortant iseposiont.wion abtfr ltsfohe prmain i arnspputsrosiblopreoese rmat t savgeothongl fol fibtion o,rages fregtrative ofd desynchng>isaon: f a iimrasnd="nfuh thsolctruo be3Drtorn conapturend prasiion annsorims. pan class="Style6"9Audnoire irpont="i-tsfortant issndrces thda avvtwe noextenerces thX-raofmamics picturesdian> i/p>


Thtg src="Sa/ve/mmcps)_age a004if" aldth="50"1 height="3"120/>
< pan class="Style6"">Wetrong> (Bbsp; isp>


Thtrong> (Bvtwe noeloit tion"> a tse coata aThese cota ava arterdesifg betwcaustheoryBver thlarge number of voeakers/land the prormauh fl thateeiersifytion"> a etic spd demamics eaker&roemmat"ledion o.


div> dphbsp;

d width="504