From:     <>

BAD MSG: system operates by maintaining a collection of 'world-model' hypotheses, iving rise to an aggregate prediction for the observed features which is then reconciled to actual observations by adjusting the model hypotheses. With this kind of system, the way to detect speech against music is to run the system: if the speech hypotheses are able to find enough support in the observations, they will be invoked and pursued. Never mind that there may be lots of other stuff going on at the same time - the system would generate other, independent hypotheses to account for that energy, and the process of combined-prediction and reconciliation would absorb the distortion and masking of overlapped features. My use of the conditional will have tipped you off that I'm not particularly close to implementing this system. I am, however, trying - see, for instance, my paper from Mohonk last year which is available at http://www.icsi.berkeley.edu/real/papers.html . It seems a shame for the speech recognition community to put effort into pattern-recognition solutions for detecting the music in signals so that they can avoid running recognition over those episodes, when the development of a more flexible and general-purpose model of the signal itself, one that accepted that most sounds consist of more than one source, might solve not only the problem of detecting when music is present, but also the recognition of the simultaneous speech. I don't think it's possible to detect the presence of speech with anything less complex than a speech recognizer, but I do feel that a speech recognizer is probably three-quarters of the solution. Not exactly an answer to the original question, but an alternative perspective that I hope may be of interest! -- DAn Ellis <dpwe(at)icsi.berkeley.edu> <http://www.icsi.berkeley.edu/~dpwe/> International Computer Science Institute Berkeley CA

This message came from the mail archive
maintained by:
DAn Ellis <dpwe@ee.columbia.edu>
Electrical Engineering Dept., Columbia University