Re: effect of phase on pitch (Peter Cariani )

Subject: Re: effect of phase on pitch
From:    Peter Cariani  <peter(at)>
Date:    Thu, 5 Feb 1998 15:53:14 +0000

R. Parncutt wrote: > Pondering the evolutionary origins of the ear's "phase deafness" in most > naturally occurring sounds, I have come up with the following argument. Does > it make sense? Is there other literature on this subject that I have missed? I definitely agree that the auditory system is essentially phase-deaf, except around the edges (which is why the edges are interesting). However, where we would differ is I think that it is possible that the phase-deafness of the system is a result of interspike interval analyses and mechanisms that integrate/fuse invariant phase relationships into unified objects, whereas you would hold that this system is phase deaf because it uses rate-place representations. Is this fair? > In everyday listening environments, phase relationships are typically > jumbled unrecognizably when sound is reflected off environmental objects; > that is, when reflected sounds of varying amplitudes (depending on the > specific configuration and physical properties of the reflecting materials) > are added onto sound traveling in a direct line from the source. Thus, phase > information does not generally carry information that can reliably aid a > listener in identifying sound sources in a reverberant environment > (Terhardt, 1988; see also Terhardt, 1991, 1992). Let's consider an echo off one surface that introduces a time delay. To the extent that the echo's time pattern resembles that of the original stimulus, depending upon the delay between the two the sound and its echo can be fused into one object. In an ecological situation, sound reflecting surfaces and their properties are not changing rapidly. The phase structure of echoes combined with the phase structure of the direct sound will then form an invariant whole, so that if one has a mechanism for fusing together repeated relative phase patterns, echoes become fused with the direct signal (i.e. fusion is a dfferent strategy for "echo suppression"). At short delays (<15 msec) one hears only one sound; at longer delays the timbre of the one sound changes, and at really long delays one hears two sounds. These differences would be related to how the auditory system integrates recurrent patterns with different delays. In such situations, one would not generally be able to distinguish one particular phase pattern from another, but it would be important that the time structure of the signal and that of the echo be largely similar in order for fusion to take place. I don't think things get much more Gibsonian than this. If the auditory system operates this way, then there is an invariant time pattern in the sound environment that the sound and the echo share that is extracted by the auditory system. One way to think about this is that the auditory system brings the correlation structure of sound & echo into the nervous system by means of phase- locked discharges. This phase-locking is seen in every sensory system, albeit on different time scales, so stimulus-driven time structure has been around at least as long as sensory receptors and sensory neurons. Essentially, if the fine structure of the stimulus is present in the timings of discharges, then it is possible to carry out very, very general kinds of pattern recognition operations that extract invariant time structure from what amounts to an analog, iconic representation of the sound. This is much closer to Gibsonian ideas concerning mechanisms of perception than models based on spectral features (perceptual atoms) and complex pattern recognitions. > This is a matter of > particular concern in an ecological approach, as non-reverberant > environments are almost non-existent in the real world (anechoic rooms, > mountain tops). On the other hand, again in real acoustic environments, > spectral frequencies (that is, the frequencies of isolated components of > complex sounds, or clear peaks in a running spectrum, forming frequency > trajectories in time-varying sounds) cannot be directly affected by > reflection off, or transmission through, environmental obstacles. They might > be indirectly affected as a byproduct of the effect that such manipulations > can have on amplitudes (e.g., a weakly defined peak could be pushed sideways > if amplitudes increased on one side and decreased on the other), but such > phenomena could hardly affect audible sound spectra. > > So for the auditory system to reliably identify sound sources, it needs to > ignore phase information, which is merely a constant distraction, and focus > as far as possible on a signal's spectral frequencies (and to a lesser > extent on the relative amplitudes of individual components, keeping in mind > that these, too, are affected by reflection and transmission). In a sense we are saying similar things here. Interspike interval distributions, like rate-place profiles, are both "phase-deaf" representations, and form analysis is based on such basic "phase-deaf" representations. > The ear's > phase deafness with regard to pitch perception is thus a positive attribute. > In fact, it may be regarded as an important phylogenetic achievement - the > result of a long evolutionary process in which animals whose ears allowed > phase relationships to interfere with the identification of dangerous or > otherwise important sound sources died before they could reproduce. If this > scenario is correct, then it is no surprise that we are highly sensitive to > small changes in frequency, and highly insensitive to phase relationships > within complex sounds. Localization of sound is important, but it is no less important to be able to recognize the forms of sounds, to be able to distinguish and recognize different sound sources. The reason that we talk so much in terms of localization is that we understand more of how localization mechanisms operate: what are the cues, what are the neural computations. One could make an analogous argument that it is important to be able to detect arbitrary recurring sound patterns that come in at different times, and that therefore basic mechanisms evolved that integrate similar time patterns over many delays. Such mechanisms would be deaf to the particular phases of sounds, but sensitive to transient changes in phase structure. Birds and humans detect mistuned harmonics quite well. Why is this? The harmonic complex has a constant phase structure that recurs from period to period and the mistuned component has a constant phase structure that recurs at its own unrelated period. Phase relations between the complex and the mistuned component are constantly changing. Two sounds are heard because invariant waveform/phase patterns are fused together and varying sets of relations are separated. Similar kinds of considerations apply to double vowels with different F0's. > Straightforward evidence of the ear's insensitivity to phase in the sounds > of the real human environment has been provided by Heinbach (1988). He > reduced natural sounds including speech (with or without background noise > and multiple speakers) and music to their spectral contours, which he called > the part-tone-time-pattern. In the process, he completely discarded all > phase information. The length of the spectrum analysis window was carefully > tuned to that of the ear, which depends on frequency. Finally, he > resynthesized the original sounds, using random or arbitrary phase > relationships. The resynthesized sounds were perceptually indistinguishable > from the originals, even though their phase relationships had been shuffled. Yes, but these sounds still had the same time-pattern within each freq. channel and the relations of time-patterns across channels were presumably stable over the course of the stimulus. If the interchannel phase relations were constantly changing, I think the sound would not have the same quality. If you introduced many random delays at different timepoints into the different frequency channels, I would think that these sounds would break apart. I've experimented with sequences of vowel periods having different phase relationships. One can take the waveform of a vowel period and flip its polarity and/or reverse it in time. This results in 4 possible operations for each vowel period. If you do this in an orderly, regular, repeating way, the resulting waveform has a pitch corresponding to the recurrence period of the whole pattern. If you randomize the sequences, the waveform has a very noisy pitch and has a very different quality, and if you introduce random time delays in between the vowel periods in addition to the random phase switches, the pitch goes away. Now the short-term spectral structure of this sound is constant, but the time-relations between events in one vowel period and another have been destroyed. Voice pitches of vowels thus can be seen as the result of recurrent phase patterns that span vowel periods. It is the delay between the patterns (the recurrence time) that determines the pitch. If there are no recurrent phase patterns there is no pitch. Recurrence time of phase (time interval) defines frequency. > It is nevertheless possible to create artificial stimuli for which clear, > significant perceptual effects of phase relationships on perception can be > demonstrated. For example, Patterson (1973, 1987) demonstrated that > listeners can discriminate two harmonic complex tones on the basis of phase > relationships alone. I think that this discrim. was on the basis of a timbre difference. I agree that phase relations can in some cases alter the relative influence of particular harmonics and thereby influence timbre. > Moore (1977) demonstrated that the relative phase of > the components affects the pitch of harmonic complex tones consisting of > three components; for each tone, there were several possible pitches, and > relative phase affected the probability of a listener hearing one of those > as 'the' pitch. These several possible pitches, I assume, were associated with partials that could be heard rather than with F0. Again phase structure can subtly alter the relative salience of particular harmonics, and hence the partials that are best heard. > Hartmann (1988) demonstrated that the audibility of a > partial within a harmonic complex tone depends on its phase relationship > with the other partials. Yes. > Meddis & Hewitt (1991b) succeeded in modeling these > various phase effects, which (as Moore, 1977, explained) generally apply > only to partials falling within a single critical band or auditory filter. I think what happens is that relative phase can affect which harmonic is most effective at creating discharges that are phase-locked to it. > In an ecological approach, the existence of phase sensitivity in such > stimuli (or such comparisons between stimuli) might be explained as follows. > These stimuli (or stimulus comparisons) do not normally occur in the human > environment. So the auditory system has not had a chance to'learn' (e.g., > through natural selection) to ignore the phase effects. As hard as the ear > might 'try' to be phase deaf in the above cases, some phase sensitivity will > always remain, for unavoidable physiological reasons. But these effects are all extremely subtle. I don't think vowel quality ever changes so radically that one hears a completely different vowel. But why are there these kinds of subtle effects at all? From a rate-perspective, one could argue for some kind of slight rate-suppression that depended on relative phases of closely spaced harmonics. The interval account would be similar, except that instead of rate suppression, one would have interval or synchrony suppression. > There could, however, be some survival value associated with the ability to > use phase relationships to identify sound sources during the first few tens > of ms of a sound, before the arrival of interference from reflected waves in > typical sound environments. On this basis, we might expect phase > relationships at least to affect timbre, even in familiar sounds. Supporting > evidence for this idea in the case of synthesized musical instrument sounds > has recently been provided by Dubnov & Rodet (1997). In the case of speech > sounds, Summerfield & Assmann (1990) found that pitch-period asynchrony > aided in the separation of concurrent vowels; however, the effect was > greater for less familiar sounds (specifically, it was observed at > fundamental frequencies of 50 Hz but not 100 Hz). In both cases, phase > relationships affected timbre but not pitch. > > The model of Meddis & Hewitt (1991a) is capable of accounting for known > phase dependencies in pitch perception (Meddis & Hewitt, 1991b). This raises > the question: why might it be necessary or worthwhile to model something > that does not have demonstrable survival value for humans (whereas music > apparently does have survival value, as evidenced by the universality of > music in human culture). It's certainly premature to judge what kinds of auditory representations have or don't have "demonstrable survival value for humans." Phase dependencies may be side issues in ecological terms, but they do shed light on basic auditory mechanisms. Deciding what is evolutionarily-relevant is difficult at best. In arguing that music perception is culturally universal, therefore it must have survival value, I think one commits an evolutionary fallacy, that every capability is the result of a particular adaptation to a particular ecological demand. Even Steven Pinker doesn't go this far. At least he would say that music perception could be a by-product of other adaptations. It's very hard indeed to identify what the inherent survival value of music would be. And there can be generalist evolutionary strategies and general-purpose pattern recognizers, so that it is not always the case that evolutionary demands and solutions have to be so parochial....... (most of vision isn't face recognition, even if one thinks that face recognition is a special purpose module selected for a special-purpose ecological demand -- we see all sorts of complex patterns that our evolutionary forebears never encountered. We were not evolutionarily selected to read text such as this, but we can do it because our visual mechanisms have sufficient generality that we can learn to recognize letters and words). I'd rather we avoid particularistic adaptive "just-so" stories to explain away peculiarities of our senses. However, studying music perception is very important even if music had/has no inherent survival value for the species, because it gives us another window on complex modes of auditory representation and processing. Music is an important aspect of auditory cognition, and your work on the structure of auditory cognition is quite valuable regardless of whether music is essential to survival. Very general kinds of pattern recognition mechanisms are possible and could very well be based on the nature of the basic auditory representations. For example, if an all-order interval analysis is carried out by the central auditory system, the harmonic relations (octaves, fifths, low-integer frequency ratios) all fall out of the inherent harmonic structure of time intervals and their interactions. (I've read your book and know you don't like these kinds of Pythagorean relations. But there they are.......) Our perception of octave similarities would be the result of very basic similarities in interval representations rather than the result of acquired associations. According to this perspective, octave-similarities and perception of missing fundamentals are the consequence of the operation of phylogenetically-ancient neural coding systems. We may be phase-deaf, but much of our auditory perception may be based on phase-locking nonetheless. -- Peter Cariani

This message came from the mail archive
maintained by:
DAn Ellis <>
Electrical Engineering Dept., Columbia University