Filtering Reveals Form in Temporally Structured Displays

TECHNICAL COMMENT
Filtering Reveals Form in
Temporally Structured Displays
In
their report, Lee and Blake (1) asked
whether the visual system could use temporal
microstructure to bind image regions into
unified objects, as has been proposed in some
neural models (2). Lee and Blake presented
two regions of dynamic texture. The elements
of the target region changed in synchrony
according to a random sequence, while the
elements of the background region changed
at independent times. The stimulus was designed in an attempt to remove all classical
form-giving cues such as luminance, contrast,
or motion, so that timing itself would provide
the only cue. Subjects were readily able to
distinguish the shape of the target region. Lee
and Blake posited the existence of new visual
mechanisms “exquisitely sensitive to the rich
temporal structure contained in these highorder stochastic events.” The results have
generated much excitement (3).
However, we believe that the effects can
be explained with well-known mechanisms.
The filtering properties of early vision can
convert the task into a simple static or dynamic texture discrimination problem. A sustained cell (temporal lowpass) will emphasize
static texture through the mechanisms of visual persistence; a transient cell (temporal
bandpass) will emphasize texture that is flickering or moving.
We simulated a lowpass mechanism to see
what would emerge. Lee and Blake’s stimuli
were composed of randomly oriented Gabor
elements, where the Gabor phase shifted forward or backward on each frame according to
a coin-flip. We downloaded one such movie
from their Web site and ran it through a
temporal lowpass filter (4) (An input frame is
A
B
C
D
Fig. 1: (A) One input frame; (B) result of temporal integration with synchronized target and
unsynchronized background; (C and D) results
of temporal integration when the target and
background are each synchronized.
shown in Fig. 1A; a filtered output frame is
shown in Fig. 1B.). At the particular moment
shown, the target region has a lower effective
contrast than does the background, providing
a strong form cue. At other moments the
target’s contrast may be above or below the
background’s contrast because of statistical
fluctuations in the reversal sequences. If a
single Gabor element happens to have a run
of multiple shifts in one direction, its effective contrast is low because of the temporal
averaging. Conversely, if it has a run of
alternating forward and backward shifts,
thus “jittering” in place, its contrast remains fairly high. Within the unsynchronized background the local contrasts
fluctuate randomly, but within the synchronized target region they all rise and fall
in unison, revealing a distinct rectangular
form.
In a second experiment Lee and Blake
synchronized both the target and background
region, each to its own random sequence.
Here the target was even more clearly visible.
This result is predicted by our hypothesis.
Since both background and target are synchronized, they will both yield uniform texture contrasts after temporal filtering. There
will be moments when, by chance, one region’s contrast is high while the other’s is
low, and the target will become especially
clear. Figure 1C shows one such moment,
again the result of filtering a movie from the
Web site with the lowpass filter. Figure 1D
shows a moment when the relative contrasts
are reversed. We also ran movies through a
temporal bandpass filter (5) with a biphasic
impulse response, to simulate a transient
mechanism. Again, the target was clearly
revealed.
Our hypothesis also predicts, with the use
of either filter, Lee and Blake’s finding that
discrimination will be best when the reversal
sequences have high entropy, that is, when
the coin-flip is unbiased. The contrast cue is
best when the target “jitters” in place while
the background has a run in a single direction
(or vice versa). This condition happens most
frequently at high entropy.
Lee and Blake’s stimuli are designed to
remove form cues from single frames and
from frame pairs. However, when one considers the full sequence, strong contrast cues
can emerge due to the spatio-temporal filtering present in early vision. These cues probably suffice to explain the perception of form
in the experiments. We do not see the need to
posit mechanisms other than those already
known to exist.
Edward H. Adelson
Hany Farid
Department of Brain and Cognitive Sciences
Massachusetts Institute of Technology
Cambridge, MA 02139, U.S.A.
E-mail: adelson,[email protected]
Reference and Notes
1. S. Lee and R. Blake, Science 284, 1165 (1999).
2. W. Singer and C. Gray, Ann. Rev. Neurosci. 18, 555
(1995).
3. M. Barinaga, Science 284, 1098 (1999).
4. The lowpass impulse response was of the form h(t) ⫽
(t/␶)2e⫺t/␶, with ␶ ⫽ 0.01. The integration time was
roughly 40 msec.
5. The bandpass impulse response was of the form h(t)
⫽ (kt/␶)ne⫺kt/␶[1/n! ⫺ (kt/␶)2/(n ⫹ 2!)], with ␶ ⫽
0.01, k ⫽ 2 and n ⫽ 4. The peak response was at 5 Hz.
Response: We agree with Adelson and Farid
that an appropriately designed, lowpass temporal filter applied to our stochastic animation sequences (1) could extract form defined
by luminance contrast without resort to temporal synchrony. We raised that possibility in
our report, noting that temporal integration
could produce occasional pulses in apparent
contrast when, by chance, motion elements
repetitively switched back and forth in direction over several successive frames (called
“jitter” in Adelson and Farid’s comment).
The output from Adelson and Farid’s model
(Fig. 1) confirms our intuition, showing that
contrast pulses could be synthesized by a
hypothetical temporal filter with the right
time constant. But to assert that these infrequent, hypothetical events “explain the perception of form” seems conjectural. In our
experiments, observers never saw static single frames such as Adelson and Farid are
pointing to in their filtered example; successive frames were rapidly animated, and contrast pulses were not conspicuous in these
animations. But perhaps this cue, although
not salient in the animations, is available and
utilized by observers when performing our
shape task. In our research, we created conditions in which the putative contrast pulses
would occur in the figure and in the background regions. Distributing identical contrast pulses throughout the display, we reasoned, should impair figure/ground segmentation based on perceived contrast. But exactly the opposite was found [see figure 2A in
(1)], implying that contrast summation does
not mediate performance on our task.
Adelson and Farid’s hypothetical temporal filter uncovers a possible consequence
that we did not address in our report. Specifically, in animation sequences containing
multiple successive frames without change in
the direction of motion (runs), effective contrast produced by temporal integration could
be temporarily reduced within the figural region where all elements are doing the same
thing. When that happens, this region could
stand out from the background, where ele-
www.sciencemag.org SCIENCE VOL 286 17 DECEMBER 1999
2231a
TECHNICAL COMMENT
ments are changing independently. Adelson
and Farid’s figure 1A depicts this hypothetical situation. Because strings of “no change”
frames are more probable at lower entropy
values, shape discrimination based on global
reductions in contrast within the figure
should be particularly easy at low entropy.
But just the opposite is true: shape from
temporal synchrony is best at high entropy,
where “no change” sequences are highly unlikely [see figure 2B in (1)].
We are grateful to Adelson and Farid for
formalizing a plausible model of temporal
integration. Using their model, we have quantitatively indexed the potential strength of
luminance cues from temporal integration
(2). We find no correlation between this
strength index and psychophysical performance on our shape discrimination task. We
have gone one step further, using this index to
create animations from which temporal integration could produce no luminance cues
whatsoever in any frames of the sequence.
We did three things to achieve this: (i) the
contrast of each moving element was randomized throughout the array and from
frame-to-frame of the animation, (ii) the average luminance of each motion element was
assigned randomly throughout the array, and
(iii) those frames causing “runs” and “jitter”
were selectively pruned from the sequence.
Observers still readily perceive shape from
temporal synchrony in these sequences that
have been purged of potential luminance cues
2231a
(3). This observation is remarkable considering that contrast randomization and luminance randomization actually introduce conflicting cues for spatial structure. Our findings undermine the supposition that temporal
integration alone can “explain the perception
of form” (1) in these stochastic displays. On
the contrary, it is revealing that temporal
integration does not erase visual signals generated by these kinds of dynamic, stochastic
events. This constitutes one more piece of
evidence that human vision contains mechanisms that preserve the temporal fine structure in dynamic events, structures that operate in the interests of spatial grouping.
Adelson and Farid also suggest that a
filter with a biphasic impulse response could
be involved in the extraction of shape from
our dynamic displays. Here, too, they confirm a point made in our report where we
noted that reversals in motion direction—the
carriers of temporal structure in our displays— could produce brief neural transients
that accurately denote points in time when
reversals occur. When applied to our displays, an appropriately tuned biphasic temporal filter accomplishes this operation (change
detection). So we agree with Adelson and
Farid that there is no need to posit the existence of new visual mechanisms sensitive to
stochastic temporal structure. Existing mechanisms provide a reasonable point of departure. Still, change detection is just a first step
in extracting shape from temporal structure.
It remains a challenge to explain how spatial
grouping is accomplished based only on irregularly occurring transients distributed
among local neural mechanisms tuned to different directions of motion.
Sang-Hun Lee
Randolph Blake
Vanderbilt Vision Research Center
Vanderbilt University
Nashville, TN 37240, U.S.A.
E-mail: [email protected]
[email protected]
References and Notes
1. S. Lee and R. Blake, Science 284, 1165 (1999).
2. The strength of the putative luminance cue for each
individual frame of the filtered sequence is computed
in the following way: First, for each frame we calculate the standard deviation of the luminance values
of all pixels within the figure region of the filtered
array (SDf) and the standard deviation of all pixels
within the background region of the filtered array
(SDb) (Note that this computation requires knowing
the exact location of the figural region within the
array, whereas in our experiments the location of this
region varied unpredictably.) We express the strength
of the luminance cue as the ratio: 兩 SDf ⫺ SDb 兩/(SDf
⫹ SDb). From this ratio, we can predict whether
performance under any given condition should be
better than performance under another condition,
based on luminance from temporal integration.
3. Readers may view versions of these animations at
http://www.psy.vanderbilt.edu/faculty/blake/Demos/
TI/TI.html. Note that these Web animations are running at slower frame rates than we use in the laboratory, and that the spatial resolution of the animations has been significantly reduced to minimize
downloading time.
16 July 1999; accepted 26 October 1999
17 DECEMBER 1999 VOL 286 SCIENCE www.sciencemag.org