Vision Research 49 (2009) 1620–1637 Contents lists available at ScienceDirect Vision Research journal homepage: www.elsevier.com/locate/visres Automatic computation of an image’s statistical surprise predicts performance of human observers on a natural image detection task T. Nathan Mundhenk a,*, Wolfgang Einhäuser b, Laurent Itti a a b Department of Computer Science, University of Southern California, Hedco Neuroscience Building, HNB 10 Los Angeles, CA 90089-2520, USA Department of Neurophysics, Phiilipps-University Marburg Renthof 7, 35032 Marburg, Germany a r t i c l e i n f o Article history: Received 24 July 2008 Received in revised form 28 March 2009 Keywords: Surprise Attention RSVP Detection Modeling Saliency Natural image Computer vision Masking Statistics a b s t r a c t To understand the neural mechanisms underlying humans’ exquisite ability at processing briefly flashed visual scenes, we present a computer model that predicts human performance in a Rapid Serial Visual Presentation (RSVP) task. The model processes streams of natural scene images presented at a rate of 20 Hz to human observers, and attempts to predict when subjects will correctly detect if one of the presented images contains an animal (target). We find that metrics of Bayesian surprise, which models both spatial and temporal aspects of human attention, differ significantly between RSVP sequences on which subjects will detect the target (easy) and those on which subjects miss the target (hard). Extending beyond previous studies, we here assess the contribution of individual image features including color opponencies and Gabor edges. We also investigate the effects of the spatial location of surprise in the visual field, rather than only using a single aggregate measure. A physiologically plausible feed-forward system, which optimally combines spatial and temporal surprise metrics for all features, predicts performance in 79.5% of human trials correctly. This is significantly better than a baseline maximum likelihood Bayesian model (71.7%). We can see that attention as measured by surprise, accounts for a large proportion of observer performance in RSVP. The time course of surprise in different feature types (channels) provides additional quantitative insight in rapid bottom-up processes of human visual attention and recognition, and illuminates the phenomenon of attentional blink and lag-1 sparing. Surprise also reveals classical Type-B like masking effects intrinsic in natural image RSVP sequences. We summarize these with the discussion of a multistage model of visual attention. ! 2009 Elsevier Ltd. All rights reserved. 1. Introduction What are the mechanisms underlying human target detection in RSVP (Rapid Serial Visual Presentation) streams of images, and can they be modeled in such a way as to allow prediction of subject performance? This question is of particular interest since, when images are presented at high speed, humans can detect some but not all images of a particular type (target images; e.g. images containing an animal) which they would be able to detect with far greater accuracy at a slower rate of presentation. Our answer to this question is that two primary forces are at work related to attention and are part of a two or more stage model (Chun & Potter, 1995; Reeves, 1982; Sperling, Reeves, Blaser, Lu, & Weichselgartner, 2001). Here we will suggest that the first stage is purely an attentional mask with the blocking strength of attention given by image features which have already been observed. The second stage on the other hand can block the perception of the image if another image is already being processed and is monopolizing its limited resources. * Corresponding author. E-mail address: [email protected] (T.N. Mundhenk). 0042-6989/$ - see front matter ! 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.visres.2009.03.025 We consider here the metric of Bayesian surprise (Itti & Baldi, 2005, 2006) to predict how easily a target image containing an animal may be found among 19 frames of other natural images (distractors) presented at 20 Hz. In a first experiment, we show that surprise measures are significantly different for target images which subjects find easy to detect in the RSVP sequences vs. those which are hard. We then present a second experiment which attempts to predict subject performance by utilizing the surprising features to determine the strength of attentional capture and masking. This is done using a back-propagation neural network whose inputs are the features of surprise and whose output is a prediction about the difficulty of a given RSVP sequence. 1.1. Overview of attention and target detection It has long been argued that attention plays a crucial role in short term visual detection and recall (Duncan, 1984; Hoffman, Nelson, & Houck, 1983; Mack & Rock, 1998; Neisser & Becklen, 1975; Sperling et al., 2001; Tanaka & Sagi, 2000; VanRullen & Koch, 2003a; Wolfe, Horowitz, & Michod, 2007). This also applies to detection of targets when images are displayed, one after another, T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 in a serial fashion (Duncan, Ward, & Shapiro, 1994; Raymond, Shapiro, & Arnell, 1992). Many studies have demonstrated that a distracting target image presented before another target image blocks its detection in a phenomenon known as the attentional blink (Einhäuser, Koch, & Makeig, 2007a; Einhäuser, Mundhenk, Baldi, Koch, & Itti, 2007b; Evans & Treisman, 2005; Maki & Mebane, 2006; Marois, Yi, & Chun, 2004; Raymond et al., 1992; Sergent, Baillet, & Dehaene, 2005). Thus, one image presented to the visual stream can interfere with another image that quickly follows, essentially acting as a forward mask (e.g. target in frame A blocks target in frame B). Additionally, with attentional blink, interference follows a time course, whereby optimal degradation of detection and recall performance for a second target image can occur when it follows the first target image from 200–400 ms (Einhäuser et al., 2007a), which is evidence of a second stage processing bottleneck. In most settings, an intermediate distractor between the first and second targets is needed to induce an attentional blink, a phenomenon known as lag-1 sparing (Raymond et al., 1992). However, for some types of stimuli, such as a strong contrast mask with varying frequencies and naturalistic colors superimposed with white noise, interference can occur very quickly (VanRullen & Koch, 2003b). This may create a situation whereby for some types of stimuli, a target is blocked by a prior target with a very recent onset (<50 ms prior), or by contrast, a much earlier onset (>150 ms prior). As such, there seems to be a short critical period with a U-shaped performance curve where interference is reduced against the second target. That is, interference is reduced if the preceding distractor comes in a critical period of approximately a 50–150 ms window before the target, but is larger otherwise. This interval we will generically refer to as the sparing interval. In addition to interference with the second target, detection of the first target itself can be blocked by backward masking (e.g. target in frame B blocks target in frame A) (Breitmeyer, 1984; Breitmeyer & Öğmen, 2006; Hogben & Di Lollo, 1972; Raab, 1963; Reeves, 1980; VanRullen & Koch, 2003b; Weisstein & Haber, 1965). However, in natural scene RSVP, backward masking occurs at a very short time interval, <50 ms without a good ability to dwell in time (Potter, Staub & O’Conner, 2002) That is, interference is not U-shaped in the same way as with forward masking. As we will mention later, longer intervals (>150 ms) may conversely enhance detection of the first target (e.g. target in frame B enhances target in frame A). However, backwards masking in the case of RSVP still retains a U like shape as the effects of the mask peak and decrease. The difference is that it does not have a second almost discrete episode of new masking following a short interval of sparing as forward masking does. That is, once the backwards mask fades in effect the first time, it is finished masking. The forward mask on the other hand has the ability mask twice. Putting these pieces together, there is a lack of literature showing a strong reverse attentional blink which would be produced by a second interval of backwards masking. With different stimuli, both forward and backward masking can be observed over very short time periods by flanking images in a sequence. However, only forward masking interferes with targets observed several frames apart, following a sparing lag with a much higher target onset latency. It should be noted that these masking effects are not universally observed in all experiments. As such, the mechanisms responsible for masking are dependant to some degree on both masking and target stimuli, which may result in a target being spared or completely blocked. Much like masking from temporal offsets, if the first target is displayed spatially offset from the second target, interference is decreased for recall of the first target (Shih, 2000). As an example, a large spatial offset would occur if a target in frame A is in the upper right hand corner while a target in frame B is in the lower left hand corner. Thus, if we overlapped the frames, the targets themselves 1621 would not overlap. As a result, recognition of individual target images allows for some parallel spatial attention (Li, VanRullen, Koch, & Perona, 2002; McMains & Somers, 2004; Rousselet, Fabre-Thorpe, & Thorpe, 2002). This is also seen if priming signals for target and distractor are spatially offset (Mounts & Gavett, 2004). However, at intervals over 100 ms, spatial overlap may actually prime targets in an attentional blink task (Visser, Zuvic, Bischof, & Di Lollo, 1999). We then should gather that objects offset in space lack some power to mask each other, but in contrast, may at longer time intervals lack the ability to prime each other. Additionally it has been found that more than one object at different temporal offsets can be present in memory pre-attentively, at the same time, but only if they do not overlap at critical temporal, spatial and even feature offsets (VanRullen, Reddy, & Koch, 2004). However, there is reduced performance as more items, such as natural images, are added in parallel (Rousselet, Thorpe, & FabreThorpe, 2004). Thus, even if objects do not interfere along critical dimensions, performance may degrade as a function of the number of complex distractors added. 1.2. Surprise and attention capture Prediction of which flanking images or objects will interfere with detection of a target might be accounted for in a Bayesian metric of statistical surprise. Such a metric can be derived for attention based on measuring how a new data sample may affect prior beliefs of an observer. Here, surprise (Itti & Baldi, 2005, 2006), is based on the conjugate prior information about observations combined with a belief about the reliability of each observation (For mathematical details, see Appendix A). Surprise is strong when a new observation causes a Bayesian learner to substantially adjust its beliefs about the world. This is encountered when the distribution of posterior beliefs highly differs from the prior. The present paper extends our previous work (Einhäuser et al., 2007b), by optimally combining surprise measures from different low-level features. The contribution of different low-level features to ‘‘surprise masking”, and thus their role in attention, can be individually assessed. Additionally, we will demonstrate how we have extended on this work by creating a non-relative metric that can compare difficulty for RSVP sequences with disjoint sets of target and distractor images. That is, our original work was only able to tell us if a new ordered sequence was relatively more difficult than its original ordering. The current work will focus on giving us a parametric and absolute measure based on how many observers should be able to spot a target image in a given sequence set. While surprise has been shown to affect RSVP performance, it remains to be seen how surprise from different types of image features interacts with recall. Importantly, critical peaks of surprise, along specific feature dimensions, can be measured and used to assess the degree to which flanking images may block one another. For instance, should an image with surprising horizontal lines have more power in masking a target than an image with surprising vertical lines? This is important since some features may be more or less informative, for instance if they have a low signal to noise ratio (SNR) between the target and distractors (Navalpakkam & Itti, 2006). Additionally, some features may be primed in human observers, making them more powerful. As an example, if features can align and enhance along temporal dimensions (Lee & Blake, 2001, Mundhenk, Landauer, Bellman, Arbib, & Itti, 2004; Mundhenk, Everist, Landauer, Itti, & Bellman, 2005) in much the same way they do spatially (Li, 1998; Li & Gilbert, 2002; Mundhenk & Itti, 2005; Yen & Fenkel, 1998), then some features that appear dominant may have a fortunate higher incidence of temporal and/or spatial colinearity in image sequences. In order to predict and eventually augment detection of target images, a metric is needed that measures the degree of interference 1622 T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 of images in a sequence. Here we follow the hypothesis that temporally flanking stimuli interfere with target detection, if they ‘‘parasitically” capture attention so that the target may fall within an attentional blink period (e.g. target in frame A and C may interfere with target in frame B). The challenge is then to devise measures of such attentional masking for natural images using stimulus statistics. We herein derive and combine several such measures, based on a previously suggested model of spatial attention that uses Bayesian surprise. Note that this ‘‘surprise masking” hypothesis does not deny other sources of recognition impairment, e.g. some targets are simply difficult to identify even in spatio-temporal isolation. Instead ‘‘surprise masking” addresses recall impairment arising from the sequential ordering of stimuli, rather than the inherent difficulty of identifying the target itself. If surprise gives a good metric of interference and masking between temporally offset targets in RSVP, we should expect that by measuring surprise between images by its spatial, temporal and feature offsets we will obtain good prediction of how well subjects will detect and recall target images. To this end, we first present evidence that surprise gives a good indication of the difficulty for detecting targets in RSVP sequences and that it elegantly reveals almost classic intrinsic masking effects in natural images. We then show for the first time a neural network model that predicts subject performance using surprise information. Additionally, we discuss the surprise masking observed between images in an RSVP sequence within the framework of a two-stage model of visual processing (Chun & Potter, 1995). 2. Methods 2.1. Surprise in brief Although the mathematical details of the surprise model are given in Appendix A, we provide here an intuitive example for illustration. The complete surprise system subdivides into individual surprise models. Each of these individual models is defined by its feature (color opponencies, Gabor wavelet derived orientations, intensity), location in the image and scale. The respective feature values will be maintained over time, in the current state, at each location in the image, by feeding the observed value to a surprise model at that location. That is, every image location, feature and scale has its own surprise model. New data coming into surprise models can then by compared with past observations in a reentrant manner (Di Lollo, Enns, & Rensink, 2000) whereby observed normalcy by the models can guide attention when there is a statistical violation of it. Different frequencies of surprising events are accounted for with several time scales by feeding the results of successive surprise models forward. Imagine that we present the system initially with the image of a yellow cat. If the system is continuously shown pictures of yellow cats, many surprise models will begin to develop a belief that what they should expect to see are features related to yellow cats. However, not all models will develop the same expectation. Models in the periphery will observe the corners or edges of the image, which in a general sample would be more likely to contain background material such as grass or sky. If after several pictures of yellow cats a blue chair is shown, surprise will be largest with models which had been locally exposed more often to the constant stimuli of the yellow cats. Additionally, surprise should be enhanced more for models which measure color, than for models which measure intensity or line orientations since, in this example, color was more constant prior to the observation of the blue chair. Finally, surprise can be combined across different types of features (channels), scales and locations into a single surprise metric. This combination can either be hardwired (Einhäuser et al., 2007b) or optimized for the prediction task (the present paper). As is typical for Bayesian models, there is an initial assumption about the underlying statistical process of the observations. Since we wish to model a process of neural events, which should bring us closer to a neurobiological like event of surprise, a Poisson/Gamma process is used as the basis. The Gamma process is used since it gives the probability distribution of observing Poisson distributed events with refractoriness such as neural spike trains (Ricciardi, 1995). Surprise is then extended into a visually relevant tool by combining its functioning with an existing model of visual saliency [see also another saliency model: (Li, 2002)]. As a less abstract example, the Gamma process can be used to model the expected waiting time to observe event B after event A has been observed. If events A and B are neural spikes, then we can see how it is useful in gauging the change in firing rates related to alterations in visual features. Here the Gamma process is given with the standard a and b hyperparameters which are updated with each new frame and are part of a conjugate prior over beliefs (Fig. 1). Abstractly, a corresponds to the expected value of visual features while b is the variance we have over the features as we gather them. Since we have a conjugate prior, the old a and b values are fed back into the model as the current best estimate when computing the new a and b (de0 0 noted a and b ) values for the next frame. In a sense, the new expected sample value and variance are computed by updating the old expected sample value and variance with new observations. This gives us a notion of belief as we gather more samples since we can have an idea of confidence in the estimate of our parameters. Additionally, as observations are added, so long as our beliefs are not dramatically violated, surprise will tend to relax over time. To put this another way, metaphorically speaking, the more things a surprise model sees, the less it tends to be surprised by what it sees. 2.2. Using surprise to extract image statistics from sequences With RSVP image sequences, each image is fed like a movie into the complete surprise system one at a time. As the system observes the images, it formulates expectations about the values of each feature channel over the image frames. Like with the yellow cat example, images presented in series that are similar, tend to produce low levels of surprise. High peaks in surprise then tend to be observed when a homogeneous sequence of images gives way to a rather heterogeneous one. However, the term homogeneous is used loosely since repetitions of heterogeneous images are in terms of surprise homogeneous. That is, certain patterns of change can be expected, and believed to be normal. Even patterns of 1/f noise have an expected normalcy and as such should not necessarily be found to be surprising. It should also be noted that surprise is measured not just between images, but within images as well. We suggest that an image can stand on its own with its inherent surprise. As a result, surprise is a combination of the spatial surprise within an image and the temporal surprise between images in a sequence. From the surprise system we can analyze surprise for different types of features as well. This is possible since the surprise model extracts initial feature maps in a manner identical to the saliency model of Itti and Koch (2001). So for instance, for any given frame, we can tell how surprising red/green color opponents are because we have a map of the red/green1 color opponency for each frame. A surprise model is then created just to analyze the surprise for the red/green color opponents. The entire surprise system thus contains models of several feature channels which can be taken as a composite in 1 For interpretation of color in Figs. 1–10, the reader is referred to the web version of this article. T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 1623 Fig. 1. With an image sequence, we run each frame through the surprise system. Feature maps such as orientation and color are extracted from the original images in the sequence (noted with ‘‘!”). Each feature map is then processed by surprise for both space and time. In spatial surprise, a locations’ beliefs about what its value should be are described in the Gamma probability process with the hyperparmeters a0p ; b0p . This is compared with its surround (ap, bp) within the current frame in the surprise model (noted with the box ‘‘!”) (for details see Appendix A Eqs. (16)–(18)). In temporal surprise, each new location (a0t ; b0t ) is compared with the same location in a previous old frame (at, bt) using the same update and surprise computation as space. Both space and time have their own a and b hyperparameters. Spatial and temporal surprise for each location within the same frame are merged in a weighted sum ‘‘R”. Additionally, the a0t ; b0t and a0p ; b0p are successively fed forward into models several times (six times in the current model) to give it sensitivity to surprise at different time scales. Here the time scales are used to create sensitivity to recurring events at different frequencies. The complete surprise for each frame of the same feature is the product (shown with the box ‘‘P”) of merged spatial/temporal surprise for each time scale. much the same way as the Itti and Koch saliency model does. A large number of features are supported within the surprise system but we use here intensity, red/green, blue/yellow color opponencies and Gabor orientations at 0", 45", 90" and 135". Once surprise has been determined for each model (color, Gabor orientations, etc.) in the surprise system, each image in the RSVP sequence is automatically labeled in terms of surprise statistics (Fig. 2). If desired, the resolution of surprise can be as fine grained as individual pixels. This can, as mentioned, be further parsed into the surprise system along each feature type. However, advantage may be gained in using broader aggregate descriptions of surprise. That is, for each RSVP image, we here obtain a single maximum or minimum value for surprise. This would simply be the maximum or minimum value of surprise given all the surprise values over all the model results for a frame. Maximum surprise is a useful metric if attention uses a maximum selector method (Didday, 1976) based on surprise statistics. The maximum valued location of surprise should be the location in the image most likely to be attended. By using this metric, we can assess the likelihood that attended regions will overlap in space from frame to frame increasing their potential interaction (e.g. masking or enhancement). The overlap from this metric can Fig. 2. The top row of images are three from a sequence of 20 that were fed into the system one at a time. The bottom row shows the derived combined surprise maps, which gives an indication of the statistical surprise the models find for each image in the sequence. Each map is a low-resolution image, which has a surprise value at each pixel, which corresponds to a location in the original image. Basic statistics are derived from the maximum, minimum, mean and standard deviation of the surprise values in each surprise map as well as the distance between maximum surprise among images giving us five basic statistic types per image. As mentioned the final surprise map is computed from each feature type independently, so while computing surprise we have access to the surprise for each feature. As an added note, the surprise for each independent image as expressed in the surprise map may look like a poor quality saliency map. However, what is surprising in each image frame is strongly affected by the images that precede it. This causes the spatial surprise component to be less visible in the output shown. 1624 T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 only be assumed based on proximity. That is, the closer a maximum location is between frames, the more likely two surprising locations are to overlap. However, as a spatial metric this is somewhat limiting as it does not give us a metric of the true spatial extent and distribution of surprise within a frame, for that we use the mean and standard deviation of surprise. So, within a frame, we use the point of maximum surprise to figure the attended location, but mean and standard deviation are used as a between image statistic to show the power and tendency to overlap that indicates to what extent natural images are competing (or co-operating) with each other. The three statistics (maximum, mean and standard deviation) combined; ideally will yield a basic metric of the attentional capture an image in a sequence exhibits. Additionally, it will allow us to consider whether attention capture is competitive (Keysers & Perrett, 2002). It is then useful to analyze surprise by its overlap between images. We do this because attention capture may be more parasitic along similar feature dimensions or along more proximal spatial alignment. Put another way, interaction or interference between locations in image frames may be stronger if they are spatially overlapping or their feature statistics are highly similar. Closer spatial, temporal and feature proximity between two locations naturally lends itself to more interaction. 2.3. Performance on RSVP and its relationship with surprise The psychophysical experiments used here to exercise the system were described in detail in (Einhäuser et al., 2007b). In brief: for the first experiment, eight human observers were shown 1000 RSVP sequences of 20 natural images. Half of the sequences contained a single target image, which was a picture of an animal. The observers’ task was to indicate whether a sequence contained an animal or not. The speed of presentation at 20 Hz, yielded a comparably low performance, which was, however, still far above chance with a desirable abundance of misses (no target reported despite being present) as compared to false alarms (target reported despite not being present). This large number of missed target sequences is particularly convenient when we account for the sequences in which no subject is able to spot the target, 29 in all. This is a large enough set to allow for statistical contrast with sequences which subjects perform superior on. Here we refine the utility of observer responses to assign a difficulty to all of the 500 target-present sequences. If one or none of the eight observers detected the target in a particular sequence, then it is classified as hard. If seven or eight of the human observers correctly spotted the target, the sequence is classified as easy. All other sequences are classified as intermediate. That is, if the target in a sequence of images can be spotted more often, then something about that sequence makes it easier. Note that this definition is different from (Einhäuser et al., 2007b), where only targets detected by either all or none of the observers were analyzed. By the present definition, there were 73 hard, 188 easy, and 239 intermediate sequences. As will be seen, partitioning image sequences in such a way yields a visible contrast between sequences labeled as easy and those labeled as hard. For visibility and clarity of illustration, intermediate sequences will be omitted from the graphs of the initial analysis, but will be included again in the neural network model that follows. This is reasonable since the intermediate sequences yield no additional insight into the workings of surprise, but instead make the important aspects illustrated more difficult to visually discern for the reader. 3. Results Initial experiments showed a cause and effect relationship between surprise levels in image sequences and performance of observers on the RSVP task. Specifically, as can be seen in the example in Fig. 3A, flanking a target image with images that register higher surprise make the task more difficult for observers. Experimental support for a causal role of surprise is provided by experiment 2 in (Einhäuser et al., 2007b), in which re-ordering image sequences according to simple global surprise statistics significantly affects recall performance. To improve the power of prediction and augmentation in an RSVP task, we need to analyze the individual features to see if they yield helpful information beyond the combined surprise values such as those seen in Fig. 3B. This is because it has remained open whether prediction performance could be improved further by extracting refined spatio-temporal measures from the surprise data. In particular, does a feature-specific surprise system rather than an aggregate measure over all features further enhance prediction? Optimal feature biasing in a search (Navalpakkam & Itti, 2007) or feature priming might lend more power to certain types of features. Analysis was carried out on all image sets which contained an animal target (‘‘target-present” sequences). From each image in each sequence, surprise statistics were extracted for each feature Fig. 3. (A) Mean surprise produced by each image in an example RSVP sequence is shown. The sequence on the top is easy and all 8 of 8 observers manage to spot the animal target. By rearranging the image order, we can create a hard condition in which none of the eight observers spot the animal target. This was achieved by placing images in an order which produced surprise peaks both before and after the target image at ±50 ms with relative surprise dips at ±100 ms creating a strong ‘‘M” shape in the plot centered at the target (we also refer to this as a ‘‘W” shape in the case of easier sequences or M–W for both). Note that the strength in the graphs shown is in the gain and not just the absolute value. A slight downward slope is observed since surprise can tend to relax over time given similar images. (B) The M–W shape can be seen more clearly when the gain of mean surprise for all hard and easy sequences combined is observed. In this illustration, the average over all gains for mean surprise is shown. T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 channel, such as color opponencies and Gabor wavelet line responses. A mean surprise is computed by taking the average of the surprise values from all locations within a frame for any given feature channel. Mean surprise of individual Gabor wavelet channels exhibit the ‘‘W” or ‘‘M” shaped interference from flanking images to the target (Fig. 4) that was also characteristic for the full 1625 system. The M–W shape is itself very similar to forms seen in masking studies as a bi-modal enhancement of Type-B masking functions (Breitmeyer & Öğmen, 2006; Cox & Dember, 1972; Reeves, 1982; Weisstein & Haber, 1965). The important thing to notice is that the ‘‘W” pattern is seen with target enhancement while the ‘‘M” pattern is seen with target masking. The significance Fig. 4. Mean surprise for each frame per feature type is shown for RSVP sequences which observers found easy vs. sequences observers found hard. All sequences have been aligned similar to Fig. 3 for analysis so that the animal target offset is frame 0. (D) is labeled to show the properties of all the sub-graphs in Fig. 4. At "250 ms we see strong masking within this feature since surprise is very high for difficult sequences. "100 ms is the lag-1 sparing interval and as such we should expect to see no effect at this frame. We see at ±50 ms strong masking by elevated surprise. The effect at this interval is the most common. At +250 we see forward priming or enhancement since surprise is higher for this frame in easy sequences. All plots show the characteristic ‘‘W” and ‘‘M” shapes centered at the target, but are most visible in (C) and (D). However, there is obviously more power for vertical orientation features. Additionally, the red/green opponent channel gives M–W results like the Gabor feature channels. However, blue/ yellow does not exhibit the M–W pattern (at least not significantly). Oddly, it is notable that surprise is significantly different at both ±250 ms (five frames), far beyond what would be expected for simple bottom-up feature effects. Also, the lag sparing is seen in all sub-graphs at "100 ms, even in C highlighting the strength of the effect. Error bars have been Bonferroni corrected for multiple t-tests. 1626 T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 Table 1 The significance of the gain of surprise for the target frame (0 ms) over the flanking frames ("50 ms and 50 ms) is presented. For each feature type and the combined surprise map, the mean of the two flanking images are combined and compared against the mean of the target. The combined mean surprise map statistics are the same as the ones presented in Fig. 3B. From the two means for the target and flankers, the t value is presented. For P values which are non-significant, the field is left blank. It is notable that mean, standard deviation and spatial offset of surprise spikes for easy sequences at the target frame. The peak is strongest for easy sequences, but is also notable for hard sequences. While this helps to validate the presence of the M–W pattern, it also suggests that the ‘‘W” pattern for easy targets is stronger than the ‘‘M” pattern for hard targets. Feature t Hard P t Easy P Mean Combined Gabor 0" Gabor 45" Gabor 90" Gabor 135Q Red/green Blue/yellow 2.961297 1.595949 0.290076 3.785376 0.454917 0.2679 1.152454 3.747544 2.08278 4.720356 4.855656 4.137201 2.815474 1.311841 Stdev Combined Gabor 0" Gabor 45" Gabor 90" Gabor 135" Red/green Blue/yellow 1.704888 2.158931 0.486151 4.216503 1.161105 0.341483 1.251205 0.005 # # 0.001 # # # 0.001 0.05 0.001 0.001 0.001 0.005 # Space Combined Gabor 0" Gabor 45" Gabor 90" Gabor 135" Red/green Blue/yellow 0.385937 0.990172 0.281247 1.056511 0.162157 0.739866 0.352024 of the M–W jump pattern for each feature types is presented in Table 1. For most feature types, and in particular, for easy sequences, mean surprise jumps significantly for the target frame when compared with the flanking frames. For targets which are difficult, there is also a significant drop in surprise compared with the flanking frames, but it is less profound. As can be seen, Gabor surprise is increased before and after the target for ‘‘hard” as compared to ‘‘easy” sequences (Fig. 4). That is, surprise produced by edges in the image interferes with target detection. However, the difference in surprise between easy and hard targets is only significant for vertical orientations (P < .005 in both flanking images; Fig. 4C) and diagonal orientations (for the image following the target P < .01 and .005, respectively; Fig. 4B and D) at time ±1 frame (i.e. ±50 ms) from the target frame. The vertical orientation results clearly show a stronger significant probability. Horizontal orientations do not exhibit significant effects (P P .05 for any image in the sequence; Fig. 4A). We can also see that there is significance in the red–green color opponencies2 (For the image following the target P < .01; Fig. 4E). It is notable that the effect for vertical lines not only holds for the immediate flankers of the target, but also for images that precede the target by as much as 250 ms (P < .005 for vertical and diagonal orientation). Conversely, there is evidence for long term backwards enhancement at 250 ms following the target which is significant for the 135" orientation channel (P < .005; Fig. 4D). Thus, within the sequences viewed by the observers, vertical line statistics when plotted and analyzed, seem to stand out more when observers are watching for animals among natural scenes. Interestingly, the significant jumps in surprise at ±250 ms illustrates that the effects of surprise seem too prolonged to be constrained to early visual cortex (Schmolesky et al., 1998). Using the same analysis as for Fig. 4, a similar pattern can also be seen in Fig. 5 with the standard deviation of surprise within an image. However, the significance is concentrated on the target image itself. That is, if surprise values form a strong peak which gives 2 Color opponencies are derived using H2SV2 color space which is described in Mundhenk et al. (2005) with source code at: http://ilab.usc.edu/wiki/index.php/ HSV_And_H2SV_Color_Space. # 0.05 # 0.001 # # # # # # # # # # 7.921573 2.709357 8.054697 5.774448 6.014868 6.308344 2.577544 0.001 0.005 0.001 0.001 0.001 0.001 0.01 1.170551 1.088986 4.254909 2.018851 2.526581 2.390607 0.036651 # # 0.001 0.05 0.01 0.05 # a greater spread of surprise values in an image, then observers are more likely to experience that a target is easy to find. Again, the vertical line statistics yield a smaller probability value (For the target image, P < .001; Fig. 5C) than the horizontal line statistics (For the image after the target, P < .05; Fig. 5A) and red/green color opponents are significant as well (P < .05 for the target image and .01 for the image following the target; Fig. 5E). That the target image benefits most from a high standard deviation of surprise is due to strong peaks in surprise which give a wider spread of values. That is, the target image may draw more attention by having basic large peaks in surprise, while masking images gain from having a broader (larger area) generally high value. In more basic terms, attention is drawn to stronger singular surprise values, while it is blocked more effectively by flanking images with a wider surprising area. One may speculate that this reflects the locality of attention: wider surprise surfaces have a better chance to overlap with the target and are thus more likely to mask it. This hypothesis is borne out by observing the spatial information from the surprise metric. If the point of maximum surprise in the target image is offset more from the points of maximum surprise in the flanking images, observers find the target easier to spot (Fig. 6). Thus, to summarize, a more peaked surprise, with more spatial offset in the target image seems to aid in the RSVP task. Additionally, different types of features, within a given sequence, may have more power than others. However, the red/green color opponent acts differently in this particular metric. It appears that it exhibits some attentional capture within the target image itself directed away from the animal target. That is, the red/green channel shows distraction behavior within the target image itself if it is not blocked by the flanking frames. 4. A neural network model to predict RSVP performance So far we have considered statistical measures pooled over many sequences and obtained a hypothesis on how surprise modulates recall. If the hypothesis holds true, a basic computational system should be able to predict observer performance on the RSVP task. Ideally, the system predicts, given a sequence of images, T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 1627 Fig. 5. The standard deviation of surprise per frame like the mean, also seems to be significant for hard vs. easy image sequences with the M–W pattern. However, significants seems more concentrated on the target image rather than the distractor images. Again, vertical lines are more prominent. Error bars have been Bonferroni corrected for multiple t-tests. how well observers will detect and report them. Additionally, the system can be agnostic to whether or not a target image is in a sequence to determine if a sequence might have intrinsic memorability for any image for our given sequences. To do this, we added a test where we increased the difficulty for the system by testing to see if it can predict the difficulty of a sequence without knowing which image in the sequence contains the target. We added this as a comparison to our standard system which has knowledge of which frame contains the target. This also allows us to test the importance of knowing the target offset, which the evidence suggests is quite helpful in the prediction of observer performance. Training our algorithm to predict performance on the RSVP task required three primary steps. Each step is outlined below. These include the initial gathering of human observer data on the RSVP task, processing of the image sequences to obtain surprise values, and training using back propagation on a feed-forward neural network, which attempts to give an answer as to how difficult a given sequence should be for human observers. 1628 T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 Fig. 6. Here the mean spatial offset of maximum surprise from one frame to the next (illustrated in Fig. 2) is shown for both easy and hard sequences. Frame 5 ("250 ms) is left out since it is always zero in a differential measurement. An M–W pattern is visible, but much less so than for Figs. 4 and 5. It is notable that in three of the feature types, easy sequence targets have maximum surprise peaks further in distance from the peaks in the flanking images than hard sequences do. Again, like with mean and standard deviation of surprise, the greatest power seems to be in the vertical orientation statistics. Red/green color results for space seem to be inverted suggesting that it has priming spatial effects to some extent even though strong mean red/green surprise blocks as seen in Fig. 4. Error bars have been Bonferroni corrected for multiple t-tests. 4.1. Data collection RSVP data were collected from two different experiments where image sequences were shown to eight observers (Einhäuser et al., 2007b). From the two experiments, we obtained a total of 866 unique sequences of images which were randomly split into three balanced equal sized sets of 288 sequences for training, testing and final validation (two sequences were randomly dropped to maintain equal sizing for all three sets). As mentioned, each sequence could be given a rank based on how many observers correctly identified the target as being present in the sequence. The rank of difficulty was a number from zero to eight, rather than just easy, intermediate and hard. This matched how many observers managed to spot the animal target. This is used, as a training value, where the output of a network is its prediction as to how many observers in a new group of eight should correctly identify the target present in a given sequence. The control image sequences, the ones without animal targets, were not included for training since we are testing a computational model of bottom-up attention, which has no ability to actually identify an animal. The result of including non-target sequences in the system will only yield a prediction of how likely an observer is to spot a non-target which from T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 an attentional standpoint is the same as asking an observer to spot the target. 4.2. Surprise analysis Each sequence was run through an algorithm, which extracted the statistics of surprise or for comparison, contrast. The contrast model (Parkhurst & Niebur, 2003), a common statistical metric which measures local variance in image intensity over 16 $ 16 image patches, was used as a baseline measure. Thus, we created and trained two network models with the same image sets for contrast and surprise. The statistics were obtained by running the same set of 866 sequences through each of the systems (surprise or contrast), which returned statistics per frame. 4.3. Training using a neural network Training was accomplished using a standard back-propagation trained feed-forward neural network (Fig. 7). The network was trained using gradient descent with momentum (Plaut, Nowlan, & Hinton, 1986). The input was first reduced in dimension to about 1/3 the original size from 528 features (listed in Fig. 7) to 170 using Principal Component Analysis (PCA) derived from the training set (Jollife, 1986). The input layer to the neural network had 170 nodes and the first hidden layer had 200 nodes. This was a number arrived at by trying different sizes in increments of 50 neurons (the validation set was not used in any way during fitting). A network with 150 hidden units has a very similar, but slightly larger error, while 250 neurons has an error which is a few percentage points larger. A second hidden layer had three nodes. For comparison, a two-layer network model was also tested. The output layer for all networks had one node. Tangential sigmoid transfer functions were used from the hidden layers. The output layer had a standard 1629 linear transfer function. Training was carried out for 20,000 epochs which is sufficient for the model to converge without over fitting. Training, testing and validation were done using three disjoint random sets with 288 samples of sequences each. The training set was used for training the network model. The testing set augmented the training set to test different meta parameters (e.g. number of neurons, PCA components). Only after all possible meta-parameter tuning was concluded, the validation set was used to evaluate performance. All results shown are from the final validation set, which was not used during training or for adjusting any of the metaparameters (network size, PCA components, etc.). As noted, the final output from the neural network was a number from zero to eight which represented how many observers it predicted should be able to correctly identify the target on the input sequence. That is, if of the eight observers, four registered a ‘hit’ on that sequence, then the target value was four. A single scalar output (a number 0–8) is used because it seems to exhibit resistance to noise in our system compared with several different network types with eight or nine outputs (e.g. a binary code using a normalized maximum). To keep the output from the neural network consistent, the output value was clamped (bounded) and rounded so that the final value given was forced to be an integer between zero and eight. 4.4. Validation and results of the neural network performance Performance evaluation used a validation set with 288 sequences, which had not been used at any time during training or parameter setting. The idea was to try and predict how observers should perform on the RSVP task. This was done by running the validation set and obtaining general predicted performance of observers on the RSVP task. As mentioned, this was a number from zero to eight for each image sequence, depending on how many Fig. 7. The network uses the mean, standard deviation and spatial offset for maximum surprise per feature type (the same which were illustrated in Figs. 4–6) for each image in a given RSVP sequence. This is taken from 22 input frames along eight different feature types. The features are then reduced in dimension using PCA. The output is the judgment about how difficult an image sequence should be for human observers. 1630 T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 people the trained network expected to correctly recognize the presence of a target animal in the sequence. To make this a meaningful output, we needed to define a bound for testing the null hypothesis. In this case, we should see that the network and Surprise account for a large proportion of RSVP performance. The null hypothesis would predict that information prior to image sequence processing would be at least as sufficient as surprise for the task. To do this, we transformed the sequence prediction into a probability. The idea was to ascertain how well the network would perform if it was asked to tell us for each image whether it expected an observer to correctly identify a target. Basically, we will compare the results of different systems, such as Surprise to that of an ideal predictor. So for instance, if the network output the number six for a sequence, then 75% of the time we would predict that an observer should notice the target (six out of eight times). Error is specified from this as how often the network will predict a hit compared with how often it should be expected to if it is an ideal predictor. This gives us an error with an applied utility since it can be used directly to measure how often our network should correctly predict subject performance. An RMSE (Root Mean Square Error) gives us a good metric for network performance, but it does not provide a direct prediction of expected error against subjects in this case. To compute error (see Fig. 8), we find the empirical error given the network output difficulty rating and how many hits should be expected given how the network classified the sequence’s difficulty. This gives us an error of expected hits by the network compared with expected hits given the actual target value for all trials in the validation set. Here, the neural network performance is compared with that of the ideal predictor. As an example, for all trials in which 4 out of 8 observers correctly spotted the target animal (a difficulty of 4), our most reasonable expectation is that observers should have a 50% chance of spotting the target. Ideally, if the neural network predictor is accurate, then it should also predict that subjects will hit on those targets 50% of the time. If however it rates some of the ‘‘50%” targets as ‘‘100%” (gives them a difficulty of 8), then the model will be too optimistic and predict a greater than 50% chance of hits for the ‘‘50%” targets. To do this (Fig. 8), the output from the network is sorted into an n $ m confusion matrix of hits where the rows represent the expected prior target output t and the columns represented the actual network output y. Thus, if a target in a sequence had six out of eight hits from the original human observers, this would give it a difficulty, and a target value of six. If the neural network then ranked it as a five, then it would incrementally go into cell bin (i, j) = (5, 6). Where j is the prior observed for how easy this target was for subjects: 0 6 i 6 8; 0 6 j 6 8 ð1Þ Expected hits by the network per cell is given by the product of the number of hits binned in a cell N Yij by the probability that the hits will be observed given the prior observed difficulty j as P(j): PðjÞ ¼ j=8; 0 6 PðjÞ 6 1 ð2Þ Symbolically we will also assume that: i ¼ j $ PðiÞ ¼ PðjÞ N Yij ð3Þ is how many sequences were binned by the network into each cell bin in the confusion matrix N at cell i, j. Summing the columns of the expected hits gives the number of expected hits from the neural network given the prior determined difficulty class j and the network determined difficulty i. Note that this must be less than the total number of trials which is 288. Fig. 8. An error metric is derived as how well we should be able to predict a ninth new observer’s RSVP performance for the image sequences we already have data for. We compute the predicted hits for each image from the output of the neural network based on the difficulty rank it gave to each sequence. The final error is the difference between the network prediction and that of an ideal predictor. T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 PY ðhitsj jÞ ¼ 8 X i¼0 NYij ( PðiÞ; 0 6 P Y ðhitsj jÞ 6 288 ð4Þ Also note that the sum of the expected hits must be less than (or equal to) the total number of trials: 06 X j P Y ðhitsj jÞ 6 288 ð5Þ For comparison, we take the prior target data and compute its expected values given the difficulty as well: PT ðhitsj jÞ ¼ NTj ( PðjÞ; 0 6 P T ðhitsj jÞ 6 288 ð6Þ A final sum error is derived by subtracting the number of expected hits as determined by the neural network y (4) with the expected number of hits from the real targets t (6). That is divided by the total number of possible hits which yields the final empirical error Ewhich is the number of expected hits in error divided by the total number of trials. ! P !! Y ! T j !P ðhitsj jÞ " P ðhitsj jÞ! E¼ P T j Nj ð7Þ If E is 0, then the system being tested performs the same as an ideal predictor. It should be noted, that the ideal predictor in this case is highly ideal since it is assumed to have perfect knowledge of sequence difficulty. Baseline conditions for testing the null hypothesis were created by generating different sets of Nu. The condition Nnaive was created by putting the same number of hits into each bin. Thus, it is the output one would expect given a random uniform guess by the neural network. Nbayes was created by binning hits at the mean (mean number of hits for all sequences which is 5.44 out of 8). That is, in the absence of additional information, we will tend to guess each sequence difficulty as the mean difficulty. Metaphorically speaking, this is particularly important since neural networks can learn to guess the mean value when they cannot understand how to produce the target value from the input features. It should be noted that Nbayes is not naïve. It still uses prior information to determine the most likely performance for sequences in the absence of better evidence. It is used as a null hypothesis since it is the best guess for observer performance prior to image processing but knowing the summary distribution of observer performance. Additionally, it presents a strong challenge to any models performance since as a maximum likelihood estimate; it has the potential to do very well depending on the nature of observer performance variance. In addition to the surprise system, a contrast system was also trained in the same way. A range of error was computed by training both the contrast and surprise systems 25 times and computing the final validation error each time. This gives a representation of the error range produced by the initial random weight sets in the neural network training. The error is the standard 5% error over the mean and variance of the validation set error for 25 runs. It should be noted that we attempted to maintain as close a parity as possible between the training of the contrast system and the surprise system. However, there is enough difference between the two systems such that there is room for broader interpretation as to performance comparison. Since, we introduce the contrast system as a general metric and this paper is not a model comparison paper the results should not be interpreted as a critique of it. Fig. 9 shows the expected error from both the contrast system and surprise system compared with the worst possible predictions. It can be seen that the surprise system (77.1% correct, error bars shown in figure) performs better than the contrast system (74.1% correct) and that the differences are significant within the bounds 1631 of the neural network variance. The surprise and contrast data can be combined by feeding both data sets in parallel into the network. If performance of the two combined is better than the surprise system by itself, then it suggests that the contrast system contains extra useful information which the surprise system alone does not have. However, the combined model (76.6% correct) does not perform better than the original surprise system. Its performance appears slightly worse, but is not significantly different than the surprise system alone. Both the contrast and surprise systems perform much better than the worst possible Naïve model (67.1% correct) and noticeably better than the null hypothesis Bayes model (71.7% correct). It is important to note that while neither the surprise nor contrast systems perform perfectly, they should not be expected to do so since they have no notion of the difficulty of actually recognizing the particular animal in a target image. They are only judging difficulty based on simple image statistics for attention in each sequence. Additionally, as mentioned, the task was made more difficult by withholding knowledge of the position of the target in the sequence, from the training system. It can be seen that when the target frame is known, the model performs its best at 79.5% correct. We can put this performance in perspective by using the Naïve model as a baseline. We compute a standard performance increase 0 as the difference between the performance of the Naïve model fbayes fnaive and the Bayes model fbayes. 0 fbayes ¼ fbayes " fnaiv e ð8Þ The same metric for the surprise system is: 0 fsurprise ¼ fsurprise " fnaiv e ð9Þ The gain of the surprise system over the Bayes model is computed as: g surprise ¼ 0 fsurprise 0 fbayes ð10Þ This tells us that the performance increase of the surprise system is 2.7 times that of the Bayes model. 5. Discussion The results shown in Figs. 3–6 are not unexpected given that the M–W curves closely resemble the timing and shape of both visual masking (Breitmeyer & Öğmen, 2006) and RSVP attentional blink study results (Chun & Potter, 1995; Einhäuser et al., 2007a; Keysers & Perrett, 2002; Raymond et al., 1992). What is interesting is that surprise elegantly reveals the masking behavior in natural images. This allows efficient use of natural scenes in vision research since low level bottom-up masking behavior can be accounted for and controlled. As an added note on computational efficiency, a single threaded surprise system will process several images a second with current hardware. This is important since it has allowed us to extract surprise from thousands of sequences. Since the surprise system is open source and downloadable, other researchers should be able to take advantage of it to create RSVP sequences with a variety of masking scenarios. Ease of use has also been factored into the surprise system since it is built upon the widely used saliency code of Itti and Koch (2001). In addition to the basic metric of surprise, when we combine it with neural network learning, it predicts which RSVP sequences of natural images are difficult for observers from those that are easy. This provides two major contributions. The first is that surprise has been shown to deliver a good metric for visual detection and recall experiments. Thus, a variety of vision experiments can be con- 1632 T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 Fig. 9. (A) is the expected error for prediction in performance for the RSVP task. Error bars represent the expected range of performance for the neural networks trained on the task. Both the surprise system and the contrast model perform much better than the null hypothesis would predict. Combining the image surprise data and contrast data does not yield better results. (B) shows that a three-layer network performs slightly better than a two-layer network. Additionally, as expected, when the frame offset for the target is provided to the network, performance increases further since it can now look for the characteristic M–W patterns in the flanking images as seen in Figs. 4–6. Note that (A) and (B) show the same type of information and can be directly compared. ducted, which can use surprise to gain insight into the intrinsic attentional statistics, especially for natural images. For instance, as we have done, surprise was successfully manipulated by reordering images in sequences, to modulate recall (Einhäuser et al., 2007b). Here we have extended on this work by creating a non-relative metric that can compare difficulty for sequences with disjoint sets of target and distractor images. That is, our original work was only able to tell us if a new ordered sequence was relatively more difficult than its original ordering. The current work gives us a parametric and absolute measure based on how many observers should be able to spot a target. The second contribution is that a system can be created to predict performance on a variety of detection tasks. In this article, we focused on the specific RSVP task of animal detection. However, the stimulus which we used is general enough as to suggest that the same procedure could be used for many other types of images and sequences. 5.1. Patterns of the two-stage model While the interaction between space, time and features is not completely understood, the interactions within these domains are well illustrated by our experimental results. Surprise in flanking images aligned spatially and along feature dimensions block the perception of a target particularly in a window of ±50 ms around the target image. Additionally, the time course is consistent at least with respect to Gabor statistics, which show the M–W pattern of attentional capture and gating. However, we must note that this study does not explicitly establish a constant time course for any image sequence, but there is some expectation of the time course to be bounded based on the reviewed masking, RSVP and attentional blink experiments. With significant peaks at widely divergent times, our results are compatible with the two-stage model described by (Chun & Potter, 1995) (for a conceptual diagram, see Fig. 10). In the ) 50 ms sur- " Fig. 10. The top row is a hypothetical summary of Figs. 4 and 5. Flanking images block the target image based on increase in mean surprise within 100 ms and show the characteristic M–W pattern. A strong tail is observed as much as 250 ms prior. Notably, the flat portion at 100 ms is at the lag sparing interval for attentional blink. This suggests that a two-stage model similar to (Chun & Potter, 1995) is at work. In the first stage fierce competition creates a mask with the power of an images surprise. Wider masks are more likely to block information, while stronger surprise in an image is more likely to break through a prior blocking mask even if it is wide. In the later stages, parts of images which have passed the first stage are assembled into a coherent object or set of objects. Once the assembly is coherent enough, subordinate information from new images can help prime it and make it more distinguishable. However, information from the subordinate images will be absorbed if the second stage has reached a critical coherence. This can be summarized by stating that the effects from surprise in the first stage are a result of direct competition and blocking while the effects seen in the latter stages are a side effect of the result of priming, absorption or integration. T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 1633 1634 T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 rounding the target, it is easier for images with targets of higher surprise to directly interfere with each other. Since targets can both forward and backward mask at this interval, competition to pass the first stage is more direct. Already present targets can be suppressed by new more surprising ones or block less surprising targets from entering. Then we see a period of about 100 ms where the target image is in a sense safer, this is at the same point in time that lag sparing is observed in attentional blink studies. After that (P200 ms), the data shows that forward masking can block the perception of the target. Conversely, the data indicates that backwards masking does not occur at this interval, but instead we see enhancement. The results seen here might be a result of the manner in which the visual cortex gates and pipelines information and may be thought of as a two or three stage model seen in Fig. 10 (stages two and three may be the same stage, but we parse them here for illustrative purposes). (1) Very surprising stimuli capture attention from other less surprising stimuli at ±50 ms. Additionally, competition is spatially local allowing for some types of parallel attention to occur. That is, some image features can get past the first stage in parallel if they are far enough apart from each other. Parallelism, while not illustrated by the data presented here is expected based on the reviewed studies on split attention (Li et al., 2002; McMains & Somers, 2004; Rousselet et al., 2002, 2004). (2) Stimuli is pipelined at 100 ms to the next stage of visual processing. This leads to a lag sparing effect or the safety period observed in Figs. 4 and 5. This is also observed in the literature we have reviewed which illustrates the lag-1 sparing in RSVP (Raymond et al., 1992) or a relaxed period of masking with a Type-B function (Breitmeyer, 1984; Breitmeyer & Öğmen, 2006; Cox & Dember, 1972; Hogben & Di Lollo, 1972; Reeves, 1982; Weisstein & Haber, 1965). The pipeline may do further processing, but is highly efficient and most importantly avoids conflicts. (3) After 150 ms, a new processing bottleneck is encountered. Unlike (1) which is a strict attention gate meant to limit input into the visual system, we suggest as others have (Chun & Potter, 1995; Sperling et al., 2001) that (3) is a bottleneck by the fact of its intrinsic limited processing resources. Conflict can occur if an image that arrived first is not finished processing when a second image arrives. The situation is made more difficult for the second image if the first image has sequestered enough resources to prevent the second image from processing. The second image in may be integrated or blocked while at the same time enhancing the first image into the third stage. That is, an image or set of targets may dominate visual resources after )250 ms, causing new input stimuli into a subordinate role. This later stage may also resemble strongly several current theories of visual processing (Dehaene, Changeux, Naccache, Sackur, & Sergent, 2006; Lamme, 2004) if it is thought of in terms of a global workspace or as visual short term memory (Shih & Sperling, 2002). To further refine what we have said, the first stage is selective for images based more strongly on surprise and is not strictly order-dependant, which is why Figs. 4 and 5 show the strong M–W pattern. The utility of this is perhaps to triage image information so that the most important nuggets are transported to later stages which have limited processing capability. The third stage is selective based on more complex criteria such as order which is why we observe the asymmetry at ±250 ms in Figs. 4 and 5. However, evidence also suggests that if a second target is salient enough, it may be able to co-occupy later stages of visual processing with the first target allowing for detection of both targets in RSVP (Shih & Reeves, 2007). Importantly, the relationship between the first target and the second target appears to be asymmetric in the later stages, which was illustrated in Fig. 4 from the surprise system at the ±250 ms extremes. Surprise and attention in the first stage does not necessitate that later stages do not also contain attention mechanisms. An attentional searchlight (Crick, 1984; Crick & Koch, 2003) theoretical model could allow for attention at many levels. That is, as information is processed, refined and sent to higher centers of processing, additional attention processing can allow a continuation of the refinement. Selection can be pushed from the bottomup or pulled from top-down processes. Two-stage models which place attention in a critical role in later stages have been proposed. For instance, (Ghorashi, Smilek, & Di Lollo, 2007) suggest that lag sparing in RSVP attentional blink are due to a pause while the visual system is configured for search in latter stages of processing. While it is unclear if a pause causes lag sparing from the research presented, significants of specific features we have illustrated from surprise at ±250 ms might lend credence to feature specific attention in later stages, 5.2. Information necessity, attention gating and biological relevance The results show that the gating of information in the visual cortex during RSVP may be based on a sound statistical mechanism. That is, target information which carries the most novel evidence is of higher value and is what is selected to be processed in a window of 100 ms. Target information separated by longer time intervals may also be indirectly affected by surprise because of interference in later stages of processing. This happens if more than one overlapping target is able to enter later visual processing at the same time. Since surprise effects are indirect at this stage, surprise may never be able to account for all RSVP performance effects. The mechanism for surprise in the human brain we believe is based on triage. Target information is held briefly during early processing allowing a second set of target features in the following stages to also be processed. Masking at the shortest interval happens when the two sets of image data are competed against each other. If the new target information is more surprising, it usurps the older target information (attention capture), otherwise it is blocked. Masking in this paradigm depends on competition within a visual pipeline. Its purpose is to create a short competition which may even introduce a delay for the sake of keeping the higher visual processes from becoming overloaded. After 50 ms, if target information is still intact, it moves beyond this staging area and out of the reach of competition from very new visual information. Masking and the loss of target detection after that point, might be based on how the visual system constructs an image from the data it has. After 150 ms, image data is lost in assembly by virtue of an integrative mechanism in a larger working memory or global workspace. Interestingly, if surprise does account well for the Type-B masking effects many have observed, it suggests that masking in the short interval is not a result of a flaw in the visual system, but rather it is the result of an internally evolved mechanism to directly filter irrelevant stimuli. That is, in the ±50 ms bounding a target, masking is an effect of an optimal statistical process. The absence of masking, if it was observed, would indicate that too much information is being processed by the visual system. 5.3. Generalization of results Returning to the topic of feed-forward neural networks, another facet to mention is on the limitations of generalization for networks such as the one we used here. Validation was carried out using images of the same type. As such, we do not necessarily know that it will perform the same for a completely different type of image set. As an example, one might imagine target pictures of 1635 T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 rocks with automobile distractors. However, evidence suggests that at less than 150 ms only coarse coding is carried out by the visual system (Rousselet et al., 2002). Thus, we would expect generalization to other types of images so long as coarse information statistics are similar. Additionally, the input to the neural network was extremely general and contained a relative lack of strong identifying information. That is, we only used whole image statistics of mean, standard deviation and spatial offset for surprise. It is quite conceivable that images of other types of targets, and distractors, may very well exhibit the same basic statistics and, as a result, the neural network should generalize. 5.4. Comparison with previous RSVP model prediction work 5.6. Applications of the surprise system In addition to the theoretical implications of our results there are also applications which can be derived. For instance, the surprise analysis could be used to help predict recall in movies, commercials or – by adjustment of timescale – even magazine ads. Since it gives us a good metric of where covert attention is expected, it may also have utility in optimizing video compression to alter quality based on where humans are expected to notice things (Itti, 2004). As long as images are viewed in a sequence, the paradigm should hold. Even large static images, whose scanning requires multiple eye-movements, effectively provide a stimulus sequence in retinal coordinates and will be an interesting issue for further research. An important item to note here is the difference between the system we have presented and the one presented by Serre et al. (Serre, Oliva, & Poggio, 2007). The primary difference is that our system is based on the temporal differentiation across images in a sequence whereas the model by Serre et al. is an instantaneous model. That is, it has no explicit notion of the procession of images in a sequence, which as we have shown has effect on the performance of RSVP. This is particularly true since the same images in different order around the target can affect performance on the RSVP task (Einhäuser et al., 2007b). However, the RSVP task performed by the Serre model is sufficiently different, that any real cross-comparison between the two models is difficult on anything other than a superficial level. Indeed, the Serre model has much better target feature granularity, which perhaps gives it better insight into feature interactions within an image. Thus, one may be able to produce an even better RSVP predictor by combining the Serre model with the one we presented here, thereby exploiting both sequence and fine granularity aspects. 6. Conclusion 5.5. Network performance Given a set M of possible models (or hypotheses), an observer with prior distribution P(M) over the models, and data D such that:3 It is also interesting to note that with the neural network, the contrast system performed almost as well as the surprise system. However, when the surprise and contrast data are combined, performance is not increased suggesting that the contrast system does not contain any useful new information that the surprise system does not already have. That is, given that the combined surprise/contrast system performs almost exactly the same as the surprise system by itself, and that the contrast system still performs much better than chance, suggests that the surprise system intrinsically contains the contrast system’s information (indeed, intensity contrast, computed with center-surround receptive fields, is one of the features of the surprise system which is similar but not the exact same thing as the contrast system). Otherwise, we should expect to see an improvement in the performance of the combined surprise contrast system. Another network performance issue that should be mentioned is that in one condition, the neural network performance was hindered by forcing it to not know the target offset in each sequence. This was done to see if a sequence of images could be labeled with a general metric so that we could conceivably gauge the recall of any image in the sequence, not just the target. However, if knowing the target offset frame is a luxury one can afford, then we illustrated that prediction performance can be improved even further. That is, knowing which frame is the target frame in advance, aids in prediction, but the network can make a prediction, though less accurate, otherwise. This is to be expected given the M–W shape for surprise as seen in Fig. 10 where surprise from the flanking images yields the most information to RSVP performance on the target. Surprise is a useful metric for predicting the performance of target recall on an RSVP task if natural images are used. Additionally, in past experiments, we have shown that this can be used to augment image sequences to change the performance of observers. An improved system of prediction can be made using a feed-forward neural network, which integrates spatial, temporal, and feature concerns in prediction, and can even be done so without knowing the offset of the target in the sequence. Type-B and spatial masking effects are revealed by surprise which provides further evidence that RSVP performance is related to masking and attentional capture. This also suggests that the masking and attentional capture are accounted for in a statistical process, perhaps as a triage mechanism for visual information in the first stage of a two-stage model of visual processing. Appendix A PðMjDÞ ¼ PðDjMÞPðMÞ PðDÞ ð11Þ Notably, the data can in this case be many things such as a pixel value or the value of the output of a feature detector. We are also not constrained to just vision; a variety of data is useful. To quantitatively measure how much effect the data observation (D) had on the observer beliefs (M), one can simply use some distance measure d between the two distributions PðMjDÞ and P(M). This yields the definition of surprise as (Itti & Baldi, 2005, 2006): SðD; MÞ ¼ d½PðM jDÞ; PðMÞ+ ð12Þ One such measure is the Kullback–Leibler (KL) distance (Kullback & Leibler, 1951) generically defined as: L¼" Z pðxÞ ln ~ðxÞ p dx pðxÞ ð13Þ By combining (13) and (11) we get the surprise given the data and the model as: SðD; MÞ ¼ log PðDÞ " Z M PðMÞ log PðDjM ÞdM ð14Þ 3 A general surprise framework is downloadable for Matlab from http://sourceforge.net/projects/surprise-mltk. The full vision code in gcc is downloadable from http://ilab.usc.edu. 1636 T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 Surprise can have as its base a variety of statistical processes, so long as the KL distance can be derived. In this work, we used a Gamma/Poisson distribution since it models neural firing patterns (Ricciardi, 1995) and since it gives a natural probability over the occurrence of events. We can define the Gamma process as: PðMðkÞÞ ¼ cðk; a; bÞ ¼ ka"1 ba e"bk CðaÞ for k > 0 ð15Þ and its corresponding KL distance as: KLðcðk; a0 ; b0 Þ; cðk; a; bÞÞ ¼ "a0 þ a log b0 CðaÞ a0 þ log þb 0 b Cða0 Þ b þ ða0 " aÞWða0 Þ ð16Þ Here we note that C is the gamma function (not to be confused with the gamma probability distribution given in (15) and W is the digamma function which is the log derivative of the gamma function C. Surprise itself in this case is the KL distance. To make this model work, we need to know what a and b are and 0 0 0 how to update them to get a and b . Importantly, in this notation, a is just the update of a at the next time step. Given some input data d 0 such as a feature filter response, we can derive a , which is abstractly analogous to the mean in the normal distribution. It is given as: a0 ¼ a ( f þ d b ð17Þ W¼ m X wj ð20Þ j¼1 ! is computed as: The spatial variance b m X !¼ 1 ( b ( wj b W j¼1 j ð21Þ Since this operates at multiple time scales, all bj are initialized to 1 for the first time scale, but are set to the previous time scales 0 b value thereafter. ! is: ! computed from b The expected value at an image location a a! ¼ ! b W 2 ( m X j¼1 aj ( wj ð22Þ Again, aj is initialized as the data value for the surrounding image locations in the first time scale, but is set to the previous value ! can then be plugged into ! and b as in (19) after that. The values a (17) and (18) for a and b respectively which allows surprise to be computed as in (16) for space at each location in an image. Total surprise for a feature channel S at each location is then computed as the product of surprise for the sum of temporal t and space p across all time scales n: S ¼ ðSp1 þ St1 ( . . . ( Spn þ Stn Þ1=ð3(nÞ ð23Þ Here f is a decay forgetting term which has been set as 0.7 based on earlier work that suggests this is a good value (Itti & Baldi, 2006). To reiterate, this assumes that the underlying process is Poisson. Otherwise one must use a more complex method to com0 pute a such as the Newton–Raphson method (Choi & Wette, 1969) or use a completely different process for computing surprise such as a Gaussian process. 0 The value of b , which is abstractly analogous to the standard deviation in the normal distribution can be computed as: Each surprise map S for each feature type can be combined into a single surprise map by taking a weighted average over all the surprise maps. The combined surprise map is created from individual surprise maps in the same way we have done in the past with saliency maps (Itti & Koch, 2001; Itti, Koch, & Braun, 2000; Itti, Koch, & Niebur, 1998) to create a combined saliency map. b0 ¼ b ( f þ 1 References ð18Þ In this formulation, b is independent of new data for temporal surprise, but not for spatial surprise which computes the b term from the image surround. (18) is sufficient for analysis of RSVP sequences since model evidence collection can start in a naïve state. Otherwise, for long sequences such as movies, b should be allowed to float based on longer term changes in the state of evidence. To deal with different time scales which have different gains, the results from a model are fed forward into successive models from 1 to i where i is the ith model out of n time scales (1 6 i 6 n). In this case, six time scales are used. For each scale we will com0 pute a different a and thus a different surprise value. For each time scale, this takes the form of: a ¼ a1 ( f þ 0 1 ... a2 (fþ b1 a1 (fþbd b2 1 ð19Þ Over a series of images, surprise can be computed in temporal and spatial terms. Temporal surprise is a straight forward computation with updates (17) and (18) where d is the value of a single location in an image and a is the hyper parameter computed on 0 the previous frame as a . Spatial surprise is computed with d also being the value of a single location, but a and b are treated like the Gamma mean and variance given values of the surrounding locations in an image. This makes spatial surprise the surprising difference between a location and its surround. Given the data from m locations in the surround D and locations j (j e D) and a Difference of Gaussian (DoG) weighting kernel w, we compute a weight sum W of the kernel as: Breitmeyer, B. G. (1984). Visual masking. New York: Oxford University Press. Breitmeyer, B. G., & Öğmen, H. (2006). Visual Masking: Time slices through conscious and unconscious vision. New York: Oxford University Press. Choi, S. C., & Wette, R. (1969). Maximum likelihood estimation of the parameters of the gamma distribution and their bias. Technometrics, 11(4), 683–690. Chun, M. M., & Potter, M. C. (1995). A two-stage model for the multiple target detection in rapid serial visual presentation. Journal of Experimental Psychology: Human Perception and Performance, 21, 109–127. Cox, S., & Dember, W. N. (1972). U-shaped metacontrast functions with a detection task. Journal of Experimental Psychology, 95, 327–333. Crick, F. (1984). Function of the thalamic reticular complex: the searchlight hypothesis. PNAS, 81, 4586–4590. Crick, F., & Koch, C. (2003). A framework for consciousness. Nature Neuroscience, 6(2), 119–126. Dehaene, S., Changeux, J.-P., Naccache, L., Sackur, J., & Sergent, C. (2006). Conscious, preconscious, and subliminal processing: a testable taxonomy. Trends in Cognitive Sciences, 10(5), 204–211. Di Lollo, V., Enns, J. T., & Rensink, R. A. (2000). Competition for consciousness among visual events: the psychophysics of reentrant visual processes. Journal of Experimental Psychology: General, 129(4), 481–507. Didday, R. L. (1976). A model of visuomotor mechanisms in the frog optic tectum. Mathematical Biosciences, 30, 169–180. Duncan, J. (1984). Selective attention and the organization of visual information. Journal of Experimental Psychology: General, 113(4), 501–517. Duncan, J., Ward, R., & Shapiro, K. (1994). Direct measurement of attentional dwell time in human vision. Nature, 369(26), 313–315. Einhäuser, W., Koch, C., & Makeig, S. (2007a). The duration of the attentional blink in natural scenes depends on stimulus category. Vision Research, 47, 597–607. Einhäuser, W., Mundhenk, T. N., Baldi, P., Koch, C., & Itti, L. (2007b). A bottom-up model of spatial attention predicts human error patterns in rapid scene recognition. Journal of Vision, 7(10), 1–13. Evans, K. K., & Treisman, A. (2005). Perception of objects in natural scenes: is it really attention free? Journal of Experimental Psychology: Human Perception and Performance, 31(6), 1476–1492. Ghorashi, S. M. S., Smilek, D., & Di Lollo, V. (2007). Visual search is postponed during the attentional blink until the system is suitably reconfigured. Journal of Experimental Psychology: Human Perception and Performance, 33(1), 124–136. Hoffman, J. E., Nelson, B., & Houck, M. R. (1983). The role of attentional resources in automatic detection. Cognitive Psychology, 15(3), 379–410. T.N. Mundhenk et al. / Vision Research 49 (2009) 1620–1637 Hogben, J., & Di Lollo, V. (1972). Effects of duration of masking stimulus and dark interval on the detection of a test disk. Journal of Experimental Psychology, 95(2), 245–250. Itti, L. (2004). Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Transactions on Image Processing, 13(10). Itti, L., & Baldi, P. (2005). A principled approach to detecting surprising events in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 631–637). Itti, L., & Baldi, P. (2006). Bayesian surprise attracts human attention. In Advances in 1215 Neural Information Processing Systems (NIPS) (Vol. 19, pp. 547–554): MIT Press. Itti, L., & Koch, C. (2001). Computational modeling of visual attention. Nature Reviews Neuroscience, 2(3), 194–203. Itti, L., Koch, C., & Braun, J. (2000). Revisiting spatial vision. Towards a unifying model. JOSA-A, 17(11), 1899–1917. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259. Jollife, I. T. (1986). Principle component analysis. New York: Springer-Verlag. Keysers, C., & Perrett, D. I. (2002). Visual masking and RSVP reveal neural competition. Trends in Cognitive Sciences, 6(3), 120–125. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79–86. Lamme, V. A. F. (2004). Separate neural definitions of visual consciousness and visual attention; a case for phenomenal awareness. Neural Networks, 17, 861–872. Lee, S.-H., & Blake, R. (2001). Neural synergy in visual grouping: when good continuation meets common fate. Vision Research, 41, 2057–2064. Li, F. F., VanRullen, R., Koch, C., & Perona, P. (2002). Rapid natural scene categorization in the near absence of attention. PNAS, 99(14), 9596–9601. Li, W., & Gilbert, C. D. (2002). Global contour saliency and local colinear interactions. Journal of Neurophysiology, 88, 2846–2856. Li, Z. (1998). A neural model of contour integration in the primary visual cortex. Neural Computation, 10, 903–940. Li, Z. (2002). A saliency map in primary visual cortex. Trends in Cognitive Sciences, 6(1), 9–16. Mack, A., & Rock, I. (1998). Inattentional blindness. Cambridge, MA: MIT Press. Maki, W. S., & Mebane, M. W. (2006). Attentional capture triggers an attentional blink. Psychonomic Bulletin and Review, 13(1), 125–131. Marois, R., Yi, D.-J., & Chun, M. M. (2004). The neural fate of consciously perceived and missed events in the attentional blink. Neuron, 41, 465–472. McMains, S. A., & Somers, D. C. (2004). Multiple selection of attentional selection in human visual cortex. Neuron, 42, 677–686. Mounts, J. R. W., & Gavett, B. E. (2004). The role of salience in localized attentional interference. Vision Research, 44, 1575–1588. Mundhenk, T. N., Everist, J., Landauer, C., Itti, L., & Bellman, K. (2005). Distributed biologically based real time tracking in the absence of prior target information. In SPIE (Vol. 6006, pp. 330–341): Boston, MA. Mundhenk, T. N., & Itti, L. (2005). Computational modeling and exploration of contour integration for visual saliency. Biological Cybernetics, 93(3), 188–212. Mundhenk, T. N., Landauer, C., Bellman, K., Arbib, M.A., & Itti, L. (2004). Teaching the computer subjective notions of feature connectedness in a visual scene for real time vision. In: SPIE (Vol. 5608, pp. 136–147): Philadelphia, PA. Navalpakkam, V., & Itti, L. (2006). Optimal cue selection strategy. Advances in Neural Information Processing Systems (NIPS), 987–994. Navalpakkam, V., & Itti, L. (2007). Search goal tunes visual features optimally. Neuron, 53(4), 605–617. Neisser, U., & Becklen, R. (1975). Selective looking: attending to visually specified event. Cognitive Psychology, 7(4), 480–494. 1637 Parkhurst, D. J., & Niebur, E. (2003). Scene content selected by active vision. Spatial Vision, 16(2), 125–154. Plaut, D., Nowlan, S., & Hinton, G.E. (1986). Experiments on learning by back propagation. Technical Report CMU-CS-86-126. Pittsburgh, PA: Department of Computer Science, Carnegie Mellon University. Potter, M. C., Staub, A., & O’Conner, D. H. (2002). The time course of competition for attention: attention is initially labile. Journal of Experimental Psychology, 28(5), 1149–1162. Raab, D. H. (1963). Backward masking. Psychological Bulletin, 60(2), 118–129. Raymond, J. E., Shapiro, K. L., & Arnell, K. M. (1992). Temporary suppression of visual processing in an RSVP task: an attentional blink? Journal of Experimental Psychology: Human Perception and Performance, 18, 849–860. Reeves, A. (1980). Visual imagery in backward masking. Perception and Psychophysics, 28, 118–124. Reeves, A. (1982). Metacontrast U-shaped functions derive from two monotonic functions. Perception, 11, 415–426. Ricciardi, L. M. (1995). Diffusion models of neuron activity. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 299–304). Cambridge, MA: The MIT Press. Rousselet, G. A., Fabre-Thorpe, M., & Thorpe, S. J. (2002). Parallel processing in highlevel categorization of natural images. Nature Neuroscience, 5(7), 629–630. Rousselet, G. A., Thorpe, S. J., & Fabre-Thorpe, M. (2004). Processing of one, two or four natural scenes in humans: the limit of parallelism. Vision Research, 44, 877–894. Schmolesky, M. T., Wang, Y., Hanes, D. P., Thompson, K. G., Leutgeb, S., Schall, J. D., et al. (1998). Signal timing across the macaque visual system. Journal of Neurophysiology, 79(6), 3272–3278. Sergent, C., Baillet, S., & Dehaene, S. (2005). Timing of the brain events underlying access to consciousness during the attentional blink. Nature Neuroscience, 8(10), 1391–1400. Serre, T., Oliva, A., & Poggio, T. (2007). A feedforward architecture accounts for rapid categorization. PNAS, 104(15), 6424–6429. Shih, S.-I. (2000). Recall of two visual targets embedded in RSVP streams of distractors depends on their temporal and spatial relationship. Perception and Psychophysics, 62(7), 1348–1355. Shih, S.-I., & Reeves, A. (2007). Attention capture in rapid serial visual presentation. Spatial Vision, 20(4), 301–315. Shih, S.-I., & Sperling, G. (2002). Measuring and modeling the trajectory of visual spatial attention. Psychological Review, 109(2), 260–305. Sperling, G., Reeves, A., Blaser, E., Lu, Z.-L., & Weichselgartner, E. (2001). Two computational models of attention. In J. Braun, C. Koch, & J. L. Davis (Eds.), Visual attention and cortical circuits (pp. 177–214). Cambridge, MA: MIT Press. Tanaka, Y., & Sagi, D. (2000). Attention and short-term memory in contrast detection. Vision Research, 40, 1089–1100. VanRullen, R., & Koch, C. (2003a). Competition and selection during visual processing of natural scenes and objects. Journal of Vision, 3, 75–85. VanRullen, R., & Koch, C. (2003b). Visual selective behavior can be triggered by a feed-forward process. Journal of Cognitive Neuroscience, 15(2), 209–217. VanRullen, R., Reddy, L., & Koch, C. (2004). Visual search and dual tasks reveal two distinct attentional resources. Journal of Cognitive Neuroscience, 16(1), 4–14. Visser, T. A. W., Zuvic, S. M., Bischof, W. F., & Di Lollo, V. (1999). The attentional blink with targets in different spatial locations. Psychonomic Bulletin and Review, 6(3), 432–436. Weisstein, N., & Haber, R. N. (1965). A U-shaped backward masking function in vision. Psychonomic Science, 2, 75–76. Wolfe, J. M., Horowitz, T. S., & Michod, K. O. (2007). Is visual attention required for robust picture memory? Vision Research, 47, 955–964. Yen, S., & Fenkel, L. H. (1998). Extraction of perceptually salient contours by striate cortical networks. Vision Research, 38(5), 719–741.
© Copyright 2026 Paperzz