How The Vision Works - Theory and Logic Group

1
How The Vision Works
Nariman Varahram (1228406)
Vienna University of Technology
Abstract—Computer vision is a field which deals with problems
regarding acquiring, processing, analyzing and understanding
real world data. The goal of computer vision is extract information from real world input images. This input images can
be in different forms, such as video sequences, multiple camera
views or multi dimensional data of medical scanners. However in
order to duplicate humans vision to computer vision, one should
understand how human’s vision works to understand what are
the challenges in implementing computer vision. Furthermore to
understand human vision one need to know how mind works
in regarding to see real world. This paper focus on mind and
both human vision and computer vision from point of view of
famous psychologist Steven Pinker, the author of ”How the Mind
Works”.
Keywords—Vision, Mind, Brain, Eye, Human. Computer
I.
I NTRODUCTION
In most movies about robot when cinematographers show a
world from robot eye-view, they show a video images of the
world decorated with some contrivances like cross-hairs, pull
down menus, fish eye distortion or red tint. However this is a
misleading portrait of vision.
If there was a possibility to see the world through a robot
eyes or through human’s brains point of view, it would not look
similar to video images with cross-hair. Instead it would be
millions of values and variables corresponding to the intensity
of light at the various location on retina. which gained by a
two dimensional projection of three dimensional world in front
of eye or the camera.
To understand how the computer vision works, Initially we
need to understand how the human vision works. And to
understand how the human vision works we need to know
how human brain and human eyes correlate and work together.
The human mind is one the last great frontiers of science.
And its truly magnificent organ. it is allows us to walk on the
moon, to discover the secrets of life, physical universe and
invent advanced and complex devices. But the mind raises
many paradoxes, on the one hand the human mind is an
engineering masterpiece. we can see, move and use common
sense better than any existing or foreseeable computer or robot.
On the other hand it struggle to find an answer in simple
questions. For instance why is the thought of eating worms as
source of protein are discussing ? why people believe ghosts
and spirits ? why people fall in love ? One of the idea that
explains both of these kinds of field is computation. Just as
the function of heart is pump blood and function of kidney
is filter the blood, function of brain is information processing
or computation. However there are many information exist in
the world which is color-less, order-less, weight-less and tasteless. For instance, to explain a person behavior, such as why the
person took the taxi ? Its not possible to answer this question
by solving mathematical functions or by simulation of neural
networks in brain. However if that person simply questioned
why he took that taxi he can gives an explanation for instance
to go to hotel. in this case reaching to hotel is his belief and
desire and leads to his behavior which is taking the taxi.
Computational theory of minds solves this problem. one
part of it because just as beliefs and desires are color-less,
order-less, and taste-less, but can cause behavior we know.
The information be in mathematical concept is also color-less,
order-less, and taste-less, but physical devices that carrying
information can by obeying law of physics, cause systematic
patterns of changes that can well be characterized this abstract
language information.
Computation theory of mind also guides the direct study
of physiology of brain an old pseudo question sometime hear
introductory psychology. To show a diagram of the eye ball
which shows an image of the world projected on to the retina
upside down. if the image in the retina is upside down that
means some part of brain that turns the images round right
side out that we can see the worlds as it is right side out ?
This is a pseudo question there lead not the any such process
in the brain because, whether the image is upside down or right
side out makes no difference how brain processes information
coming from it. And the information is the only property of
activity in brain, to relevant to explaining the mind.
II. V ISIONS
From centuries we known the body is complex devices.
For instance the human eye has many parts that intellectually
arranged to accomplish some outcome, namely focusing image
of layers of lights in sensitive tissue. We explain the various
parts of the eye by saying that some sense designed to forming
image. now the punch line is the mind is a complex device it
is complex enough to we can not duplicate it simple functions
like seeing or retrieving information in a computer or robot.
This leads to the idea that in order to understand the mind
we have reverse engineer it. in forward engineering we have a
goal that want our device to accomplish and it leads to build
the device. In reverse engineering we start with the device
and have to figure out what was designed to do ? Mind is
a complicated device that has solve many different kinds of
engineering problems, such as seeing in three dimensions,
moving arms and legs, understanding physical world and many
others. These problem are different and the tools to solving
them are different. we know that specialization is ubiquitous
in biology in general. The heart has a different shape and
belongs because the heart designed to pump blood and eyes
are designed to see the three dimensional world. heart cells
are different with eyes cells. focusing of this paper is on eyes,
2
Fig. 1.
Object covered by black background.
Fig. 2.
Same object covered by white background.
vision and how brain process and reasoning the information
which gained by eyes.
A. Human Vision
As described in Introduction section, Image of world from
point of view of eyes are millions of values and variables
corresponding to the intensity of light at the various location
on retina.
The task of the brain is to process this numbers and recover
and understanding the three dimensional structure of the world
from the intensities. the brain is evolve many trick for doing
it, However this task is not that much simple. Objects in real
world are not always easy to be understand. For instance Figure
1 shows an object which covered by dark background. Hence
foreground object can not easily distinguish from background.
Figure 2 represent same object is white background. It is easy
to conclude that spotting shape of this object in first figure is
not easy task based on respective environment.
Another challenge in seeing objects is called ”shape from
shading” and it works based on a simple law of physics as
follows.
Imagine a light source and a surface front of it. The steeper
the angle of the surface the less light is reflect back. Hence
as the surface is rotate with the respect to the light source the
globe of light is reflected on it goes from bright to dim. this
way of light reflection is true in terms of laws of physics.
psychology take advantages of physic law and run backward
and say the dimmer image on retina is steeper the angle of
the surface in the world and therefore the brain enable to
reconstruct the shape from the angles of the thousands facets
that collectively defined the three dimensional of shape of the
surface. The only problem with ”shape from shading” trick is
that the brain interpret brightness angle and therefore assumes
a uniformly or at least randomly colored world. It assumes
that any difference in lightness or darkness on retina comes
from differences in angle and ultimately shape in the world.
This assumption obviously is not generally true, And predict
the surfaces colored in clever way should foolish shape from
shading module and cause us to see things that aren’t there. In
fact that is exactly what happens in many of the contrivance
of modern life taking advantage of this. For instance television
is kind of illusion. people spend hours and hours to stare at a
plane of glass. Why the people stare plane of glass ? because
it designed to displays pattern of shading that are shade from
shading analyzer interpret as coming from three dimensional
objects. Therefore we stare the plane of glass because the plane
of glass is engineered to defeat this part of our brain and cause
us to hallucinate real world behind the glass.
Another example is makeup, People who are skilled in
fine makeup know if a person nose is too big WE can
make it smaller by putting little bit of rouge on border of
the nose, and brain interprets dark as steeper angle. Hence
nose looks skinnier. More generally many of the illusions
fallacy behaviors are like makeup and television, They come
from mismatch between assumptions of world that built in
our mental modules and the structure of the current world.
Therefore in case of comparing two objects, For instance milk
and cola, can we conclude that If large numbers come from
bright regions and small numbers come from dark regions,then
large number equals white and corresponds milk ? and small
number equals black and corresponds to cola? No. The amount
of light received by the retina depends not only on how pale
or dark the object is, but also on how bright or dim the light
illuminating the object is. That means we see the milk white
even in dark area and see the cola black even in bright area.
This means human’s conscious matches the world as it is rather
than the world as it presents itself to the eye. The harmony
between how the world looks and how the world is must be an
3
achievement of our neural wizardry, because black and white
don’t simply announce themselves on the retina.
Impressive part of this process is where human brain deduces an object shape and substance from its two dimensional
projection. why its impressive ? because this kind of problems
which knows as ”ill-posed problems” are generally unsolvable
and does not lead to an unique solution. a patch of grey which
received by eye can be either milk in shade or cola under
light. Vision evolved to convert this unsolvable problem to
solvable one by adding premises and assumptions. Therefore
it means if we travel to another world where assumptions
are no longer valid of that world because of unlucky and
unpredictable coincidences we fall prey to an illusion.
The next problem is depth of seeing. human eye project
three dimensional image of world to two dimensional image
on retina.Therefore third dimension will reconstruct in human
brain. However the information which is the how far the real
object was does not receive by retina.
Another important fact that need to consider for human
vision is humans are binocular. Which means humans receive
two independent images for each of their eyes. images which
project in left eye’s retina is not completely same as in right
eye. This fact can explain how stereo-grams work. Moreover it
explains why it is impossible for the painter to draw any near
solid object as painting which can not distinguish it with real
object. The two eyes have slightly different views which called
”binocular parallax”. Imagine looking at Soccer ball on a table
with a rugby ball behind it and tennis ball in front it. Aim
your eyes at the soccer ball, The soccer ball is at six o’clock
in both retinas. Now look at projection of tennis ball which is
located in front of soccer ball. In the left eye they sit in seven
o’clock but in right eye they sit on five o’clock similarly when
look at rugby ball which is further projected image sit at five
o’clock in left and six thirty at right. Afterward when brain
detect these two images correspond to a single object in real
word, these two individual images (also knows as Leonardo’s
Window) will combine by mind and produce the result which
is the image that we see. and that is why it is impossible for
the painter to draw same painting as near real object because
in case of painting two similar pictures are projected However
in case of real object two pictures are dissimilar.
To explain what happens when looking at stereo-grams, The
idea is not complex. the image was captured by two Leonardo’s
windows or more generally by two cameras, each of these
positioned to place where one of the eyes located. place left
image front of person’s left eye and similarly place right image
front of person’s right eye. when brain assumes two eyes are
looking at same three dimensional real world image, with
only difference in views which caused by binocular parallax,
This is the time when brain fooled by picture and combine
those pictures as one picture and cause to image appear in
different depths. Although brain adjust the eyes physically by
controlling muscles in two ways. This is the reason why some
people can not see stereo-grams.
In first adjustment brain controls fatness of eyes lens. this
lens receives lights from world and focus them all at a point on
retina. For the distant object muscle inside the eyeball control
thickness of lens in way to make it thin and in case of close
Fig. 3.
Stereo-grams.
Fig. 4.
Position of images in each eyes in case of looking at stereo-gram.
objects make this lens fat to avoid blurry image. Figure 5
illustrate how muscle change thickness of lens.
The goal of second adjustment is to aim two eyes which
separated from each other by about one and half inches at
same object in world. this task applies by the help of muscles
which attached to side of eyes. The more object is close the
more eyes should be crossed. Figure 6 illustrates how brain
controls eyes orientation.
B. Computer Vision
One of the clearest definition of goal of vision has come
from artificial intelligence researcher David Marr. He said
”Vision is a process that produces from images of the external
world a description that is useful to the viewer and not cluttered
with irrelevant information.”
4
Fig. 5. Adjusting thickness of lens by controlling muscle with brain signals.
Fig. 7.
Fig. 6. Adjusting orientation of eyes by controlling muscle with brain signals.
If vision did not produce description, then every organs and
mental faculty such as moving, talking, planning ans etc would
need their own procedure to deduce meaning, which is not
happens.
When retina project a pattern in two dimension vision
deduce the shape of the object based on retinal image. After
that all parts of mind starts discovering to produce a description
for it. and Finally the mind attach this description with mental
modules readable format to the object in three dimensional
coordinates.
Let pretend that we have somehow built a robot that can
see and move. What will it do with what it sees? How should
it decide how to act? An intelligent being cannot treat every
object it sees as a unique entity unlike anything else in the
universe. It has to put objects in categories so that it may apply
its hard-won knowledge about similar objects, encountered in
The squares marked A and B are the same shade of grey.
the past, to the object at hand. But whenever one tries to
program a set of criteria to capture the members of a category,
the category disintegrates. Leaving aside slippery concepts like
”beauty” or ”dialectical materialism”.
The fact is most challenging part of computer vision is
understanding and reasoning and give the image meaning and
description like the mind do, Assume Soon autonomous robots
of all shapes and sizes, from cars to hospital helpers will be
a familiar sight in public. But in order for that to happen, the
machines need to learn to navigate our environment, and that
requires a lot more than a good pair of eyes.
There are some images that our brains consistently put
together incorrectly, and these are what we call optical illusions. Optical illusions are interesting because if mathematical
models of vision can predict new ones, it’s a useful indicator
that the model is reflecting human vision accurately. Optical
illusions are intrinsically fascinating magic tricks from nature
but at the same time they are also a way to test how good the
model is.
For instance, Most robots would not be fooled by the Adelson checkerboard illusion where human think two identical
grey squares are different shades Figure
Although it might seem like the machine wins this round,
robots have problems recognizing shadows and accounting
for the way they change the landscape. Computer vision
suffers badly when there are variations in lighting conditions,
occlusions and shadows are very often considered to be real
objects.
This is why autonomous vehicles need more than a pair
of suitably advanced cameras. Radar and laser scanners are
5
necessary because machine intelligence need much more information to recognize an object than we do. Its not just
places and objects that robots need to recognize. To be faithful
assistants and useful workers, they need to recognize people
and our intentions. Military robots need to correctly distinguish
enemy soldiers from frightened civilians, and care robots need
to recognize not just people but their emotions.
The contextual awareness needed to safely navigate the
world is not to be taken lightly. Imagine a plastic ball rolling
into the road. Most human drivers would expect that a child
might follow it, and slow down accordingly. A robot can too,
but distinguishing between a ball and a plastic bag is difficult,
even with all of their sensors and algorithms. And that is
before we start thinking about people who might set out to
intentionally distract or confuse a robot, tricking it into driving
onto the pavement or falling down a staircase. Could a robot
recognize a fake road diversion that might be a prelude to a
theft or a hijacking?
An intelligent being has to deduce the implications of what
it knows, but only the relevant implications. Dennett points out
that this requirement poses a deep problem not only for robot
design but for epistemology, the analysis of how we know.
The problem escaped the notice of generations of philosophers,
who were left complacent by the illusory effortlessness of their
own common sense.
III. C ONCLUSION
There is a big difference between seeing the world, and
understanding it. seeing is just a starting point to understand
the world.
The hard part is getting a robot to intelligently identify what
it has detected. We take for granted what goes into creating
our own view of the road. We tend to think of the world falling
onto retinas like the picture through a camera lens, but sight
is much more complicated. The whole visual system shreds
images, breaks them up into maps of color, maps of motion,
and so on, and somehow then manages to reintegrate that. How
the brain performs this trick is still a mystery. I conclude and
predict that for the foreseeable future we have to have a human
monitoring the system. It is not realistic to say any time soon
computers will take over and make all decisions on behalf of
the driver.
R EFERENCES
[1]
S. Pinker How The Mind Works, London, England: Penguin Group,
1997.
[2] S. Pinker Interview at Cambridge University 1997, London, England:
Penguin Group, 1997.
[3] Wikipedia https://en.wikipedia.org/wiki/Computer vision