Sat-Cam: Personal Satellite Virtual Camera

Sat-Cam: Personal Satellite Virtual Camera
Hansung Kim1,2 , Itaru Kitahara1 Kiyoshi Kogure1 , Norihiro Hagita1 , and
Kwanghoon Sohn2
1
2
Intelligent Robotics and Communications Laboratories, ATR, Keihanna Science
City, Kyoto, 619-0288, Japan ,
{kitahara, kogure, hagita}@atr.jp,
http://www.irc.atr.jp
Dept. of Electrical and Electronics Eng., Yonsei University 134 Shinchon-dong,
Seodaemun-gu, Seoul, 120-749, Korea
[email protected], [email protected],
http://diml.yonsei.ac.kr
Abstract. We propose and describe a novel video capturing system
called Sat-Cam that can observe and record the users’ activity from
effective viewpoints to negate the effects of unsteady mobile video cameras or switching between a large number of video cameras in a given
environment. By using real-time imagery from multiple cameras, our system generates virtual views fixed in relation to the object. The system
consists of a capturing and segmentation components, a 3D modeling
component, and a rendering component connected over a network. Results indicate that a scene rendered by Sat-Cam provides stable scenes
that help viewers understand the activity in real time. By using the 3D
modeling technique, an occlusion problem in object tracking is solved,
allowing us to generate scenes from any direction and zooming in/out
condition.
1
Introduction
With the progress of communication technology, many systems are being developed to share experience or knowledge and interact with other people [1][2][3][4]
[5][6]. As the proverb goes –“a picture is worth a thousand words”– we can understand most of other people’s activities by watching their videos. In this paper,
we propose a novel video capturing system called Sat-Cam that can observe and
record the users’ activity from effective viewpoints. Applications of this technology extend to surveillance, remote education or training, telecommunication,
and so on.
There are two typical approaches to capturing visual information of working
records. As illustrated in Figure 1(left), the first one is to use a head-mounted
or mobile camera attached to users [5][6][7][8]. This approach can easily record
all activities the user performs with the minimum amount of data (i.e., single
video stream). However, when a user, who does not have same context as the
captured user, tries to understand his/her experiences by watching the video,
the mobile camera may cause the user inconvenience, because the video data
Fig. 1. Approaches to capturing visual information of user’s activities
is captured from a subjective viewpoint. For example, the sway of the camera
leads to viewers becoming confused; moreover, it is difficult for third parties
to understand the scene from the camera since the movement of the user itself
cannot be observed.
The second approach, which is illustrated in Figure 1 (center), is to use cameras fixed in an environment. Since the cameras always provide objective and
stable visual information, it is easier for third parties to understand them. However, an enormous amount of useless video must be captured to cover the whole
area at all times. To determine the best situation for observing the activities,
we have to switch between multiple videos. By increasing the number of capturing cameras, this switching-monitoring operation sometimes exceeds a human’s
processing ability.
To overcome the above problems, we propose the Sat-Cam system, is illustrated in Figure 1 (right), which is a method for capturing a target object from
a bird’s-eye view. The system generates virtual views fixed relative to the object
by reconstructing its 3D model in real time. The scene rendered by Sat-Cam provides stable scenes of minimal area for understanding the user’s activity, making
it easier for third parties to understand.
In the next section, we provide an overview of the Sat-Cam system. Section
3 then describes the detailed algorithms used in the system. Section 4 shows the
experimental set up and result, and finally, we draw conclusions in Section 5.
2
Sat-Cam System
Figure 2 shows the concept of Sat-Cam. CV (Computer Vision)-based 3D video
display systems have become feasible with the recent progress in computer and
video technologies, and indeed several systems have been developed[9][10][11]. As
“Sat-Cam” stands for “Satellite Camera,” the system aims to capture the visual
information of working records by a virtual camera that orbits the target user,
employing a 3D video processing technique. Since the virtual camera always tags
Fig. 2. Overview of Sat-Cam System
along with the user, it can record all activities the user performs as a single video
stream. When Sat-Cam’s point of view is set to look down on the target space
like a satellite, the captured video can be easily understood by third parties.
One of the most important features of this system is that it works in real time.
If we generate the Sat-cam video in the post-process, it is necessary to record
enormous amounts of environmental video data.
3
Algorithms of the Proposed Method
This section describes in detail the algorithms used by the system. The system
comprises three sub-systems: object segmentation in capturing PCs, 3D modeling in a 3D modeling server, and rendering virtual views in a rendering PC.
When target objects are captured by cameras, each capturing PC segments
the objects and transmits the segmented masks to a 3D modeling server. The
modeling server generates 3D models of the objects from the gathered masks,
and tracks each object in a sequence of 3D models scenes. The 3D model, object
ID (identification) and 3D position of the objects, are sent to a rendering PC
via a network, and finally, the rendering PC generates a video at the designated
point of view with the 3D model and texture information from cameras.
3.1
Object Segmentation
Real-time object segmentation is one of the most important components of the
proposed system, since the performance of the segmentation decides the quality
of the final 3D model.
We realized the object segmentation of color images based on Chien’s [12]
and Kumars’ [13] algorithms using background subtraction and inter-frame differences. At first, the background is modeled with minimum and maximum intensities of the input images which are low-pass filtered to eliminate noise. Then, the
frame difference mask is calculated by thresholding the difference between two
consecutive frames. In the third step, an initial object mask is constructed from
the frame difference and background difference masks by the OR process. Forth,
we refine the initial mask by a closing process and eliminate small regions with a
region-growing technique. Finally, in order to smoothen the objects’ boundaries
Fig. 3. 3D modeling with the shape-from-silhouette technique
and to eliminate holes inside the objects, we applied Kumar’s profile extraction
technique[13] from all quarters.
If the target space has poor or unstable illumination, thermal cameras can be
used. In this case, the segmentation process is much simpler than for color images
since a human object is brighter than the background. We make an initial mask
by thresholding with the intra-variance from the mean of the thermal scene.
The final segmented mask is converted into binary code and transmitted to
the modeling server via UDP (User Datagram Protocol).
3.2
3D Modeling
The transmitted binary segmented images from the capturing PCs are used
to reconstruct a 3D model of the objects. The modeling PC knows projection
matrices of all cameras because they were calibrated in advance.
We use the shape-from-silhouette technique to reconstruct a 3D model as
shown in Figure 3 [14]. The check points M (X, Y, X) in 3D spaces are projected
onto multiple images In with the following equation, where Pn is a projection
matrix of a camera Cn ;
(u, v, 1)T = Pn (X, Y, Z, 1)T
(1)
If all projected points of M are included in the foreground region of multiple
images, we select the point as inside voxel of an object.
Testing all points in a 3D model is, however, a very time-consuming process
and results in heavy data. Therefore, we used an octree data structure for modeling. For each voxel of a level, 27 points (i.e., each corner and the centers of
edges, faces and a cube) are tested. If all checking points are either included in
or excluded from an object, the voxel is assigned as a full or empty voxel, respectively. Otherwise, the voxel splits into eight sub-voxels and is tested again at the
next refinement level. Figure 4 shows the structure of the octree. Its structure
dramatically reduces the modeling speed and the amount of data.
After the modeling process, the server performs object tracking in the model.
It is very difficult, however, to track in the 3D model in real time; therefore, we
perform the tracking in the 2D plane.
Fig. 4. Octree structure
We assume that ordinary objects (human) have a constant height (e.g., 170
cm), and extract a 2D plane model by slicing the 3D model at a lower height
(e.g., 120 cm). Then, we grow and label the regions on the plane model. By
tracking the center of each labeled region in a series of model frames, we can
identify and track each object.
Finally, modeling parameters, 3D positions of objects with ID numbers, and
node information of an octree model are transmitted to the rendering part.
3.3
Virtual View Rendering
In the rendering part, the received 3D model is reconstructed and the virtual
view of Sat-cam is synthesized. When the octree information is received, it reconstructs the 3D model by decoding the node information and inserts the model
at the correct position in 3D space. The transmitted data from the modeling
server also includes 3D position information and object IDs of any objects. The
rendering PC requests texture information an objects to capturing PCs, and
performs texture mapping onto the reconstructed 3D model.
However, the resolution of our 3D model is not sufficient for a simple (1on-1) texture mapping method because we place real-time processing ahead of
reconstructing a fine 3D model so that the octree method describes the 3D model
with several levels of resolution. Our system employs the “Projective Texture
Mapping Method” to solve this problem [15]. This mapping method projects
the texture image onto the 3D objects as if a slide projector. Consequently, the
resolution of the texture image is retained during the texture mapping process,
regardless of the resolution and shape of the mapped 3D model. Moreover, this
method is implemented as OpenGL functional libraries; it is possible to take
advantage of a high-speed graphic accelerator. By merging the working space
(background), which is modeled in advance, a complete 3D model of the working
space and object is reconstructed.
Finally, the rendering PC generates scenes at the viewpoint requested by a
user. This system provides the following two modes to control the viewpoint of
a virtual camera.
Tracking mode: In this mode, the virtual camera observes the object from
a position above and behind the user. The direction of the object should be
known in order to control the pan and the tilt value of the virtual camera. We
assumed that the direction toward which an object moves is the same to its front
Fig. 5. Configuration of our pilot Sat-Cam System
direction. The direction of movement is estimated by tracking a global path of
objects from the movements in the previous consecutive frames. While the object
is moving, this controlling mode is applied.
Orbiting mode: We can make the virtual camera go around the object like
a satellite when the target object stops in one position. Thus, orbiting mode
makes it possible to observe the blind (self-occluded) spots.
4
Implementation of Sat-Cam System
As shown in Figure 5, we have implemented a distributed system using eight PCs
and six calibrated cameras (three SONY EV-100 color cameras and three AVIO
IR-30 thermal cameras). The systems are realized with commercially available
hardware. Six portable Celeron 800-MHz PCs are used to capture the video
streams and segment objects. The segmented information is sent via UDP over
a 100-Mb/s network to the modeling PC. The modeling and rendering PCs
have Pentium-4 CPUs, and GeForce-4 FX5200 and Quadro FX1000 graphic
accelerators, respectively.
The segmentation information from each camera has a resolution of 180×120
and the 3D space has a resolution of 256 × 128 × 256 on a 2cm voxel grid. It
covers an area of about 25m2 areas and 2.5m height. A 2cm voxel grid was
the smallest grid that could be processed with our current system in real time
since the computational cost increase dramatically as the resolution of the model
increases. We set up some parameters in the segmentation process as follows: 5
as a threshold for frame difference, 100 for the smallest region size, 5 for elasticity
to make silhouette.
Table 1 shows a run-time analysis with our algorithm. The times listed are
average times for a single target to exist in a working space. Capturing a video
takes the most time in the segmentation part, the 3D modeling in the modeling
part, and the texture mapping in the rendering part. The bottleneck in the
system is the 3D modeling process, since it is performed in 3D space. However,
the frame rate of the whole system shows about 10 frames per second, though
it depends on the complexity of the objects.
Table 1. Run-time analysis (msec)
Segmentation
Function
Time
Capturing
66.15
Segmentation 2.44
Closing
1.35
Elimination
5.28
Silhouette
7.48
Transmission 0.95
Total
83.65ms
Frame/sec
11.95f/s
3D Modeling
Function
Time
Receiving
0.23
Initialization 32.70
3D Modeling 85.29
Labeling
1.76
Tracking
0.17
Transmission
1.84
Total
121.99ms
Frame/sec
8.20f/s
Rendering
Function
Time
Receiving
0.31
Capturing
Texture
67.02
Rendering
85.70
Flushing
0.04
Total
Frame/sec
103.07ms
9.70f/s
Fig. 6. 3D modeling results: the upper row is the result of segmentation process with
using thermal cameras; the lower row is the result with using color cameras. The
reconstructed 3D model is shown in the rightmost cell.
Figure 6 shows snapshots of segmented images and a constructed 3D model.
Generally, thermal cameras provide more reliable segmented information. Therefore, we assigned higher priority to the information from thermal cameras, this
priority can be adjusted since their reliability may decrease in the case where
people put on warm clothes. The rendered scene from a rendering PC is shown
in Figure 7.
Fig. 7. Rendered scenes by Sat-Cam
5
Conclusions and Future works
We proposed a novel video capturing system called Sat-Cam that can observe
and record the users’ activities from effective viewpoints to negate the effects of
unsteady mobile video cameras or switching between a large number of videos in
a given environment. By using real-time imagery from multiple cameras, the proposed system generates virtual views fixed in relation to the object. The system
provides stable scenes that enable viewers to understand the user’s activity in
real-time. In future works, we will consider evaluating the Sat-Cam’s usefulness
for sharing life-log information between users.
References
1. Xu, L.Q., Lei, B.J., Hendriks, E.: Computer Vision for a 3-D Visualisation and
Telepresence Collaborative Working Environment. BT Technology Journal, Vol. 20,
No. 1 (2002) 64-74
2. Winer, L.R., Cooperstock, J.R.: The ”Intelligent Classroom”: Changing Teaching
and learning with an evolving technological environment. Journal of Computers and
Education, vol. 38 (2002) 253-266.
3. Kuwahara, N., Noma, H., Kogure, K., Hagita, N., Tetutani, N., Iseki, H.: Wearable
Auto-Event-Recording System for Medical Nursing. Proc. INTERACT’03 (2003)
805-808
4. http://www.virtue.eu.com/
5. Kawashima, T., Nagasaki, T., Toda, M., Morita, S.: Information Summary Mechanism for Episode Recording to Support Human Memory. Proc. PRU, (2002) 49-56
6. Nakamura, Y., Ohde, J. Ohta, Y.: Structuring Personal Activity Records based on
Attention - Analyzing Videos from Head-mounted Camera. Proc. 15th ICPR, (2000)
220-223
7. Yamazoe, H., Utsumi, A., Tetsutani, N., Yachida. M.: Vision-based Human Motion
Tracking using Head-mounted Cameras and Fixed Cameras for Interaction Analysis.
Proc. ACCV, Vol.2 (2004) 682-687
8. Ohta, Y., Sugaya, Y., Igarashi, H., Ohtsuki, T., Taguchi, K.: Share-Z: Client/Server
Depth Sensing for See-Through Head-Mounted Displays, PRESENCE, Vol. 11, No.
2 (2002) 176-188
9. Kanade,T., Rander, P.W., Narayanan, P.J.: Virtualized Reality: Constructing Virtual Worlds from Real Scenes. IEEE Multimedia, Vol.4, No.1, (1997) 34-47
10. Wada, T., Wu, X., Matsuyama, T.: Homography Based Parallel Volume Intersection: Toward Real-Time Volume Reconstruction Using Active Cameras. Proc. of
Computer Architectures for Machine Perception 2000, (2000) 331-339
11. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-Based
Visual Hulls, ACM SIGGRAPH 2000, (2000) 369-374
12. Chien, S.I., Ma, S.Y, Chen, L.G.: Efficient Moving Object Segmentation Algorithm
using Background Registration Technique. IEEE Trans. on CSVT, Vol. 12, No. 7
(2002) 577-586
13. Kumar, P., Sengupta, K., Ranganath, S.: Real Time Detection and Recognition
of Human Profiles using Inexpensive Desktop Cameras. Proc. 15th ICPR, Vol. 1
(2000) 1096-1099
14. Kitahara, I., Ohta, Y.: Scalable 3D Representation for 3D Video Display in a
Large-scale Space. Proc. IEEE Virtual Reality (2003) 45-52
15. Everitt. C.: Projective Texture Mapping. NVIDIA SDK White Paper