Using Computer Vision Technology to Play Gramophone Records

Using Computer Vision Technology to Play Gramophone
Records
Baozhong Tian
Department of Computer Science
Azusa Pacific University
Azusa, CA 91702-7000, USA
[email protected]
John L. Barron
Department of Computer Science
The University of Western Ontario
London, Ontario, Canada, N6A 5B7
[email protected]
Abstract
We present a non-contact optical flow based method to reproduce the sound signal from gramophone records using
3D robust scene reconstruction of the surface orientation of the walls of the grooves. The conversion of analogue data
to digital data is an important task for the preservation of historic sound recordings. We digitally viewed the grooves
of a record using a microscope that was modified to overcome the limitation of a shallow depth of field by using a
thin glass plate to obtain part of the image at a second focal length to gain better overall quality images of the groove.
The sound signal was recovered from the groove surface orientation. The overall algorithm has been tested and found
to be working correctly using undamaged and damaged real records.
Keywords: Sound Signal Reproduction, Surface Reconstruction from Optical Flow, Gramophone Record, Robust
Estimation, Surface Orientation. Optical Signal Reproduction/Retrieval from gramophone discs/mechanical sound
carriers.
1 Introduction
Reproducing sound mechanically started as early 1877 on cylinder-shaped carriers, and in 1887 on disk-shaped gramophone records and the technology of recording and retrieving acoustic signals on gramophone records reached its peak
during the 1970’s, just before the digital format compact disc (CD) began to dominate the mass marketing of music.
Although the audio quality of a CD is judged to be very good by most people, some audiophiles believe that the sampling rate of a CD (44.1kHz) is not high enough to reproduce the rich musical information faithfully. Today there are
still some high-end record playing equipments in production.
There are a lot of historical recordings that need to be archived. The problem with some of these recordings is
that they have become so fragile with age and use that they can not tolerate being played back using a traditional style
turntable with a mechanical stylus. This problem motivates research on non-contact record playing systems.
A non-contact playback system may has the following additional uses:
1. the replay of broken discs or cylinders or otherwise damaged carriers, i.e. lacquer discs in the state just before
flaking off cannot be played mechanically,
2. optical replay at multiple speeds
The work presented here is a step in these directions.
1.1 Traditional Method of Sound Reproduction
We use stereo gramophone long-playing records (stereo LP) to illustrate how sound reproduction is performed. During
the record cutting procedure, the left and right channel signals control the speed of the cutting stylus at a +45/-45
lateral manner, i.e. a composition of two orthogonal speeds perpendicular to each other, while the record rotates at a
constant speed. This is called modulation of the grooves. The movement of the stylus determines the slopes in the
tangential direction of the groove walls. During the cutting process the left and right groove walls’ modulation are
kept independent from each other. When the play back stylus has a similar setup as the cutting stylus had, stereo
signals can be reproduced. The electrical signal outputs are proportional to the +45/-45 lateral speeds of the stylus
while riding along the groove and modulated by the groove walls.
ϖ
V
T
VL
VR
(+45)
(-45)
Record surface
(Land)
Stylus tip
Left wall
VL
(a)
Right wall
VR
Groove Bottom
(b)
Figure 1: Illustrations of the movement of the record and the stylus: (a) top view, (b) cross section view.
Figure 1a illustrates the top view of the movements of a record and a stylus. The stylus has a tangential speed VT
relative to the groove due to the record rotation ~ω. There are also left and right lateral movements of the stylus (VL
and VR ) in +45/-45 directions. Figure 1b shows a cross section view of the compound +45/-45 lateral movement of
the stylus.
The major task of sound reproduction is to track the groove walls as precisely as possible. The conventional method
uses a diamond-tip stylus to run along the V-shaped groove by applying a certain tracking force on the stylus. The
problem is that the stylus has some weight, so the tracking of high frequency signals is more difficult. Other problems
include groove damage such as scratches and small particles that result in annoying clicks, pops and degradation of
sound over time and the maintenance of the correct settings of the turntable, tone arm, cartridge and stylus during play
back (may require frequent adjustments).
1.2 Literature Survey of Non-Contact Record Playing Methods
[Iwai et al., 1986] developed a laser beam reflection method to reproduce the sound from all kinds of old (repaired)
wax phonograph cylinders. They used their method to reproduce the recorded talk and songs of the Ainu people in
Sakhalin and Kokkaido, Japan, made by Polish anthropologist, Bronislaw Pilsudski, from 1902 to 1905 on standard
Edison wax cylinders. An optical beam, incident onto the groove cut on the wax cylinder, reflects to a detection
plane that is perpendicular to the optical axis. The time-varying position of the intersection of this reflected beam
with the detection plane while the cylinder is rotated (played) corresponds to the sound signal. The final sound signal
is then obtained by filtering this signal with a frequency equalizer. Problems such as the fidelity of the recovered
sound (decreased beamwidths result in enhanced fidelity and loudness while increased beamwidth results in loss of
consonant sounds and speckle noise due to the elimination of higher frequencies caused by the smoothing effect on the
time-varying intersection point position variation), the existence of speckle noise (due to recrystallization of the wax
caused by poor storage conditions for 100+ years), the echo of reproduced sounds (caused by increased beamwidths
overlapping multiple grooves) and the occasional tracking error (caused by the illuminating beam improperly mistracking a groove over time) had to be resolved (usually by appropriate parameter settings and/or additional hardware).
[Kessler and Ziegler, 1999, Stanke and Kessler, 2000, Ziegler, 2000] describe a contact-less method to play back
the copper negatives of Edison cylinder phonograms (“galvanos”) in the Berlin Phonogramm-Archiv. Note that for
these negatives, grooves on the original wax cylinders now are ridges on the galvanos. Image processing was used
for tracking these ridges on the copper galvanos. A direct galvano player consisting of an endoscope and a diamond
stylus to play the record was constructed. The sound quality is reported to be as good as the original cylinders played
on the modern cylinder player with a diamond stylus. More information is available at 2 websites1 .
ELP corporation [ELP, 1997] spent ten years developing a laser turntable (invented by Robert E. Stoddard et al.
[Stoddard and Stark, 1989, Stoddard, 1989]) utilizing five laser beams to track the microgroove optically. This is a
pure analogue process, but it is so sensitive to foreign particles in the groove and on the record surface that it requires
the record to be cleaned every time it is played. This ELP laser turntable uses two of the five beams of the laser to
track the groove walls and the other three laser beams for groove tracking. This has two main advantages: the laser
beams are weightless and can be made as thin as 2µm in diameter, which is much thinner than a high-end stylus
(4-12µm). However, the system is very complicated and expensive and it only works well with black records because
of the reflective nature of the material. Coloured records may produce unpredictable results [ELP, 1997].
Because the laser turntable is very expensive (in the price range of a small car) and because it is very sensitive
to the cleanness and color of the record, it is not judged to be a feasible solution. Other research has studied the
feasibility of reproducing the sound signal by image processing methods. In 2002, Ofer Springer [Springer, 2002]
proposed an idea he called the virtual gramophone. Springer’s idea is to scan an image of the record and write a
decoder to apply a “virtual needle” following the groove spiral form. However, when the authors listened to a sample
decoded sound2, the music was judged to be barely recognizable. Inspired by Springer’s idea, a group of Swedish
students [Olsson et al., 2003] developed a system to use more sophisticated digital signal processing methods, such
FIR Wiener filtering and spectral subtraction, to reduce noise level in the reproduced sound, resulting in a better
reproduction result than that of Springer’s. Both systems used an off-shelf scanner, which limited the resolution of the
images to a maximum of 2400dpi or 10µm per pixel. At this resolution, the quantization noise is quite high because
the maximum lateral travel of the groove is about 150µm.
[Fadeyev and Haber, 2003] developed a 2D method to reconstruct mechanically recorded sound by image processing. The resolution was greatly improved by the aid of micro-photography. Their algorithm detects the groove
bottom as an edge in the image and then differentiates the bottom edge shape to reproduce sound signals. The groove
bottom edges are not always very well defined and sometimes distorted by dirt particles. The groove walls, which
contain much sound informations, were ignored. This project resulted in a fast optical scanner for disc records called
I.R.E.N.E (Image, Reconstruct, Erase Noise, Etc.). They also introduced a 3D method to reproduce vertically modulated records such as wax cylinders [Fadeyev et al., 2005]. It uses 3D profile scanning provided by a laser confocal
scanning microscope3. For cylinders, 3D scanning is necessary because the audio is stored in the vertical modulations
of the cylinder’s surface. In general, even for standard records, 3D scanning is better that 2D scanning because it
allows the entire surface to be analyzed rather that just a projection or slice of it.
[Cavaglieri et al., 2001, Stotzer et al., 2003, Stotzer et al., 2004] developed a 2D method they called the VisualAudio concept. A picture of the record was taken using a large format film that was as big as the record. The film was then
scanned using a rotating scanner, which is actually a line scan camera positioned above the film while the film is being
rotated on a turntable. Edges were then detected from the digitized image and then sound signals were computed from
the edges. Unlike the method of [Fadeyev and Haber, 2003], they used the groove and surface intersection as the edge,
instead of using groove bottom. This gives them the capability to reproduce the sound from stereo 33 rpm recordings.
Also the use of the rotating scanner eliminated the need for adjusting the sample rate as the groove turned close to the
record’s center. The images were rectangular, and not circular, as scanned by a flat-bed scanner. A 4X magnifier was
1 www.gfai.de
and www.chrosterhamp.se/phono/stank/html
2 www.cs.huji.ac.il/˜springer/
3 irene.lbl.gov/
fitted to the rotating scanner to get the desired image resolution. A Signal to Noise Ratio (SNR) analysis showed that
a satisfying SNR of 40dB can be achieved if the standard deviation (σn ) of edge position noise was kept below 1.28
µm. However, listening to the reproduced sound clips from their web site4 indicated that the noise level needs to be
further reduced.
[Laborelli et al., 2007] proposed a contact-less optical playing method using structured colour illumination. A
region of a record is illuminated by beams of light rays, where the colour of a light beam is dependent on the direction
of incidence. These beams are reflected by the record groove wall towards a camera, allowing direct access to the
audio signal via colour image decoding. Structured illumination also allows the exploitation of the height information
of groove walls of the record. The authors claim that their method is also potentially advantageous for the detection of
dust occlusion and the automatic interpolation of fractured records.
2 Proposed Method
We propose a sound reproduction method based on Computer Vision technologies such as optical flow and surface
reconstruction. The proposed method uses a microscope to obtain a sequence of magnified images of the groove walls
and uses 3D scene surface reconstruction to calculate the slopes of the walls. Figure 2 shows the system diagram.
Image Sequence
Acquisition
Optical Flow
Depth Map
Surface
Orientation
Raw Sound
Signal
Groove
Tracking
Digital Signal Processing
(DSP, Optional)
Motor Controllers
Figure 2: System diagram.
The major features of the proposed method can be summarized as:
• Using as much information on the record as possible to reproduce the sound. Plenty of information is stored via
the surface orientation of the groove walls, which is not used by 2D methods during their scanning/photographing
processes. A 2D method only computes detectable edges such as a groove’s bottom or groove-surface (land)
intersections.
• Computer Vision technologies such as optical flow and depth map estimation are applied to this problem to
obtain the 3D information characterizing the groove, thus eliminating the requirement for a specialized 3D
scanning device.
• Robust estimation techniques help choose the best areas of the groove wall for the computation and reject noisy
areas which have been damaged by scratches and dirt particles, reducing the level of the noise and improving
the quality of the reproduced sound.
We discuss the individual system components below.
3 Image Sequence Acquisition
We use groups of overlapping image sequences to cover the entire groove and allow an optical flow calculation for the
frames in each group. Since images are 1280 pixels wide, we use an overlap of 100 pixels between the center images
4 www.eif.ch/visualaudio/
in each group. We obtained each group of images by first acquiring a short avi video clip about a groove position and
then using an image processing program (ImageMagick) to separate out 36 individual color images from the avi file.
Because color adds little to optical flow calculations [Barron and Klette, 2002], these color images were immediately
converted to grayvalue images to reduce processing and storage costs, before any optical flow calculation is done.
Each two consecutive images in same group differ only by a translation. When the camera moves to the next segment
of a groove to capture another group of images, the motion is such that last image of the current group differs from
the last image of the previous group by a small amount so that the reconstructed groove from the optical flow fields is
continuous when paired together.
(a)
(b)
Figure 3: Two pieces of groove from (a) a 78 rpm SP record and (b) a 33 rpm LP record. The magnifying factor is
60X.
Figure 3 shows images of grooves under a microscope. The magnification factor of the microscope is set to be
such that the field of view covers about 600µm in width so that for a camera with 640 × 480 pixel resolution, the
horizontal spatial resolution is about 1µm. The illumination is set so that the groove walls are bright while the record
surface and the groove bottom are dark. In our actual computations, we use a better microscope with higher resolution
and magnification camera to significantly improve the image quality.
With our current setup, and the few seconds of recovered sound, which means just 3 or 4 rounds of groove length,
the scanning resolution makes little difference relative to current tracking radius. As a record is scanned from its outer
most edge to its inner edges, there are more pixels per wave length in the outer grooves than in the inner grooves.
As a result. the scanning resolution should be compensated for according to the radius as the scanning head rotates.
Our microscope is set at 400X, which gives us quite high resolution to prevent speed fluctuations within one group
of scans. Accurate alignment between neighboring groups can also reduce speed fluctuations when sound pieces are
stitched together.
4 Image Preprocessing
The images need to be preprocessed, i.e. we need to compute the image intensity derivatives and the optical flow
fields, etc. before we can compute the depth maps. We experimented with implementations of two standard differential
optical flow techniques, namely those of [Horn and Schunck, 1981] and [Lucas and Kanade, 1981] with differentiation
by [Simoncelli, 1994]. Those results are reported in a PhD thesis [Tian, 2006, Tian, 2008]. We obtained the best optical
flow using Horn and Schunck’s algorithm, adapted to our problem by imposing surface constraints that arise from
computing optical flow on the images of record groove walls, i.e. a local surface planarity constraint and a constraint
arising from the “V” shape of groove cross sections. All differential optical flow is based on the Motion Constraint
equation. Briefly, we describe this constraint equation, describe the Lucas and Kanade and Horn and Schunck optical
flow methods in light of this constraint, describe how we perform intensity differentiation and finally describe our
optical flow algorithm, which is Horn and Schunck based but uses an additional groove orientation constraint.
4.1 The Motion Constraint Equation
Let I(x, y, t) be the intensity function at pixel (x, y) at time t. If we assume that the brightness pattern in the local
neighborhood about I(x, y) at time t moves as a simple translation (δx, δy, δt), that the brightness patterns about
I(x + δx, y + δy, t + δt) and I(x, y, t) should be the same:
I(x + δx, y + δy, t + δt) = I(x, y, t).
(1)
st
Then a 1 order Taylor series expansion yields:
I(x + δx, y + δy, t + δt) = I(x, y, t) + Ix δx + Iy δy + It δt,
(2)
where Ix , Iy and It are the spatio-temporal image intensity derivatives. Using Equation(1), we can re-write Equation
(2) as:
Ix δx + Iy δy + It δt = 0.
(3)
Dividing δx and δy by δt yields the motion constraint equation:
Ix u + Iy v + It = 0,
where u =
δx
δt
and v =
δy
δt .
(4)
This can be expressed more compactly as:
∇I · ~v + It = 0,
(5)
where ∇I = (Ix , Iy ) is the spatial intensity spatial gradient and ~v = (u, v) is optical flow vector (or image velocity)
at pixel (x, y). The motion constraint equation is the basis for most differential optical flow equations. It specifies that
at each image pixel, its optical flow vector is constrained to be a point lying on a line. An additional constraint on the
optical flow is required to resolve this ambiguity (called the aperture problem).
4.2 Lucas and Kanade Optical Flow
To resolve the aperture problem, [Lucas and Kanade, 1981] assume the motion of a pixel is constant in some local
neighbourhood about that pixel. Thus, two (or more) sets of different derivatives in this neighbourhood yield a nonsingular linear (least-squares) system of equations that yield values for (u, v). The main problem with this algorithm
is that the neighbourhoods have to be small in order to satisfy the constant motion assumption, making optical flow
calculations very sensitive to noise. One can discriminate between “good” and “bad” optical flow vectors on the basis
the smallest eigenvalue of the least squares integration matrix [Barron et al., 1994] but note that then one typically
obtains sparse optical flow fields which are not that useful for our application.
4.3 Horn and Schunck Optical Flow
[Horn and Schunck, 1981] resolve the aperture problem by using a global smoothness constraint in addition to the
motion constraint equation to compute 100% dense flow. That is, they combine the gradient constraint in Equation (4)
with a global smoothness term to constrain the estimated velocity field ~v = (u, v), minimizing
Z
(∇I · ~v + It )2 + λ2 (||∇u||22 + ||∇v||22 )dxdy
(6)
D
over a domain D (the image), where the magnitude of the Lagrange multiplier λ reflects the influence of the smoothness term. We used λ = 1.0 and λ = 10.0 in our work. Guass-Seidel iterative equations are used to minimize the
Euler Lagrange equations derived from Equation (6) and yield the optical flow field as:
uk+1
=
ūk −
Ix [Ix ūk + Iy v̄ k + It ]
and
α2 + Ix2 + Iy2
(7)
v k+1
=
v̄ k −
Iy [Ix ūk + Iy v̄ k + It ]
,
α2 + Ix2 + Iy2
(8)
where k denotes the iteration number, u0 and v 0 denote initial velocity estimates which are set to zero, and ūk and v̄ k
denote neighbourhood averages of uk and v k respectively. We use at most 100 iterations in our iterative calculations,
which is sufficient to achieve convergence (the average norm of the difference in the flow fields between adjacent
iteration k and k + 1 is less than a some threshold τ ).
4.4 Intensity Differentiation
Differentiation (to compute Ix , Iy and It ) was done using Simoncelli’s [Simoncelli, 1994] matched balanced filters
for low pass filtering (blurring using the (p5 ) kernel) and high pass filtering (differentiation using the (d5 ) kernel) [see
Table 1 for the 5-tap filters we use]. Matched filters allow comparisons between the signal and its derivatives as the
high pass filter is simply the derivative of the low pass filter and this is hypothesized to yield more accurate derivative
values [Simoncelli, 1994]. Using these two masks, Ix is computed by applying p5 in the t dimension, then p5 to those
results in the y dimension and finally d5 to those results in the x dimension. Iy is computed in a similar manner. To
compute It , we first apply p5 in the x dimension for each of 5 adjacent images, then p5 again in the y dimension on
those 5 results and finally d5 in the t dimension on the x and y smoothed results. Before performing this filtering
we use a simple averaging filter [ 14 , 21 , 14 ] to slightly blur the images (this reduces the 7 input images to 5 smoothed
images). Simoncelli claims, that because both of his filters were derived from the same principles, more accurate
derivatives result. He demonstrated on the Yosemite Fly-Through sequence [Simoncelli, 1994].
n
0
1
2
3
4
p5
0.036
0.249
0.431
0.249
0.036
d5
-0.108
-0.283
0.0
0.283
0.108
Table 1: Simoncelli’s 5-point Matched/Balanced Kernels
4.5 Computing Regularized Optical Flow using a Groove Orientation Constraint
Our image sequences are generated by a single stationary camera taking images of a record surface as it rotates
underneath the camera. This is equivalent to a moving camera taking images of a stationary record surface, except now
the flow field is in the opposite direction. The standard image velocity equations [Longuet-Higgins and Prazdny, 1980]
relate a velocity vector, ~v = (u, v), measured at image location ~p = (x, y, f ) = f P~ /X3 , [i.e. the perspective
~ = (U1 , U2 , U3 ) and 3D sensor rotation
projection of a 3D point P~ = (X, Y, Z)], to the 3D sensor translation U
~ω = (ω1 , ω2 , ω3 ) parameters. We can rewrite these standard equations as the sum of the translational and rotational
image velocity components as
~v (~
p, t) = ~vT (~
p, t) + ~vR (~
p, t)
(9)
where ~vT and ~vR are the translational and rotational components of image velocity:
~vT (~
p, t) = A1 (~
p)
and:
~) =
A1 (Y
A2 (~
p) =
~
U
and ~vR (~
p, t) = A2 (~
p)~ω (t),
Z
xy
f
(f +
−f
0
y2
f )
0
−f
x
y
−(f +
− xy
f
x2
f )
and
y
−x
(10)
(11)
!
.
(12)
~ = [U1 , 0, 0]T and ω
Because the camera motion is simple translation along the positive x axis, i.e. U
~ = (0, 0, 0), in
our application, Z can be computed as:
f U1
.
(13)
Z=−
u
We can differentiate Z with respect to u as:
U1
∂Z
(14)
= f 2.
∂u
u
Since y = f YZ , we obtain:
f
f
u
∂y
=
= f U1 = − .
∂Y
Z
U1
− u
(15)
1
∂Z
=
∂Y
c
(16)
Because of the shape of the cutting head (a “V”, i.e. a 45◦ slope) the surfaces of the groove are planar and:
where c = ±1 depending on the side of the groove surface. Figure 4 illustrates this groove surface constraint.
X3
X2
X1
Figure 4: Illustration of the groove surface constraint due to the shape of the groove.
From Equation (16), we have:
∂Z ∂u ∂y
1
∂Z
=
=
∂Y
∂u ∂y ∂Y
c
(17)
Substituting Equations (14) and (15) into Equation (17), we obtain:
∂u
u
=−
.
∂y
c·f
(18)
Because of the the groove orientation constraint in Equation (18), we modify Horn and Schunck’s regularization
functional to:
2 2 2 2 !
Z
∂u
u
∂u
∂v
∂v
2
2
(∇I · ~v + It ) + α
.
(19)
+
+
+
+
∂x
∂y
cf
∂x
∂y
D
Using the approximations ∇2 v1 ≈ v̄1 − v1 and ∇2 v2 ≈ v̄2 − v2 [Horn and Schunck, 1981] we obtain the Euler
Lagrange equations:
(Ix2 +
α2
+ α2 )u + Ix Iy v
c2 f 2
Ix Iy u + (Iy2 + α2 )v
= α2 ū − Ix It and
(20)
= α2 v̄ − Iy It .
(21)
The Gauss-Seidel method was used to obtain the iterative equations:
n+1
= ū −
v n+1
= v̄ n −
u
n
1
α2
2
c2 f 2 Iy + c2 f 2 )ū + Ix Iy v̄ + Ix It
and
2
2
2
Ix2 + c21f 2 Iy2 + cα
2 f 2 + Iy + α
(Iy2 + c21f 2 Iy2 )v̄ + Ix Iy ū + (1 + c21f 2 )Iy It
.
2
Ix2 + c21f 2 Iy2 + cα2 f 2 + Iy2 + α2
(Ix2 +
(22)
(23)
Figure 5 shows the optical flow field computed using the above algorithm on a set of synthetic groove images.
Figure 5: The computed optical flow by the modified Horn & Schunck regularization using the groove orientation
constraint, with a spatial gradient threshold ||∇I ≥ 2.0 and a Lagrange multiplier α = 1.0. The flow field was
subsampled by 8 and scaled by 6.0.
5 Depth Map Computation
We performed a survey [Tian and Barron, 2005] of recent algorithms for dense depth maps (from image velocities
or intensity derivatives) which appeared to give good results in the literature. All of these algorithms assume known
camera translation and rotation (or can be made to have this assumption). We looked at algorithms by [Heel, 1990],
[Matthies et al., 1989], [Hung and Ho, 1999] and [Barron et al., 2003]. A detailed description of the implementation
of these algorithms and their performance can be found in a conference paper [Tian and Barron, 2005]. Quantitative
results show that the methods of [Barron et al., 2003] was the best over all, although [Matthies et al., 1989] was very
competitive.
5.1 Barron, Ngai and Spies
Barron, Ngai and Spies [Barron et al., 2003] proposed a Kalman filter framework for recovering dense depth map from
the time-varying optical flow fields generated by a camera translating over a scene by a known amount. They assumed
local neighbourhood planarity to avoid having to compute non-pixel correspondences. That is, surface orientation (of
a plane) is what is tracked by the Kalman filter over time.
We have already seem the standard image velocity equations [Longuet-Higgins and Prazdny, 1980] relate an image
~ and 3D sensor
velocity vector measured at image location ~
p = (x, y, f ) = f P~ /Z, to the 3D sensor translation U
~ , t) = ~vT (Y
~ , t) + ~vR (Y
~ , t) where ~vT and ~vR are the translational and rotational components of image
rotation ~ω: ~v (Y
velocity.
We define the depth scaled camera translation as
~u(~
p, t) =
~ (t)
U
= ûµ(~
p, t),
~
||P (t)||2
(24)
~
~
f ||U||2
2
where û = Û = (u1 , u2 , u3 ) is the normalized direction of translation and µ(~
p, t) = ||||U||
~ ||2 = Z||~
p||2 is the depth
P
scaled sensor speed at ~
p at time t. The focal length f is assumed to be known. If we define 2 vectors:
~r(~
p) = (r1 , r2 ) =
~ p) = (d1 , d2 ) =
d(~
|~v − A2 (~
p)~ω | and
||~
p||2
|A1 (~
p)û|
,
f
(25)
(26)
~ means each element in the vector is replaced by its absolute value. Then we can solve for µ as weighted
where |A|
average:
r2 |v2 |
r1 |v1 |
d1 + d2
.
(27)
µ=
|v1 | + |v2 |
5.1.1 Planar Orientation from Relative Depth
We compute the local surface orientation as a unit normal vector, α̂ = (α1 , α2 , α3 ) from µ values as:
α̂ · ~p =
cµ||~
p||2
~
||U ||2
(28)
We can solve for α̂c by setting up a linear system of equations, one for each pixel in a n × n neighbourhood where
planarity has been assumed and using a standard least squares solution calculation.
5.1.2 The Overall Calculation
We state the overall algorithm. At the initial time, t = 1:
1. We compute all the µ’s as described in equation (27).
2. In each n × n neighbourhood centered at a pixel (i, j) we compute ( α̂c )(i,j) at that pixel using equation (28).
We call these computed α̂c ’s the measurements and denote them as ~gM(i,j) .
3. Given these measurements, we use the ~gM(i,j) to recompute the µ(i,j) ’s as:
µ(i, j) =
~ ||2
(~gM(i,j) · p~(i,j) )||U
||~
p(i,j) ||2
(29)
We apply a median filter to the µ(i, j) within 5 × 5 neighbourhoods to remove outliers. We repeat step 2 once
more with these values to obtain the final ~g values.
At time t ≥ 2:
1. We compute µ at each pixel location and then compute all ~gM(i,j) ’s in the same way described above for the
new optical flow field. Using the image velocity measurements at time t = i, we use the best estimate of surface
orientation at time t = i − 1 at location p~ − ~v (∆t = 1) plus the measurement at p~ and its covariance matrix
~ at time t = i. We do this at all Y
~ locations (where possible), recompute
to obtain a new best estimate at Y
the µ values via equation (29) and output these as the 3D shape of the scene. At time t = i we proceed as for
time t = 2, except we use the best µ estimates from time t = i − 1 instead of time t = 1 in the Kalman filter
updating.
5.1.3 The Kalman Filter Equations
Note that the components of α̂c in equation (28) are not independent, thus we have a covariance matrix with non-zero
off-diagonal elements in the Kalman filter equations. We use a standard set of Kalman filter equations to integrate the
surface orientations (and hence depth) over time. Please see Barron, Ngai and Spies [Barron et al., 2003] for details
of the Kalman filter equations.
5.2 Experimental Technique
We generated ray-traced cube, cylinder and sphere image sequences with the camera translating to the left by (−1, 0, 0),
as shown in Figure 6.
(a)
(b)
(c)
Figure 6: Synthetic test data: (a) A marble-texture cube (b) A marble-texture cylinder and (c) A marble-texture sphere.
We marble textured this sequence so that optical flow could be used. The texture is kept fixed to the object. We
also generated a second set of image sequences with the same objects but with sinusoidal patterns instead of marble
texture. These sequences allowed the correct derivatives to be computed, we used these or the correct optical flow to
confirm the correctness of our implementations. We compute error distributions for number of depth ranges ≤ 5%,
between 5% and 15% and ≥ 15% for 4 frames (the 7th frame at the beginning of the sequences, the 19th , the 27th and
36th frames in the middle and near the end of the sequences). We also compute the average error (as a percentage)
and its standard deviation for the 4 frames.
5.3 Error Distributions and Depth Maps
Table 2 shows the error distributions for the 27th frame in the three image sequences. We can see that as more images
are processed by the Kalman filter results become more accurate. By the 36th image, about 85% of the depth values
had less than 15% error. Figure 7 shows the raw depth maps for the 3 objects. We left these images unsmoothed and
untexture mapped to show how good the depth recovery really was.
<5%
33.50
51.63
52.18
52.76
36.71
54.26
52.91
52.10
26.03
41.45
43.99
45.75
Error %
7th cube
19th cube
27th cube
36th cube
7th cylinder
19th cylinder
27th cylinder
36th cylinder
7th sphere
19th sphere
27th sphere
36th sphere
>15%
25.23
10.81
9.66
11.71
25.43
12.48
13.51
13.23
37.59
19.73
17.68
16.66
5-15%
41.27
37.56
38.16
35.53
37.86
33.25
33.58
34.67
36.38
38.82
38.33
37.59
Mean±σ
12.16±13.84
7.19±8.51
6.88±7.88
7.23±8.67
12.55±15.26
7.29±8.87
7.52±8.95
7.46±8.50
16.91±18.18
9.64±10.49
9.01±9.91
8.68±9.75
Table 2: The percentage of the estimated depth values that have various relative error distributions in the experiments
for the cube, cylinder and sphere using [Barron et al., 2003]’s algorithm. The last column shows the mean error and
its standard deviation σ.
0
10
20
30
40
50
60
0
200
0
400
600
60
10
20
30
40
50
60
0
200
600
60
800
0
400
30
40
50
0
400
600
800
1000
50
50
40
40
40
30
30
30
20
20
20
10
10
10
0
0
0
(b)
60
200
1000
50
(a)
20
60
800
1000
10
(c)
Figure 7: Depth maps for [Barron et al., 2003]’s algorithm for (a) the cube, (b) the cylinder and (c) the sphere at the
27th frame in the image sequences.
We report experimental results for Barron et al.’s algorithm on synthetic record groove images and on real groove
images with encouraging results in this paper. Because the groove wall orientation can be described by 2 angles,
one of which is constrained and because the vertical component of image velocity is always very small (uni-direction
constraint), we believe imposing such constraints yields even better results then the original algorithm. In addition,
[Barron et al., 2003] was modified to use only horizontal velocities. Now, effectively, we have only one angle of the
surface orientation to track in the Kalman filter.
6 Robust Estimation of Surface Orientation
Surface orientation is computed from depth using a least squares framework. Assuming local planarity, the surface
orientation α̂ of a local neighbourhood is constant and satisfies the planar equation α̂ · P~ = c, where P~ = [X, Y, Z]
is the 3D coordinate of a pixel and c is a constant. We can solve this linear system using a robust estimation method
called Local M-Estimates, as recommended in Press et al. [Press et al., 1992]. This will give more accurate surface
orientation as outlier data will be suppressed.
α̂
c
We present a robust estimation formulation of this calculation as follows. We use vector ~g = (g1 , g2 , g3 ) to denote
α
= ( αcx , cy , αcz ). We can set up a least square system:

or


W

a11
a21
..
.
a12
a22
..
.
a13
a23
..
.
aN 1
aN 2
aN 3





 g1


 g2  = W 


g3
b1
b2
..
.
bN



,

W A~g = W B
(30)
(31)
where W is a N × N diagonal matrix with diagonal elements acting as the weights for the N equations, and
ai1
ai2
=
=
Xi
Yi
(32)
(33)
ai3
=
Zi and
(34)
bi
=
1.
(35)
The solution is ~g = (AT W 2 A)−1 AT W 2 B.
Sound reproduction is as described in [Tian, 2006, Tian, 2008], except that the sampling rate is different than that
of the synthetic signals. In a synthetic groove, the sampling rate is 220.5kHz because of the 10X interpolation of the
sound signal when ray-tracing the groove. In the real record, the width of view in each captured image of the groove
is about 0.5mm. Dividing by 1280 pixels, we obtain ∆d = 0.39µm per pixel. The tangential speed of the groove (at
radius=115mm) if played by a 78RPM record player is: Vt = ωR = 2π78/60 · 115mm = 939.34mm/s. The sample
Vt
frequency at the pixel level is fs = ∆
= 2404.71kHz, which is high enough to digitize the signal of a record. We
d
take advantage of this oversampling factor and apply sub-sampling early in the calculation at the depth recovery stage
to significantly reduce the computational burden.
There are 6000 groups of such images, with each group containing 12 images (using 36 images per group as for
the synthetic data might improve the result slightly but the computing time would increase significantly) for the optical
flow computation. It took nearly a week to capture these images and another week to process them (so currently we
are nowhere near real-time, this work is a proof-of-concept only). About 3 seconds of music was reproduced in this
time period.
Figure 8 shows the original and reconstructed piece of music from “A Fine Romance” by Kern-Fields, produced
by Decca, a 78RPM record The original music contains both vocal instrumental sounds. The computed Pearson’s
product-moment correlation coefficient between them is r = 0.559. Pearson’s product-moment correlation coefficient
r is computed as:
P
(xi − x̄)(yi − ȳ)
pP
r = pP i
,
(36)
2
2
(x
i i − x̄)
i (yi − ȳ)
where x and y are the sampled signals. Listening confirms5 that the vocal sound is quite recognizable in spite of the
presence of noise. The music part is lost in the noise, probably because it is already very weak during the singing
section.
The weight matrix W plays a critical role in this robust estimation calculation. Initially, W is I (the Identity
Matrix) so that a rough solution is obtained.
5 www.csd.uwo.ca/faculty/barron/AUDIOFILES
(a)
(b)
(c)
(d)
Figure 8: (a) The original sound wave recorded from a 78 rpm record player. (b) The recovered sound wave from
groups of groove images. (c) The wave peak envelope of the original sound. (d) The wave peak envelope of the
recovered sound. The horizontal axis represents the time and the vertical axis represents the magnitude.
Using this solution, we can refine W using the Lorentzian estimator, ρ, as shown in [Black and Anandan, 1996]:
2 !
1 di
ρ(di , σ) = log 1 +
(37)
2 σ
and the influence function ψ (which is the derivative of ρ):
ψ(di , σ) =
2di
,
2σ 2 + d2i
(38)
where σ is a scale parameter and di is the residual value of each equation:
di = |ai1 g1 + ai2 g2 + ai3 g3 − bi |.
(39)
Then the weight matrix elements get updated as:
wi =
ψ(di , σ)
.
di
(40)
We can re-calculate ~g again using the updated W . This procedure is repeated until one of the following stopping
criteria is met:
• the total residual is smaller than some threshold: ||di ||2 < τ1 ,
• the total residual begins to diverge:
||di ||t− − ||di ||t < τ2 or
• the number of iterations reaches a limit.
The second threshold, τ2 , which is a small positive number, allows the total residual to vary up and down a bit before
iterations are considered to be converging or diverging.
According to Black and Anandan, tuning the scale parameter σ may work well given that the initial√
approximation
for it is not too bad. Since in the Lorentzian estimator, a residual di is considered an outlier if di ≥ 2σ, lowing σ
after each iteration will reveal more and more outliers. Another benefit from doing this is that the number of outliers
identified could help us to determine whether the value of σ is properly chosen and when to terminate the iterations.
7 From Surface Orientation to Sound Signal
Once the surface orientations are computed, they need to be interpreted into sound signals so that they can be
played. Figure 9 illustrates a piece of a groove, showing the left surface orientation α̂L . The figure also shows the two
angles (θXY and θY Z ) that determine α̂L . Due to the +45/-45 modulation of the groove walls, θY Z is approximately
45 degrees at all times. Note also that locally the surfaces are planar. To extract the left channel signal, we observe
that the surface orientation lies in the z = y plane because of the +45/-45 stereo modulation. Accordingly, the surface
orientation corresponding to the right channel lies in the z = −y plane.
=y
ez
n
pla
z
αL
αL
θL
nL
nL
θXY
θ YZ =45
y
x
Figure 9: Illustration of a piece of groove showing the surface orientation α̂L of the left groove wall lies in the z = y
plane.
We define θL to be√ the√modulation angle between the surface orientation α̂ and the left-channel-zero-modulation
orientation n̂L = [0, 22 , 22 ], which is the surface orientation of the left groove wall when the signal is zero. Then
the ratio of the lateral speed VL and the tangential speed VT of the stylus is:
VL
= tan θL
VT
(41)
θL = arccos(α̂L · n̂L ).
(42)
where
VL corresponds to the left channel signal and needs to be adjusted according to VT = ωR, where R is the current
distance to the record center.
A similar method can be applied to reproduce the right channel signal, VR , using
√
√
2
the direction n̂R = [0, − 2 , 22 ] instead of n̂L . A mono recorded gramophone record with a horizontal groove
modulation can be treated as a special case of the stereo groove modulation, where VL = VR = V , so either the left
or right surface orientation can be used to reproduce the sound signal.
For a mono SP or a wax cylinder with vertical groove modulation, we can project the surface orientation onto the
vertical x − z plane, i.e. y = 0, and θ and V can be calculated using above equations, except now:
α̂
n̂
= [αx , 0, αz ] and
= [0, 0, 1].
(43)
(44)
Due to the robustness of the algorithm introduced in Section 6, we anticipate that the algorithm will be able to
reject or attenuate most of the noise such as pops, clicks caused by scratches or small dirt particles, etc.
8 Synthetic Evaluation
In order to test the above algorithms, we made 1390 groups of ray-traced groove images from a 2-second piece of
recorded sound file. Then we applied the above methods to reconstruct the surface structure and then retrieve sound
from the reconstructed groove [Tian and Barron, 2006]. The simulation involves ray-tracing of groove images based
on shape features of the record grooves. We did not perform RIAA equalization on our recovered sound tracks as this
technique was introduced in 1955 and our 78 rpm records were manufactured circa 1930. At present, equalization has
not been taken into account. and the sound has an excess of high frequency sound components. We applied a 2nd order
lowpass Butterworth filter to the sound signal, where the filter response decreases by 12 dB per octave and has a cutoff
frequency of Fc = 3000 Hz, to attenuate high frequency noise [Butterworth, 1930, Bianchi and Sorrentino, 2007].
However, in order to get a more satisfying sound, it may be preferable to modify the equalization after the normal
preamp output or to apply a custom equalization to a flat transfer in the digital domain.
8.1 Reconstruction of the Depth and Recovery of the Sound
We fed the computed optical flow sequence to [Barron et al., 2003]’s depth recovery algorithm (discussed above),
which uses a Kalman filter to compute a smooth surface reconstruction. Figure 10 shows a recovered groove depth
map computed from optical flows using [Barron et al., 2003]’s depth recovery algorithm after 30 frames of optical
flow. The groove structure is well recovered.
0
60
10
20
30
40
50
60
0
500
1000
1500
50
40
30
20
10
0
Figure 10: The 3D perspective view of reconstructed groove calculated from optical flow using algorithm given by
[Barron et al., 2003]. The horizontal axis represents the x axis and the vertical axis represents the y axis.
Figure 11 shows the synthetic sinusoid waves and the corresponding recovered sound waves for synthetic sinusoid
waves at frequency of 200Hz, 500Hz, 1KHz, 2KHz, 5KHz, 10KHz and 20KHz. For a sound wave at a frequency
of 2KHz and less, the recovered wave looks good. For higher frequencies, such as 5KHz and greater, the noise
eventually becomes so dominant, that it overwhelms the true signal. Note that for the real data results (presented
below), the sampling rate was 10 times greater. Hence, we can recover all the relevant frequency information up to
20KHz from the real data without aliasing problems.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
Figure 11: Synthetic Sinusoid groove walls for (a) 200Hz, (b) 500Hz, (c) 1KHz, (d) 2KHz, (i) 5KJz, (j) 10KHz and
(k) 20HKz and (e)-(h) and (l)-(n) their recovered sound waves respectively. The horizontal axis represents time and
the vertical axis represents magnitude.
After the depth maps were computed, the surface orientations of the grooves were then estimated using the algorithm described in Section 6. From the surface orientations, sound signals were computed as outlined in Section
7.
We tested our algorithm on 2 synthetic sound wave pieces: a male voice saying “Computers are useless, they only
give you answers” and a short rendering of part of the Canadian anthem “O Canada”. These two synthetic image
sequences were generated using ray tracing (our program takes the number of images per group, camera speed (1,0,0),
group number, i.e. which segment of the groove and the number of images per group (36) as input parameters. Since
we use 7 images to compute 1 flow field, 36 images allow 30 optical flow fields to be computed: the error distribution
of depth estimation accuracy converged after the 27th depth map calculation in the above cube, cylinder and sphere
experiments (see Figures 6 and 7 and Table 2).
Since we have the sound data, the shape of the groove can be determined by the sound data. Then we ray trace
it using a given focal length and distance from the view point. This is a computationally expensive process because
we had to extend the ray gradually along a pre-determined path until it encountered a groove (the shape of a groove is
arbitrary and we have no analytical means to solve for its coordinates other than a brute-force search).
The sound wave pieces for these 2 sequences, such as those shown in Figure 12, were combined together to form
the total sound wave.
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
50
100
150
200
250
300
350
Figure 12: Sound wave: The recovered sound wave from one piece of synthesized groove shown as float numbers.
The horizontal axis represents the time and the vertical axis represents the magnitude.
The original sound wave and the recovered sound wave were compared using their wave peak envelopes, shown
in Figures 13 and 14.
250
250
200
200
150
150
100
100
50
50
0
0
1
2
3
4
5
6
0
0
1
2
3
4
5
6
4
4
x 10
x 10
(a)
(b)
(c)
(d)
Figure 13: (a) is the original sound wave of the male voice. (b) is the recovered sound wave of the male voice from
groups of groove images. (c) is the wave peak envelope of the original sound. (d) is the wave peak envelope of the
recovered sound. The horizontal axis represents the time and the vertical axis represents the magnitude.
250
250
200
200
150
150
100
100
50
50
0
0
1
2
3
4
5
6
7
8
9
0
0
1
2
3
4
4
5
6
7
8
9
4
x 10
x 10
(a)
(b)
(c)
(d)
Figure 14: (a) is the original sound wave of “O Canada”. (b) is the recovered sound wave from groups of groove
images. (c) is the wave peak envelope of the original sound. (d) is the wave peak envelope of the recovered sound.
The horizontal axis represents the time and the vertical axis represents the magnitude.
We compute the wave peak envelopes of the computed and the original waveform and then compute Pearson’s
product-moment correlation coefficient of them as r = 0.887 for the male voice, which indicates good correlation.
Figure 14 shows original and reconstructed signals for a segment of music from “O Canada”. The computed
Pearson’s product-moment correlation coefficient between the original and reconstructed signals is r = 0.890. The
reproduction of these signals6 confirms that the sound is quite recognizable, in spite of the presence of some noise.
The purpose of using Pearson’s product-moment correlation coefficient is not to provide a quantitative measurement of how good the reconstructed signal is, but rather to provide a relative figure of which type of sound is reconstructed better. In our experiment, the sound of music as shown in Figure 14 is slightly better recovered than the
human voice sound as shown in Figure 13. Qualitative listening to the original and recovered sounds also confirm this
point.
A preliminary explanation for this is that the music contains more high frequency sounds than that of human voice.
High frequencies cause faster changes in surface orientation, hence it is easier for the algorithm to achieve a higher
signal to noise ratio (SNR). Imagine if the signal consists of mainly low frequency information. In this case, the groove
surface will be nearly flat all the time. The recovered sound may contain more noise, overwhelming the useful sound
signal.
9 Real Experiments on Gramophone Records
Acquiring real images of the gramophone record groove proved to be quite challenging. The extreme dimensions of
the groove required extra care when capturing images using a microscope. Illumination, movement, etc. all have to be
carefully adjusted and controlled.
The nature of our algorithm requires the use of a very short focal length objective, which results in a very shallow
depth of field, i.e. only a narrow band of the groove surface is in focus while all other areas are out of focus if they
do not have the same depth as the focused region has. Special action must be performed to the light path of the
microscope objective so that more than one region of the groove is in focus at the same time. We installed a thin glass
plate between the objective and the record surface, covering half of the field of view to create a dual focus objective.
This was both a simple and effective solution to our shallow depth of field problem. We discuss this in detail below.
Processing the captured image is also a hard task as the images captured are not as ideal as the images generated
in the simulation. Some images are blurry while others show dirt in the groove.
To overcome all the above mentioned difficulties, special care must be taken during the experiment and the implementation of the algorithms. These will be discussed in detail in the following sections.
9.1 System Setup
For the real data, we captured 1390 groups of images, each group contains 45 frames for computing optical flow
sequences. The distance between center image of each group is less than the width of an image in order to get some
overlap among the groups for smoothly varying optical flow between neighboring groups.
The experimental setup shown in Figure 15 consists of a slowly turning Turntable, a Microscope that is used to
magnify the groove by approximately 400X to get good images of the groove with its subtle variations (the groove
width varies between 100 to 300 µm in width) and a Image Capture Device, in our case a high speed digital camera
fitted to the microscope. Images were 24 bit color and were 1280 × 1024 pixels in size.
The records being photographed are 78RPM Standard Play (SP) records. We focus on the 78RPM records since
they need less magnification and are easier to illuminate than the 33RPM Long Play (LP) records.
6 www.csd.uwo.ca/faculty/barron/AUDIOFILES
Figure 15: Setup of microscope, camera, turntable and reduction drive.
9.2 The Focal Length Issue
Since we are trying to compute the groove wall’s surface orientation, the absolute value of the depth is not critical.
However, the relative difference in the depth is important for the computation of surface orientation. The recovery
of accurate depth differences is rather difficult if the target is far away from the camera because the variance in the
measurement may exceed the depth difference itself. The average depth of a SP record is about 120 µm. If the average
error of depth estimation is 7 − 8%, then the average distance between groove and camera should be less or equal to
1.7mm, which is quite short, in term of focal length of the microscope’s objective.
In our experimental setup, a 40X objective lens is used to obtain the shortest focal length available. Since a
microscope’s barrel length (the distance from the objective to the port where an eyepiece is fitted) is fixed, it is focused
by varying the distance to an object until the desired distance is reached. The barrel length can be considered as
the image distance u, which in our case is 160mm, as marked on the housing of the objective lens. Because the
u
= 160mm
= 4mm. Then we can compute focal
magnification is 40X, the object distance can be computed as v = 40
40
length, f , using the thin lens equation as:
f=
1
u
1
+
1
v
=
1
160
1
+
1
4
= 3.9mm.
(45)
This focal length (3.9mm) is longer than we expected (1.7mm), but this is as close as we can get at this time. Thus,
we use what is available and see what adjustments we need to make to compensate for this at a later stage in the
experiments.
9.3 The Dual-focal Microscope Objective
The main problem with a high power objective lens is the depth of field. For a 40X objective lens, the depth of field
is about 1µm [Nikon, 2006]. For a lower magnification factor we obtain a bigger depth of field, but at the same time,
the focal length also increases, which makes the difference in depths more difficult to distinguish. For example, a 4X
objective lens has a depth of field of about 45µm [Nikon, 2006], which covers about half of the total groove depth.
However, the focal length of this 4X objective is about 32mm according to the formula in Equation (45). Trying to
detect a maximum depth variation of 120µm with a focal length of 32mm is a very tough job! So we stick to the
available 40X objective in our experiments.
Figure 16 shows a captured image using the 40X objective. It shows that only two narrow stripes located on the
opposite groove walls are brought in to focus. The overall image is a bit blurry because the record is turning at a
certain speed. The two in-focus stripes are relatively sharp compared to rest areas of the image.
Since obtaining a larger depth of field, i.e. a clear image all the way from groove top to bottom, is theoretically
impossible using our current microscope, bringing more than one depth stripe into focus becomes a more realistic
alternate approach. Figure 17 illustrates the technique of obtaining two focus of depths using a piece of thin cover
glass inserted between the objective lens and the record surface, covering half of the field of view, creating a dualfocusing objective.
Figure 16: The groove image captured directly using the 40X objective lens. Notice the shallow depth of field. Only
two stripes (shown in highlighted areas) on opposite groove walls of same depth are in focus.
Image
Objective Lens
Cover glass d=0.11mm
Object
u
v
δv
Figure 17: Illustration of the modified light path for a microscope objective lens, with a piece of thin cover glass
inserted between the objective lens and the record surface, covering half of the field of view. This setup forms a
dual-focusing objective lens.
When light travels through the piece of cover glass, it is shifted by a certain amount because of the refraction index
of glass (λ = 1.515) is different from that of the air (λ = 1.0). This shift cause the distance of the in-focused object
to increase by a certain amount δv. In the next section, we will show that this shifted distance δv is approximately
constant.
Figure 18a shows a captured image using the modified light path from the same 40X objective lens. Clearly there
are three in-focus stripes: the top two are of the same depth, with the groove bottom in the middle, out of focus. The
third stripe at the bottom represents a closer stripe on the same groove wall where the second stripe resides. So there
are two in-focused regions on the lower groove wall shown in one image. By taking a sequence of such images of the
moving groove walls, we can determine their locations by segmenting their optical flow field. After assigning different
depth values to these stripe areas, surface orientations are computed based on this information.
(a)
(b)
(c)
(d)
Figure 18: (a) Groove image captured using 40X objective lens. Notice there are only three stripes (highlighted areas)
on opposite groove walls of different depths that are in focus. (b) Locations of computed optical flow (white dots) of
a section of a moving record groove. The direction and magnitude of the optical flows are not shown because only
the position information is used here. (c) Results of Hough transform. The 3 square boxes indicates the 3 peaks that
represent the 3 most clustered lines. (d) the 2 segmented regions and their centroid lines, used for computing surface
orientations.
9.4 Determining the Two Depth Levels
Figure 19 shows the detailed light path, made by enlarging the right half of Figure 17.
Cover glass d=0.11mm
Original in-focus object before inserting the cover glass
New in-focus object
θ
θ’
L2
L1
δv
δv
Figure 19: Illustration of the light path showing the shift of the focus distance of the objective lens when a piece of
cover glass is inserted.
We can compute the shifted focus distance δv as follows. From the light incident angle θ and refraction angle θ′ ,
we have:
L1 = d tan(θ) and L2 = d tan(θ′ ),
(46)
where d is the thickness of the cover glass. Thus, the shifted focus distance δv can be computed as:
1 cos(θ)
,
δv = 1 − ·
λ cos(θ′ )
sin(θ)
where λ = sin(θ
′ ) = 1.515 is the refraction index of the cover glass. The value of
incident angle is less than 22.5◦ . Using this approximation, Equation (47) becomes:
1
= 0.037mm,
δv = d · 1 −
λ
cos(θ)
cos(θ ′ )
(47)
is close to 1 when the
(48)
where d = 0.11mm. Next we compute the distances of the bottom two stripes in Figure 18 using this information.
The bottom stripe, since it is not covered by glass. has a distance of 4mm. The top two are at the same distance of
4.037mm using the calculated focus distance shift.
9.4.1 Optical Flow and In Focus Regions
Optical flow is computed using Horn and Schunck’s algorithm [Horn and Schunck, 1981] modified by adding the
groove surface orientation constraint. This constraint regulates the distribution of horizontal optical flow along the
vertical direction since it is known that one of the angles defining wall orientation is 45◦ . Full details are in a PhD
thesis [Tian, 2006, Tian, 2008]. Figure 18b shows the area where optical flow is computed. Since optical flow is used
to determine the regions which are in good focus [out-of-focus regions tended to be heavily blurred with intensity
derivatives near zero], the magnitude of the optical flow is not as important as where it is located. Any location that
has an optical flow vector is marked for later segmentation using Hough transform.
9.4.2 Hough Transform
The computed optical flow for the grooves in Figure 18a is shown in Figure 18b and indicates that three stripe regions
that are in focus. In order to segment the three regions, we compute a Hough transform [Rafeal C. Gonzalez, 1992]
using the optical flow position information.
After observing the optical flow positions in many frames, we observed that the stripes are clustered in straight
line shapes. Thus the dimensionality of the Hough transform used to segment these lines is reduced to 2, resulting in
less computation time and storage space for the accumulator array. At each position that has an optical flow vector,
many pairs of ρ and θ values can be calculated according to the normal representation of a line x cos θ + y sin θ = ρ
that passes through this position. At every optical flow position, each value of θ from −45◦ to 45◦ in steps of 1◦ is
used to compute a ρ value and the accumulator array element at [θ, ρ] is incremented by one. The Hough transform
as a θ − ρ image is shown in Figure 18c. In our experiments, since there are so many images to process, quantization
steps 10 times bigger were used to obtain a much faster computational speed without diminishing the segmentation
results by much.
Three global peaks of (θ, ρ) values in the accumulator array are detected, which represent the three lines. Any
(θ, ρ) pair that is within a certain range of the three peak (θ, ρ) values is considered to belong to that region. To
segment the flow field, we compute (θ, ρ) values for each optical flow position. If it falls in one and only one of the
three peak regions, then it belongs to that region. If it falls to two or more regions, then this position lies between
two regions and usually is caused by noise. It is simply discarded. Adjusting the range around the peak affects the
width of the three regions. Increasing the range also widens the regions while decreasing the range make the regions
narrower. There is a compromise between region density and noise tolerance. The range is determined by trial and
error [θ varying from −45◦ to +45◦ and ρ varying from 0 to 1024].
As discussed in Section 9.4, only the lower two regions have different depth values. After successful segmentation
of the three stripes, the top one is discarded, as shown in Figure 18d. Also shown in this Figure are the centroid lines
for the two regions remaining (after a bit of smoothing). Void sections are filled with [x, y] computed from peaks (θ, ρ)
values using the θ − ρ equation corresponding to its region. Depth values are then assigned to these two centroid lines
to reduce the computational cost when computing surface orientation.
9.5 Computing Surface Orientation
Earlier, we introduced a robust method of estimating the surface orientation given depth values within a small neighbourhood [Tian and Barron, 2006]. Now the depth area has been reduced to two virtually parallel lines with two
different depth values. Our robust algorithm also works in this case. We use a neighbourhood of 100 pixels about each
line for best results. This size choice represents the best compromise between increasing this size (would attenuate
high frequencies sounds) or decreasing the size (would enhance noise effects).
9.6 Sound from Surface Orientation
The following images show an example of how the damaged groove walls are repaired and missing signal filled.
Figure 20a shows the damaged groove from “Give a little, take a little” by Hank Thompson, produced by RCA
Victor, a 78RPM record. Nearly half of the groove walls in this image is missing. Figure 20b shows the positions
of the computed optical flow. There are some vertical stripes caused by the damage. The Hough transform segments
these lines and removes the vertical stripes. Figure 20c shows the results of Hough transform. Figure 20d shows the
segmented two regions and their centroid lines for computing the surface orientations. The missing part has been filled
up using its default line parameters.
(a)
(b)
(c)
(d)
Figure 20: (a) An image showing the damaged groove. (b) Positions of the computed optical flow of the damaged
groove. (c) Results of Hough transform. The 3 square boxes indicate the 3 clustered line regions. (d) Segmented two
regions and their centroid lines for computing the surface orientations. The missing part has been filled up using its
default line parameters.
Figure 21 shows the original and reconstructed music. The computed Pearson’s product-moment correlation coefficient between them is r = 0.561. Listening confirms that the music sound is quite recognizable in spite of the
presence of noise. The popping noise present in the original sound is not so obvious in the reconstructed sound.
(a)
(b)
(c)
(d)
Figure 21: (a) The original sound wave recorded from a 78RPM record player with audible pops. (b) The recovered
sound wave from groups of groove images. (c) The wave peak envelope of the original sound. (d) The wave peak
envelope of the recovered sound.
10 Conclusions
This framework forms a basis for reproducing sound from gramophone records using 3D reconstruction algorithms.
The real data experiment revealed some practical limitations on the algorithm. Some effort was put into trying to solve
these problems. This framework indicates that our 3D reconstruction approach may be good for non-contact record
playing and archiving. Future work includes improving each step of the algorithm: better optical flow, better imaging
setup, better depth orientation reconstruction, and better computing resources such as faster I/O, more CPU power and
maybe a parallel (SIMD) implementation to make it real-time.
References
[Barron and Klette, 2002] Barron, J. and Klette, R. (2002). Quantitative colour optical flow. In Intl. Conf. on Pattern
Recognition (ICPR2002), volume 4, pages 251–255.
[Barron et al., 1994] Barron, J. L., Fleet, D. J., and Beauchemin, S. S. (1994). Performance of optical flow techniques.
IJCV, 12(1):43–77.
[Barron et al., 2003] Barron, J. L., Ngai, W. K. J., and Spies, H. (2003). Quantitative depth recovery from timevarying optical flow in a kalman filter framework. In T. Asano, R. K. and Ronse, C., editors, LNCS 2616 Theoretical
Foundations of Computer Vision: Geometry, Morphology, and Computational Imaging, pages 344–355.
[Bianchi and Sorrentino, 2007] Bianchi, G. and Sorrentino, R. (2007).
McGraw-Hill Professional.
Electronic Filter Simulation & Design.
[Black and Anandan, 1996] Black, M. J. and Anandan, P. (1996). The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer Vision and Image Understanding, 63(1):75–104.
[Butterworth, 1930] Butterworth, S. (1930). On the theory of filter amplifiers. Wireless Engineer, 7:536–541.
[Cavaglieri et al., 2001] Cavaglieri, S., Johnsen, O., and Bapst, F. (2001). Optical retrieval and storage of analog
sound recordings. In The AES 20th International Conference, Budapest, Hungary.
[ELP, 1997] ELP (1997). Elp laser turntable. Internet reference: www.elpj.com.
[Fadeyev and Haber, 2003] Fadeyev, V. and Haber, C. (2003). Reconstruction of mechanically recorded sound by
image processing. J. of Audio Eng. Soc., 51(12):1172–1185.
[Fadeyev et al., 2005] Fadeyev, V., Haber, C., Maul, C., McBride, J., and Golden, M. (2005). Reconstruction of
recorded sound from an edison cylinder using three-dimensional non-contact optical surface metrology. J. of Audio
Eng. Soc., 53(6):485–508.
[Heel, 1990] Heel, J. (1990). Direct dynamic motion vision. In Proc IEEE Conf. on Robot Automation.
[Horn and Schunck, 1981] Horn, B. K. P. and Schunck, B. G. (1981). Determining optical flow. Artificial Intelligence,
17:185–204.
[Hung and Ho, 1999] Hung, Y. S. and Ho, H. T. (1999). A kalman filter approach to direct depth estimation incorporating surface structure. IEEE PAMI, pages 570–576.
[Iwai et al., 1986] Iwai, T., Asakura, T., Ifukube, T., and Kawashima, T. (1986). Reproduction of sound from old wax
phonograph cylinders using the laser-beam reflection method. Applied Optics, 25(5):597–604. (Internet reference:
www.opticsinfobase.org/abstract.cfm?URI=ao-25-5-597).
[Kessler and Ziegler, 1999] Kessler, T. and Ziegler, S. (1999). Direct play back of negatives of historic sound
cylinders. In EVA (Electronic Media & Visual Arts) Europe’99, pages 8.1 – 8.5. (Internet reference:
www.gfai.de/projekte/spubito/papers/eva99.pdf).
[Laborelli et al., 2007] Laborelli, L., Chenot, J.-H., and Perrier, A. (2007). Non-contact phonographic discs digitisation using structured colour illumination. In Audio Engineering Society 122nd Convention, Vienna, Austria, page
(11 pages). (Paper 7009).
[Longuet-Higgins and Prazdny, 1980] Longuet-Higgins, H. and Prazdny, K. (1980). The interpretation of a moving
retinal image. Proceedings of the Royal Society of London B, Biology Sciences, 208(1173):385–397.
[Lucas and Kanade, 1981] Lucas, B. D. and Kanade, T. (1981). An iterative image-registration technique with an
application to stereo vision. In Image Understanding Workshop, pages 121–130. DARPA.
[Matthies et al., 1989] Matthies, L., Szeliski, R., and Kanade, T. (1989). Kalman filter-based algorithms for estimating
depth from image sequences. IJCV, 3(3):209–238.
[Nikon, 2006] Nikon (2006). Nikon microscopy tutorial. Internet reference: www.microscopyu.com.
[Olsson et al., 2003] Olsson, P., Öhlin. R. Olofsson, D., Vaerlien, R., and Ayrault, C. (2003). The digital needle
project - group light blue. Technical report, KTH Royal Institute of Technology, Stockholm, Sweden. Internet
reference: www.s3.kth.se/signal/edu/projekt/students/03/lightblue/.
[Press et al., 1992] Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1992). Numerical Recipes
in C. Cambridge University Press, 2 edition.
[Rafeal C. Gonzalez, 1992] Rafeal C. Gonzalez, R. E. W. (1992). Digital Image Processing. Addison-Wesley Publishing Company.
[Simoncelli, 1994] Simoncelli, E. P. (1994). Design of multi-dimensional derivative filters. In IEEE Int. Conf. Image
Processing, volume 1, pages 790–793.
[Springer, 2002] Springer, O. (2002).
www.cs.huji.ac.il/˜springer/.
Digital needle - a virtual gramophone.
Internet reference:
[Stanke and Kessler, 2000] Stanke, G. and Kessler, T. (2000). Verfahren zur gewinnung von tonsignalen aus negativspuren in kupfernegativen von edison-zylindern auf bildanalytischem/sensoriellen wege (spubito)/procedure to recover sound signals from the negative tracks in copper negatives of edison cylinders in an image analysis/sensorial
way (spubito). In Simon, A., editor, Das Berliner Phonogramm-Archiv 1900-2000. VWB-Verlag für Bildung und
Wissenschaft, Berlin.
[Stoddard, 1989] Stoddard, R. E. (1989). Optical turntable system with reflected spot position detection. United
States Patent 4,870,631.
[Stoddard and Stark, 1989] Stoddard, R. E. and Stark, R. N. (1989). Dual beam optical turntable. United States Patent
4,870,631.
[Stotzer et al., 2003] Stotzer, S., Johnsen, O., Bapst, F., Milan, C., Sudan, C., Cavaglieri, S. S., and Pellizzari, P.
(2003). Visualaudio: an optical technique to save the sound of phonographic records. IASA Journal, pages 38–47.
[Stotzer et al., 2004] Stotzer, S., Johnsen, O., Bapst, F., Sudan, C., and Ingol, R. (2004). Phonographic sound extraction using image and signal processing. In Proc. ICASSP, Montreal, Quebec, Canada.
[Tian, 2006] Tian, B. (2006). Reproduction of Sound Signal from Gramophone Records using 3D Scene Reconstruction. PhD thesis, University of Western Ontario, London, Ontario, Canada N6A 5B7.
[Tian, 2008] Tian, B. (2008). Sound Recovery from Gramophone Records by 3D Reconstruction. VDM Verlag.
[Tian and Barron, 2005] Tian, B. and Barron, J. L. (2005). A quantitative comparison of 4 algorithms for recovering
dense accurate depth. In 2nd Canadian Conference on Computer and Robot Vision, pages 498–505, Victoria, BC,
Canada.
[Tian and Barron, 2006] Tian, B. and Barron, J. L. (2006). Reproduction of sound signal from gramophone records
using 3d scene reconstruction. In Irish Machine Vision and Image Processing Conference, pages 84–91, Dublin,
Ireland.
[Ziegler, 2000] Ziegler, S. (2000). Das walzenprojekt zur rettung der größten sammlung alter klangdokumente von
traditioneller musik aus aller welt. walzen und schellackplatten des berliner phonogramm-archivs/the wax cylinder
project in rescue of the largest collection of old sound documents of traditional music from around the world: Wax
cylinders and shellac records of the berlin phonogramm-archiv. In Simon, A., editor, Das Berliner PhonogrammArchiv 1900-2000. VWB-Verlag für Bildung und Wissenschaft, Berlin.