Using Computer Vision Technology to Play Gramophone Records Baozhong Tian Department of Computer Science Azusa Pacific University Azusa, CA 91702-7000, USA [email protected] John L. Barron Department of Computer Science The University of Western Ontario London, Ontario, Canada, N6A 5B7 [email protected] Abstract We present a non-contact optical flow based method to reproduce the sound signal from gramophone records using 3D robust scene reconstruction of the surface orientation of the walls of the grooves. The conversion of analogue data to digital data is an important task for the preservation of historic sound recordings. We digitally viewed the grooves of a record using a microscope that was modified to overcome the limitation of a shallow depth of field by using a thin glass plate to obtain part of the image at a second focal length to gain better overall quality images of the groove. The sound signal was recovered from the groove surface orientation. The overall algorithm has been tested and found to be working correctly using undamaged and damaged real records. Keywords: Sound Signal Reproduction, Surface Reconstruction from Optical Flow, Gramophone Record, Robust Estimation, Surface Orientation. Optical Signal Reproduction/Retrieval from gramophone discs/mechanical sound carriers. 1 Introduction Reproducing sound mechanically started as early 1877 on cylinder-shaped carriers, and in 1887 on disk-shaped gramophone records and the technology of recording and retrieving acoustic signals on gramophone records reached its peak during the 1970’s, just before the digital format compact disc (CD) began to dominate the mass marketing of music. Although the audio quality of a CD is judged to be very good by most people, some audiophiles believe that the sampling rate of a CD (44.1kHz) is not high enough to reproduce the rich musical information faithfully. Today there are still some high-end record playing equipments in production. There are a lot of historical recordings that need to be archived. The problem with some of these recordings is that they have become so fragile with age and use that they can not tolerate being played back using a traditional style turntable with a mechanical stylus. This problem motivates research on non-contact record playing systems. A non-contact playback system may has the following additional uses: 1. the replay of broken discs or cylinders or otherwise damaged carriers, i.e. lacquer discs in the state just before flaking off cannot be played mechanically, 2. optical replay at multiple speeds The work presented here is a step in these directions. 1.1 Traditional Method of Sound Reproduction We use stereo gramophone long-playing records (stereo LP) to illustrate how sound reproduction is performed. During the record cutting procedure, the left and right channel signals control the speed of the cutting stylus at a +45/-45 lateral manner, i.e. a composition of two orthogonal speeds perpendicular to each other, while the record rotates at a constant speed. This is called modulation of the grooves. The movement of the stylus determines the slopes in the tangential direction of the groove walls. During the cutting process the left and right groove walls’ modulation are kept independent from each other. When the play back stylus has a similar setup as the cutting stylus had, stereo signals can be reproduced. The electrical signal outputs are proportional to the +45/-45 lateral speeds of the stylus while riding along the groove and modulated by the groove walls. ϖ V T VL VR (+45) (-45) Record surface (Land) Stylus tip Left wall VL (a) Right wall VR Groove Bottom (b) Figure 1: Illustrations of the movement of the record and the stylus: (a) top view, (b) cross section view. Figure 1a illustrates the top view of the movements of a record and a stylus. The stylus has a tangential speed VT relative to the groove due to the record rotation ~ω. There are also left and right lateral movements of the stylus (VL and VR ) in +45/-45 directions. Figure 1b shows a cross section view of the compound +45/-45 lateral movement of the stylus. The major task of sound reproduction is to track the groove walls as precisely as possible. The conventional method uses a diamond-tip stylus to run along the V-shaped groove by applying a certain tracking force on the stylus. The problem is that the stylus has some weight, so the tracking of high frequency signals is more difficult. Other problems include groove damage such as scratches and small particles that result in annoying clicks, pops and degradation of sound over time and the maintenance of the correct settings of the turntable, tone arm, cartridge and stylus during play back (may require frequent adjustments). 1.2 Literature Survey of Non-Contact Record Playing Methods [Iwai et al., 1986] developed a laser beam reflection method to reproduce the sound from all kinds of old (repaired) wax phonograph cylinders. They used their method to reproduce the recorded talk and songs of the Ainu people in Sakhalin and Kokkaido, Japan, made by Polish anthropologist, Bronislaw Pilsudski, from 1902 to 1905 on standard Edison wax cylinders. An optical beam, incident onto the groove cut on the wax cylinder, reflects to a detection plane that is perpendicular to the optical axis. The time-varying position of the intersection of this reflected beam with the detection plane while the cylinder is rotated (played) corresponds to the sound signal. The final sound signal is then obtained by filtering this signal with a frequency equalizer. Problems such as the fidelity of the recovered sound (decreased beamwidths result in enhanced fidelity and loudness while increased beamwidth results in loss of consonant sounds and speckle noise due to the elimination of higher frequencies caused by the smoothing effect on the time-varying intersection point position variation), the existence of speckle noise (due to recrystallization of the wax caused by poor storage conditions for 100+ years), the echo of reproduced sounds (caused by increased beamwidths overlapping multiple grooves) and the occasional tracking error (caused by the illuminating beam improperly mistracking a groove over time) had to be resolved (usually by appropriate parameter settings and/or additional hardware). [Kessler and Ziegler, 1999, Stanke and Kessler, 2000, Ziegler, 2000] describe a contact-less method to play back the copper negatives of Edison cylinder phonograms (“galvanos”) in the Berlin Phonogramm-Archiv. Note that for these negatives, grooves on the original wax cylinders now are ridges on the galvanos. Image processing was used for tracking these ridges on the copper galvanos. A direct galvano player consisting of an endoscope and a diamond stylus to play the record was constructed. The sound quality is reported to be as good as the original cylinders played on the modern cylinder player with a diamond stylus. More information is available at 2 websites1 . ELP corporation [ELP, 1997] spent ten years developing a laser turntable (invented by Robert E. Stoddard et al. [Stoddard and Stark, 1989, Stoddard, 1989]) utilizing five laser beams to track the microgroove optically. This is a pure analogue process, but it is so sensitive to foreign particles in the groove and on the record surface that it requires the record to be cleaned every time it is played. This ELP laser turntable uses two of the five beams of the laser to track the groove walls and the other three laser beams for groove tracking. This has two main advantages: the laser beams are weightless and can be made as thin as 2µm in diameter, which is much thinner than a high-end stylus (4-12µm). However, the system is very complicated and expensive and it only works well with black records because of the reflective nature of the material. Coloured records may produce unpredictable results [ELP, 1997]. Because the laser turntable is very expensive (in the price range of a small car) and because it is very sensitive to the cleanness and color of the record, it is not judged to be a feasible solution. Other research has studied the feasibility of reproducing the sound signal by image processing methods. In 2002, Ofer Springer [Springer, 2002] proposed an idea he called the virtual gramophone. Springer’s idea is to scan an image of the record and write a decoder to apply a “virtual needle” following the groove spiral form. However, when the authors listened to a sample decoded sound2, the music was judged to be barely recognizable. Inspired by Springer’s idea, a group of Swedish students [Olsson et al., 2003] developed a system to use more sophisticated digital signal processing methods, such FIR Wiener filtering and spectral subtraction, to reduce noise level in the reproduced sound, resulting in a better reproduction result than that of Springer’s. Both systems used an off-shelf scanner, which limited the resolution of the images to a maximum of 2400dpi or 10µm per pixel. At this resolution, the quantization noise is quite high because the maximum lateral travel of the groove is about 150µm. [Fadeyev and Haber, 2003] developed a 2D method to reconstruct mechanically recorded sound by image processing. The resolution was greatly improved by the aid of micro-photography. Their algorithm detects the groove bottom as an edge in the image and then differentiates the bottom edge shape to reproduce sound signals. The groove bottom edges are not always very well defined and sometimes distorted by dirt particles. The groove walls, which contain much sound informations, were ignored. This project resulted in a fast optical scanner for disc records called I.R.E.N.E (Image, Reconstruct, Erase Noise, Etc.). They also introduced a 3D method to reproduce vertically modulated records such as wax cylinders [Fadeyev et al., 2005]. It uses 3D profile scanning provided by a laser confocal scanning microscope3. For cylinders, 3D scanning is necessary because the audio is stored in the vertical modulations of the cylinder’s surface. In general, even for standard records, 3D scanning is better that 2D scanning because it allows the entire surface to be analyzed rather that just a projection or slice of it. [Cavaglieri et al., 2001, Stotzer et al., 2003, Stotzer et al., 2004] developed a 2D method they called the VisualAudio concept. A picture of the record was taken using a large format film that was as big as the record. The film was then scanned using a rotating scanner, which is actually a line scan camera positioned above the film while the film is being rotated on a turntable. Edges were then detected from the digitized image and then sound signals were computed from the edges. Unlike the method of [Fadeyev and Haber, 2003], they used the groove and surface intersection as the edge, instead of using groove bottom. This gives them the capability to reproduce the sound from stereo 33 rpm recordings. Also the use of the rotating scanner eliminated the need for adjusting the sample rate as the groove turned close to the record’s center. The images were rectangular, and not circular, as scanned by a flat-bed scanner. A 4X magnifier was 1 www.gfai.de and www.chrosterhamp.se/phono/stank/html 2 www.cs.huji.ac.il/˜springer/ 3 irene.lbl.gov/ fitted to the rotating scanner to get the desired image resolution. A Signal to Noise Ratio (SNR) analysis showed that a satisfying SNR of 40dB can be achieved if the standard deviation (σn ) of edge position noise was kept below 1.28 µm. However, listening to the reproduced sound clips from their web site4 indicated that the noise level needs to be further reduced. [Laborelli et al., 2007] proposed a contact-less optical playing method using structured colour illumination. A region of a record is illuminated by beams of light rays, where the colour of a light beam is dependent on the direction of incidence. These beams are reflected by the record groove wall towards a camera, allowing direct access to the audio signal via colour image decoding. Structured illumination also allows the exploitation of the height information of groove walls of the record. The authors claim that their method is also potentially advantageous for the detection of dust occlusion and the automatic interpolation of fractured records. 2 Proposed Method We propose a sound reproduction method based on Computer Vision technologies such as optical flow and surface reconstruction. The proposed method uses a microscope to obtain a sequence of magnified images of the groove walls and uses 3D scene surface reconstruction to calculate the slopes of the walls. Figure 2 shows the system diagram. Image Sequence Acquisition Optical Flow Depth Map Surface Orientation Raw Sound Signal Groove Tracking Digital Signal Processing (DSP, Optional) Motor Controllers Figure 2: System diagram. The major features of the proposed method can be summarized as: • Using as much information on the record as possible to reproduce the sound. Plenty of information is stored via the surface orientation of the groove walls, which is not used by 2D methods during their scanning/photographing processes. A 2D method only computes detectable edges such as a groove’s bottom or groove-surface (land) intersections. • Computer Vision technologies such as optical flow and depth map estimation are applied to this problem to obtain the 3D information characterizing the groove, thus eliminating the requirement for a specialized 3D scanning device. • Robust estimation techniques help choose the best areas of the groove wall for the computation and reject noisy areas which have been damaged by scratches and dirt particles, reducing the level of the noise and improving the quality of the reproduced sound. We discuss the individual system components below. 3 Image Sequence Acquisition We use groups of overlapping image sequences to cover the entire groove and allow an optical flow calculation for the frames in each group. Since images are 1280 pixels wide, we use an overlap of 100 pixels between the center images 4 www.eif.ch/visualaudio/ in each group. We obtained each group of images by first acquiring a short avi video clip about a groove position and then using an image processing program (ImageMagick) to separate out 36 individual color images from the avi file. Because color adds little to optical flow calculations [Barron and Klette, 2002], these color images were immediately converted to grayvalue images to reduce processing and storage costs, before any optical flow calculation is done. Each two consecutive images in same group differ only by a translation. When the camera moves to the next segment of a groove to capture another group of images, the motion is such that last image of the current group differs from the last image of the previous group by a small amount so that the reconstructed groove from the optical flow fields is continuous when paired together. (a) (b) Figure 3: Two pieces of groove from (a) a 78 rpm SP record and (b) a 33 rpm LP record. The magnifying factor is 60X. Figure 3 shows images of grooves under a microscope. The magnification factor of the microscope is set to be such that the field of view covers about 600µm in width so that for a camera with 640 × 480 pixel resolution, the horizontal spatial resolution is about 1µm. The illumination is set so that the groove walls are bright while the record surface and the groove bottom are dark. In our actual computations, we use a better microscope with higher resolution and magnification camera to significantly improve the image quality. With our current setup, and the few seconds of recovered sound, which means just 3 or 4 rounds of groove length, the scanning resolution makes little difference relative to current tracking radius. As a record is scanned from its outer most edge to its inner edges, there are more pixels per wave length in the outer grooves than in the inner grooves. As a result. the scanning resolution should be compensated for according to the radius as the scanning head rotates. Our microscope is set at 400X, which gives us quite high resolution to prevent speed fluctuations within one group of scans. Accurate alignment between neighboring groups can also reduce speed fluctuations when sound pieces are stitched together. 4 Image Preprocessing The images need to be preprocessed, i.e. we need to compute the image intensity derivatives and the optical flow fields, etc. before we can compute the depth maps. We experimented with implementations of two standard differential optical flow techniques, namely those of [Horn and Schunck, 1981] and [Lucas and Kanade, 1981] with differentiation by [Simoncelli, 1994]. Those results are reported in a PhD thesis [Tian, 2006, Tian, 2008]. We obtained the best optical flow using Horn and Schunck’s algorithm, adapted to our problem by imposing surface constraints that arise from computing optical flow on the images of record groove walls, i.e. a local surface planarity constraint and a constraint arising from the “V” shape of groove cross sections. All differential optical flow is based on the Motion Constraint equation. Briefly, we describe this constraint equation, describe the Lucas and Kanade and Horn and Schunck optical flow methods in light of this constraint, describe how we perform intensity differentiation and finally describe our optical flow algorithm, which is Horn and Schunck based but uses an additional groove orientation constraint. 4.1 The Motion Constraint Equation Let I(x, y, t) be the intensity function at pixel (x, y) at time t. If we assume that the brightness pattern in the local neighborhood about I(x, y) at time t moves as a simple translation (δx, δy, δt), that the brightness patterns about I(x + δx, y + δy, t + δt) and I(x, y, t) should be the same: I(x + δx, y + δy, t + δt) = I(x, y, t). (1) st Then a 1 order Taylor series expansion yields: I(x + δx, y + δy, t + δt) = I(x, y, t) + Ix δx + Iy δy + It δt, (2) where Ix , Iy and It are the spatio-temporal image intensity derivatives. Using Equation(1), we can re-write Equation (2) as: Ix δx + Iy δy + It δt = 0. (3) Dividing δx and δy by δt yields the motion constraint equation: Ix u + Iy v + It = 0, where u = δx δt and v = δy δt . (4) This can be expressed more compactly as: ∇I · ~v + It = 0, (5) where ∇I = (Ix , Iy ) is the spatial intensity spatial gradient and ~v = (u, v) is optical flow vector (or image velocity) at pixel (x, y). The motion constraint equation is the basis for most differential optical flow equations. It specifies that at each image pixel, its optical flow vector is constrained to be a point lying on a line. An additional constraint on the optical flow is required to resolve this ambiguity (called the aperture problem). 4.2 Lucas and Kanade Optical Flow To resolve the aperture problem, [Lucas and Kanade, 1981] assume the motion of a pixel is constant in some local neighbourhood about that pixel. Thus, two (or more) sets of different derivatives in this neighbourhood yield a nonsingular linear (least-squares) system of equations that yield values for (u, v). The main problem with this algorithm is that the neighbourhoods have to be small in order to satisfy the constant motion assumption, making optical flow calculations very sensitive to noise. One can discriminate between “good” and “bad” optical flow vectors on the basis the smallest eigenvalue of the least squares integration matrix [Barron et al., 1994] but note that then one typically obtains sparse optical flow fields which are not that useful for our application. 4.3 Horn and Schunck Optical Flow [Horn and Schunck, 1981] resolve the aperture problem by using a global smoothness constraint in addition to the motion constraint equation to compute 100% dense flow. That is, they combine the gradient constraint in Equation (4) with a global smoothness term to constrain the estimated velocity field ~v = (u, v), minimizing Z (∇I · ~v + It )2 + λ2 (||∇u||22 + ||∇v||22 )dxdy (6) D over a domain D (the image), where the magnitude of the Lagrange multiplier λ reflects the influence of the smoothness term. We used λ = 1.0 and λ = 10.0 in our work. Guass-Seidel iterative equations are used to minimize the Euler Lagrange equations derived from Equation (6) and yield the optical flow field as: uk+1 = ūk − Ix [Ix ūk + Iy v̄ k + It ] and α2 + Ix2 + Iy2 (7) v k+1 = v̄ k − Iy [Ix ūk + Iy v̄ k + It ] , α2 + Ix2 + Iy2 (8) where k denotes the iteration number, u0 and v 0 denote initial velocity estimates which are set to zero, and ūk and v̄ k denote neighbourhood averages of uk and v k respectively. We use at most 100 iterations in our iterative calculations, which is sufficient to achieve convergence (the average norm of the difference in the flow fields between adjacent iteration k and k + 1 is less than a some threshold τ ). 4.4 Intensity Differentiation Differentiation (to compute Ix , Iy and It ) was done using Simoncelli’s [Simoncelli, 1994] matched balanced filters for low pass filtering (blurring using the (p5 ) kernel) and high pass filtering (differentiation using the (d5 ) kernel) [see Table 1 for the 5-tap filters we use]. Matched filters allow comparisons between the signal and its derivatives as the high pass filter is simply the derivative of the low pass filter and this is hypothesized to yield more accurate derivative values [Simoncelli, 1994]. Using these two masks, Ix is computed by applying p5 in the t dimension, then p5 to those results in the y dimension and finally d5 to those results in the x dimension. Iy is computed in a similar manner. To compute It , we first apply p5 in the x dimension for each of 5 adjacent images, then p5 again in the y dimension on those 5 results and finally d5 in the t dimension on the x and y smoothed results. Before performing this filtering we use a simple averaging filter [ 14 , 21 , 14 ] to slightly blur the images (this reduces the 7 input images to 5 smoothed images). Simoncelli claims, that because both of his filters were derived from the same principles, more accurate derivatives result. He demonstrated on the Yosemite Fly-Through sequence [Simoncelli, 1994]. n 0 1 2 3 4 p5 0.036 0.249 0.431 0.249 0.036 d5 -0.108 -0.283 0.0 0.283 0.108 Table 1: Simoncelli’s 5-point Matched/Balanced Kernels 4.5 Computing Regularized Optical Flow using a Groove Orientation Constraint Our image sequences are generated by a single stationary camera taking images of a record surface as it rotates underneath the camera. This is equivalent to a moving camera taking images of a stationary record surface, except now the flow field is in the opposite direction. The standard image velocity equations [Longuet-Higgins and Prazdny, 1980] relate a velocity vector, ~v = (u, v), measured at image location ~p = (x, y, f ) = f P~ /X3 , [i.e. the perspective ~ = (U1 , U2 , U3 ) and 3D sensor rotation projection of a 3D point P~ = (X, Y, Z)], to the 3D sensor translation U ~ω = (ω1 , ω2 , ω3 ) parameters. We can rewrite these standard equations as the sum of the translational and rotational image velocity components as ~v (~ p, t) = ~vT (~ p, t) + ~vR (~ p, t) (9) where ~vT and ~vR are the translational and rotational components of image velocity: ~vT (~ p, t) = A1 (~ p) and: ~) = A1 (Y A2 (~ p) = ~ U and ~vR (~ p, t) = A2 (~ p)~ω (t), Z xy f (f + −f 0 y2 f ) 0 −f x y −(f + − xy f x2 f ) and y −x (10) (11) ! . (12) ~ = [U1 , 0, 0]T and ω Because the camera motion is simple translation along the positive x axis, i.e. U ~ = (0, 0, 0), in our application, Z can be computed as: f U1 . (13) Z=− u We can differentiate Z with respect to u as: U1 ∂Z (14) = f 2. ∂u u Since y = f YZ , we obtain: f f u ∂y = = f U1 = − . ∂Y Z U1 − u (15) 1 ∂Z = ∂Y c (16) Because of the shape of the cutting head (a “V”, i.e. a 45◦ slope) the surfaces of the groove are planar and: where c = ±1 depending on the side of the groove surface. Figure 4 illustrates this groove surface constraint. X3 X2 X1 Figure 4: Illustration of the groove surface constraint due to the shape of the groove. From Equation (16), we have: ∂Z ∂u ∂y 1 ∂Z = = ∂Y ∂u ∂y ∂Y c (17) Substituting Equations (14) and (15) into Equation (17), we obtain: ∂u u =− . ∂y c·f (18) Because of the the groove orientation constraint in Equation (18), we modify Horn and Schunck’s regularization functional to: 2 2 2 2 ! Z ∂u u ∂u ∂v ∂v 2 2 (∇I · ~v + It ) + α . (19) + + + + ∂x ∂y cf ∂x ∂y D Using the approximations ∇2 v1 ≈ v̄1 − v1 and ∇2 v2 ≈ v̄2 − v2 [Horn and Schunck, 1981] we obtain the Euler Lagrange equations: (Ix2 + α2 + α2 )u + Ix Iy v c2 f 2 Ix Iy u + (Iy2 + α2 )v = α2 ū − Ix It and (20) = α2 v̄ − Iy It . (21) The Gauss-Seidel method was used to obtain the iterative equations: n+1 = ū − v n+1 = v̄ n − u n 1 α2 2 c2 f 2 Iy + c2 f 2 )ū + Ix Iy v̄ + Ix It and 2 2 2 Ix2 + c21f 2 Iy2 + cα 2 f 2 + Iy + α (Iy2 + c21f 2 Iy2 )v̄ + Ix Iy ū + (1 + c21f 2 )Iy It . 2 Ix2 + c21f 2 Iy2 + cα2 f 2 + Iy2 + α2 (Ix2 + (22) (23) Figure 5 shows the optical flow field computed using the above algorithm on a set of synthetic groove images. Figure 5: The computed optical flow by the modified Horn & Schunck regularization using the groove orientation constraint, with a spatial gradient threshold ||∇I ≥ 2.0 and a Lagrange multiplier α = 1.0. The flow field was subsampled by 8 and scaled by 6.0. 5 Depth Map Computation We performed a survey [Tian and Barron, 2005] of recent algorithms for dense depth maps (from image velocities or intensity derivatives) which appeared to give good results in the literature. All of these algorithms assume known camera translation and rotation (or can be made to have this assumption). We looked at algorithms by [Heel, 1990], [Matthies et al., 1989], [Hung and Ho, 1999] and [Barron et al., 2003]. A detailed description of the implementation of these algorithms and their performance can be found in a conference paper [Tian and Barron, 2005]. Quantitative results show that the methods of [Barron et al., 2003] was the best over all, although [Matthies et al., 1989] was very competitive. 5.1 Barron, Ngai and Spies Barron, Ngai and Spies [Barron et al., 2003] proposed a Kalman filter framework for recovering dense depth map from the time-varying optical flow fields generated by a camera translating over a scene by a known amount. They assumed local neighbourhood planarity to avoid having to compute non-pixel correspondences. That is, surface orientation (of a plane) is what is tracked by the Kalman filter over time. We have already seem the standard image velocity equations [Longuet-Higgins and Prazdny, 1980] relate an image ~ and 3D sensor velocity vector measured at image location ~ p = (x, y, f ) = f P~ /Z, to the 3D sensor translation U ~ , t) = ~vT (Y ~ , t) + ~vR (Y ~ , t) where ~vT and ~vR are the translational and rotational components of image rotation ~ω: ~v (Y velocity. We define the depth scaled camera translation as ~u(~ p, t) = ~ (t) U = ûµ(~ p, t), ~ ||P (t)||2 (24) ~ ~ f ||U||2 2 where û = Û = (u1 , u2 , u3 ) is the normalized direction of translation and µ(~ p, t) = ||||U|| ~ ||2 = Z||~ p||2 is the depth P scaled sensor speed at ~ p at time t. The focal length f is assumed to be known. If we define 2 vectors: ~r(~ p) = (r1 , r2 ) = ~ p) = (d1 , d2 ) = d(~ |~v − A2 (~ p)~ω | and ||~ p||2 |A1 (~ p)û| , f (25) (26) ~ means each element in the vector is replaced by its absolute value. Then we can solve for µ as weighted where |A| average: r2 |v2 | r1 |v1 | d1 + d2 . (27) µ= |v1 | + |v2 | 5.1.1 Planar Orientation from Relative Depth We compute the local surface orientation as a unit normal vector, α̂ = (α1 , α2 , α3 ) from µ values as: α̂ · ~p = cµ||~ p||2 ~ ||U ||2 (28) We can solve for α̂c by setting up a linear system of equations, one for each pixel in a n × n neighbourhood where planarity has been assumed and using a standard least squares solution calculation. 5.1.2 The Overall Calculation We state the overall algorithm. At the initial time, t = 1: 1. We compute all the µ’s as described in equation (27). 2. In each n × n neighbourhood centered at a pixel (i, j) we compute ( α̂c )(i,j) at that pixel using equation (28). We call these computed α̂c ’s the measurements and denote them as ~gM(i,j) . 3. Given these measurements, we use the ~gM(i,j) to recompute the µ(i,j) ’s as: µ(i, j) = ~ ||2 (~gM(i,j) · p~(i,j) )||U ||~ p(i,j) ||2 (29) We apply a median filter to the µ(i, j) within 5 × 5 neighbourhoods to remove outliers. We repeat step 2 once more with these values to obtain the final ~g values. At time t ≥ 2: 1. We compute µ at each pixel location and then compute all ~gM(i,j) ’s in the same way described above for the new optical flow field. Using the image velocity measurements at time t = i, we use the best estimate of surface orientation at time t = i − 1 at location p~ − ~v (∆t = 1) plus the measurement at p~ and its covariance matrix ~ at time t = i. We do this at all Y ~ locations (where possible), recompute to obtain a new best estimate at Y the µ values via equation (29) and output these as the 3D shape of the scene. At time t = i we proceed as for time t = 2, except we use the best µ estimates from time t = i − 1 instead of time t = 1 in the Kalman filter updating. 5.1.3 The Kalman Filter Equations Note that the components of α̂c in equation (28) are not independent, thus we have a covariance matrix with non-zero off-diagonal elements in the Kalman filter equations. We use a standard set of Kalman filter equations to integrate the surface orientations (and hence depth) over time. Please see Barron, Ngai and Spies [Barron et al., 2003] for details of the Kalman filter equations. 5.2 Experimental Technique We generated ray-traced cube, cylinder and sphere image sequences with the camera translating to the left by (−1, 0, 0), as shown in Figure 6. (a) (b) (c) Figure 6: Synthetic test data: (a) A marble-texture cube (b) A marble-texture cylinder and (c) A marble-texture sphere. We marble textured this sequence so that optical flow could be used. The texture is kept fixed to the object. We also generated a second set of image sequences with the same objects but with sinusoidal patterns instead of marble texture. These sequences allowed the correct derivatives to be computed, we used these or the correct optical flow to confirm the correctness of our implementations. We compute error distributions for number of depth ranges ≤ 5%, between 5% and 15% and ≥ 15% for 4 frames (the 7th frame at the beginning of the sequences, the 19th , the 27th and 36th frames in the middle and near the end of the sequences). We also compute the average error (as a percentage) and its standard deviation for the 4 frames. 5.3 Error Distributions and Depth Maps Table 2 shows the error distributions for the 27th frame in the three image sequences. We can see that as more images are processed by the Kalman filter results become more accurate. By the 36th image, about 85% of the depth values had less than 15% error. Figure 7 shows the raw depth maps for the 3 objects. We left these images unsmoothed and untexture mapped to show how good the depth recovery really was. <5% 33.50 51.63 52.18 52.76 36.71 54.26 52.91 52.10 26.03 41.45 43.99 45.75 Error % 7th cube 19th cube 27th cube 36th cube 7th cylinder 19th cylinder 27th cylinder 36th cylinder 7th sphere 19th sphere 27th sphere 36th sphere >15% 25.23 10.81 9.66 11.71 25.43 12.48 13.51 13.23 37.59 19.73 17.68 16.66 5-15% 41.27 37.56 38.16 35.53 37.86 33.25 33.58 34.67 36.38 38.82 38.33 37.59 Mean±σ 12.16±13.84 7.19±8.51 6.88±7.88 7.23±8.67 12.55±15.26 7.29±8.87 7.52±8.95 7.46±8.50 16.91±18.18 9.64±10.49 9.01±9.91 8.68±9.75 Table 2: The percentage of the estimated depth values that have various relative error distributions in the experiments for the cube, cylinder and sphere using [Barron et al., 2003]’s algorithm. The last column shows the mean error and its standard deviation σ. 0 10 20 30 40 50 60 0 200 0 400 600 60 10 20 30 40 50 60 0 200 600 60 800 0 400 30 40 50 0 400 600 800 1000 50 50 40 40 40 30 30 30 20 20 20 10 10 10 0 0 0 (b) 60 200 1000 50 (a) 20 60 800 1000 10 (c) Figure 7: Depth maps for [Barron et al., 2003]’s algorithm for (a) the cube, (b) the cylinder and (c) the sphere at the 27th frame in the image sequences. We report experimental results for Barron et al.’s algorithm on synthetic record groove images and on real groove images with encouraging results in this paper. Because the groove wall orientation can be described by 2 angles, one of which is constrained and because the vertical component of image velocity is always very small (uni-direction constraint), we believe imposing such constraints yields even better results then the original algorithm. In addition, [Barron et al., 2003] was modified to use only horizontal velocities. Now, effectively, we have only one angle of the surface orientation to track in the Kalman filter. 6 Robust Estimation of Surface Orientation Surface orientation is computed from depth using a least squares framework. Assuming local planarity, the surface orientation α̂ of a local neighbourhood is constant and satisfies the planar equation α̂ · P~ = c, where P~ = [X, Y, Z] is the 3D coordinate of a pixel and c is a constant. We can solve this linear system using a robust estimation method called Local M-Estimates, as recommended in Press et al. [Press et al., 1992]. This will give more accurate surface orientation as outlier data will be suppressed. α̂ c We present a robust estimation formulation of this calculation as follows. We use vector ~g = (g1 , g2 , g3 ) to denote α = ( αcx , cy , αcz ). We can set up a least square system: or W a11 a21 .. . a12 a22 .. . a13 a23 .. . aN 1 aN 2 aN 3 g1 g2 = W g3 b1 b2 .. . bN , W A~g = W B (30) (31) where W is a N × N diagonal matrix with diagonal elements acting as the weights for the N equations, and ai1 ai2 = = Xi Yi (32) (33) ai3 = Zi and (34) bi = 1. (35) The solution is ~g = (AT W 2 A)−1 AT W 2 B. Sound reproduction is as described in [Tian, 2006, Tian, 2008], except that the sampling rate is different than that of the synthetic signals. In a synthetic groove, the sampling rate is 220.5kHz because of the 10X interpolation of the sound signal when ray-tracing the groove. In the real record, the width of view in each captured image of the groove is about 0.5mm. Dividing by 1280 pixels, we obtain ∆d = 0.39µm per pixel. The tangential speed of the groove (at radius=115mm) if played by a 78RPM record player is: Vt = ωR = 2π78/60 · 115mm = 939.34mm/s. The sample Vt frequency at the pixel level is fs = ∆ = 2404.71kHz, which is high enough to digitize the signal of a record. We d take advantage of this oversampling factor and apply sub-sampling early in the calculation at the depth recovery stage to significantly reduce the computational burden. There are 6000 groups of such images, with each group containing 12 images (using 36 images per group as for the synthetic data might improve the result slightly but the computing time would increase significantly) for the optical flow computation. It took nearly a week to capture these images and another week to process them (so currently we are nowhere near real-time, this work is a proof-of-concept only). About 3 seconds of music was reproduced in this time period. Figure 8 shows the original and reconstructed piece of music from “A Fine Romance” by Kern-Fields, produced by Decca, a 78RPM record The original music contains both vocal instrumental sounds. The computed Pearson’s product-moment correlation coefficient between them is r = 0.559. Pearson’s product-moment correlation coefficient r is computed as: P (xi − x̄)(yi − ȳ) pP r = pP i , (36) 2 2 (x i i − x̄) i (yi − ȳ) where x and y are the sampled signals. Listening confirms5 that the vocal sound is quite recognizable in spite of the presence of noise. The music part is lost in the noise, probably because it is already very weak during the singing section. The weight matrix W plays a critical role in this robust estimation calculation. Initially, W is I (the Identity Matrix) so that a rough solution is obtained. 5 www.csd.uwo.ca/faculty/barron/AUDIOFILES (a) (b) (c) (d) Figure 8: (a) The original sound wave recorded from a 78 rpm record player. (b) The recovered sound wave from groups of groove images. (c) The wave peak envelope of the original sound. (d) The wave peak envelope of the recovered sound. The horizontal axis represents the time and the vertical axis represents the magnitude. Using this solution, we can refine W using the Lorentzian estimator, ρ, as shown in [Black and Anandan, 1996]: 2 ! 1 di ρ(di , σ) = log 1 + (37) 2 σ and the influence function ψ (which is the derivative of ρ): ψ(di , σ) = 2di , 2σ 2 + d2i (38) where σ is a scale parameter and di is the residual value of each equation: di = |ai1 g1 + ai2 g2 + ai3 g3 − bi |. (39) Then the weight matrix elements get updated as: wi = ψ(di , σ) . di (40) We can re-calculate ~g again using the updated W . This procedure is repeated until one of the following stopping criteria is met: • the total residual is smaller than some threshold: ||di ||2 < τ1 , • the total residual begins to diverge: ||di ||t− − ||di ||t < τ2 or • the number of iterations reaches a limit. The second threshold, τ2 , which is a small positive number, allows the total residual to vary up and down a bit before iterations are considered to be converging or diverging. According to Black and Anandan, tuning the scale parameter σ may work well given that the initial√ approximation for it is not too bad. Since in the Lorentzian estimator, a residual di is considered an outlier if di ≥ 2σ, lowing σ after each iteration will reveal more and more outliers. Another benefit from doing this is that the number of outliers identified could help us to determine whether the value of σ is properly chosen and when to terminate the iterations. 7 From Surface Orientation to Sound Signal Once the surface orientations are computed, they need to be interpreted into sound signals so that they can be played. Figure 9 illustrates a piece of a groove, showing the left surface orientation α̂L . The figure also shows the two angles (θXY and θY Z ) that determine α̂L . Due to the +45/-45 modulation of the groove walls, θY Z is approximately 45 degrees at all times. Note also that locally the surfaces are planar. To extract the left channel signal, we observe that the surface orientation lies in the z = y plane because of the +45/-45 stereo modulation. Accordingly, the surface orientation corresponding to the right channel lies in the z = −y plane. =y ez n pla z αL αL θL nL nL θXY θ YZ =45 y x Figure 9: Illustration of a piece of groove showing the surface orientation α̂L of the left groove wall lies in the z = y plane. We define θL to be√ the√modulation angle between the surface orientation α̂ and the left-channel-zero-modulation orientation n̂L = [0, 22 , 22 ], which is the surface orientation of the left groove wall when the signal is zero. Then the ratio of the lateral speed VL and the tangential speed VT of the stylus is: VL = tan θL VT (41) θL = arccos(α̂L · n̂L ). (42) where VL corresponds to the left channel signal and needs to be adjusted according to VT = ωR, where R is the current distance to the record center. A similar method can be applied to reproduce the right channel signal, VR , using √ √ 2 the direction n̂R = [0, − 2 , 22 ] instead of n̂L . A mono recorded gramophone record with a horizontal groove modulation can be treated as a special case of the stereo groove modulation, where VL = VR = V , so either the left or right surface orientation can be used to reproduce the sound signal. For a mono SP or a wax cylinder with vertical groove modulation, we can project the surface orientation onto the vertical x − z plane, i.e. y = 0, and θ and V can be calculated using above equations, except now: α̂ n̂ = [αx , 0, αz ] and = [0, 0, 1]. (43) (44) Due to the robustness of the algorithm introduced in Section 6, we anticipate that the algorithm will be able to reject or attenuate most of the noise such as pops, clicks caused by scratches or small dirt particles, etc. 8 Synthetic Evaluation In order to test the above algorithms, we made 1390 groups of ray-traced groove images from a 2-second piece of recorded sound file. Then we applied the above methods to reconstruct the surface structure and then retrieve sound from the reconstructed groove [Tian and Barron, 2006]. The simulation involves ray-tracing of groove images based on shape features of the record grooves. We did not perform RIAA equalization on our recovered sound tracks as this technique was introduced in 1955 and our 78 rpm records were manufactured circa 1930. At present, equalization has not been taken into account. and the sound has an excess of high frequency sound components. We applied a 2nd order lowpass Butterworth filter to the sound signal, where the filter response decreases by 12 dB per octave and has a cutoff frequency of Fc = 3000 Hz, to attenuate high frequency noise [Butterworth, 1930, Bianchi and Sorrentino, 2007]. However, in order to get a more satisfying sound, it may be preferable to modify the equalization after the normal preamp output or to apply a custom equalization to a flat transfer in the digital domain. 8.1 Reconstruction of the Depth and Recovery of the Sound We fed the computed optical flow sequence to [Barron et al., 2003]’s depth recovery algorithm (discussed above), which uses a Kalman filter to compute a smooth surface reconstruction. Figure 10 shows a recovered groove depth map computed from optical flows using [Barron et al., 2003]’s depth recovery algorithm after 30 frames of optical flow. The groove structure is well recovered. 0 60 10 20 30 40 50 60 0 500 1000 1500 50 40 30 20 10 0 Figure 10: The 3D perspective view of reconstructed groove calculated from optical flow using algorithm given by [Barron et al., 2003]. The horizontal axis represents the x axis and the vertical axis represents the y axis. Figure 11 shows the synthetic sinusoid waves and the corresponding recovered sound waves for synthetic sinusoid waves at frequency of 200Hz, 500Hz, 1KHz, 2KHz, 5KHz, 10KHz and 20KHz. For a sound wave at a frequency of 2KHz and less, the recovered wave looks good. For higher frequencies, such as 5KHz and greater, the noise eventually becomes so dominant, that it overwhelms the true signal. Note that for the real data results (presented below), the sampling rate was 10 times greater. Hence, we can recover all the relevant frequency information up to 20KHz from the real data without aliasing problems. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) Figure 11: Synthetic Sinusoid groove walls for (a) 200Hz, (b) 500Hz, (c) 1KHz, (d) 2KHz, (i) 5KJz, (j) 10KHz and (k) 20HKz and (e)-(h) and (l)-(n) their recovered sound waves respectively. The horizontal axis represents time and the vertical axis represents magnitude. After the depth maps were computed, the surface orientations of the grooves were then estimated using the algorithm described in Section 6. From the surface orientations, sound signals were computed as outlined in Section 7. We tested our algorithm on 2 synthetic sound wave pieces: a male voice saying “Computers are useless, they only give you answers” and a short rendering of part of the Canadian anthem “O Canada”. These two synthetic image sequences were generated using ray tracing (our program takes the number of images per group, camera speed (1,0,0), group number, i.e. which segment of the groove and the number of images per group (36) as input parameters. Since we use 7 images to compute 1 flow field, 36 images allow 30 optical flow fields to be computed: the error distribution of depth estimation accuracy converged after the 27th depth map calculation in the above cube, cylinder and sphere experiments (see Figures 6 and 7 and Table 2). Since we have the sound data, the shape of the groove can be determined by the sound data. Then we ray trace it using a given focal length and distance from the view point. This is a computationally expensive process because we had to extend the ray gradually along a pre-determined path until it encountered a groove (the shape of a groove is arbitrary and we have no analytical means to solve for its coordinates other than a brute-force search). The sound wave pieces for these 2 sequences, such as those shown in Figure 12, were combined together to form the total sound wave. 1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 50 100 150 200 250 300 350 Figure 12: Sound wave: The recovered sound wave from one piece of synthesized groove shown as float numbers. The horizontal axis represents the time and the vertical axis represents the magnitude. The original sound wave and the recovered sound wave were compared using their wave peak envelopes, shown in Figures 13 and 14. 250 250 200 200 150 150 100 100 50 50 0 0 1 2 3 4 5 6 0 0 1 2 3 4 5 6 4 4 x 10 x 10 (a) (b) (c) (d) Figure 13: (a) is the original sound wave of the male voice. (b) is the recovered sound wave of the male voice from groups of groove images. (c) is the wave peak envelope of the original sound. (d) is the wave peak envelope of the recovered sound. The horizontal axis represents the time and the vertical axis represents the magnitude. 250 250 200 200 150 150 100 100 50 50 0 0 1 2 3 4 5 6 7 8 9 0 0 1 2 3 4 4 5 6 7 8 9 4 x 10 x 10 (a) (b) (c) (d) Figure 14: (a) is the original sound wave of “O Canada”. (b) is the recovered sound wave from groups of groove images. (c) is the wave peak envelope of the original sound. (d) is the wave peak envelope of the recovered sound. The horizontal axis represents the time and the vertical axis represents the magnitude. We compute the wave peak envelopes of the computed and the original waveform and then compute Pearson’s product-moment correlation coefficient of them as r = 0.887 for the male voice, which indicates good correlation. Figure 14 shows original and reconstructed signals for a segment of music from “O Canada”. The computed Pearson’s product-moment correlation coefficient between the original and reconstructed signals is r = 0.890. The reproduction of these signals6 confirms that the sound is quite recognizable, in spite of the presence of some noise. The purpose of using Pearson’s product-moment correlation coefficient is not to provide a quantitative measurement of how good the reconstructed signal is, but rather to provide a relative figure of which type of sound is reconstructed better. In our experiment, the sound of music as shown in Figure 14 is slightly better recovered than the human voice sound as shown in Figure 13. Qualitative listening to the original and recovered sounds also confirm this point. A preliminary explanation for this is that the music contains more high frequency sounds than that of human voice. High frequencies cause faster changes in surface orientation, hence it is easier for the algorithm to achieve a higher signal to noise ratio (SNR). Imagine if the signal consists of mainly low frequency information. In this case, the groove surface will be nearly flat all the time. The recovered sound may contain more noise, overwhelming the useful sound signal. 9 Real Experiments on Gramophone Records Acquiring real images of the gramophone record groove proved to be quite challenging. The extreme dimensions of the groove required extra care when capturing images using a microscope. Illumination, movement, etc. all have to be carefully adjusted and controlled. The nature of our algorithm requires the use of a very short focal length objective, which results in a very shallow depth of field, i.e. only a narrow band of the groove surface is in focus while all other areas are out of focus if they do not have the same depth as the focused region has. Special action must be performed to the light path of the microscope objective so that more than one region of the groove is in focus at the same time. We installed a thin glass plate between the objective and the record surface, covering half of the field of view to create a dual focus objective. This was both a simple and effective solution to our shallow depth of field problem. We discuss this in detail below. Processing the captured image is also a hard task as the images captured are not as ideal as the images generated in the simulation. Some images are blurry while others show dirt in the groove. To overcome all the above mentioned difficulties, special care must be taken during the experiment and the implementation of the algorithms. These will be discussed in detail in the following sections. 9.1 System Setup For the real data, we captured 1390 groups of images, each group contains 45 frames for computing optical flow sequences. The distance between center image of each group is less than the width of an image in order to get some overlap among the groups for smoothly varying optical flow between neighboring groups. The experimental setup shown in Figure 15 consists of a slowly turning Turntable, a Microscope that is used to magnify the groove by approximately 400X to get good images of the groove with its subtle variations (the groove width varies between 100 to 300 µm in width) and a Image Capture Device, in our case a high speed digital camera fitted to the microscope. Images were 24 bit color and were 1280 × 1024 pixels in size. The records being photographed are 78RPM Standard Play (SP) records. We focus on the 78RPM records since they need less magnification and are easier to illuminate than the 33RPM Long Play (LP) records. 6 www.csd.uwo.ca/faculty/barron/AUDIOFILES Figure 15: Setup of microscope, camera, turntable and reduction drive. 9.2 The Focal Length Issue Since we are trying to compute the groove wall’s surface orientation, the absolute value of the depth is not critical. However, the relative difference in the depth is important for the computation of surface orientation. The recovery of accurate depth differences is rather difficult if the target is far away from the camera because the variance in the measurement may exceed the depth difference itself. The average depth of a SP record is about 120 µm. If the average error of depth estimation is 7 − 8%, then the average distance between groove and camera should be less or equal to 1.7mm, which is quite short, in term of focal length of the microscope’s objective. In our experimental setup, a 40X objective lens is used to obtain the shortest focal length available. Since a microscope’s barrel length (the distance from the objective to the port where an eyepiece is fitted) is fixed, it is focused by varying the distance to an object until the desired distance is reached. The barrel length can be considered as the image distance u, which in our case is 160mm, as marked on the housing of the objective lens. Because the u = 160mm = 4mm. Then we can compute focal magnification is 40X, the object distance can be computed as v = 40 40 length, f , using the thin lens equation as: f= 1 u 1 + 1 v = 1 160 1 + 1 4 = 3.9mm. (45) This focal length (3.9mm) is longer than we expected (1.7mm), but this is as close as we can get at this time. Thus, we use what is available and see what adjustments we need to make to compensate for this at a later stage in the experiments. 9.3 The Dual-focal Microscope Objective The main problem with a high power objective lens is the depth of field. For a 40X objective lens, the depth of field is about 1µm [Nikon, 2006]. For a lower magnification factor we obtain a bigger depth of field, but at the same time, the focal length also increases, which makes the difference in depths more difficult to distinguish. For example, a 4X objective lens has a depth of field of about 45µm [Nikon, 2006], which covers about half of the total groove depth. However, the focal length of this 4X objective is about 32mm according to the formula in Equation (45). Trying to detect a maximum depth variation of 120µm with a focal length of 32mm is a very tough job! So we stick to the available 40X objective in our experiments. Figure 16 shows a captured image using the 40X objective. It shows that only two narrow stripes located on the opposite groove walls are brought in to focus. The overall image is a bit blurry because the record is turning at a certain speed. The two in-focus stripes are relatively sharp compared to rest areas of the image. Since obtaining a larger depth of field, i.e. a clear image all the way from groove top to bottom, is theoretically impossible using our current microscope, bringing more than one depth stripe into focus becomes a more realistic alternate approach. Figure 17 illustrates the technique of obtaining two focus of depths using a piece of thin cover glass inserted between the objective lens and the record surface, covering half of the field of view, creating a dualfocusing objective. Figure 16: The groove image captured directly using the 40X objective lens. Notice the shallow depth of field. Only two stripes (shown in highlighted areas) on opposite groove walls of same depth are in focus. Image Objective Lens Cover glass d=0.11mm Object u v δv Figure 17: Illustration of the modified light path for a microscope objective lens, with a piece of thin cover glass inserted between the objective lens and the record surface, covering half of the field of view. This setup forms a dual-focusing objective lens. When light travels through the piece of cover glass, it is shifted by a certain amount because of the refraction index of glass (λ = 1.515) is different from that of the air (λ = 1.0). This shift cause the distance of the in-focused object to increase by a certain amount δv. In the next section, we will show that this shifted distance δv is approximately constant. Figure 18a shows a captured image using the modified light path from the same 40X objective lens. Clearly there are three in-focus stripes: the top two are of the same depth, with the groove bottom in the middle, out of focus. The third stripe at the bottom represents a closer stripe on the same groove wall where the second stripe resides. So there are two in-focused regions on the lower groove wall shown in one image. By taking a sequence of such images of the moving groove walls, we can determine their locations by segmenting their optical flow field. After assigning different depth values to these stripe areas, surface orientations are computed based on this information. (a) (b) (c) (d) Figure 18: (a) Groove image captured using 40X objective lens. Notice there are only three stripes (highlighted areas) on opposite groove walls of different depths that are in focus. (b) Locations of computed optical flow (white dots) of a section of a moving record groove. The direction and magnitude of the optical flows are not shown because only the position information is used here. (c) Results of Hough transform. The 3 square boxes indicates the 3 peaks that represent the 3 most clustered lines. (d) the 2 segmented regions and their centroid lines, used for computing surface orientations. 9.4 Determining the Two Depth Levels Figure 19 shows the detailed light path, made by enlarging the right half of Figure 17. Cover glass d=0.11mm Original in-focus object before inserting the cover glass New in-focus object θ θ’ L2 L1 δv δv Figure 19: Illustration of the light path showing the shift of the focus distance of the objective lens when a piece of cover glass is inserted. We can compute the shifted focus distance δv as follows. From the light incident angle θ and refraction angle θ′ , we have: L1 = d tan(θ) and L2 = d tan(θ′ ), (46) where d is the thickness of the cover glass. Thus, the shifted focus distance δv can be computed as: 1 cos(θ) , δv = 1 − · λ cos(θ′ ) sin(θ) where λ = sin(θ ′ ) = 1.515 is the refraction index of the cover glass. The value of incident angle is less than 22.5◦ . Using this approximation, Equation (47) becomes: 1 = 0.037mm, δv = d · 1 − λ cos(θ) cos(θ ′ ) (47) is close to 1 when the (48) where d = 0.11mm. Next we compute the distances of the bottom two stripes in Figure 18 using this information. The bottom stripe, since it is not covered by glass. has a distance of 4mm. The top two are at the same distance of 4.037mm using the calculated focus distance shift. 9.4.1 Optical Flow and In Focus Regions Optical flow is computed using Horn and Schunck’s algorithm [Horn and Schunck, 1981] modified by adding the groove surface orientation constraint. This constraint regulates the distribution of horizontal optical flow along the vertical direction since it is known that one of the angles defining wall orientation is 45◦ . Full details are in a PhD thesis [Tian, 2006, Tian, 2008]. Figure 18b shows the area where optical flow is computed. Since optical flow is used to determine the regions which are in good focus [out-of-focus regions tended to be heavily blurred with intensity derivatives near zero], the magnitude of the optical flow is not as important as where it is located. Any location that has an optical flow vector is marked for later segmentation using Hough transform. 9.4.2 Hough Transform The computed optical flow for the grooves in Figure 18a is shown in Figure 18b and indicates that three stripe regions that are in focus. In order to segment the three regions, we compute a Hough transform [Rafeal C. Gonzalez, 1992] using the optical flow position information. After observing the optical flow positions in many frames, we observed that the stripes are clustered in straight line shapes. Thus the dimensionality of the Hough transform used to segment these lines is reduced to 2, resulting in less computation time and storage space for the accumulator array. At each position that has an optical flow vector, many pairs of ρ and θ values can be calculated according to the normal representation of a line x cos θ + y sin θ = ρ that passes through this position. At every optical flow position, each value of θ from −45◦ to 45◦ in steps of 1◦ is used to compute a ρ value and the accumulator array element at [θ, ρ] is incremented by one. The Hough transform as a θ − ρ image is shown in Figure 18c. In our experiments, since there are so many images to process, quantization steps 10 times bigger were used to obtain a much faster computational speed without diminishing the segmentation results by much. Three global peaks of (θ, ρ) values in the accumulator array are detected, which represent the three lines. Any (θ, ρ) pair that is within a certain range of the three peak (θ, ρ) values is considered to belong to that region. To segment the flow field, we compute (θ, ρ) values for each optical flow position. If it falls in one and only one of the three peak regions, then it belongs to that region. If it falls to two or more regions, then this position lies between two regions and usually is caused by noise. It is simply discarded. Adjusting the range around the peak affects the width of the three regions. Increasing the range also widens the regions while decreasing the range make the regions narrower. There is a compromise between region density and noise tolerance. The range is determined by trial and error [θ varying from −45◦ to +45◦ and ρ varying from 0 to 1024]. As discussed in Section 9.4, only the lower two regions have different depth values. After successful segmentation of the three stripes, the top one is discarded, as shown in Figure 18d. Also shown in this Figure are the centroid lines for the two regions remaining (after a bit of smoothing). Void sections are filled with [x, y] computed from peaks (θ, ρ) values using the θ − ρ equation corresponding to its region. Depth values are then assigned to these two centroid lines to reduce the computational cost when computing surface orientation. 9.5 Computing Surface Orientation Earlier, we introduced a robust method of estimating the surface orientation given depth values within a small neighbourhood [Tian and Barron, 2006]. Now the depth area has been reduced to two virtually parallel lines with two different depth values. Our robust algorithm also works in this case. We use a neighbourhood of 100 pixels about each line for best results. This size choice represents the best compromise between increasing this size (would attenuate high frequencies sounds) or decreasing the size (would enhance noise effects). 9.6 Sound from Surface Orientation The following images show an example of how the damaged groove walls are repaired and missing signal filled. Figure 20a shows the damaged groove from “Give a little, take a little” by Hank Thompson, produced by RCA Victor, a 78RPM record. Nearly half of the groove walls in this image is missing. Figure 20b shows the positions of the computed optical flow. There are some vertical stripes caused by the damage. The Hough transform segments these lines and removes the vertical stripes. Figure 20c shows the results of Hough transform. Figure 20d shows the segmented two regions and their centroid lines for computing the surface orientations. The missing part has been filled up using its default line parameters. (a) (b) (c) (d) Figure 20: (a) An image showing the damaged groove. (b) Positions of the computed optical flow of the damaged groove. (c) Results of Hough transform. The 3 square boxes indicate the 3 clustered line regions. (d) Segmented two regions and their centroid lines for computing the surface orientations. The missing part has been filled up using its default line parameters. Figure 21 shows the original and reconstructed music. The computed Pearson’s product-moment correlation coefficient between them is r = 0.561. Listening confirms that the music sound is quite recognizable in spite of the presence of noise. The popping noise present in the original sound is not so obvious in the reconstructed sound. (a) (b) (c) (d) Figure 21: (a) The original sound wave recorded from a 78RPM record player with audible pops. (b) The recovered sound wave from groups of groove images. (c) The wave peak envelope of the original sound. (d) The wave peak envelope of the recovered sound. 10 Conclusions This framework forms a basis for reproducing sound from gramophone records using 3D reconstruction algorithms. The real data experiment revealed some practical limitations on the algorithm. Some effort was put into trying to solve these problems. This framework indicates that our 3D reconstruction approach may be good for non-contact record playing and archiving. Future work includes improving each step of the algorithm: better optical flow, better imaging setup, better depth orientation reconstruction, and better computing resources such as faster I/O, more CPU power and maybe a parallel (SIMD) implementation to make it real-time. References [Barron and Klette, 2002] Barron, J. and Klette, R. (2002). Quantitative colour optical flow. In Intl. Conf. on Pattern Recognition (ICPR2002), volume 4, pages 251–255. [Barron et al., 1994] Barron, J. L., Fleet, D. J., and Beauchemin, S. S. (1994). Performance of optical flow techniques. IJCV, 12(1):43–77. [Barron et al., 2003] Barron, J. L., Ngai, W. K. J., and Spies, H. (2003). Quantitative depth recovery from timevarying optical flow in a kalman filter framework. In T. Asano, R. K. and Ronse, C., editors, LNCS 2616 Theoretical Foundations of Computer Vision: Geometry, Morphology, and Computational Imaging, pages 344–355. [Bianchi and Sorrentino, 2007] Bianchi, G. and Sorrentino, R. (2007). McGraw-Hill Professional. Electronic Filter Simulation & Design. [Black and Anandan, 1996] Black, M. J. and Anandan, P. (1996). The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer Vision and Image Understanding, 63(1):75–104. [Butterworth, 1930] Butterworth, S. (1930). On the theory of filter amplifiers. Wireless Engineer, 7:536–541. [Cavaglieri et al., 2001] Cavaglieri, S., Johnsen, O., and Bapst, F. (2001). Optical retrieval and storage of analog sound recordings. In The AES 20th International Conference, Budapest, Hungary. [ELP, 1997] ELP (1997). Elp laser turntable. Internet reference: www.elpj.com. [Fadeyev and Haber, 2003] Fadeyev, V. and Haber, C. (2003). Reconstruction of mechanically recorded sound by image processing. J. of Audio Eng. Soc., 51(12):1172–1185. [Fadeyev et al., 2005] Fadeyev, V., Haber, C., Maul, C., McBride, J., and Golden, M. (2005). Reconstruction of recorded sound from an edison cylinder using three-dimensional non-contact optical surface metrology. J. of Audio Eng. Soc., 53(6):485–508. [Heel, 1990] Heel, J. (1990). Direct dynamic motion vision. In Proc IEEE Conf. on Robot Automation. [Horn and Schunck, 1981] Horn, B. K. P. and Schunck, B. G. (1981). Determining optical flow. Artificial Intelligence, 17:185–204. [Hung and Ho, 1999] Hung, Y. S. and Ho, H. T. (1999). A kalman filter approach to direct depth estimation incorporating surface structure. IEEE PAMI, pages 570–576. [Iwai et al., 1986] Iwai, T., Asakura, T., Ifukube, T., and Kawashima, T. (1986). Reproduction of sound from old wax phonograph cylinders using the laser-beam reflection method. Applied Optics, 25(5):597–604. (Internet reference: www.opticsinfobase.org/abstract.cfm?URI=ao-25-5-597). [Kessler and Ziegler, 1999] Kessler, T. and Ziegler, S. (1999). Direct play back of negatives of historic sound cylinders. In EVA (Electronic Media & Visual Arts) Europe’99, pages 8.1 – 8.5. (Internet reference: www.gfai.de/projekte/spubito/papers/eva99.pdf). [Laborelli et al., 2007] Laborelli, L., Chenot, J.-H., and Perrier, A. (2007). Non-contact phonographic discs digitisation using structured colour illumination. In Audio Engineering Society 122nd Convention, Vienna, Austria, page (11 pages). (Paper 7009). [Longuet-Higgins and Prazdny, 1980] Longuet-Higgins, H. and Prazdny, K. (1980). The interpretation of a moving retinal image. Proceedings of the Royal Society of London B, Biology Sciences, 208(1173):385–397. [Lucas and Kanade, 1981] Lucas, B. D. and Kanade, T. (1981). An iterative image-registration technique with an application to stereo vision. In Image Understanding Workshop, pages 121–130. DARPA. [Matthies et al., 1989] Matthies, L., Szeliski, R., and Kanade, T. (1989). Kalman filter-based algorithms for estimating depth from image sequences. IJCV, 3(3):209–238. [Nikon, 2006] Nikon (2006). Nikon microscopy tutorial. Internet reference: www.microscopyu.com. [Olsson et al., 2003] Olsson, P., Öhlin. R. Olofsson, D., Vaerlien, R., and Ayrault, C. (2003). The digital needle project - group light blue. Technical report, KTH Royal Institute of Technology, Stockholm, Sweden. Internet reference: www.s3.kth.se/signal/edu/projekt/students/03/lightblue/. [Press et al., 1992] Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. (1992). Numerical Recipes in C. Cambridge University Press, 2 edition. [Rafeal C. Gonzalez, 1992] Rafeal C. Gonzalez, R. E. W. (1992). Digital Image Processing. Addison-Wesley Publishing Company. [Simoncelli, 1994] Simoncelli, E. P. (1994). Design of multi-dimensional derivative filters. In IEEE Int. Conf. Image Processing, volume 1, pages 790–793. [Springer, 2002] Springer, O. (2002). www.cs.huji.ac.il/˜springer/. Digital needle - a virtual gramophone. Internet reference: [Stanke and Kessler, 2000] Stanke, G. and Kessler, T. (2000). Verfahren zur gewinnung von tonsignalen aus negativspuren in kupfernegativen von edison-zylindern auf bildanalytischem/sensoriellen wege (spubito)/procedure to recover sound signals from the negative tracks in copper negatives of edison cylinders in an image analysis/sensorial way (spubito). In Simon, A., editor, Das Berliner Phonogramm-Archiv 1900-2000. VWB-Verlag für Bildung und Wissenschaft, Berlin. [Stoddard, 1989] Stoddard, R. E. (1989). Optical turntable system with reflected spot position detection. United States Patent 4,870,631. [Stoddard and Stark, 1989] Stoddard, R. E. and Stark, R. N. (1989). Dual beam optical turntable. United States Patent 4,870,631. [Stotzer et al., 2003] Stotzer, S., Johnsen, O., Bapst, F., Milan, C., Sudan, C., Cavaglieri, S. S., and Pellizzari, P. (2003). Visualaudio: an optical technique to save the sound of phonographic records. IASA Journal, pages 38–47. [Stotzer et al., 2004] Stotzer, S., Johnsen, O., Bapst, F., Sudan, C., and Ingol, R. (2004). Phonographic sound extraction using image and signal processing. In Proc. ICASSP, Montreal, Quebec, Canada. [Tian, 2006] Tian, B. (2006). Reproduction of Sound Signal from Gramophone Records using 3D Scene Reconstruction. PhD thesis, University of Western Ontario, London, Ontario, Canada N6A 5B7. [Tian, 2008] Tian, B. (2008). Sound Recovery from Gramophone Records by 3D Reconstruction. VDM Verlag. [Tian and Barron, 2005] Tian, B. and Barron, J. L. (2005). A quantitative comparison of 4 algorithms for recovering dense accurate depth. In 2nd Canadian Conference on Computer and Robot Vision, pages 498–505, Victoria, BC, Canada. [Tian and Barron, 2006] Tian, B. and Barron, J. L. (2006). Reproduction of sound signal from gramophone records using 3d scene reconstruction. In Irish Machine Vision and Image Processing Conference, pages 84–91, Dublin, Ireland. [Ziegler, 2000] Ziegler, S. (2000). Das walzenprojekt zur rettung der größten sammlung alter klangdokumente von traditioneller musik aus aller welt. walzen und schellackplatten des berliner phonogramm-archivs/the wax cylinder project in rescue of the largest collection of old sound documents of traditional music from around the world: Wax cylinders and shellac records of the berlin phonogramm-archiv. In Simon, A., editor, Das Berliner PhonogrammArchiv 1900-2000. VWB-Verlag für Bildung und Wissenschaft, Berlin.
© Copyright 2026 Paperzz