COMPARING IMAGES USING CLOUD COMPUTING A PROJECT REPORT Submitted by RAJYA BADAM in partial fulfilment for the award of the degree of MASTER OF SCIENCE in COMPUTER SCIENCE UNIVERSITY OF COLORADO AT COLORADO SPRINGS UNIVERSITY OF COLORADO: COLORADO SPRINGS 80918 May 2015 1 COMPARING IMAGES USING CLOUD COMPUTING This project for the Master of Science degree by Rajya Badam Has been approved for the Department of Computer Science By Advisor: Dr. Terry Boult Member: Dr. Edward Chow Member: Dr. Sudhashu Semwal May 14th 2015 2 COMPARING IMAGES USING CLOUD COMPUTING Acknowledgements I would like to thank my advisor Dr.Terry Boult for helping me through this project. Both my project members Dr. Sudhanshu Semwal and Dr. Edward Chow, from the University of Colorado at Colorado Springs. I am so grateful to my husband Balaji Badam who understood me and helped in every aspect. Without him it would not be possible to complete my project. Thank you for all your help and support. My son Adhrit Badam also played a role in the project by not disturbing, playing all by himself when I am working. I would also thank my parents Jakkampudi Ganeswara Rao and Jakkampudi Vijaya for their encouragement on completing this project. 3 COMPARING IMAGES USING CLOUD COMPUTING TABLE OF CONTENTS Abstract 1. Introduction 1.1. Overview 2. Related Work 2.1. Image Hashing 2.1.1. SHA Algorithm 2.1.2. Perceptual Hashing 2.1.3. Discrete Cosine Transform 2.2. Mongo DB 2.2.1. No SQL Database 2.3. Cloud Computing 3. System Design 3.1. Exact Hashing 3.1.1. MD5 hash 3.1.2. SHA hash 3.2. Approximate Hashing 3.2.1. Perceptual Hash 3.2.2. Discrete Cosine Hash 4. Implementation 4.1. Uploading Images 4.2. Data Set 4.3. System Environment 5. Performance Evaluation 5.1. Comparison of exact and approximate hash matches 4 COMPARING IMAGES USING CLOUD COMPUTING 5.2. Image Hits 5.3. Performance measure on different browsers 5.3.1. Test the Amazon Server performance with different results 5.3.2. Image with Text 5.3.3. Image with Stain 6. Advantages 7. Learned 7.1. Cloud Computing 7.2. MongoDB 7.3. Image Hashing 8. Future Work 9. Conclusion 10.References 5 COMPARING IMAGES USING CLOUD COMPUTING Abstract Comparing Images is a resource intensive task. This project involves creating an application that will use resources on the cloud and provides a web interface that allows a user to perform an image search. The application consists of an image URL database along with an image hash to compute an exact or an approximate hash comparison of provided image against the database. The database is composed of image URL from various sources on the web. The project uses Amazon Web Services as the cloud computing platform. The advantages of using Amazon Web Services is it is Easy to use, Flexible, Cost-effective, Reliable, Scalable with highperformance and Secure. The Advantages of Cloud Computing are lower capital expense, lower variable cost, scalable capacity needs, increase speed and agility, lower maintenance costs and finally it is globally accessible. All the image URLs along with the respective image hash code will be saved into a database, we will be using the popular NoSql database - Mongo DB. Users shall be able to upload an image and search for the matching image URLs. Amazon EC2 instance will be used to host the web application. The web implementation will be designed using PHP – Hypertext Preprocessor. A NoSql database - Mongo DB will be used on the Amazon EC2 instance to host the database for this project. The image hash will be computed using Python The work on the project divided into: 1. Collecting a large set of image URLs for uploading into database 2. Generating a hash of the image 6 COMPARING IMAGES USING CLOUD COMPUTING 3. Uploading image hash and URL to Mongo DB 4. Create Web application in PHP and host on the Amazon EC2 instance 5. Hosting the PHP web application on Amazon EC2 instance 6. Exact or approximate match image search results are captured 1. Introduction Image authentication plays an important role in security and communication. Images are being transferred over the Internet and are readily available for access from any part of the world and without introducing authentication mechanism, it is almost impossible to distinguish if an image is original or being manipulated. 1.1 Overview Image searching is a process of converting an image into a digital form and extracting some useful information from it. Images can be searched by computing the hash code. MD5 checksum is used for finding the exact search of an images, by computing the hash code. Approximate image search can be used for finding similar images. For computing the image search the input can be either an image or the url. The output of the search will be exact or approximate matches of the image along with the url’s. The user will be able to select the url and view the images. Finding the images that are used by other sources using the URL search is the main purpose of this project. Many images that were used by many other sources and it is difficult to track the stolen or used resources. It is very helpful to find the images that are used elsewhere. 7 COMPARING IMAGES USING CLOUD COMPUTING Now-a-days image searching can be done using image URL’s or by directly uploading images, instead of searching images using a word or a phrase. Uploading the images for searching similar matches are mostly used. It is common to upload and retrieve images from the web. This project creates an on-demand instance which processes a batch of images. The application makes use of Amazon EC2 and NoSQL DB. Amazon EC2 Web Application Amazon EC2 Web Application Mongo DB Process Request Data Mongo DB Process Request Data The image that is uploaded for searching is retrieved from the Mongo DB. 2. Related Work 2.1 Image Hashing Exact Hash uses MD5 or SHA algorithms for comparing the images. Hash algorithms are called secure when 1. It is impossible to find a message that corresponds to a given message digest. 2. It is impossible to find two different messages that produce the same message digest. 3. If a message is changed even by a single character, the result will be a completely different message digest. 8 COMPARING IMAGES USING CLOUD COMPUTING i. MD5 hash ii. SHA hash ii. Approximate Hashing i. Perceptual Hash ii. Discrete Cosine Hash (DCT) 2.1.1 SHA Algorithm The Secure Hash Algorithm is a family of cryptographic hash functions published by the National Institute of Standards and Technology (NIST) as a U.S. Federal Information Processing Standard (FIPS), including: SHA-0: A retronym applied to the original version of the 160-bit hash function published in 1993 under the name "SHA". It was withdrawn shortly after publication due to an undisclosed "significant flaw" and replaced by the slightly revised version SHA-1. SHA-1: A 160-bit hash function which resembles the earlier MD5 algorithm. This was designed by the National Security Agency (NSA) to be part of the Digital Signature Algorithm. Cryptographic weaknesses were discovered in SHA-1, and the standard was no longer approved for most cryptographic uses after 2010. SHA-2: A family of two similar hash functions, with different block sizes, known as SHA-256 and SHA-512. They differ in the word size; SHA-256 uses 32-bit words where SHA-512 uses 64-bit words. There are also truncated versions of each standard, known as SHA-224, SHA-384, SHA512/224 and SHA-512/256. These were also designed by the NSA. 9 COMPARING IMAGES USING CLOUD COMPUTING SHA-3: A hash function formerly called Keccak, chosen in 2012 after a public competition among non-NSA designers. It supports the same hash lengths as SHA-2, and its internal structure differs significantly from the rest of the SHA family. 2.1.2 Perceptual Hashing Perceptual image hash functions produce hash values based on the image’s visual appearance. A perceptual hash can also be referred to as e.g. a robust hash or a fingerprint. Such a function calculates similar hash values for similar images, whereas for dissimilar images dissimilar hash values are calculated. Finally, using an adequate distance or similarity function to compare two perceptual hash values, it can be decided whether two images are perceptually different or not. Perceptual image hash functions can be used e.g. for the identification or integrity verification of images. 2.1.3 Discrete Cosine Transform The DCT, like any Fourier-related transform, expresses a function or signal (a sequence of finitely many data points) in terms of a sum of sinusoids with different frequencies and amplitudes. The DCT uses only cosine functions. 2.2 Mongo DB Mongo DB is one of many cross-platform document-oriented databases. It is also classified as a No-SQL database, which does not use traditional table-based relational database structure. It uses JSON-like documents with dynamic schemas called BSON. 10 COMPARING IMAGES USING CLOUD COMPUTING BSON is a computer data interchange format used mainly as a data storage and network transfer format in the MongoDB database. It is a binary form for representing simple data structures and associative arrays called objects or documents in MongoDB. The name “BSON” is based on the term JSON and stands for “Binary JSON”. Mongo DB is an open source software which is used by Craigslist, eBay, Foursquare, Source Forge, Viacom, New York Times etc. It is the most popular NoSQL database system. It uses document-oriented structure, which stores the business subject in the minimal number of documents. 2.2.1 No SQL Database No SQL database is used for storage and retrieval of data that is modeled in all means other than the tabular relations used in relational databases. It gives simplicity of design, horizontal scaling and finer control over availability. No SQL uses key-value, graph, or document forms of data structures. No SQL databases are increasingly used in big data and real-time web applications. No SQL is also called “Not only SQL” that also supports SQL-like query languages. 2.3 Cloud Computing Cloud computing relies on sharing of resources to achieve coherence and economies of scale, similar to a utility (like the electricity grid) over a network. At the foundation of cloud computing is the broader concept of converged infrastructure and shared services. 11 COMPARING IMAGES USING CLOUD COMPUTING Cloud computing, or "the cloud", also focuses on maximizing the effectiveness of the shared resources. Cloud resources are usually not only shared by multiple users but are also dynamically reallocated per demand. This can work for allocating resources to users. For example, a cloud computer facility that serves European users during European business hours with a specific application (e.g., email) may reallocate the same resources to serve North American users during North America's business hours with a different application (e.g., a web server). This approach should maximize the use of computing power thus reducing environmental damage as well since less power, air conditioning, rack space, etc. are required for a variety of functions. With cloud computing, multiple users can access a single server to retrieve and update their data without purchasing licenses for different applications. For this project cloud computing plays an important role. It uses Infrastructure as a service (IaaS). In the most basic cloud-service model & according to the IETF (Internet Engineering Task Force), providers of IaaS offer computers – physical or (more often) virtual machines – and other resources. (A hypervisor, such as Xen, Oracle VirtualBox, KVM, VMware ESX/ESXi, or HyperV runs the virtual machines as guests. Pools of hypervisors within the cloud operational supportsystem can support large numbers of virtual machines and the ability to scale services up and down according to customers' varying requirements.) IaaS clouds often offer additional resources such as a virtual-machine disk image library, raw block storage, and file or object storage, firewalls, load balancers, IP addresses, virtual local area networks (VLANs), and software bundles.IaaScloud providers supply these resources on-demand from their large pools installed in data centers. For wide-area connectivity, customers can use either the Internet or carrier clouds (dedicated virtual private networks). 12 COMPARING IMAGES USING CLOUD COMPUTING To deploy their applications, cloud users install operating-system images and their application software on the cloud infrastructure. In this model, the cloud user patches and maintains the operating systems and the application software. Cloud providers typically bill IaaS services on a utility computing basis: cost reflects the amount of resources allocated and consumed. In this project the comparison of images involves mainly two types. Exact hashing and approximate hashing. 3 System Design 3.1 Overview This system is based on computing the hash code for the uploaded image and then comparing the existing images and their respective hash code from the MongoDB. 13 COMPARING IMAGES USING CLOUD COMPUTING Query Image should be uploaded Computes Hash code for the uploaded image Displays the Exact Match results Exact, perceptual and dct hashing will be performed Compares the uploaded image data with the existing images in the database Matches Exact hash code Matches Perceptual or DCT hash code Displays Approximate image results Returns zero results If no image matches 3.2 Setting up the system In order to process the comparison, Amazon Web Services should be setup with all the sign in privileges. All the rules must be setup for accessing the webpage. Mongo DB must be installed and setup for use. All the required data should be collected for inserting in the database. Generating hash code to the uploaded image must be considered. Research existing hashing algorithms to be used for computing the image comparisons. Exact and approximate hashing must be performed 14 COMPARING IMAGES USING CLOUD COMPUTING with maximum number of possibilities. PHP should be installed on the system for developing web interface. The discrete cosine transform (DCT) is an efficient means to compute a hash from frequency spectrum data, and the distance calculation is relatively simple. While it is insufficient to consider image similarity in any semantically meaningful way, it does provide a hash as an ID for an image, and is robust against minor distortions, like small rotations, blurring and compression. The graphs below show the hamming distances (i.e. the number of bits that differ in the 64-bit hash) for two scenarios: the intra distances where the source images are from the same source only one is a distorted version of the other, and inter distances, where the two compared images are altogether different images. The main point is that a threshold of twenty-two, T=22, can be applied to determine if two images are indeed the same source image. Note: Please ignore the x-axis on the second table. The x-axis is merely a list of comparisons between the specific images. According to Dr. Neal, a better approach is using pHash. Here is how pHash is computed: a. Reduced to grayscale b. Resize the image to 32x32 c. Compute the Discrete Cosine Transform (DCT) of the image. The DCT separates the image into a collection of frequencies and scalars d. Just keep the top-left 8x8 of the DCT. While the DCT is 32x32, the top-left 8x8 represents the lowest frequencies in the picture e. Compute the median value f. Compute the hash from the DCT. Set the 64 hash bits to 0 or 1 depending on whether each of the 64 DCT values is above or below the median value. 15 COMPARING IMAGES USING CLOUD COMPUTING 4 Implementation 4.1 Uploading Images A web page is created using PHP, an image can be uploaded into the database. Also by inputting the image hash code values will be computed for that particular image. By querying the database with the input image all the hash code values will be compared and if there is any match the results will be displayed. Main goal is to capture many approximate images that are available in the database. 16 COMPARING IMAGES USING CLOUD COMPUTING 4.2 Data Set Around 500 images were inserted into the database and the image search operation is performed on all the images. 4.3 System Environment Amazon Web Services EC2 cloud computing is used for running the application. Apache2 web server is installed and runs on EC2 instance. PHP 5.5.9 version is used for web interface. PEAR Crypt_HMAC is used for computing the hashing. Mongo DB 3.0.1 is installed and runs on EC2 instance. OpenCV is installed and is used for comparing exact images. 5 Performance Evaluation 5.1 Comparison of exact and approximate hash matches Table 1.1 explains about the image hits for each hash in a detailed manner. In the Image column you can click on particular image and see all the changes that are made using GIMP (The GNU Image Manipulation Program) tool. Table 1.1 - comparison of the analysis of certain images with exact, perceptual and dct hashing methods S No. Effect Exact Match Perceptual DCT Match Match 1. Everest Pass Pass Pass 2. Blur Fail Pass Pass 3. Brighten Fail Fail Pass 17 COMPARING IMAGES USING CLOUD COMPUTING 4. Cartoon Fail Fail Pass 5. Colorize Fail Pass Pass 6. Color Levels Fail Fail Pass 7. Crop Fail Fail Pass 8. Cross Mark Fail Pass Pass 9. Curves Fail Fail Pass 10. Gray Fail Pass Pass 11. Hue Fail Pass Pass 12. Lighten Fail Pass Pass 13. Low Contrast Fail Fail Pass 14. Pixelize Fail Pass Pass 15. Posterize Fail Fail Pass 16. Scaled Fail Pass Pass 17. Sharpen Fail Fail Pass 18. Sparkle Fail Pass Pass 19. Stain Fail Fail Pass 20. Text Fail Pass Pass 21. Threshold Fail Fail Fail 22. Distort Fail Fail Fail 23. Edge Fail Fail Fail 5.2 Image Hits 18 COMPARING IMAGES USING CLOUD COMPUTING Exact hash worked for all the image sets that I have worked on. It computes a 40 hex decimal values and inserts it into the database. When we search for a particular image it will compute the hash value and will iterate through all sha hash values. Using SHA hash the exact match results are 100 % all the time. Exact hash image hits 120 100 100 Image Set 1 Image Set 2 100 100 Image Set 3 Image Set 4 100 Percentage 80 60 40 20 0 Exact Hash matches Image set 1 contains Mountain images Image set 2 contains Space images Image Set 3 contains Animals images Image Set 4 contains Flowers images For the graph analysis I have taken around 20 images, changed the appearance by adding color brightness, text, cropping, scaling etc. By running different set of images gives the following 19 COMPARING IMAGES USING CLOUD COMPUTING analysis. Exact hash will always work if there is exact matching image in the database. Whereas perceptual image does not work mostly for the approximate image comparison. DCT works for most of the changed images. For Perceptual hash the image hit rates are 35 – 40 % and for DCT hash image hit rates are 75 80%. Approximate Hash image hits 90 80 Percentage 70 60 50 40 30 20 10 0 Image Set 1 Image Set 2 Image Set 3 Image Set 4 Hashes Perceptual DCT Figure 1.1 Comparison of images on percentage values 5.3 Performance measure on different browsers 5.3.1 Test the Amazon Server performance with different results The query retrieval rate is fast and the image is displayed in a timely manner. 20 COMPARING IMAGES USING CLOUD COMPUTING 5.3.2 Everest with Text 5.3.3 Everest with Stain 21 COMPARING IMAGES USING CLOUD COMPUTING 1.53 Mt.Everest 1.525 Time in ms 1.52 1.515 1.51 1.505 1.5 1.495 1.49 Everest Everest with Text Everest with Stain Web Browsers Approximate All the three images for both exact and approximate hash computed similarly. On Amazon web server all the images took similar time for getting the results back after computing the data in the MongoDB. With the above results the performance of the system faster as it takes only few milliseconds to display the results. 22 COMPARING IMAGES USING CLOUD COMPUTING 6 Advantages 6.1 Advantages 6.1.1 Images used on the internet without authentication can easily be detected 6.1.2 Mongo DB advantages: 6.1.2.1 Schema less: MongoDB is document database in which one collection holds different documents. Number of fields, content and size of the document can be differ from one document to another. 6.1.2.2 Structure of a single object is clear 6.1.2.3 No complex joins 6.1.2.4 Deep query-ability. MongoDB supports dynamic queries on documents using a document-based query language that's nearly as powerful as SQL 6.1.2.5 Ease of scale-out: MongoDB is easy to scale 6.1.2.6 Replication & High Availability 6.1.2.7 Document Oriented Storage : Data is stored in the form of JSON style documents 6.1.2.8 Index on any attribute 6.1.2.9 Conversion / mapping of application objects to database objects not needed 6.1.2.10 Rich Queries 6.1.2.11 Replication & High Availability 6.1.2.12 Auto-Sharding 7 Learned I have learned many new concepts through this project. At first I worked on setting up the Amazon Cloud computing server. Running the web server over the internet is very useful now-a-days. We will be able to access the server from anywhere. I have followed Processing Images with Amazon Web Services [1], which has a solution that uploads an image and then process that image using Amazon Web Services (AWS). For this image will be uploaded in the browser and that image 23 COMPARING IMAGES USING CLOUD COMPUTING thumbnail will be created and is saved. Processing large image into a thumbnail saves lot of storage space. For this I have used PHP for running the web application and used the following services: 7.1 Cloud Computing Implemented User Interface where the user can upload images and AWS will process them into a thumbnail image with much smaller size. 7.1.1 An Amazon Elastic Cloud Computing (EC2) instance running Apache and PHP 7.1.2 An Amazon Simple Storage Service (S3) account to hold uploaded images 7.1.3 An Amazon Simple DB account, to hold metadata about those images 7.1.4 An Amazon Simple Query Service (SQS) account, to send and receive messages that involve those images. 7.2 MongoDB Installed and learned Mongo DB on Amazon cloud. Initially worked on saving exact hash codes in the databases and then added approximate hash codes for both perceptual and discrete cosine transformations (DCT). In Mongo DB there are 4 rows, row1 with image URL, and row2 with exact map hash code, row3 with perceptual hash code and row4 with DCT hash codes. 7.3 Image Hashing Image hashing is the new concept that I have learned for this project. Understood both the exact and approximate hash for comparing and displaying the images. For exact hash either MD5 or 24 COMPARING IMAGES USING CLOUD COMPUTING SHA algorithm can be used. I have used SHA algorithm with 160 bits of data. Whereas approximate comparison is a tedious one which has lots of algorithms. Locality sensitive hashing, perceptual hashing, dct hashing etc. Locality sensitive hashing (LSH) requires lots of data for comparing the images. It requires normalizing the database by taking all the image hashes and putting them in hash buckets and recreating the database. Or loading all the records into a MVP tree and then do the search. Not a problem when you have 1000 records but millions of records it is a problem. Also when you add a new image you need to run your database normalization again. LSH finds near-neighbors in high-dimensional space. Near Neighbors are the points that are a small distance apart. 8 Future Work This project can be extended to use Locality Scale Hashing for comparing approximate hash images. Web crawling of images can be done for finding any stolen images. Many more images can be uploaded into the database for testing. Use the web crawler for automatically uploading the images. Can increase the scalability of the database, as of now there is 2GB limit. Work on reducing the false positives for the results. Sort the retrieved images by the image similarity score. 9 Conclusion I have used two techniques for comparing the images on cloud using MongoDB. Exact and approximate image matching is used. Image that needs to be searched is insert using the browse button and its respective hash values were computed. It will iterate through all the hash code values in the database and if there is any exact or approximate match will display the results. SHA1 algorithm is used for exact image search and it is faster. 25 COMPARING IMAGES USING CLOUD COMPUTING Perceptual and dct sorting hash is used for approximate hashing and the results are much faster than expected. 26 COMPARING IMAGES USING CLOUD COMPUTING 10 References 1. Processing Images with Amazon Web Services by John Fronckowiak, June 26, 2008 2. Implementation and Benchmarking of Perceptual Image Hash Functions by Christoph Zauner 3. Compression Tolerant DCT based image hash, C.Kailasanathan, R. Safavi-Naini, P.Ogunbona. May 2003 4. Robust Image Hashing Based on Statistical Invariance of DCT Coefficients, Fa-Xin Yu, Yan-Qiang Lei and Yuan-Gen Wang, Zhe-Ming Lu. May 2009 5. Dong, W., Wang, Z., Charikar, M., Li, K.: High-Confidence Near-Duplicate Image Detection. In: ACM International Conference on Multimedia Retrieval (2012) 6. Locality Sensitive Hashing: http://web.stanford.edu/class/cs246/slides/03-lsh.pdf 7. High-Confidence Near-Duplicate Image Detection. Wei Dong, Zhe Wang, Moses Charikar, Kai Li. Feb 2006. 8. http://www.phash.org/docs/design.html 9. http://nekkidphpprogrammer.blogspot.com/2014/01/the-better-dct-perceptual-hashalgorithm.html 10. http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html 27
© Copyright 2026 Paperzz