Master - College of Engineering and Applied Science

COMPARING IMAGES USING CLOUD COMPUTING
A PROJECT REPORT
Submitted by
RAJYA BADAM
in partial fulfilment for the award of the degree
of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
UNIVERSITY OF COLORADO AT COLORADO SPRINGS
UNIVERSITY OF COLORADO: COLORADO SPRINGS 80918
May 2015
1
COMPARING IMAGES USING CLOUD COMPUTING
This project for the Master of Science degree by
Rajya Badam
Has been approved for the
Department of Computer Science
By
Advisor: Dr. Terry Boult
Member: Dr. Edward Chow
Member: Dr. Sudhashu Semwal
May 14th 2015
2
COMPARING IMAGES USING CLOUD COMPUTING
Acknowledgements
I would like to thank my advisor Dr.Terry Boult for helping me through this project. Both my
project members Dr. Sudhanshu Semwal and Dr. Edward Chow, from the University of Colorado
at Colorado Springs.
I am so grateful to my husband Balaji Badam who understood me and helped in every aspect.
Without him it would not be possible to complete my project. Thank you for all your help and
support. My son Adhrit Badam also played a role in the project by not disturbing, playing all by
himself when I am working.
I would also thank my parents Jakkampudi Ganeswara Rao and Jakkampudi Vijaya for their
encouragement on completing this project.
3
COMPARING IMAGES USING CLOUD COMPUTING
TABLE OF CONTENTS
Abstract
1. Introduction
1.1. Overview
2. Related Work
2.1. Image Hashing
2.1.1. SHA Algorithm
2.1.2. Perceptual Hashing
2.1.3. Discrete Cosine Transform
2.2. Mongo DB
2.2.1. No SQL Database
2.3. Cloud Computing
3. System Design
3.1. Exact Hashing
3.1.1. MD5 hash
3.1.2. SHA hash
3.2. Approximate Hashing
3.2.1. Perceptual Hash
3.2.2. Discrete Cosine Hash
4. Implementation
4.1. Uploading Images
4.2. Data Set
4.3. System Environment
5. Performance Evaluation
5.1. Comparison of exact and approximate hash matches
4
COMPARING IMAGES USING CLOUD COMPUTING
5.2. Image Hits
5.3. Performance measure on different browsers
5.3.1. Test the Amazon Server performance with different results
5.3.2. Image with Text
5.3.3. Image with Stain
6. Advantages
7. Learned
7.1. Cloud Computing
7.2. MongoDB
7.3. Image Hashing
8. Future Work
9. Conclusion
10.References
5
COMPARING IMAGES USING CLOUD COMPUTING
Abstract
Comparing Images is a resource intensive task. This project involves creating an application that
will use resources on the cloud and provides a web interface that allows a user to perform an image
search. The application consists of an image URL database along with an image hash to compute
an exact or an approximate hash comparison of provided image against the database. The database
is composed of image URL from various sources on the web.
The project uses Amazon Web Services as the cloud computing platform. The advantages of using
Amazon Web Services is it is Easy to use, Flexible, Cost-effective, Reliable, Scalable with highperformance and Secure. The Advantages of Cloud Computing are lower capital expense, lower
variable cost, scalable capacity needs, increase speed and agility, lower maintenance costs and
finally it is globally accessible.
All the image URLs along with the respective image hash code will be saved into a database, we
will be using the popular NoSql database - Mongo DB. Users shall be able to upload an image and
search for the matching image URLs. Amazon EC2 instance will be used to host the web
application. The web implementation will be designed using PHP – Hypertext Preprocessor. A
NoSql database - Mongo DB will be used on the Amazon EC2 instance to host the database for
this project. The image hash will be computed using Python
The work on the project divided into:
1. Collecting a large set of image URLs for uploading into database
2. Generating a hash of the image
6
COMPARING IMAGES USING CLOUD COMPUTING
3. Uploading image hash and URL to Mongo DB
4. Create Web application in PHP and host on the Amazon EC2 instance
5. Hosting the PHP web application on Amazon EC2 instance
6. Exact or approximate match image search results are captured
1. Introduction
Image authentication plays an important role in security and communication. Images are being
transferred over the Internet and are readily available for access from any part of the world and
without introducing authentication mechanism, it is almost impossible to distinguish if an image
is original or being manipulated.
1.1 Overview
Image searching is a process of converting an image into a digital form and extracting some useful
information from it. Images can be searched by computing the hash code. MD5 checksum is used
for finding the exact search of an images, by computing the hash code. Approximate image search
can be used for finding similar images.
For computing the image search the input can be either an image or the url. The output of the
search will be exact or approximate matches of the image along with the url’s. The user will be
able to select the url and view the images.
Finding the images that are used by other sources using the URL search is the main purpose of this
project. Many images that were used by many other sources and it is difficult to track the stolen or
used resources. It is very helpful to find the images that are used elsewhere.
7
COMPARING IMAGES USING CLOUD COMPUTING
Now-a-days image searching can be done using image URL’s or by directly uploading images,
instead of searching images using a word or a phrase. Uploading the images for searching similar
matches are mostly used.
It is common to upload and retrieve images from the web. This project creates an on-demand
instance which processes a batch of images. The application makes use of Amazon EC2 and
NoSQL DB.
Amazon EC2
Web Application
Amazon EC2
Web Application
Mongo DB
Process Request
Data
Mongo DB
Process Request
Data
The image that is uploaded for searching is retrieved from the Mongo DB.
2. Related Work
2.1 Image Hashing
Exact Hash uses MD5 or SHA algorithms for comparing the images.
Hash algorithms are called secure when
1. It is impossible to find a message that corresponds to a given message digest.
2. It is impossible to find two different messages that produce the same message digest.
3. If a message is changed even by a single character, the result will be a completely different
message digest.
8
COMPARING IMAGES USING CLOUD COMPUTING
i. MD5 hash
ii. SHA hash
ii.
Approximate Hashing
i. Perceptual Hash
ii. Discrete Cosine Hash (DCT)
2.1.1 SHA Algorithm
The Secure Hash Algorithm is a family of cryptographic hash functions published by the National
Institute of Standards and Technology (NIST) as a U.S. Federal Information Processing Standard
(FIPS), including:
SHA-0: A retronym applied to the original version of the 160-bit hash function published in 1993
under the name "SHA". It was withdrawn shortly after publication due to an undisclosed
"significant flaw" and replaced by the slightly revised version SHA-1.
SHA-1: A 160-bit hash function which resembles the earlier MD5 algorithm. This was designed
by the National Security Agency (NSA) to be part of the Digital Signature Algorithm.
Cryptographic weaknesses were discovered in SHA-1, and the standard was no longer approved
for most cryptographic uses after 2010.
SHA-2: A family of two similar hash functions, with different block sizes, known as SHA-256 and
SHA-512. They differ in the word size; SHA-256 uses 32-bit words where SHA-512 uses 64-bit
words. There are also truncated versions of each standard, known as SHA-224, SHA-384, SHA512/224 and SHA-512/256. These were also designed by the NSA.
9
COMPARING IMAGES USING CLOUD COMPUTING
SHA-3: A hash function formerly called Keccak, chosen in 2012 after a public competition among
non-NSA designers. It supports the same hash lengths as SHA-2, and its internal structure differs
significantly from the rest of the SHA family.
2.1.2 Perceptual Hashing
Perceptual image hash functions produce hash values based on the image’s visual appearance. A
perceptual hash can also be referred to as e.g. a robust hash or a fingerprint. Such a function
calculates similar hash values for similar images, whereas for dissimilar images dissimilar hash
values are calculated. Finally, using an adequate distance or similarity function to compare two
perceptual hash values, it can be decided whether two images are perceptually different or not.
Perceptual image hash functions can be used e.g. for the identification or integrity verification of
images.
2.1.3 Discrete Cosine Transform
The DCT, like any Fourier-related transform, expresses a function or signal (a sequence of finitely
many data points) in terms of a sum of sinusoids with different frequencies and amplitudes. The
DCT uses only cosine functions.
2.2 Mongo DB
Mongo DB is one of many cross-platform document-oriented databases. It is also classified as a
No-SQL database, which does not use traditional table-based relational database structure. It uses
JSON-like documents with dynamic schemas called BSON.
10
COMPARING IMAGES USING CLOUD COMPUTING
BSON is a computer data interchange format used mainly as a data storage and network transfer
format in the MongoDB database. It is a binary form for representing simple data structures and
associative arrays called objects or documents in MongoDB. The name “BSON” is based on the
term JSON and stands for “Binary JSON”.
Mongo DB is an open source software which is used by Craigslist, eBay, Foursquare, Source
Forge, Viacom, New York Times etc. It is the most popular NoSQL database system. It uses
document-oriented structure, which stores the business subject in the minimal number of
documents.
2.2.1 No SQL Database
No SQL database is used for storage and retrieval of data that is modeled in all means other than
the tabular relations used in relational databases. It gives simplicity of design, horizontal scaling
and finer control over availability. No SQL uses key-value, graph, or document forms of data
structures.
No SQL databases are increasingly used in big data and real-time web applications. No SQL is
also called “Not only SQL” that also supports SQL-like query languages.
2.3 Cloud Computing
Cloud computing relies on sharing of resources to achieve coherence and economies of scale,
similar to a utility (like the electricity grid) over a network. At the foundation of cloud computing
is the broader concept of converged infrastructure and shared services.
11
COMPARING IMAGES USING CLOUD COMPUTING
Cloud computing, or "the cloud", also focuses on maximizing the effectiveness of the shared
resources. Cloud resources are usually not only shared by multiple users but are also dynamically
reallocated per demand. This can work for allocating resources to users. For example, a cloud
computer facility that serves European users during European business hours with a specific
application (e.g., email) may reallocate the same resources to serve North American users during
North America's business hours with a different application (e.g., a web server). This approach
should maximize the use of computing power thus reducing environmental damage as well since
less power, air conditioning, rack space, etc. are required for a variety of functions. With cloud
computing, multiple users can access a single server to retrieve and update their data without
purchasing licenses for different applications.
For this project cloud computing plays an important role. It uses Infrastructure as a service (IaaS).
In the most basic cloud-service model & according to the IETF (Internet Engineering Task Force),
providers of IaaS offer computers – physical or (more often) virtual machines – and other
resources. (A hypervisor, such as Xen, Oracle VirtualBox, KVM, VMware ESX/ESXi, or HyperV runs the virtual machines as guests. Pools of hypervisors within the cloud operational supportsystem can support large numbers of virtual machines and the ability to scale services up and down
according to customers' varying requirements.) IaaS clouds often offer additional resources such
as a virtual-machine disk image library, raw block storage, and file or object storage, firewalls,
load balancers, IP addresses, virtual local area networks (VLANs), and software bundles.IaaScloud providers supply these resources on-demand from their large pools installed in data centers.
For wide-area connectivity, customers can use either the Internet or carrier clouds (dedicated
virtual private networks).
12
COMPARING IMAGES USING CLOUD COMPUTING
To deploy their applications, cloud users install operating-system images and their application
software on the cloud infrastructure. In this model, the cloud user patches and maintains the
operating systems and the application software. Cloud providers typically bill IaaS services on a
utility computing basis: cost reflects the amount of resources allocated and consumed.
In this project the comparison of images involves mainly two types. Exact hashing and
approximate hashing.
3 System Design
3.1 Overview
This system is based on computing the hash code for the uploaded image and then comparing the
existing images and their respective hash code from the MongoDB.
13
COMPARING IMAGES USING CLOUD COMPUTING
Query Image should be
uploaded
Computes Hash code for
the uploaded image
Displays the Exact Match
results
Exact, perceptual and dct
hashing will be performed
Compares the uploaded
image data with the existing
images in the database
Matches Exact hash code
Matches Perceptual or DCT
hash code
Displays Approximate image
results
Returns zero results
If no image matches
3.2 Setting up the system
In order to process the comparison, Amazon Web Services should be setup with all the sign in
privileges. All the rules must be setup for accessing the webpage. Mongo DB must be installed
and setup for use. All the required data should be collected for inserting in the database. Generating
hash code to the uploaded image must be considered. Research existing hashing algorithms to be
used for computing the image comparisons. Exact and approximate hashing must be performed
14
COMPARING IMAGES USING CLOUD COMPUTING
with maximum number of possibilities. PHP should be installed on the system for developing web
interface.
The discrete cosine transform (DCT) is an efficient means to compute a hash from frequency
spectrum data, and the distance calculation is relatively simple. While it is insufficient to consider
image similarity in any semantically meaningful way, it does provide a hash as an ID for an image,
and is robust against minor distortions, like small rotations, blurring and compression. The graphs
below show the hamming distances (i.e. the number of bits that differ in the 64-bit hash) for two
scenarios: the intra distances where the source images are from the same source only one is a
distorted version of the other, and inter distances, where the two compared images are altogether
different images. The main point is that a threshold of twenty-two, T=22, can be applied to
determine if two images are indeed the same source image.
Note: Please ignore the x-axis on the second table. The x-axis is merely a list of comparisons
between the specific images.
According to Dr. Neal, a better approach is using pHash. Here is how pHash is computed:
a. Reduced to grayscale
b. Resize the image to 32x32
c. Compute the Discrete Cosine Transform (DCT) of the image. The DCT separates the
image into a collection of frequencies and scalars
d. Just keep the top-left 8x8 of the DCT. While the DCT is 32x32, the top-left 8x8
represents the lowest frequencies in the picture
e. Compute the median value
f. Compute the hash from the DCT. Set the 64 hash bits to 0 or 1 depending on whether
each of the 64 DCT values is above or below the median value.
15
COMPARING IMAGES USING CLOUD COMPUTING
4 Implementation
4.1 Uploading Images
A web page is created using PHP, an image can be uploaded into the database. Also by inputting
the image hash code values will be computed for that particular image. By querying the database
with the input image all the hash code values will be compared and if there is any match the results
will be displayed. Main goal is to capture many approximate images that are available in the
database.
16
COMPARING IMAGES USING CLOUD COMPUTING
4.2 Data Set
Around 500 images were inserted into the database and the image search operation is performed
on all the images.
4.3 System Environment
Amazon Web Services EC2 cloud computing is used for running the application. Apache2 web
server is installed and runs on EC2 instance. PHP 5.5.9 version is used for web interface. PEAR
Crypt_HMAC is used for computing the hashing. Mongo DB 3.0.1 is installed and runs on EC2
instance. OpenCV is installed and is used for comparing exact images.
5 Performance Evaluation
5.1 Comparison of exact and approximate hash matches
Table 1.1 explains about the image hits for each hash in a detailed manner. In the Image column
you can click on particular image and see all the changes that are made using GIMP (The GNU
Image Manipulation Program) tool.
Table 1.1 - comparison of the analysis of certain images with exact, perceptual and dct hashing
methods
S No.
Effect
Exact Match
Perceptual
DCT Match
Match
1.
Everest
Pass
Pass
Pass
2.
Blur
Fail
Pass
Pass
3.
Brighten
Fail
Fail
Pass
17
COMPARING IMAGES USING CLOUD COMPUTING
4.
Cartoon
Fail
Fail
Pass
5.
Colorize
Fail
Pass
Pass
6.
Color Levels
Fail
Fail
Pass
7.
Crop
Fail
Fail
Pass
8.
Cross Mark
Fail
Pass
Pass
9.
Curves
Fail
Fail
Pass
10.
Gray
Fail
Pass
Pass
11.
Hue
Fail
Pass
Pass
12.
Lighten
Fail
Pass
Pass
13.
Low Contrast
Fail
Fail
Pass
14.
Pixelize
Fail
Pass
Pass
15.
Posterize
Fail
Fail
Pass
16.
Scaled
Fail
Pass
Pass
17.
Sharpen
Fail
Fail
Pass
18.
Sparkle
Fail
Pass
Pass
19.
Stain
Fail
Fail
Pass
20.
Text
Fail
Pass
Pass
21.
Threshold
Fail
Fail
Fail
22.
Distort
Fail
Fail
Fail
23.
Edge
Fail
Fail
Fail
5.2 Image Hits
18
COMPARING IMAGES USING CLOUD COMPUTING
Exact hash worked for all the image sets that I have worked on. It computes a 40 hex decimal
values and inserts it into the database. When we search for a particular image it will compute the
hash value and will iterate through all sha hash values. Using SHA hash the exact match results
are 100 % all the time.
Exact hash image hits
120
100
100
Image Set 1
Image Set 2
100
100
Image Set 3
Image Set 4
100
Percentage
80
60
40
20
0
Exact
Hash matches
Image set 1 contains Mountain images
Image set 2 contains Space images
Image Set 3 contains Animals images
Image Set 4 contains Flowers images
For the graph analysis I have taken around 20 images, changed the appearance by adding color
brightness, text, cropping, scaling etc. By running different set of images gives the following
19
COMPARING IMAGES USING CLOUD COMPUTING
analysis. Exact hash will always work if there is exact matching image in the database. Whereas
perceptual image does not work mostly for the approximate image comparison. DCT works for
most of the changed images.
For Perceptual hash the image hit rates are 35 – 40 % and for DCT hash image hit rates are 75 80%.
Approximate Hash image hits
90
80
Percentage
70
60
50
40
30
20
10
0
Image Set 1
Image Set 2
Image Set 3
Image Set 4
Hashes
Perceptual
DCT
Figure 1.1 Comparison of images on percentage values
5.3 Performance measure on different browsers
5.3.1 Test the Amazon Server performance with different results
The query retrieval rate is fast and the image is displayed in a timely manner.
20
COMPARING IMAGES USING CLOUD COMPUTING
5.3.2 Everest with Text
5.3.3 Everest with Stain
21
COMPARING IMAGES USING CLOUD COMPUTING
1.53
Mt.Everest
1.525
Time in ms
1.52
1.515
1.51
1.505
1.5
1.495
1.49
Everest
Everest with Text
Everest with Stain
Web Browsers
Approximate
All the three images for both exact and approximate hash computed similarly. On Amazon web
server all the images took similar time for getting the results back after computing the data in the
MongoDB. With the above results the performance of the system faster as it takes only few
milliseconds to display the results.
22
COMPARING IMAGES USING CLOUD COMPUTING
6 Advantages
6.1 Advantages
6.1.1 Images used on the internet without authentication can easily be detected
6.1.2 Mongo DB advantages:
6.1.2.1
Schema less: MongoDB is document database in which one collection holds different
documents. Number of fields, content and size of the document can be differ from one
document to another.
6.1.2.2
Structure of a single object is clear
6.1.2.3
No complex joins
6.1.2.4
Deep query-ability. MongoDB supports dynamic queries on documents using a
document-based query language that's nearly as powerful as SQL
6.1.2.5
Ease of scale-out: MongoDB is easy to scale
6.1.2.6
Replication & High Availability
6.1.2.7
Document Oriented Storage : Data is stored in the form of JSON style documents
6.1.2.8
Index on any attribute
6.1.2.9
Conversion / mapping of application objects to database objects not needed
6.1.2.10
Rich Queries
6.1.2.11
Replication & High Availability
6.1.2.12
Auto-Sharding
7 Learned
I have learned many new concepts through this project. At first I worked on setting up the Amazon
Cloud computing server. Running the web server over the internet is very useful now-a-days. We
will be able to access the server from anywhere. I have followed Processing Images with Amazon
Web Services [1], which has a solution that uploads an image and then process that image using
Amazon Web Services (AWS). For this image will be uploaded in the browser and that image
23
COMPARING IMAGES USING CLOUD COMPUTING
thumbnail will be created and is saved. Processing large image into a thumbnail saves lot of storage
space. For this I have used PHP for running the web application and used the following services:
7.1 Cloud Computing
Implemented User Interface where the user can upload images and AWS will process them into a
thumbnail image with much smaller size.
7.1.1 An Amazon Elastic Cloud Computing (EC2) instance running Apache and PHP
7.1.2 An Amazon Simple Storage Service (S3) account to hold uploaded images
7.1.3 An Amazon Simple DB account, to hold metadata about those images
7.1.4 An Amazon Simple Query Service (SQS) account, to send and receive messages that
involve those images.
7.2 MongoDB
Installed and learned Mongo DB on Amazon cloud. Initially worked on saving exact hash codes
in the databases and then added approximate hash codes for both perceptual and discrete cosine
transformations (DCT).
In Mongo DB there are 4 rows, row1 with image URL, and row2 with exact map hash code, row3
with perceptual hash code and row4 with DCT hash codes.
7.3 Image Hashing
Image hashing is the new concept that I have learned for this project. Understood both the exact
and approximate hash for comparing and displaying the images. For exact hash either MD5 or
24
COMPARING IMAGES USING CLOUD COMPUTING
SHA algorithm can be used. I have used SHA algorithm with 160 bits of data. Whereas
approximate comparison is a tedious one which has lots of algorithms. Locality sensitive hashing,
perceptual hashing, dct hashing etc. Locality sensitive hashing (LSH) requires lots of data for
comparing the images. It requires normalizing the database by taking all the image hashes and
putting them in hash buckets and recreating the database. Or loading all the records into a MVP
tree and then do the search. Not a problem when you have 1000 records but millions of records it
is a problem. Also when you add a new image you need to run your database normalization again.
LSH finds near-neighbors in high-dimensional space. Near Neighbors are the points that are a
small distance apart.
8 Future Work
This project can be extended to use Locality Scale Hashing for comparing approximate hash
images. Web crawling of images can be done for finding any stolen images. Many more images
can be uploaded into the database for testing. Use the web crawler for automatically uploading the
images. Can increase the scalability of the database, as of now there is 2GB limit. Work on
reducing the false positives for the results. Sort the retrieved images by the image similarity score.
9 Conclusion
I have used two techniques for comparing the images on cloud using MongoDB. Exact and
approximate image matching is used. Image that needs to be searched is insert using the browse
button and its respective hash values were computed. It will iterate through all the hash code values
in the database and if there is any exact or approximate match will display the results.
SHA1 algorithm is used for exact image search and it is faster.
25
COMPARING IMAGES USING CLOUD COMPUTING
Perceptual and dct sorting hash is used for approximate hashing and the results are much faster
than expected.
26
COMPARING IMAGES USING CLOUD COMPUTING
10 References
1. Processing Images with Amazon Web Services by John Fronckowiak, June 26, 2008
2. Implementation and Benchmarking of Perceptual Image Hash Functions by Christoph
Zauner
3. Compression Tolerant DCT based image hash, C.Kailasanathan, R. Safavi-Naini,
P.Ogunbona. May 2003
4. Robust Image Hashing Based on Statistical Invariance of DCT Coefficients, Fa-Xin Yu,
Yan-Qiang Lei and Yuan-Gen Wang, Zhe-Ming Lu. May 2009
5. Dong, W., Wang, Z., Charikar, M., Li, K.: High-Confidence Near-Duplicate Image
Detection. In: ACM International Conference on Multimedia Retrieval (2012)
6. Locality Sensitive Hashing: http://web.stanford.edu/class/cs246/slides/03-lsh.pdf
7. High-Confidence Near-Duplicate Image Detection. Wei Dong, Zhe Wang, Moses
Charikar, Kai Li. Feb 2006.
8. http://www.phash.org/docs/design.html
9. http://nekkidphpprogrammer.blogspot.com/2014/01/the-better-dct-perceptual-hashalgorithm.html
10. http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html
27