Min ∆= min - Academic Science,International Journal of Computer

1
Data Leakage Detection Using Data Allocation
Strategies
Richa Desai
Megh Jagad
Dept. of Computer Engineering
Dwarkadas J Sanghvi College of Engineering, Mumbai , India
[email protected]
Dept. of Computer Engineering
Dwarkadas J Sanghvi College of Engineering,Mumbai, India
[email protected]
Prof. Abhijit Patil
(Assistant Professor)
Dept. of Computer Engineering
Dwarkadas J Sanghvi College of Engineering, India
[email protected]

Abstract— This paper focuses on detecting data leakage due
to authorized agents who have easy access to the data. We
employ data allocation strategies in order to detect data leakage
and introduce techniques to prevent it complying with the laws
and rules of it’s locality. In this paper, we aim to analyze a guilt
model that detects the agents using allocation strategies leaving
the original data as it is. The guilty agent is one who leaks a
portion of distributed data. The idea is to distribute the data to
the agents in an intelligent manner. Also, we throw light on the
punishments that can be imposed on cyber criminals of different
countries, based on their law system.
Index Terms— sensitive data; fake objects; data allocation
strategies. [2]
I. INTRODUCTION
Data leakage is essentially, the unauthorized transfer of
some important and private data to outside agent i.e a third
party. The person or entity receiving the data is an unauthorized
recipient. Certain data like financial information, information
about transactions, intellectual property(IP), personal
information, etc. and all other business or industry related
information come under the category of Sensitive data. In the
real world scenario, sensitive data needs to be shared among
various stakeholders such as employees, business partners and
customers by the distributor. [1] This increases the risk of
confidential information falling into unauthorized hands,
whether caused intentionally or by error or mistake. In order to
identify the culprit, we have to find the leakage points in case
of an error or guilty agents, in case of intentional data leakage.
II. RELATED WORK
Data leakage detection has been a crucial problem in IT
systems. Using the security softwares available, many security
threats like hacking, intrusion, impersonation, eves dropping
and VIRUS can be prevented. Nowadays, security mechanisms
are prevalent in most of the forms of electronic exchange of
data. Guilt detection is a challenging task, however, in a case
where data is handed over to authorized agents (humans) who
are in turn, expected to transfer data to intended recipients. The
following review establishes facts in line with this problem.
Data provenance problem has been around and it is closely
related to the origin and originality of data. Probability of the
guilt can be traced by tracing the origin of the objects. This
paper throws light on further research in this field, where all the
Figure 1: Possible Scenarios of Data Leakage
2
possible causes of proving the data resonance problem is
reviewed [2].
The presented solutions are very particular to the domain and
they are related to data warehousing and assume us to have
previous knowledge on the data resources. Our problem
statement here is simple and basic, which doesn’t modify the
original data items to be distributed. Unlike watermarking, the
set of objects aren’t changed when they are distributed through
trusted agents. Without involving the usage of watermarking
linear tracing is performed. Watermarking technology has been
playing a very crucial role in safeguarding the intellectual
property of people that is in electronic format. However, the
data that needs to be protected has to be modified in order to
ember the water mark for security purposes. The distributor of
the watermarked images gets alerted immediately in case the
image is modified or tampered, thus authenticating the
occurrence of fraud. Watermarking can be used to secure data
in the form of images, audio and video. This watermarking also
introduced some redundancy in the digital data. A similar
technique can be used to protect relational data, by adding some
markings into the data for security purposes. Our approach in
this paper and watermarking, both have one thing is common,
i.e providing identification of originality of information.
However, they are totally opposite in the sense that our
approach doesn’t require the modification of objects, as in the
case of watermarking. Other research works focus on enabling
IT systems to ensuring that the data is received by intended
receivers. These policies aid in the protection of data during its
transfer and detection of its leakage as well. However, it gets
nearly impossible for these policies to satisfy the requirements
of the agents due to their restrictive nature [3].
be so to the observer. So, these fake objects behave like a
watermark for the data set without modifying the latter. If one
or more leaked objects, which were fake are found to have been
allocated to a particular agent, then it more certain that that
agent was guilty [4].
III. THEORY
Guilty Agents
After distribution of the objects, suppose that the distributor
finds a set S ⊆ T to be leaked. This implies that an unauthorized
agent, say target possesses S. The target might be using S for
displaying or some legal process. Now, all agents U1.. Un have
some data, so anyone can be suspected to having leaked the
data. However, they will try to prove their innocence and claim
that data must have reached the target in some other way. We
aim to calculate the probability of the data being leaked from
the agents and not the other sources. More data in S makes is
difficult for agents to prove their innocence, whereas lesser data
makes is difficult to prove that the target obtained data through
other medium. We want to estimate the probability of data
leakage by agents as well as the probability of a particular agent
being guilty. For example, if agent U1 is allocated an object,
and others objects are allocated to other agents, we can easily
suspect on U1 more confidently. The model presented captures
this intuition. We call an agent Ui guilty if one or more objects
are contributed by it to the target. We denote the case of agent
Ui being guilty for a leaked set S by Gi|S. Next, we estimate
probability of Ui being guilty given S.
Usually, nowadays, watermarking technique is used to tackle
the issue of data leakage, i.e., a unique data or image is
embedded whenever data is distributed. It is easy to identify the
leaked data occupied by unauthorized agents due to
watermarking, and the leaker can be identified. Watermarks are
effectively applicable in some cases, however, they require
some changes in the original data. Moreover, watermarks are
vulnerable destruction by malicious attackers. For example, a
situation where a company is in partnership with other
companies, and sharing of data is required, or sharing of records
of a particular patient with researchers to come up with new
cures. When a company shares its data, it is called the
distributor of data, and the companies with whom the data is
shared, the agents.
The main drawback of watermarking technique is, the data
which is being distributed has to be modified, hence, the
original data isn’t sent as it is.
In this paper, we will be adding fake objects in order to detect
the leakage of distributor’s sensitive data, and also identify the
agent who leaked it. We develop a model for assessing the guilt
of agents and present various algorithms for object distribution
to different agents, that’ll enable us to find the leaker efficiently.
Lastly, we consider adding fake objects to the data set to be
distributed. These objects aren’t real entities, but they appear to
IV. PROBLEM SETUP AND NOTATIONS
Let’s consider a set of sensitive data items, T = {t1, t2, t3,..}
belonging to a distributor. Some of these data objects have to
be shared by the distributor with agents U1, U2, … Un but the
data objects shouldn’t be leaked to other agents. The objects can
be a tuples in a relation, and of any size or type.
A subset Ri ⊆ T of objects is given to an agent Ui,
determined by either a sample or explicit request:
• Sample request Ri = SAMPLE(T,mi): Any random subset
of mi records can be given to Ui from T.
• Explicit request Ri = EXPLICIT(T,condition): Agent Ui
receives all the T objects that satisfy the given condition.
Example: Let’s say, the customer records for the given
company A are in T. Now, a marketing agency U1 is hired to
do an online survey of customers by this company A. A sample
of say 1000 customers is required by U1 for this survey.
Simultaneously, company A hires agent U2 for handling all
California customers’ billing. Hence, all t records that satisfy
the given condition of state being “California” are provided to
U2. Our model can also be expanded to fulfilling requests for
samples of an object that satisfy conditions like an agent
wanting 1000 California customer records.
3
or the target. To simplify, the assumption states that joint events
have a very minute probability.
Assumption 2: An object t ∈ S can be obtained only by the
target in any of the two ways:
• An agent Ui has leaked t from its own set Ri; or
• The target has guessed t without any help of the agent. It
implies that, for all t ∈ S, the event guessed by the target is
disjoint with the events that agent Ui leaks, where i = 1,...,n.
VI. GUILT MODEL ANALYSIS
Here, we consider three scenarios, each on which has a target
which has obtained the distributor’s objects, i.e. T=S.
V. AGENT GUILT MODEL
An agent is said to be guilty if we find that the data which got
leaked had been allotted to that particular agent. The probability
that the given agent is guilty is represented by [2] Pr {Gi|S}.
Some agent might claim to be innocent by saying that the data
might have been found by some other means, like guessing or
collecting from elsewhere. Hence, we need to consider the
chances of data being guessed, which is called the guessing
probability and is represented by p. The value of p varies with
the type of data, and it’s a probable and not exact value. The
guessing probability is a crucial factor in a fault tolerant system,
making the system more reliable. An agent is considered to be
guilty with the calculated probability at least. In usual course,
in order to calculate the probability of an agent Ui being guilty
for a set S, we first find the probability of him leaking a single
object t from S. In order to find this, we have to know to which
agents, t was given. This set is Vt.
Pr {some agent leaked t to S} = 1-p
(1)
All the agents possessing t are equally capable of leaking,
hence
Pr {Ui leaks t to S} = (1−𝑝) |𝑉𝑡|
6.1) Impact of the Probability
Consider that target T1 possesses 16 objects. All these
objects are given to agent I1 whereas only 8 of these objects are
given to another agent, I2. The probability of each agent leaking
data is computed. The results are shown in the graph.
(2)
Probability of an agent leaking an object will be divided
equally into number of agents who were given that particular
object. If no agent possesses that object, i.e. Vt=0, then Pr (Ui
leaks t to S)=0. The probability of an agent Ui leaking is
calculated by multiplying probability of each object leakage
that are in S as well as Ri, i.e.
Pr {G1 | S} = 1 – ∏ (1− 𝑡∈𝑆∩(1−𝑝) |𝑉𝑡| )
We make following assumptions about the relationship among
the numerous data leakage events [5].
Assumption 1: For all t, t’ ∈ S where t is not equal to t0 , the
provenance of t does not depend on t0
In this statement provenance refers to the source of a value in
the leaked set. The source can be any agent who has t in its set
Here, the dotted line represents Pr{G1} and the continuous line
represents Pr{G2}.As the value of ‘p’ comes closer to 0, the
probability of the target guessing all the 16 values decreases.
Every agent had leaked data, enough to make its individual guilt
touch 1. Though, increase in value of p greatly decreases the
probability of I2 being guilty: the 8 objects were given to both
I1 and I2. Hence, it’s difficult to blame I2 for the leakage.
Similarly, probability of I1 being guilty remains the same i.e.
close to 1 with increase in p, as the 8 objects held by I1 aren’t
shared with other agents. At the extremes, as the value of p
reaches 1, the probability of target guessing all the values is
high, so the guilt probability approaches 0.
6.2 Dependence on Overlap of Ri with S
Here, we again consider two agents. One of them receives all
the data, whereas other one receives a varying fraction of it. The
probability of guilt of both agents, as a fraction of fraction of
objects U2 has, is given in Figure 2(a), i.e., as a function of|R2
∩S|/|S|. Here, the value of p is as low as 0.2, and U1 holds all
16S objects, where as in the previous scenario, U2 had 50S
4
objects. We observe that when the objects are rare (p=0.2), we
can estimate an agent to be guilty without many object leakages,
confidently. This result supports our intuition: an agent is
clearly suspicious, if he’s holding even a very small number of
incriminating objects.
instance, the left most value of x = 1.0 is the starting point with
I−1 and I2. The rightmost point, x = 8.0, represents I1 with 16
objects, and 8 agents like I2, each with the same 8 objects [6].
VII. DATA ALLOCATION STRATEGIES
Now, we throw light on the various ways in which data
objects can be allocated to enable tracing the agent from the set
of distributed set of objects in case of any data leakage. Three
algorithms namely s-random, s-overlap and s-max based on
round robin approach, used for data allocation strategies for
sample data requests are discussed. The general algorithm is
given below:
Their general algorithm for data allocation of data set D is
shown in figure 2.
Input: Requests from each agent req1,req2, req3
The rate of increase of the guilt probability goes down as p
increases. The observation and intuition are similar. As the
objects become easier to guess, it takes more and more evidence
of leakage.
6.3 Dependence on the Number of Agents with Ri Data
Here, we see how the sharing of S objects by agents affects
their probabilities of being guilty. Consider an agent I1 with 16
*T = S objects, and another agent I2 having 8objects, and let p
= 0.5. Then, we gradually increase the number of agents having
the same 8 objects. We compute the probabilities Pr of G1 and
G2 for all cases and the results are presented in fig . The number
of agents having 8 objects is represented on the x-axis. For
1. SUM ĸ sum of all requests.
2. while SUM > 0 do
for each data request
3.
i-> ĸ SelectAgent( )
4.
k-> ĸ SelectObject( )
5.
Add k to the allocated set of i.
6.
SUM ĸ SUM – 1.
The running time of the given algorithm is O (ղ ƛ SUM) as
the loop will run SUM times and ղ denotes the running time of
SelectAgent() and ƛ denotes the running time of SelectObject().
A. Object Allocation
The SelectObject( ) function is used to pick a data object and
allocate it to an agent i [3]. There are three ways to do the same:
5
i) s-random: This allocation technique fulfills the agents’
requests but neglects the distributor’s objective. An object from
the distributor’s data set is selected randomly and hence, a poor
data allocation might result from s-random. For example, the
distributor set D contains three objects and three agents request
for one object each, then by s-random, there might be a case of
all three agents being allocated with the same object.
ii) s-overlap: In the same example mentioned above, we can
allocate every agent with a unique data object using the soverlap selection. Here, a list is created by the distributor of data
objects which have been allocated to the minimum number of
agents, and then objects are selected randomly from this subset.
If the sum of requests from all agents is lesser than the total size
of data set, a disjoint set is created. But with the increase in
SUM, the resulting sets have an object sharing distribution.
iii) s-max: An object that yields maximum relative overlap
among all of the agents is allocated to an agent.
The running time of s-max algorithm is O(|D| n) since both
external and internal loops run for all the agents [7].
B. Agent Selection
The agent selection can be done in numerous ways.
i) Round Robin: The Round Robin technique involves
allocation of a data object to every agent. Consider three agents
who have requested three objects each. Then, in Round Robin
technique, first data object is given to the first agent, then one
to the second agent and then one to the third agent. Now, the
distribution starts again t from the first agent, and so on until all
the agents’ requirements are satisfied. Here, the main drawback
is that if first agent needs only 2 objects whereas there are 100
agents, the first agent will have to wait for all 100 agents in
order to get its second object.
ii) First Come First Serve (FCFS): In First Come First Search,
we completely satisfy an agent’s requests before we start
allocating data objects to the next agent. Consider the above
example. In First Come First Serve, the first agent gets all its
three objects, and then other objects are allocated to other
agents. The agent that requests earlier doesn’t have to wait for
other agents.
iii) Longest Request First (LRF): In Longest Request First, we
sort the agents in decreasing order of their number of requests.
Then the requests are served in FCFS, i.e. first the agent with
highest request are allocated objects first and then the one
having smaller requests. In LRF, the agent having maximum
number of requests doesn’t have to wait for other agents with
smaller requests. However, LRF has one drawback. For
example, consider a data set D = { d1, d2, d3, d4, d5 } and three
agents requesting 3, 4 and 2 objects. A data allocation for FCFS
or LRF using s-overlap or s-max may yield: R1 = { d1, d2 }, R2
= { d3, d4, d5 }, and R3 = { d1 }. Now, if d1 is leaked, then the
distributor will suspect the agents A1 and A3 equally. However,
if third agent requests for two objects, then s-max yields R3 = {
d1, d3 } and now if both d1 and d3 are leaked, the agent A3 is
said to be guilty.
iv) Shortest Request First (SRF): As observed in the above
example, if an agent with smaller number of requests is satisfied
after the agents with higher requests, it may result in poor
allocation. Thus, we propose a case where the agents with
smaller size of requests are first satisfied completely, and then
moving on to the ones with larger requests. For the above
example, SRF using smax yields R3 = { d1 }, R1 = { d2, d3 }
and R2 = { d4, d5, d2 }.
The SRF allocation is clearly better than LRF allocation as in
the LRF allocation, R3 gets completely overlapped by R1 but
in SRF, there is no such complete overlapping of one agent by
any other agents (R1 and R2 have d2 in common but both
contain different data objects too).
The time complexity of Round-Robin and FCFS is O(1).
Also, if we keep a set of agents in memory that are sorted on
the basis of their request size, then Ș = O(1). However, the
sorting of agents will take either O(N logN) time or it can be
implemented using a priority queue [8].
VIII. RESULTS AND ANALYSIS
We measure the algorithm performance and also evaluate the
effectiveness of the approximation. The ∆ difference functions
are used for evaluating a given allocation with objective
scalarization as metrics.
_
∆= ∑ ∆(i,j)/N
N=n(n-1)
Min ∆= min ∆(i,j)
Metric ∆ is the mean of ∆ (i, j) values for a particular
allocation and it shows how efficient the guilt detection is, on
average, for that specific allocation. Metric min ∆ is the
minimum ∆ (i, j) value, corresponding to the case where agent
i has leaked his data while both i as well as some other agent j
have very comparable guilt probabilities. If min ∆ is small, then
it will be difficult to identify i as the leaker, versus j.
IX. DATA LEAKAGE IN CORPORATE WORLD
The consequences of insufficient security can’t be ignored.
IT professionals need to overcome these drawbacks to secure
the corporate assets in a way, complying with external and
internal corporate specifications. Our aim is to create and
uphold feasible security policies, considering different business
cultures as well as practices, so that the employees having
different backgrounds can share information with safety.
Two surveys were conducted in 10 countries which are selected
on basis of varying working cultures. For each country, we
6
consider 100 employees and equal number of IT professionals
and survey them, thus garnering 2000 respondents [9].
.
X. CYBER CRIME LAWS OF NATIONS
McConnell International surveyed the global network of
information technology policy officials in order to estimate the
condition of cyber security laws across the world. Countries had
to provide laws they would use prosecute criminal acts
Countries’ legislation was evaluated to check whether their
criminal statutes and rules covered cyberspace and its different
types of crimes.
Thirty-three of the countries don’t have updated laws for
addressing different types of cyber-crime. Out of the other
countries, only nine have framed legislation addressing five or
lesser varieties of cyber-crime, and ten have up to date laws to
prosecute about ten types of cyber-criminals.
Covering cyberspace into the rule of law is a very crucial step
in order to create a secure environment for people as well as
businesses. This extension still is a work in progress; companies
today must firstly secure their own systems and data from
attacks, be it from foreign or own entities.
To facilitate this self-defense, organizations should implement
cyber security plans suitable for public, process, and technical
issues.
Organizations also need to educate and train their employees
on security practices and incorporate the active usage of
unbreakable security technology-like firewalls, intrusion
detection tools, anti-virus software, etc.
Such system protection tools, both software as well as hardware
for securing information systems are both complex as well as
costly. In order to avoid these issues, system manufacturers and
operators have to turn off security features routinely. Also, no
common standards exist to mark the quality level of tools. This
inability of quantifying the costs and advantages of information
security investments proves to be a disadvantage while they
compete for organization’s resources, for security managers.
There is lot of scope for improvement in this field [10].
7
Countries
US
Brazil
UK
France
Germany
Italy
China
Japan
India
Australia
Data Crimes
Interception Modification
N
N
N
Y
N
N
Y
N
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Theft
N
N
Y
Y
Y
N
Y
Y
Y
Network Crimes
Interference Sabotage
N
Y
N
N
Y
Y
Y
Y
Y
Access Crimes
Unauthorized Virus
access
Decimation
Y
Y
Y
N
N
N
Y
Y
N
Y
Y
Y
N
N
N
Y
Y
Y
N
N
N
N
N
Y
N
Y
N
Related Crimes
Aid
Forgery
Cyber
Crimes
Y
N
Y
N
N
N
Y
N
Y
N
N
N
Y
Y
Y
N
N
Y
Fraud
Updated Laws of Countries
XI. CONCLUSION
Data leakage is a silent type of threat. Sensitive information
can be accidentally or intentionally be leaked by an insider.
Email, Web sites, FTP, messaging, databases, and any other
electronic means can be used to distribute this sensitive
information. Two things are important in assessing the risk of
distributing data, data allocation strategy which is used to
distribute the tuples among customers or calculating guilt
probability in which the leaked data set is overlapped with the
customer’s data set.
REFERENCES
[1] “Data Leakage: The High Cost of Insider Threats.” Worldwide
White
Paper.,http://www.cisco.com/en/US/solutions/collateral/ns170/n
s896/ns895 /white_paper_c11-506224.html., accessed January
2013
[2] “DATA ALLOCATION STRATEGIES IN DATA LEAKAGE
DETECTION”http://citeseerx.ist.psu.edu/viewdoc/download?do
i=10.1.1.417.1779&rep=rep1&type=pdf
[3] International Journal of Computer Trends and Technologyvolume3Issue12012
ISSN:2231-2803
http://www.internationaljournalssrg.org
Data
Allocation
Strategies for Detecting Data LeakageSrikanth Yadav, Dr. Y.
Eswara rao, V. Shanmukha Rao, R. Vasantha
[4] International Journal of Computer Applications in Engineering
Sciences [ISSN: 2231-4946]197 | P a g e Development of Data
leakage Detection Using Data Allocation Strategies Rudragouda
G Patil Dept of CSE,The Oxford College of Engg, Bangalore.
[24] Chan, Chi-Kwong, and L. M. Cheng. (2004), "Hiding data in
images by simple LSB substitution." Pattern Recognition Vol. 37,
no. 3, pp. 469-474.
[5] Cyber Crime and Punishment? Archaic Laws Threaten Global
Information, December 2000
[6] Comparative Evaluation of Algorithms for Effective Data Leakage
Detection, Proceedings of 2013 IEEE Conference on Information
and Communication Technologies (ICT 2013)
[7] Data Allocation Strategies for Leakage Detection, IOSRJCE
[8] Data Leakage Detection, Stanford
[9] S. Jajodia, P. Samarati, M. L. Sapino, and V. S. Subrahmanian.
Flexible support for multiple access control policies. ACM Trans.
Database Syst., 26(2):214–260, 2001.
[10] Y. Li, V. Swarup, and S. Jajodia. Finger-printing relational
databases: Schemes and specialties. IEEE Transactions on
Dependable and Secure Computing, 02(1):34–45, 2005.
N
N
Y
Y
Y
N
Y
Y
Y