1 Data Leakage Detection Using Data Allocation Strategies Richa Desai Megh Jagad Dept. of Computer Engineering Dwarkadas J Sanghvi College of Engineering, Mumbai , India [email protected] Dept. of Computer Engineering Dwarkadas J Sanghvi College of Engineering,Mumbai, India [email protected] Prof. Abhijit Patil (Assistant Professor) Dept. of Computer Engineering Dwarkadas J Sanghvi College of Engineering, India [email protected] Abstract— This paper focuses on detecting data leakage due to authorized agents who have easy access to the data. We employ data allocation strategies in order to detect data leakage and introduce techniques to prevent it complying with the laws and rules of it’s locality. In this paper, we aim to analyze a guilt model that detects the agents using allocation strategies leaving the original data as it is. The guilty agent is one who leaks a portion of distributed data. The idea is to distribute the data to the agents in an intelligent manner. Also, we throw light on the punishments that can be imposed on cyber criminals of different countries, based on their law system. Index Terms— sensitive data; fake objects; data allocation strategies. [2] I. INTRODUCTION Data leakage is essentially, the unauthorized transfer of some important and private data to outside agent i.e a third party. The person or entity receiving the data is an unauthorized recipient. Certain data like financial information, information about transactions, intellectual property(IP), personal information, etc. and all other business or industry related information come under the category of Sensitive data. In the real world scenario, sensitive data needs to be shared among various stakeholders such as employees, business partners and customers by the distributor. [1] This increases the risk of confidential information falling into unauthorized hands, whether caused intentionally or by error or mistake. In order to identify the culprit, we have to find the leakage points in case of an error or guilty agents, in case of intentional data leakage. II. RELATED WORK Data leakage detection has been a crucial problem in IT systems. Using the security softwares available, many security threats like hacking, intrusion, impersonation, eves dropping and VIRUS can be prevented. Nowadays, security mechanisms are prevalent in most of the forms of electronic exchange of data. Guilt detection is a challenging task, however, in a case where data is handed over to authorized agents (humans) who are in turn, expected to transfer data to intended recipients. The following review establishes facts in line with this problem. Data provenance problem has been around and it is closely related to the origin and originality of data. Probability of the guilt can be traced by tracing the origin of the objects. This paper throws light on further research in this field, where all the Figure 1: Possible Scenarios of Data Leakage 2 possible causes of proving the data resonance problem is reviewed [2]. The presented solutions are very particular to the domain and they are related to data warehousing and assume us to have previous knowledge on the data resources. Our problem statement here is simple and basic, which doesn’t modify the original data items to be distributed. Unlike watermarking, the set of objects aren’t changed when they are distributed through trusted agents. Without involving the usage of watermarking linear tracing is performed. Watermarking technology has been playing a very crucial role in safeguarding the intellectual property of people that is in electronic format. However, the data that needs to be protected has to be modified in order to ember the water mark for security purposes. The distributor of the watermarked images gets alerted immediately in case the image is modified or tampered, thus authenticating the occurrence of fraud. Watermarking can be used to secure data in the form of images, audio and video. This watermarking also introduced some redundancy in the digital data. A similar technique can be used to protect relational data, by adding some markings into the data for security purposes. Our approach in this paper and watermarking, both have one thing is common, i.e providing identification of originality of information. However, they are totally opposite in the sense that our approach doesn’t require the modification of objects, as in the case of watermarking. Other research works focus on enabling IT systems to ensuring that the data is received by intended receivers. These policies aid in the protection of data during its transfer and detection of its leakage as well. However, it gets nearly impossible for these policies to satisfy the requirements of the agents due to their restrictive nature [3]. be so to the observer. So, these fake objects behave like a watermark for the data set without modifying the latter. If one or more leaked objects, which were fake are found to have been allocated to a particular agent, then it more certain that that agent was guilty [4]. III. THEORY Guilty Agents After distribution of the objects, suppose that the distributor finds a set S ⊆ T to be leaked. This implies that an unauthorized agent, say target possesses S. The target might be using S for displaying or some legal process. Now, all agents U1.. Un have some data, so anyone can be suspected to having leaked the data. However, they will try to prove their innocence and claim that data must have reached the target in some other way. We aim to calculate the probability of the data being leaked from the agents and not the other sources. More data in S makes is difficult for agents to prove their innocence, whereas lesser data makes is difficult to prove that the target obtained data through other medium. We want to estimate the probability of data leakage by agents as well as the probability of a particular agent being guilty. For example, if agent U1 is allocated an object, and others objects are allocated to other agents, we can easily suspect on U1 more confidently. The model presented captures this intuition. We call an agent Ui guilty if one or more objects are contributed by it to the target. We denote the case of agent Ui being guilty for a leaked set S by Gi|S. Next, we estimate probability of Ui being guilty given S. Usually, nowadays, watermarking technique is used to tackle the issue of data leakage, i.e., a unique data or image is embedded whenever data is distributed. It is easy to identify the leaked data occupied by unauthorized agents due to watermarking, and the leaker can be identified. Watermarks are effectively applicable in some cases, however, they require some changes in the original data. Moreover, watermarks are vulnerable destruction by malicious attackers. For example, a situation where a company is in partnership with other companies, and sharing of data is required, or sharing of records of a particular patient with researchers to come up with new cures. When a company shares its data, it is called the distributor of data, and the companies with whom the data is shared, the agents. The main drawback of watermarking technique is, the data which is being distributed has to be modified, hence, the original data isn’t sent as it is. In this paper, we will be adding fake objects in order to detect the leakage of distributor’s sensitive data, and also identify the agent who leaked it. We develop a model for assessing the guilt of agents and present various algorithms for object distribution to different agents, that’ll enable us to find the leaker efficiently. Lastly, we consider adding fake objects to the data set to be distributed. These objects aren’t real entities, but they appear to IV. PROBLEM SETUP AND NOTATIONS Let’s consider a set of sensitive data items, T = {t1, t2, t3,..} belonging to a distributor. Some of these data objects have to be shared by the distributor with agents U1, U2, … Un but the data objects shouldn’t be leaked to other agents. The objects can be a tuples in a relation, and of any size or type. A subset Ri ⊆ T of objects is given to an agent Ui, determined by either a sample or explicit request: • Sample request Ri = SAMPLE(T,mi): Any random subset of mi records can be given to Ui from T. • Explicit request Ri = EXPLICIT(T,condition): Agent Ui receives all the T objects that satisfy the given condition. Example: Let’s say, the customer records for the given company A are in T. Now, a marketing agency U1 is hired to do an online survey of customers by this company A. A sample of say 1000 customers is required by U1 for this survey. Simultaneously, company A hires agent U2 for handling all California customers’ billing. Hence, all t records that satisfy the given condition of state being “California” are provided to U2. Our model can also be expanded to fulfilling requests for samples of an object that satisfy conditions like an agent wanting 1000 California customer records. 3 or the target. To simplify, the assumption states that joint events have a very minute probability. Assumption 2: An object t ∈ S can be obtained only by the target in any of the two ways: • An agent Ui has leaked t from its own set Ri; or • The target has guessed t without any help of the agent. It implies that, for all t ∈ S, the event guessed by the target is disjoint with the events that agent Ui leaks, where i = 1,...,n. VI. GUILT MODEL ANALYSIS Here, we consider three scenarios, each on which has a target which has obtained the distributor’s objects, i.e. T=S. V. AGENT GUILT MODEL An agent is said to be guilty if we find that the data which got leaked had been allotted to that particular agent. The probability that the given agent is guilty is represented by [2] Pr {Gi|S}. Some agent might claim to be innocent by saying that the data might have been found by some other means, like guessing or collecting from elsewhere. Hence, we need to consider the chances of data being guessed, which is called the guessing probability and is represented by p. The value of p varies with the type of data, and it’s a probable and not exact value. The guessing probability is a crucial factor in a fault tolerant system, making the system more reliable. An agent is considered to be guilty with the calculated probability at least. In usual course, in order to calculate the probability of an agent Ui being guilty for a set S, we first find the probability of him leaking a single object t from S. In order to find this, we have to know to which agents, t was given. This set is Vt. Pr {some agent leaked t to S} = 1-p (1) All the agents possessing t are equally capable of leaking, hence Pr {Ui leaks t to S} = (1−𝑝) |𝑉𝑡| 6.1) Impact of the Probability Consider that target T1 possesses 16 objects. All these objects are given to agent I1 whereas only 8 of these objects are given to another agent, I2. The probability of each agent leaking data is computed. The results are shown in the graph. (2) Probability of an agent leaking an object will be divided equally into number of agents who were given that particular object. If no agent possesses that object, i.e. Vt=0, then Pr (Ui leaks t to S)=0. The probability of an agent Ui leaking is calculated by multiplying probability of each object leakage that are in S as well as Ri, i.e. Pr {G1 | S} = 1 – ∏ (1− 𝑡∈𝑆∩(1−𝑝) |𝑉𝑡| ) We make following assumptions about the relationship among the numerous data leakage events [5]. Assumption 1: For all t, t’ ∈ S where t is not equal to t0 , the provenance of t does not depend on t0 In this statement provenance refers to the source of a value in the leaked set. The source can be any agent who has t in its set Here, the dotted line represents Pr{G1} and the continuous line represents Pr{G2}.As the value of ‘p’ comes closer to 0, the probability of the target guessing all the 16 values decreases. Every agent had leaked data, enough to make its individual guilt touch 1. Though, increase in value of p greatly decreases the probability of I2 being guilty: the 8 objects were given to both I1 and I2. Hence, it’s difficult to blame I2 for the leakage. Similarly, probability of I1 being guilty remains the same i.e. close to 1 with increase in p, as the 8 objects held by I1 aren’t shared with other agents. At the extremes, as the value of p reaches 1, the probability of target guessing all the values is high, so the guilt probability approaches 0. 6.2 Dependence on Overlap of Ri with S Here, we again consider two agents. One of them receives all the data, whereas other one receives a varying fraction of it. The probability of guilt of both agents, as a fraction of fraction of objects U2 has, is given in Figure 2(a), i.e., as a function of|R2 ∩S|/|S|. Here, the value of p is as low as 0.2, and U1 holds all 16S objects, where as in the previous scenario, U2 had 50S 4 objects. We observe that when the objects are rare (p=0.2), we can estimate an agent to be guilty without many object leakages, confidently. This result supports our intuition: an agent is clearly suspicious, if he’s holding even a very small number of incriminating objects. instance, the left most value of x = 1.0 is the starting point with I−1 and I2. The rightmost point, x = 8.0, represents I1 with 16 objects, and 8 agents like I2, each with the same 8 objects [6]. VII. DATA ALLOCATION STRATEGIES Now, we throw light on the various ways in which data objects can be allocated to enable tracing the agent from the set of distributed set of objects in case of any data leakage. Three algorithms namely s-random, s-overlap and s-max based on round robin approach, used for data allocation strategies for sample data requests are discussed. The general algorithm is given below: Their general algorithm for data allocation of data set D is shown in figure 2. Input: Requests from each agent req1,req2, req3 The rate of increase of the guilt probability goes down as p increases. The observation and intuition are similar. As the objects become easier to guess, it takes more and more evidence of leakage. 6.3 Dependence on the Number of Agents with Ri Data Here, we see how the sharing of S objects by agents affects their probabilities of being guilty. Consider an agent I1 with 16 *T = S objects, and another agent I2 having 8objects, and let p = 0.5. Then, we gradually increase the number of agents having the same 8 objects. We compute the probabilities Pr of G1 and G2 for all cases and the results are presented in fig . The number of agents having 8 objects is represented on the x-axis. For 1. SUM ĸ sum of all requests. 2. while SUM > 0 do for each data request 3. i-> ĸ SelectAgent( ) 4. k-> ĸ SelectObject( ) 5. Add k to the allocated set of i. 6. SUM ĸ SUM – 1. The running time of the given algorithm is O (ղ ƛ SUM) as the loop will run SUM times and ղ denotes the running time of SelectAgent() and ƛ denotes the running time of SelectObject(). A. Object Allocation The SelectObject( ) function is used to pick a data object and allocate it to an agent i [3]. There are three ways to do the same: 5 i) s-random: This allocation technique fulfills the agents’ requests but neglects the distributor’s objective. An object from the distributor’s data set is selected randomly and hence, a poor data allocation might result from s-random. For example, the distributor set D contains three objects and three agents request for one object each, then by s-random, there might be a case of all three agents being allocated with the same object. ii) s-overlap: In the same example mentioned above, we can allocate every agent with a unique data object using the soverlap selection. Here, a list is created by the distributor of data objects which have been allocated to the minimum number of agents, and then objects are selected randomly from this subset. If the sum of requests from all agents is lesser than the total size of data set, a disjoint set is created. But with the increase in SUM, the resulting sets have an object sharing distribution. iii) s-max: An object that yields maximum relative overlap among all of the agents is allocated to an agent. The running time of s-max algorithm is O(|D| n) since both external and internal loops run for all the agents [7]. B. Agent Selection The agent selection can be done in numerous ways. i) Round Robin: The Round Robin technique involves allocation of a data object to every agent. Consider three agents who have requested three objects each. Then, in Round Robin technique, first data object is given to the first agent, then one to the second agent and then one to the third agent. Now, the distribution starts again t from the first agent, and so on until all the agents’ requirements are satisfied. Here, the main drawback is that if first agent needs only 2 objects whereas there are 100 agents, the first agent will have to wait for all 100 agents in order to get its second object. ii) First Come First Serve (FCFS): In First Come First Search, we completely satisfy an agent’s requests before we start allocating data objects to the next agent. Consider the above example. In First Come First Serve, the first agent gets all its three objects, and then other objects are allocated to other agents. The agent that requests earlier doesn’t have to wait for other agents. iii) Longest Request First (LRF): In Longest Request First, we sort the agents in decreasing order of their number of requests. Then the requests are served in FCFS, i.e. first the agent with highest request are allocated objects first and then the one having smaller requests. In LRF, the agent having maximum number of requests doesn’t have to wait for other agents with smaller requests. However, LRF has one drawback. For example, consider a data set D = { d1, d2, d3, d4, d5 } and three agents requesting 3, 4 and 2 objects. A data allocation for FCFS or LRF using s-overlap or s-max may yield: R1 = { d1, d2 }, R2 = { d3, d4, d5 }, and R3 = { d1 }. Now, if d1 is leaked, then the distributor will suspect the agents A1 and A3 equally. However, if third agent requests for two objects, then s-max yields R3 = { d1, d3 } and now if both d1 and d3 are leaked, the agent A3 is said to be guilty. iv) Shortest Request First (SRF): As observed in the above example, if an agent with smaller number of requests is satisfied after the agents with higher requests, it may result in poor allocation. Thus, we propose a case where the agents with smaller size of requests are first satisfied completely, and then moving on to the ones with larger requests. For the above example, SRF using smax yields R3 = { d1 }, R1 = { d2, d3 } and R2 = { d4, d5, d2 }. The SRF allocation is clearly better than LRF allocation as in the LRF allocation, R3 gets completely overlapped by R1 but in SRF, there is no such complete overlapping of one agent by any other agents (R1 and R2 have d2 in common but both contain different data objects too). The time complexity of Round-Robin and FCFS is O(1). Also, if we keep a set of agents in memory that are sorted on the basis of their request size, then Ș = O(1). However, the sorting of agents will take either O(N logN) time or it can be implemented using a priority queue [8]. VIII. RESULTS AND ANALYSIS We measure the algorithm performance and also evaluate the effectiveness of the approximation. The ∆ difference functions are used for evaluating a given allocation with objective scalarization as metrics. _ ∆= ∑ ∆(i,j)/N N=n(n-1) Min ∆= min ∆(i,j) Metric ∆ is the mean of ∆ (i, j) values for a particular allocation and it shows how efficient the guilt detection is, on average, for that specific allocation. Metric min ∆ is the minimum ∆ (i, j) value, corresponding to the case where agent i has leaked his data while both i as well as some other agent j have very comparable guilt probabilities. If min ∆ is small, then it will be difficult to identify i as the leaker, versus j. IX. DATA LEAKAGE IN CORPORATE WORLD The consequences of insufficient security can’t be ignored. IT professionals need to overcome these drawbacks to secure the corporate assets in a way, complying with external and internal corporate specifications. Our aim is to create and uphold feasible security policies, considering different business cultures as well as practices, so that the employees having different backgrounds can share information with safety. Two surveys were conducted in 10 countries which are selected on basis of varying working cultures. For each country, we 6 consider 100 employees and equal number of IT professionals and survey them, thus garnering 2000 respondents [9]. . X. CYBER CRIME LAWS OF NATIONS McConnell International surveyed the global network of information technology policy officials in order to estimate the condition of cyber security laws across the world. Countries had to provide laws they would use prosecute criminal acts Countries’ legislation was evaluated to check whether their criminal statutes and rules covered cyberspace and its different types of crimes. Thirty-three of the countries don’t have updated laws for addressing different types of cyber-crime. Out of the other countries, only nine have framed legislation addressing five or lesser varieties of cyber-crime, and ten have up to date laws to prosecute about ten types of cyber-criminals. Covering cyberspace into the rule of law is a very crucial step in order to create a secure environment for people as well as businesses. This extension still is a work in progress; companies today must firstly secure their own systems and data from attacks, be it from foreign or own entities. To facilitate this self-defense, organizations should implement cyber security plans suitable for public, process, and technical issues. Organizations also need to educate and train their employees on security practices and incorporate the active usage of unbreakable security technology-like firewalls, intrusion detection tools, anti-virus software, etc. Such system protection tools, both software as well as hardware for securing information systems are both complex as well as costly. In order to avoid these issues, system manufacturers and operators have to turn off security features routinely. Also, no common standards exist to mark the quality level of tools. This inability of quantifying the costs and advantages of information security investments proves to be a disadvantage while they compete for organization’s resources, for security managers. There is lot of scope for improvement in this field [10]. 7 Countries US Brazil UK France Germany Italy China Japan India Australia Data Crimes Interception Modification N N N Y N N Y N Y Y Y Y Y Y Y Y Y Y Theft N N Y Y Y N Y Y Y Network Crimes Interference Sabotage N Y N N Y Y Y Y Y Access Crimes Unauthorized Virus access Decimation Y Y Y N N N Y Y N Y Y Y N N N Y Y Y N N N N N Y N Y N Related Crimes Aid Forgery Cyber Crimes Y N Y N N N Y N Y N N N Y Y Y N N Y Fraud Updated Laws of Countries XI. CONCLUSION Data leakage is a silent type of threat. Sensitive information can be accidentally or intentionally be leaked by an insider. Email, Web sites, FTP, messaging, databases, and any other electronic means can be used to distribute this sensitive information. Two things are important in assessing the risk of distributing data, data allocation strategy which is used to distribute the tuples among customers or calculating guilt probability in which the leaked data set is overlapped with the customer’s data set. REFERENCES [1] “Data Leakage: The High Cost of Insider Threats.” Worldwide White Paper.,http://www.cisco.com/en/US/solutions/collateral/ns170/n s896/ns895 /white_paper_c11-506224.html., accessed January 2013 [2] “DATA ALLOCATION STRATEGIES IN DATA LEAKAGE DETECTION”http://citeseerx.ist.psu.edu/viewdoc/download?do i=10.1.1.417.1779&rep=rep1&type=pdf [3] International Journal of Computer Trends and Technologyvolume3Issue12012 ISSN:2231-2803 http://www.internationaljournalssrg.org Data Allocation Strategies for Detecting Data LeakageSrikanth Yadav, Dr. Y. Eswara rao, V. Shanmukha Rao, R. Vasantha [4] International Journal of Computer Applications in Engineering Sciences [ISSN: 2231-4946]197 | P a g e Development of Data leakage Detection Using Data Allocation Strategies Rudragouda G Patil Dept of CSE,The Oxford College of Engg, Bangalore. [24] Chan, Chi-Kwong, and L. M. Cheng. (2004), "Hiding data in images by simple LSB substitution." Pattern Recognition Vol. 37, no. 3, pp. 469-474. [5] Cyber Crime and Punishment? Archaic Laws Threaten Global Information, December 2000 [6] Comparative Evaluation of Algorithms for Effective Data Leakage Detection, Proceedings of 2013 IEEE Conference on Information and Communication Technologies (ICT 2013) [7] Data Allocation Strategies for Leakage Detection, IOSRJCE [8] Data Leakage Detection, Stanford [9] S. Jajodia, P. Samarati, M. L. Sapino, and V. S. Subrahmanian. Flexible support for multiple access control policies. ACM Trans. Database Syst., 26(2):214–260, 2001. [10] Y. Li, V. Swarup, and S. Jajodia. Finger-printing relational databases: Schemes and specialties. IEEE Transactions on Dependable and Secure Computing, 02(1):34–45, 2005. N N Y Y Y N Y Y Y
© Copyright 2026 Paperzz