International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 6, June 2012) Application of Binomial Probability Distribution in Cluster Analysis of Similar Categorical Variables Parag M. Moteria1, Dr. Y R Ghodasara2 1Assistant Professor, MCA Department, ISTAR College, V.V.Nagar, Gujarat, India 2Associate Professor, College of AIT, Anand Agriculture University, Anand, Gujarat, India Abstract – A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. One approach to find out similarity or dissimilarity in categorical variables is contingency table and ratio of mismatches. This paper discusses a novel approach to find out similarity or dissimilarity using Binomial probability distribution in categorical variables. III. CATEGORICAL VARIABLES Variables which record a response as a set of categories are termed categorical. Such variables fall into three classifications: Nominal, Ordinal, and Interval [4]. Nominal variables have categories that have no natural order to them. Nominal variables are also known as categorical variables. A categorical variable is a generalization of the binary variable in that it can take on more than two states [1]. That means actual response in any category type is binary (i.e. it records one of two possible conditions or outcomes). It does not matter whether the variables are numeric or alphabetic. The yes/no type variable could be entered as "YES":"NO" or 1:2 or "Y":"N" [4]. Let the number of states of a categorical variable be M. The states can be denoted by letters, symbols, or a set of integers, such as 1, 2 ,…, M. Notice that such integers are used just for data handling and do not represent any specific ordering [1]. Examples could be different crops: wheat, barley, and peas or different irrigation methods: flood, furrow, and dry land or map color is a categorical variable that may have, say, five states: red, yellow, green, pink, and blue [1][4]. Categorical variables can be encoded by asymmetric binary variables by creating a new binary variable for each of the M states. For an object with a given state value, the binary variable representing that state is set to 1, while the remaining binary variables are set to 0. For example, to encode the categorical variable map color, a binary variable can be created for each of the five colors listed in table – I. For an object having the color yellow, the yellow variable is set to 1, while the remaining four variables are set to 0 [1]. Keywords – Categorical variables, Nominal variables, Binomial probability distribution, Binomial mass function, Asymmetric binary variables, and Cluster analysis. I. INTRODUCTION The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering [1]. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Although classification is an effective means for distinguishing groups or classes of objects, it requires the often costly collection and labelling of a large set of training tuples or patterns, which the classifier uses to model each group. It is often more desirable to proceed in the reverse direction: First partition the set of data into groups based on data similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful features that distinguish different groups. II. TYPES OF DATA IN CLUSTER ANALYSIS IV. BINOMIAL PROBABILITY DISTRIBUTION Following are different types of data/variables: 1. Interval-Scaled Variables 2. Binary Variables 3. Categorical, Ordinal, and Ratio-Scaled Variables 4. Variables of Mixed Types 5. Vector Objects Binomial probability typically deals with the probability of several successive decisions, each of which has two possible outcomes. The probability of an event can be expressed as a binomial probability if its outcomes can be broken down into two probabilities p and q, where p and q are complementary (i.e. p + q = 1) [2]. 133 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 6, June 2012) Measurement (Color of A=1 & D=1) = 0.25 … … …(c) Measurement (Color of B=0 & C=0) = 0.25 … … …(d) Measurement (Color of B=0 & D=1) = 0.5 … … …(e) Measurement (Color of C=0 & D=1) = 0.5 … … …(f) Above results compare with (1/2n) (n = number of pairs / number of records), if it is equals then pair has same value. In equation, (c) and (d) equals to (1/2n) i.e. pair in both equations has similar value. But our input value is 1 (YELLOW) to getting success, therefore in equation (b) is our result to search for similarity in YELLOW color. Case –2 (Three records) Measurement (Color of A=1, B=0 and C=0 – Two 1 and One 0) = 3/8 … … … (A) Measurement (Color of A=1, B=0 and D=1 – Two 1 and One 0) = 3/8 … … … (B) Measurement (Color of A=1, C=0 and D=1 – All are 1) = 3/8 … … … (C) Measurement (Color of B=0, C=0 and D=1 – Two 1 and One 0) = 3/8 … … … (D) Above results compare with (1/2n) (n = number of pairs / numbers of records), if it is equals then pair has same value. In equation, there is no any equation equals to (1/2n). Therefore, there is no similarity between three pairs (records). Similarly, we can apply this method to N records to compute similarity or dissimilarity in categorical variables. We can write general form to compute similarity in categorical variables is: Measurement (n) = (1/2n) represents similarity in numbers of pair. Where, n represents numbers of pair of records and n = 2,…,N A binomial experiment (also known as a Bernoulli trial) is a statistical experiment that has the following properties [3]: The experiment consists of n repeated trials. Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure. The probability of success, denoted by p, is the same on every trial. The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials. The probability of success is constant - 0.5 on every trial Binomial mass function [3] is: P(x) = nCx(1/p)x(1/q)n-x Where, n represents numbers of pair / numbers of records / numbers of objects p represents probability of success q = 1- p represents probability of failure x represents discrete random variable – it takes either 1 for success or 0 for failure. A. Categorical Variable Comparison Based on Binomial Probability Distribution Consider the following table: TABLE I OBJECT ID A B C D … MAP COLOR (CATEGORICAL VARIABLE) YELLOW RED GREEN YELLOW … COLOR CODE 1 0 0 1 ... From the Table – I, consider categorical attribute and set code 1 for YELLOW and 0 for other colors. It makes them asymmetric binary variables [1]. The similarity or dissimilarity for this form of data can be calculated using binomial probability mass function as per our published paper [9]. Case – 1 (Two records) Measurement (Color of A=1 & B=0) = 0.5 … … … (a) Measurement (Color of A=1 & C=0) = 0.5 … … …(b) V. CONCLUSION The paper shows with example that binomial probability distribution is successfully applied to find out similarity or dissimilarity in categorical variables. The advantage of the method is that it is also applicable to N number of records without any constraints. 134 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 6, June 2012) REFERENCES [1 ] Jiawei Hanand Micheline Kamber, “Data Mining Concepts and Techniques - Second Edition”, ELSEVIER Morgan Kaufman Publisher (page: 383,386,392,393) [2 ] http://en.wikipedia.org/wiki/Binomial_probability [3 ] http://stattrek.com/probability-distributions/binomial.aspx [4 ] http://www.uiweb.uidaho.edu/ag/statprog/sas/workshops/c atmod/handout1_cat.pdf [5 ] http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm [6 ] http://www.oswego.edu/~srp/stats/variable_types.htm [7 ] http://www.psychstat.missouristate.edu/multibook/mlt08m .html [8 ] http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm [9 ] Parag M. Moteria, Dr. Y.R.Ghodasara, “Application of Binomial Distribution in Cluster Analysis of Similar Binary Variables”, International Journal of Emerging Technology & Advanced Engineering, Vol 2 Issue 4 April 2012 (page: 84 TO 86 ) 135
© Copyright 2025 Paperzz