Application of Binomial Probability Distribution in Cluster

International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 6, June 2012)
Application of Binomial Probability Distribution in Cluster
Analysis of Similar Categorical Variables
Parag M. Moteria1, Dr. Y R Ghodasara2
1Assistant Professor, MCA Department, ISTAR College, V.V.Nagar, Gujarat, India
2Associate Professor, College of AIT, Anand Agriculture University, Anand, Gujarat, India
Abstract – A cluster is a collection of data objects that are
similar to one another within the same cluster and are
dissimilar to the objects in other clusters. One approach to
find out similarity or dissimilarity in categorical variables is
contingency table and ratio of mismatches. This paper
discusses a novel approach to find out similarity or
dissimilarity using Binomial probability distribution in
categorical variables.
III. CATEGORICAL VARIABLES
Variables which record a response as a set of
categories are termed categorical. Such variables fall into
three classifications: Nominal, Ordinal, and Interval [4].
Nominal variables have categories that have no natural
order to them. Nominal variables are also known as
categorical variables. A categorical variable is a
generalization of the binary variable in that it can take on
more than two states [1]. That means actual response in
any category type is binary (i.e. it records one of two
possible conditions or outcomes). It does not matter
whether the variables are numeric or alphabetic. The
yes/no type variable could be entered as "YES":"NO" or
1:2 or "Y":"N" [4]. Let the number of states of a
categorical variable be M. The states can be denoted by
letters, symbols, or a set of integers, such as 1, 2 ,…, M.
Notice that such integers are used just for data handling
and do not represent any specific ordering [1]. Examples
could be different crops: wheat, barley, and peas or
different irrigation methods: flood, furrow, and dry land
or map color is a categorical variable that may have, say,
five states: red, yellow, green, pink, and blue [1][4].
Categorical variables can be encoded by asymmetric
binary variables by creating a new binary variable for
each of the M states. For an object with a given state
value, the binary variable representing that state is set to
1, while the remaining binary variables are set to 0. For
example, to encode the categorical variable map color, a
binary variable can be created for each of the five colors
listed in table – I. For an object having the color yellow,
the yellow variable is set to 1, while the remaining four
variables are set to 0 [1].
Keywords – Categorical variables, Nominal variables,
Binomial probability distribution, Binomial mass function,
Asymmetric binary variables, and Cluster analysis.
I. INTRODUCTION
The process of grouping a set of physical or abstract
objects into classes of similar objects is called clustering
[1]. A cluster is a collection of data objects that are
similar to one another within the same cluster and are
dissimilar to the objects in other clusters. A cluster of
data objects can be treated collectively as one group and
so may be considered as a form of data compression.
Although classification is an effective means for
distinguishing groups or classes of objects, it requires the
often costly collection and labelling of a large set of
training tuples or patterns, which the classifier uses to
model each group. It is often more desirable to proceed in
the reverse direction: First partition the set of data into
groups based on data similarity (e.g., using clustering),
and then assign labels to the relatively small number of
groups. Additional advantages of such a clustering-based
process are that it is adaptable to changes and helps
single out useful features that distinguish different
groups.
II. TYPES OF DATA IN CLUSTER ANALYSIS
IV. BINOMIAL PROBABILITY DISTRIBUTION
Following are different types of data/variables:
1. Interval-Scaled Variables
2. Binary Variables
3. Categorical, Ordinal, and Ratio-Scaled
Variables
4. Variables of Mixed Types
5. Vector Objects
Binomial
probability
typically
deals
with
the probability of several successive decisions, each of
which has two possible outcomes. The probability of an
event can be expressed as a binomial probability if its
outcomes can be broken down into two
probabilities p and q, where p and q are complementary
(i.e. p + q = 1) [2].
133
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 6, June 2012)
Measurement (Color of A=1 & D=1) = 0.25 … … …(c)
Measurement (Color of B=0 & C=0) = 0.25 … … …(d)
Measurement (Color of B=0 & D=1) = 0.5 … … …(e)
Measurement (Color of C=0 & D=1) = 0.5 … … …(f)
Above results compare with (1/2n) (n = number of
pairs / number of records), if it is equals then pair has
same value.
In equation, (c) and (d) equals to (1/2n) i.e. pair in both
equations has similar value. But our input value is 1
(YELLOW) to getting success, therefore in equation (b)
is our result to search for similarity in YELLOW color.
Case –2 (Three records)
Measurement (Color of A=1, B=0 and C=0 – Two 1 and
One 0) = 3/8 … … … (A)
Measurement (Color of A=1, B=0 and D=1 – Two 1 and
One 0) = 3/8 … … … (B)
Measurement (Color of A=1, C=0 and D=1 – All are 1) =
3/8 … … … (C)
Measurement (Color of B=0, C=0 and D=1 – Two 1 and
One 0) = 3/8 … … … (D)
Above results compare with (1/2n) (n = number of
pairs / numbers of records), if it is equals then pair has
same value.
In equation, there is no any equation equals to (1/2n).
Therefore, there is no similarity between three pairs
(records).
Similarly, we can apply this method to N records to
compute similarity or dissimilarity in categorical
variables.
We can write general form to compute similarity in
categorical variables is:
Measurement (n) = (1/2n) represents similarity in
numbers of pair.
Where,
n represents numbers of pair of records and n =
2,…,N
A binomial experiment (also known as a Bernoulli
trial) is a statistical experiment that has the following
properties [3]:
 The experiment consists of n repeated trials.
 Each trial can result in just two possible
outcomes. We call one of these outcomes a
success and the other, a failure.
 The probability of success, denoted by p, is the
same on every trial.
 The trials are independent; that is, the outcome
on one trial does not affect the outcome on other
trials.
 The probability of success is constant - 0.5 on
every trial
Binomial mass function [3] is:
P(x) = nCx(1/p)x(1/q)n-x
Where,
n represents numbers of pair / numbers of records /
numbers of objects
p represents probability of success
q = 1- p represents probability of failure
x represents discrete random variable – it takes either
1 for success or 0 for failure.
A. Categorical Variable Comparison Based on
Binomial Probability Distribution
Consider the following table:
TABLE I
OBJECT ID
A
B
C
D
…
MAP COLOR
(CATEGORICAL
VARIABLE)
YELLOW
RED
GREEN
YELLOW
…
COLOR
CODE
1
0
0
1
...
From the Table – I, consider categorical attribute and
set code 1 for YELLOW and 0 for other colors. It makes
them asymmetric binary variables [1].
The similarity or dissimilarity for this form of data can
be calculated using binomial probability mass function as
per our published paper [9].
Case – 1 (Two records)
Measurement (Color of A=1 & B=0) = 0.5 … … … (a)
Measurement (Color of A=1 & C=0) = 0.5 … … …(b)
V. CONCLUSION
The paper shows with example that binomial
probability distribution is successfully applied to find out
similarity or dissimilarity in categorical variables. The
advantage of the method is that it is also applicable to N
number of records without any constraints.
134
International Journal of Emerging Technology and Advanced Engineering
Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 6, June 2012)
REFERENCES
[1 ] Jiawei Hanand Micheline Kamber, “Data Mining
Concepts and Techniques - Second Edition”, ELSEVIER
Morgan Kaufman Publisher (page: 383,386,392,393)
[2 ] http://en.wikipedia.org/wiki/Binomial_probability
[3 ] http://stattrek.com/probability-distributions/binomial.aspx
[4 ] http://www.uiweb.uidaho.edu/ag/statprog/sas/workshops/c
atmod/handout1_cat.pdf
[5 ] http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm
[6 ] http://www.oswego.edu/~srp/stats/variable_types.htm
[7 ] http://www.psychstat.missouristate.edu/multibook/mlt08m
.html
[8 ] http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm
[9 ] Parag M. Moteria, Dr. Y.R.Ghodasara, “Application of
Binomial Distribution in Cluster Analysis of Similar
Binary Variables”, International Journal of Emerging
Technology & Advanced Engineering, Vol 2 Issue 4 April
2012 (page: 84 TO 86 )
135