STUDY MATERIAL
Subject
Staff
Class
Semester
: Data Mining
: Usha.P
: III B.Sc(CS-B)
: VI
UNIT-II: Visualizing and Exploring Data: Summarizing Data –Tools for Displaying single
variables – Tools for displaying relationships between two variables - Tools for displaying
relationships more than two variables – Principal Components analysis – Multi-dimensional
scaling. Data Analysis and Uncertainty: Dealing with Uncertainty – Random Variables and
their relationships – Samples and statistical inference – Estimation – Hypothesis Testing –
sampling Method.
1. Visualizing and Exploring Data
Visual Methods for finding Structures in data:
• Power of human eye/brain to detect structures
– Product of eons of evolution
• Display data in ways that capitalize on humanpattern processing abilities
• Can find unexpected relationships
– Limitation: very large data sets
Exploratory Data Analysis
• Explore the data without any clear ideas of what weare looking for
• EDA techniques are
– Interactive
– Visual
• Many graphical methods for low-dimensional data
• For higher dimensions -- Principal ComponentsAnalysis
1. Summarizing the data
• Mean
– Centrality(the sense that it minimizes the sum of squared differences between it and
The data values)
• Minimizes sum of squared errors to all samples
• If there are n data values, mean is the value such that the sumof n copies of the
mean equals the sum of data values
1
– Measures of Location
• Mean is a measure of location
• Median (value with equal no of points above/ below)
• First Quartile (value greater than a quarter of data points)
• Third Quartile (value greater than three quarters)
• Mode
– Most Common Value of Data
• Multimodal
– 10 data points take value 3, ten value 7 all other values less often than 10
The sample mean is defined as
The variance is defined as the averageof the squared differences between the mean and the
individual data values:
The standard deviation is the square root of the variance:
The interquartile range, common in some applications, is the difference between the third
and first quartile. The range is the difference between the largest and smallest data
point.
Skewnessmeasures whether or not a distribution has a single long tail and is commonly
Defined as
For example, the distribution of peoples' incomes typically shows the vast majority of
people earning small to moderate amounts, and just a few people earning large sums,tailing
off to the very few who earn astronomically large sums—the Bill Gateses of theworld. A
distribution is said to be right-skewed if the long tail extends in the direction ofincreasing
values and left-skewed otherwise. Right-skewed distributions are morecommon. Symmetric
distributions have zero skewness.
2. Tools for Displaying SingleVariables
2
• Basic display for univariate data is thehistogram
– No of values of the variable that lie inconsecutive intervals
Histogram
These Used it every week
except Holidays
3
Histogram of Diastolic blood pressure of individuals
(UCI ML archive)
Disadvantages of Histograms
• Random Fluctuations in values
• Alternative choices for ends of intervalsgive very different diagrams
• Apparent multimodality can arise thenvanish for different choices of intervals or for
different small sample
• Effects diminish with increasing size ofdata set
The disadvantages of histograms have also been tackled by smoothing estimates. One
of the most widely used types is the kernel estimate.Denoting the kernel function by K
and its width (or bandwidth) by h, the estimated density at any point x is
4
common form for K is the Normal (Gaussian) curve, with h as its spread parameter
(standard deviation), i.e.,
K(t,h)=Ce
3. Tools for Displaying Relationships between TwoVariables
• Box Plots
• Scatter Plots
• Contour Plots
• Time as one of the two variables
Boxplots on Four Different Variables From the Pima Indians Diabetes Data Set.
For Each Variable, a Separate Boxplot is Produced for the Healthy Subjects (Labeled 1) and
The Diabetic Subjects (Libelled 2). The Upper and Lower Boundaries of Each Box Represent
the Upper and Lower Quartiles of the Data Respectively. The Horizontal Line within Each
BoxRepresents the Median of the Data. The Whiskers Extend 1.5 Times the Interquartile
RangeFrom the End of Each Box. All Data Points outside the Whiskers are Plotted
Individually (Although some Over plotting is Present, e.g., for Values of 0).
5
The scatter plot is a standard tool for displaying two variables at a time. Figure 3.6
showsthe relationship between two variables describing credit card repayment patterns
(thedetails are confidential). It is clear from this diagram that the variables are
stronglycorrelated—when one value has a high (low) value, the other variable is likely to
have ahigh (low) value. However, a significant number of people depart from this
pattern;showing high values on one of the variables and low values on the other. It might
beworth investigating these individuals to find out why they are unusual
A Standard Scatter plot for Two Banking Variables
Scatter plot Disadvantages
1. With large no of data points reveals little structure
6
2. Can conceal over printing which can be significant for multimodal data
Contour plot
1. Overcomes some scatter plot problems
2. Requires a 2-D density estimate to be constructed with a 2-D kernel
A Contour Plot of the Data from Figure 3.7.
7
Display when one of the variables is time
A Plot of the Number of Credit Cards in Circulation in the United Kingdom, By Year.
Patterns of Change over Time in the Number of Miles Flown by UK Airlines in the 1960s.
Weight Changes Over Time in a Group of 10,000 School Children in the 1930s
8
4. Tools for Displaying More Than Two Variables
• Scatter plots for all pairs of variables
• Trellis Plot
• Parallel Coordinates Plot
• Sheets of Paper and Computer screens are fine for two variables
• Need projections from higher-dimensional data to 2-D plane
• Methods
– Examine all pairs of variables
• Scatterplot matrix
• Trellis plot
• Icons
Scatter Plot Matrix
A scatter plot Matrix for the Computer CPU Data
A scatterplot matrix for characteristics, performance measures, and relative
performance measures of 209 computer CPUs dating from over 10 years ago. The variables
are cycle time, minimum memory (kb), maximum memory (kb), cache size (kb), minimum
channels, maximum channels, relative performance, and estimatedrelative performance
9
(relative to an IBM 370/158-3). While some pairs of variables appear to be unrelated, others
are strongly related.
Brushing allows us to highlight points in a scatterplot matrix in such a way that the
points corresponding to the same objects in each scatter plot are highlighted. This is
particularly useful in interactive exploration of data.
Disadvantage of Scatter Plot Matrices
• Scatter Plot Matrices are multiplebivariate solutions
• Not a multivariate solution
• Such projections sacrificeinformation
Trellis Plot
• Rather than displaying scatter plot foreach pair of variables
• Fix a particular pair of variables andproduce a series of scatter plots, histograms, time series
plots, contour plots etc
Male
Female
10
Icon Plot
Star Plot:
Each direction corresponds to a
Variable.
Length corresponds to a value
53 samples of
Minerals
12 chemical
Properties
Brushing allows us to highlight points in a scatterplot matrix in such a way that the
points corresponding to the same objects in each scatterplot are highlighted. This is
particularly useful in interactive exploration of data.
11
5. Principal Components Analysis
—PCA is a way of expressing a high dimensional data set via an alternative set of dimensions
that are
– Orthogonal to each other
– Easy to rank by how “representative” they are
– Not always easy to interpret
—It is useful because it allows visualization ofthe data in most representative dimension
—It may (or not) provide useful predictions forclustering and phenotype prediction
—Rotate data so that the new coordinatesare based on correlation structure of the
data
—Select a subset of the new coordinateswith high variability, and use for datavisualization,
summarization andclustering
—E.G. Clustering tumour samples usingPCA on expression profiles.
Singular Value Decomposition:
—SVD is identical to PCA if the data isalready cantered
—Otherwise the idea is the same but therotation is driven by mean informationas well as
covariance structure
5. Multidimensional Scaling:
—PCA requires vector representation
—It can be interesting to start from
Pair wise distances between n points.
—How can we do dimension reduction and visualization in this case?
—Find coordinates for points in d dimensional space s.t. distances are preserved as well as
possible
—Multi-Dimensional Scaling (MDS) is a general technique for displaying ndimensional
data in 2D.
—It preserves the notion of “nearness”, and therefore clusters of items in n-dimensions still
look like clusters on a plot.
Motivations
12
MDS attempts to
—Identify abstract variables whichhave generated the inter-objectsimilarity measures
—Reduce the dimension of the data ina non-linear fashion
—Reproduce non-linear higher-dimensionalstructures on a lower-dimensionaldisplay
5. Multidimensional Scaling:
6. Data Analysis and Uncertainty
Reasons for Uncertainty:
1. Data may only be a sample of populationto be studied
Uncertain about extent to which
samplesdiffer from each other
2. Interest is in making a prediction abouttomorrow based on data we have today
3. Cannot observe some values and need to make a guess
Dealing with Uncertainty:
• Several Conceptual bases
1. Probability
2. Fuzzy Sets
3. Rough Sets
• Probability Theory vs Probability Calculus
• Probability Calculus is well-developed
• Generally accepted axioms and derivations
• Probability Theory has scope for perspectives
13
• Mapping real world to what probability isLack theoretical backbone andthe wide
acceptance of probability
Frequentist vs. Bayesian
• Frequentist
• Probability is objective
• It is the limiting proportion of times event occurs inidentical situations
– An idealization since all customers are not identical
• Bayesian
• Subjective probability
• Explicit characterization of all uncertainty includingany parameters estimated from the data
• Frequently yield same results.
7. Random Variable:
• Mapping from property of objects to a variable that can take a set of possible values via a
Process that appears to the observer to have an element of unpredictability
• Possible values of random variable is its domain
• Examples
• Coin toss (domain is the set [heads, tails])
• No of times a coin has to be tossed to get a head
– Domain is integers
• Flying time of a paper aeroplane in seconds
– Domain is set of positive real numbers
7.1Properties of Univariate (Single)random variable:
• X is random variable and x is its value
• Domain is finite:
• Probability mass function p(x)
• Domain is real line:
• Probability density function p(x)
• Expectation of X
• E[X]=∫ x p(x) dx
14
7.2 Multivariate Random Variable:
• Set of several random variables
• D-dimensional vector x={x1,..,xd}
• Density function of x is the joint density functionp(x1,.., xd)
• Density function of single variable (or subset) iscalled a marginal density
• Derived by summing over variables not includedp(x1)=∫∫ p(x1,x2,x3)dx2dx3
8. Samples and Statistical Inference:
Statistical inference is the process of drawing conclusions from data that are subject
to random variation, for example, observational errors or sampling variation. More
substantially, the terms statistical inference, statistical induction and inferential statistics
are used to describe systems of procedures that can be used to draw conclusions from datasets
arising from systems affected by random variation. Initial requirements of such a system of
procedures for inference and induction are that the system should produce reasonable answers
when applied to well-defined situations and that it should be general enough to be applied
across a range of situations.
9. Hypothesis Testing
In many situations we want to see whether the data support some idea about the
valueof a parameter. For example, we might want to know if a new treatment has an
effectgreater than that of the standard treatment, or if two variables are related in a
population.Since we are often unable to measure these for an entire population, we must base
ourconclusions on samples. Statistical tools for exploring such hypotheses are
calledhypothesis tests.
Classical Hypothesis Testing:
The basic principle of hypothesis tests is as follows. We begin by defining
twocomplementary hypotheses: the null hypothesis and the alternative hypothesis. Often
thenull hypothesis is some point value (e.g., that the effect in question has value zero—that
there is no treatment difference or regression slope) and the alternative hypothesis is simply
the complement of the null hypothesis. Suppose, for example, that we are trying to draw
conclusions about a parameter? The null hypothesis, denoted by H0, mightstate that? =? 0
and the alternative hypothesis (H1) might state that???0. Using theobserved data, we
calculate a statistic (what form of statistic is best depends on thenature of the hypothesis
being tested; examples are given below). The statistic wouldvary from sample to sample—it
would be a random variable. If we assume that the nullhypothesis is correct, then we can
determine the expected distribution for the chosenstatistic, and the observed value of the
statistic would be one point from that distribution.If the observed value were way out in the
tail of the distribution, we would have toconclude either that an unlikely event had occurred
or that the null hypothesis was not, infact, true. The more extreme the observed value, the less
confidence we would have in the null hypothesis
15
Hypothesis Testing in Context:
In data mining, however, analyses can become more complicated.Firstly, because data
mining involves large data sets, we should expect to obtainstatistical significance: even slight
departures from the hypothesized model form will beidentified as significant, even though
they may be of no practical importance. (If they areof practical importance, of course, then
well and good.) Worse, slight departures from themodel arising from contamination or data
distortion will show up as significant. We havealready remarked on the inevitability of this
problem.
Secondly, sequential model fitting processes are common. Beginning in chapters 8
wewill describe various stepwise model fitting procedures, which gradually refine a modelby
adding or deleting terms. Running separate tests on each model, as if it were denovo, leads to
incorrect probabilities. Formal sequential testing procedures have beendeveloped, but they
can be quite complex. Moreover, they may be weak because of the multiple testing going on.
Thirdly, the fact that data mining is essentially an exploratory process has
variousimplications. One is that many models will be examined. Suppose we test m true
(thoughwe will not know this) null hypotheses at the 5% level, each based on its own subset
ofthe data, independent of the other tests. For each hypothesis separately, there is aprobability
of 0.05 of incorrectly rejecting the hypothesis. Since the tests areindependent, the probability
of incorrectly rejecting at least one is p = 1 - (1 - 0.05)m.When m = 1 we have p = 0.05,
which is fine. But when m = 10 we obtain p = 0.4013, andwhen m = 100 we obtain p =
0.9941. Thus, if we test as few as even 100 true nullhypotheses, we are almost certain to
incorrectly reject at least one.
Alternatively, wecould control the overall family error rate, setting the probability of
incorrectly rejectingone of more of the m true null hypotheses to 0.05. In this case we use
0.05 = 1 - (1 - a)mfor each given mto obtain the level a at which each of the separate null
hypotheses istested. With m = 10 we obtain a = 0.0051, and with m = 100 we obtain a =
0.0005. Thismeans that we have a very small probability of incorrectly rejecting any of the
separatecomponent hypotheses.
10. Sampling Methods
Sampling is the process of selecting a small number of elements from a larger defined
target group of elements such that the information gatheredfrom the small group will allow
judgmentsto be made about the larger groups.
Basics of Sampling Theory:
Population
Element
Defined target population
Sampling unit
Sampling frame
Sampling Methods:
16
Types of Sampling Methods:
Probability
Simple random sampling
Systematic random sampling
Stratified random sampling
Cluster sampling
Simple random sampling:
is a method ofprobability sampling in which every unit has an equal
nonzero chance of being selected
Systematic random sampling:
is a method ofprobability sampling in which the defined target
population is ordered and the sample is selected according to position using a skip interval.
Steps in Drawing a Systematic Random Sample:
Obtain a list of units that contains an acceptable frame of the target population
Determine the number of units in the list and the desired sample size
Compute the skip interval
Determine a random start point
Beginning at the start point, select the units by choosing each unit that corresponds to
the skip interval.
Stratified random sampling:
17
is a method ofprobability sampling in which the population is
divided into different subgroups and samplesare selected from each.
Steps in Drawing a Stratified Random Sample:
Divide the target population into homogeneous subgroups or strata
Draw random samples fro each stratum
Combine the samples from each stratum into a single sample of the target population.
Cluster sampling:
method by which the population is divided into groups (clusters), any
of which can be considered a representative sample. These clusters are mini-populations and
therefore are heterogeneous. Once clusters are established a random draw is done to select
one (or more) clusters to represent the population. Area and systematic sampling (discussed
earlier) are two common methods.
Nonprobability
Convenience sampling
Judgment sampling
Quota sampling
Snowball sampling
Convenience sampling:
samples drawn at the convenience of the interviewer. People tend to
make the selection at familiar locations and to choose respondents who are like themselves.
Judgment sampling:
samples that require a judgment or an “educated guess” on the part of
the interviewer as to who should represent the population.
Also, “judges” (informed
individuals) may be asked to suggest who should be in the sample.
Quota sampling:
samples that set a specific number of certain types of individuals to be
interviewed.
18
•
Often used to ensure that convenience samples will have desired proportion of
different respondent classes.
Snowball sampling(Referral sample):
samples which require respondents to provide the names of additional
respondents.
•
Members of the population who are less known, disliked, or whose opinions
conflict with the respondent have a low probability of being selected.
19
© Copyright 2026 Paperzz