Learning from Observations – The ultimate intelligent agent.
There are 3 main types of learning:
o Supervised learning – used in environments where an action is followed by immediate
feedback
o Reinforcement learning – used in environments where feedback on actions is not
immediate
o Unsupervised learning – used where there isn’t any feedback on actions!
A learning agent can be divided into 4 conceptual components (according to the book)
o Performance element – takes in precepts and decides on actions, up to this point this is
the only thing we’ve been concerned with!
o Learning element – responsible for making improvements to the performance element
o A critic – provides feedback on how the agent is doing. The critic’s performance
measure is fixed. Otherwise, the agent could adjust its performance measure instead of
its actions to “maximize” its performance measure. Kind of like a patting oneself on the
back.
o Problem generator – suggests alternate, possibly sub-optimal, courses of action which
could help the agent learn about better courses of action
Theoretically, a learning agent could learn about every component of the performance agent.
It could learn which percepts are important and their relationship to the utility function, what
its actions accomplish, and which actions one should take given a set of precepts.
Inductive Learning – Supervised Learning
Gather a set of input-output examples from some application: Training Set
i.e. Stock Forecasting
Train the learning model (decision tree, neural network, etc.) on the training set until “done”
The Goal is to “generalize” on novel data not yet seen
Gather a further set of input-output examples from the same application: Test Set in order to
validate what the system is doing
Use the learning system on actual data
Formally, given
a function f
a set of examples (x, f(x))
produce h such that h approximates f
Inductive learning is defined to be the process of learning from pre-classified examples.
T = {e1, e2, . . . en},
where each ei = (a, o) = (a1a2. . .am ,o)
T
1. Choose h such that
h
(e.o h(e.a))2
eT
2. Hypothesis "goodness" =
is minimized.
(e.o h(e.a))2 .
eV
Why Induction?
Cost and Errors in Programming A Solution
Domain knowledge limited - financial forecasting
Encoding/extracting of domain knowledge may be expensive
speech recognition
Augment existing domain knowledge
Adaptability
General, easy-to use mechanism for a large set of applications
Do better than current approaches
Examples of Real World Problems
Medical diagnosis
Gene discovery
Stock market prediction
Loan underwriting
Mushroom identification
Voting records
Speech recognition
Learn This!
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
1
2
3
4
5
6
7
1.4
1.2
1
0.8
0.6
0.4
0.2
0
1
2
3
4
5
6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
1
2
3
4
5
6
7
What explanations do we prefer? Common Biases in learning:
Minimize Error on known examples
Information gain
Ockham’s Razor – Prefer the simplest hypothesis that describes the data (mml, mdl).
Without bias, you cannot learn!
Bias influences what you will learn. Sometimes these biases are inherent in the basic
learning algorithm you choose, sometimes they are implicit in the error function you are
using.
So which biases are the “good” biases?
The conservation law of generalization performance [Schaffer, 1994] proves that no
learning bias can outperform any other bias over the space of all possible learning tasks
Basic Approach to Inductive Learning
Select Application
Select Input features from the application (training set)
Train with selected learning model - Small is Good (Overfitting vs. Generalization)
Test learned hypothesis on novel data (test set)
Use on actual data
Decision Trees
Binary output. Takes a set of attributes for a given situation or object and outputs a yes/no
answer.
Typically, each internal node in the tree corresponds to a test on a single attribute. However,
you could have more complicated tests than that! And there are models that split on more than
just a single attribute. The test need only split the data reaching the node in some way.
The branches emanating from a decision node are labeled with the possible values of the test
for that branch.
Leaf nodes are where classification takes place. Leaf nodes are labeled with the Boolean
value that should be returned if the node is reached (yes/no).
Decision Tree Expressiveness
Any Boolean function can be represented by a decision tree.
In the worst case, the size of the tree needs to be exponential in the number of variables – i.e. a
full-on representation of the truth table. Example – parity.
Can we do better than that? Is there a representation for Boolean functions that has a worst
case performance better than a decision tree? Wouldn’t it be great if there was?
The hypothesis space for
The problem is, there are a whole lot of Boolean functions.
o A truth table with n variables has 2n rows, 1 row for each possible truth setting of the
variables.
o For each row, the output of the function can be either a 0 or a 1.
o So, we have 2n bits that can be set to either 0 or 1.
2
2
n
o That means there are
possible Boolean functions on n variables.
o This is the size of the hypothesis search space that we’re dealing with for propositional
languages, and there are more complex logics than prepositional logic, and more
complicated search spaces.
ID3/C4.5
Top down induction of decision trees
non-incremental - are extensions
Highly used and successful
Attribute Features - discrete - output is also discrete
Search for smallest tree is too complex
Use greedy iterative approach
ID3 Learning Approach
C is a set of examples
A test on attribute A partitions C into {Ci, C2,...,Cw} where w is number of states of A
Start with TS as C and first find good A for root - Attribute which is "most important"
Continue recursively until sets unambiguously classified or there are no more "relevant"
features
- C4.5 actually expands and then prunes back statistically irrelevant partitions afterwards
Information Gain
Twenty Questions - what are good questions, ones which when asked decrease information
remaining
Assumes regularity in the data
What would be a good attribute test?
Information of a message in bits: I(m) = -log2(pm)
if 16 equiprobable messages, I for one is -log(1/16) = 4 bits
If set S of messages of only w types,
then still for one: I = -log2(pm)
Gain Criterion
Info(S) = Entropy(S) =
0 <= Info(S) <= log2(|output classes|)
Info(S) is average amount of information needed to identify the class of an example in S
Expected Information after partitioning using A:
E(A) = InfoA(S) = , v=|A|, examples
Gain(A) = Info(S) - InfoA(S)
ID3 Learning Algorithm
1. S = Training Set
2. Calculate gain for each remaining attribute
3. Select highest and create a new node for each partition
4. For each partition
- if one class then end
- else if > 1 class then goto 2 with remaining attributes
- else if empty, label with most common class of parent (or set as null)
if attributes exhausted? - (this will only happen for an inconsistent S) - label with majority
class
ID3 Example and Discussion
Attributes which best discriminate between classes are chosen
If the same ratios are found in partitioned set, then gain is 0
Complexity: O(|TS| * |attributes| * |nodes|) - Polynomial
ID3 Noise Handling
Noise can cause inability to converge 100% or more complex decision tree (overfitting). Thus,
allow leafs with multiple classes.
If truly irrelevant class proportion in partitioned sets should be similar to initial set, and thus a
small gain. Could only allow attributes with gains exceeding some threshold in order to sift
noise. However, empirically tends to disallow relevant attribute tests.
Use Chi-square test to decide confidence in whether attribute is irrelevant. Best ID3 results.
(Takes amount of data into account which is not done by above)
When you decide to not test with any more attributes, label node with either most common, or
with probability of most common (good for distribution vs function)
C4.5 handles noise by first filling out complete tree and then pruning any nodes where
expected error could statistically decrease (# if cases at node critical). Could also use test set
for pruning.
ID3 - Missing Attribute Values - Learning
Throw out data with missing attributes - too common, could be important, not prepared to
generalize with missing attributes
Set attribute to most probable attribute class
Set attribute to most probable attribute class given the example class - similar performance
Use a learning scheme (ID3, etc) to fill in attribute class where TS is made up of complete
examples and the initial output class is just another attribute. Better, but not always
empirically convincing
Let unknown be just another attribute value - for ID3 has anomaly of apparent higher gain due
to more attributes - see Gain Ratio criteria
Gain(A) = p(A is known) * (Info(S) - InfoA(S)) - Best empirical results, IV(A) also modified
to add additional unknown class
ID3 - Missing Attribute Values - Execution
When arriving at an attribute test for which the attribute is missing during execution
Each branch has a probability of being taken based on what percentage of TS examples went
down each branch
Take all branches, but carry a weight representing the probability. Weights could be further
modified (multiplied) by other missing attributes in current test example as they continue
down the tree.
Results in multiple active leaf nodes. Set output as leaf with highest weight, or sum weights
for each output class, and output the class with the largest sum
ID3/C4.5 Miscellaneous
Continuous data handled by testing all n-1 possible binary thresholds to see which gives best
gain ratio. The highest then becomes the binary threshold for the attribute at that level. Is
binary sufficient?
Intelligibility of DT - critical? - C4.5 rules - transforms tree into prioritized rule list with
default (most common output for examples not covered by rules). Does simplification of
superfluous attributes by greedy elimination strategy (based on statistical error confidence as
in error pruning). Prunes less productive rules within rule classes
ID5 - Incremental ID3, enough information is stored at nodes to allow direct update with new
examples
ID3 - Conclusions
Good Empirical Results
Comparable application robustness and accuracy with neural networks - faster learning
(though NN are better with continuous - both input and output)
Most used and well known of current systems - used widely to aid in creating rules for expert
systems
Barriers to inductive learning:
-
Finding a good hypothesis
Obtaining Data
Social Response
No algorithm outperforms random on all problems
© Copyright 2026 Paperzz