Chapter 18
|




Chapter will cover learning methods for propositional logic.
input - x
output - f(x)
example - A pair (x, f(x))
h - Hypothesis
H - Hypothesis space, for example a set of polynomials to consider in fitting a function to data points.
Inductive learning - Given a collection of examples of f, return a function h that approximates f.



h = linear hypothesis, does not agree with f but simple and good prediction in most cases

h = nonlinear hypothesis, better agreement with f and good prediction in most cases

h = polynomial hypothesis, exact agreement with f but what about prediction

h = random hypothesis, precise agreement with f but poor prediction

Example - decide whether to wait on a table at a restaurant
- Example - example number
- Alt - is there an alternative restaurant
- Bar - bar area
- Fri - true on Fridays
- Hun - hungry
- Pat - how many patrons
- Price - price
- Rain - is it raining
- Res - do you have a reservation
- Type - type of food
- Est - estimated wait time interval
- Target - will wait decision tree output

Decision tree induction terms

Decision tree induction operation
- Reaches decision by performing sequence of tests on internal nodes of tree.
- Test value of one attribute at internal nodes with branches labeled with possible values tested.
- Leaf nodes specify decision value returned if that node is reached.
What is the will wait decision for the following attributes:
- Patrons some?
- Raining yes, Hungry yes, Alternate yes, Patrons Full, Wait estimate 10-30?
- Raining yes, Hungry yes, Alternate yes, Patrons Full, Wait estimate 30-60?
Represent 2 in propositional logic.
Example internal nodes
- Alt - is there an alternative restaurant
- Bar - bar area
- Fri - true on Fridays
- Hun - hungry
- Pat - how many patrons
- Price - price
- Rain - is it raining
- Res - do you have a reservation
- Type - type of food
- Est - estimated wait time interval
Leaf nodes
- Target - will wait decision tree output

Each row corresponds to a path in the tree.
While in propositional logic:
A xor B Û (A Ù ØB) Ú (ØA Ù B)
corresponds to only those paths from root to a leaf with a positive outcome.
For the AND operation:
- Give the truth table.
- How many other ways could the truth table be given?
- How many attributes?
- How many different decisions (leaf nodes) are there?
- Give the decision tree.
- Can it be represented more compactly (fewer nodes) and still reach the same decisions?
- Can XOR be represented more compactly (fewer nodes) and still reach the same decisions?

Can a program simply memorize all possible examples and use to answer questions? How many different restaurant examples are needed to provide one for all possible decisions?
Can a decision tree provide a prediction when there is no value given to select a path at an internal node?
Can a decision tree be constructed that needs only a subset of possible decision paths?


Best to choose an attribute that provides greatest amount of classification information.
Patrons gives classification information for values None (negative) and Some (positive).
Type gives no classification information.



Information in a message
I(P(v1),...P(vn)) = -P(vi)log2(P(vi)) = number of bits to represent the answer
I(fair coin) = -p(heads)log2(p(heads)) - p(tails)log2(p(tails))
= -1/2 log2(1/2) - 1/2 log2(1/2)
= -1/2 log22-1 - 1/2 log22-1
= -1/2(-1) -1/2(-1)
= 1Assuming equal probability of restaurant Will Wait examples:
p(yes) = 6/12, p(no) = 6/12
I(p(yes),p(no)) = I(1/2, 1/2) =
= -1/2 log2(1/2) -1/2 log2(1/2)
= -1/2 * -1 -1/2 *-1
= 1 bit
Choosing the right attribute
Choose attribute that gives most information. Determine amount of information by calculating remainder information after testing an attribute.
Usually no attribute will give complete information (e.g. 1 bit) but will give some.
Any attribute A divides training examples E into subset according to values of attribute A (Patrons=2 None, 4 Some, 6 Full).
Each subset Ei has pi positive and ni negative examples so need additional I(pi/pi+ni,ni/pi+ni) bits of information
Random example has ith value of A with probability (pi+ni/p+n) so on average need, after testing A, the remaining bits of information:
Remainder(A) = S (pi+ni/p+n) I(pi/pi+ni,ni/pi+ni)
Gain is difference between number of bits originally required and the remaining bit requirement.
Gain(A) = I(pi/pi+ni,ni/pi+ni) - Remainder(A)
Patrons
Will Wait None Some Full Yes 0 4 2 No 2 0 4 Remainder(Patrons) =
[2/12*I(0,1)+4/12*I(1,0)+6/12*I(2/6,4/6)] = 0.46 bitGain(Patrons) = 1 - Remainder(Patrons) = 1 - 0.46 = 0.54 bit
Price
Will Wait $ $$ $$$ Yes 3 2 1 No 4 0 2 Remainder(Price) =
[7/12*I(3/7, 4/7)+2/12*I(2,0)+3/12*I(1/3,2/3)] = 0.81 bitGain(Price) = 1 - Remainder(Price) = 1 - 0.81 = 0.19 bit
Original decision tree

Results from Decision Tree Learning
Gain Patrons greatest: 0.54 bits
After classifying on Patrons
Hungry, Price, Rain, Type, Fri/Sat tie with: 0.25 bits
After classifying on Hungry
Type: 0.5 bits
After classifying on Type
Fri: 1.0 bits
Decision learning tree results
Test Patrons 0.540852082973 Patrons = NONE ==> RESULT = No Patrons = FULL ==> Test Hungry 0.251629167388 Hungry = Yes ==> Test Type 0.5 Type = Burger ==> RESULT = Yes Type = Thai ==> Test Fri 1.0 Fri = Yes ==> RESULT = Yes Fri = No ==> RESULT = No Type = Italian ==> RESULT = No Hungry = No ==> RESULT = No Patrons = SOME ==> RESULT = YesPrediction:
Hungry:Yes, Patrons:FULL, Type:Thai, Fri:Yes => Yes
Program computation of attribute information in bits
Round 1 2 Patrons: Full 3 Hungry:Yes 4 Type: Thai Alternate 0.0
Bar 0.0
Fri 0.020
Hungry 0.195
Patrons 0.54
Price 0.195
Rain 0.02
Reservation 0.02
Type 0.0
Estimate 0.207Alternate 0.109
Bar 0.0
Fri 0.109
Hungry 0.251
Price 0.251
Rain 0.044
Reservation 0.251
Type 0.251
Estimate 0.251Alternate 0.0
Bar 0.0
Fri 0.311
Price 0.311
Rain 0.311
Reservation 0.311
Type 0.5
Estimate 0.0Alternate 0.0
Bar 0.0
Fri 1.0
Price 0.0
Rain 1.0
Reservation 0.0
Estimate 1.0
What is the will wait decision for the following attributes:
- Patrons Some, Raining yes, Hungry yes, Alternate yes, Type Italian, Wait estimate 10-30?
- Raining yes, Hungry yes, Alternate yes, Type Thai, Patrons Full, Fri yes, Wait estimate 10-30?
Example (Download program and example)
Credit History Debt Collateral Income Risk 1 BAD HIGH NONE $0 to $15K HIGH 2 UNKNOWN HIGH NONE $15 to $35K HIGH 3 UNKNOWN LOW NONE $15 to $35K MODERATE 4 UNKNOWN LOW NONE $0 to $15K HIGH 5 UNKNOWN LOW NONE over $35K LOW 6 UNKNOWN LOW ADEQUATE over $35K LOW 7 BAD LOW NONE $0 to $15K HIGH 8 BAD LOW ADEQUATE over $35K MODERATE 9 GOOD LOW NONE over $35K LOW 10 GOOD HIGH ADEQUATE over $35K LOW 11 GOOD HIGH NONE $0 to $15K HIGH 12 GOOD HIGH NONE $15 to $35K MODERATE 13 GOOD HIGH NONE over $35K LOW 14 BAD HIGH NONE $15 to $35K HIGH
Decision learning tree data
H,M,L='HIGH','MODERATE','LOW' B,U,G='BAD','UNKNOWN','GOOD' A,NONE='ADEQUATE','NONE' creditAttributes = ['Credit History', 'Debt', 'Collateral', 'Income'] creditExamples = [ ([U,H,NONE,'$15 to $35K'],H), ([U,L,NONE,'$15 to $35K'],M), ([U,L,NONE,'$0 to $15K'], H), ([U,L,NONE,'over $35K'], L), ([U,L,A, 'over $35K'], L), ([B,H,NONE,'$0 to $15K'], H), ([B,L,NONE,'$0 to $15K'], H), ([B,L,A, 'over $35K'], M), ([B,H,NONE,'$15 to $35K'],H), ([G,L,NONE,'over $35K'], L), ([G,H,A, 'over $35K'], L), ([G,H,NONE,'$0 to $15K'], H), ([G,H,NONE,'$15 to $35K'],M), ([G,H,NONE,'over $35K'] ,L) ] creditDecisionTree = DECISION_TREE_LEARNING(creditExamples,creditAttributes, None) creditDecisionTree.display() # print decision tree question = {'Credit History':U, 'Debt':H, 'Collateral':NONE, 'Income':'$15 to $35K'} print 'Question: ', question print 'Prediction: Risk ', creditDecisionTree.predict(question) # print prediction
Decision Learning Tree results
Test Income 0.966323671285
Income = over $35K -> Test Credit History 0.650022421648
Credit History = UNKNOWN -> RESULT = LOW
Credit History = BAD => RESULT = MODERATE
Credit History = GOOD -> RESULT = LOW
Income = $0 to $15K -> RESULT = HIGH
Income = $15 to $35K -> Test Credit History 0.5
Credit History = UNKNOWN -> Test Debt 1.0
Debt = HIGH -> RESULT = HIGH
Debt = LOW -> RESULT = MODERATE
Credit History = BAD -> RESULT = HIGH
Credit History = GOOD -> RESULT = MODERATEPrediction
Collateral: NONE, Debt: HIGH, Credit History: UNKNOWN,
Income: $15 to $35K -> Risk: HIGH
Program computation of attribute information in bits
Round 1 2 Income: $15 to $35K 3 Income: Over $35K Credit History 0.265
Debt 0.062
Collateral 0.206
Income 0.966
Credit History 0.5
Debt 0.311
Collateral 0.0
Credit History: Unknown Debt 1.0
Collateral 0.0Credit History 0.65
Debt 0.109
Collateral 0.19
Credit Risk Assessment
What is the Risk decision prediction for the following attributes:
- Income $0 to $15K?
- Income $15 to $35K, Credit History Unknown, Debt Low?
Does Collateral attribute contribute any information gain?
Information Theoretic Test Selection
Credit History Debt Collateral Income Risk 1 BAD HIGH NONE $0 to $15K HIGH 2 UNKNOWN HIGH NONE $15 to $35K HIGH 3 UNKNOWN LOW NONE $15 to $35K MODERATE 4 UNKNOWN LOW NONE $0 to $15K HIGH 5 UNKNOWN LOW NONE over $35K LOW 6 UNKNOWN LOW ADEQUATE over $35K LOW 7 BAD LOW NONE $0 to $15K HIGH 8 BAD LOW ADEQUATE over $35K MODERATE 9 GOOD LOW NONE over $35K LOW 10 GOOD HIGH ADEQUATE over $35K LOW 11 GOOD HIGH NONE $0 to $15K HIGH 12 GOOD HIGH NONE $15 to $35K MODERATE 13 GOOD HIGH NONE over $35K LOW 14 BAD HIGH NONE $15 to $35K HIGH Information in a message
I(P(v1),...P(vn)) = -P(vi)log2(P(vi)) = number of bits to represent the message
Assuming equal probability of Risk examples:
p(HIGH RISK) = 6/14, p(MODERATE RISK) = 3/14, p(LOW RISK) = 5/14
I(Risk) = -6/14 log2(6/14) -3/14 log2(3/14) -5/14 log2(5/14)
= -6/14 * -1.222 -3/14 *-2.222 -5/14 * -1.485
= 1.531 bitsRemainder(A) = S (pi+ni/p+n) I(pi/pi+ni,ni/pi+ni)
Gain(A) = I(pi/pi+ni,ni/pi+ni) - Remainder(A)
Income
Risk $0 to $15K $15K to $35K over $35K High 4 2 0 Medium 0 2 1 Low 0 0 5 Remainder(Income) =
[4/14*I(4/4,0,0)+4/14*I(2/4,2/4,0)+6/14*I(1/6,5/6)] = 0.564 bitGain(Income) = 1.531 - Remainder(Income) = 1.531 - 0.564 = 0.966 bit
The gain on selecting Credit History first is about 0.265. Give the formula to compute the information gain on Credit History.


