Chapter 13
|






or P(die < 4) = P(1)+P(2)+P(3) = 1/2
What is probability of the event drawing an Ace from a deck of cards, P(Ace)?

for a die numbers 1, 3 and 5.
Example
Lucky(Ace) = true
P(Lucky=true) = 1/52 + 1/52 + 1/52 + 1/52

For Boolean random variables
Odd and LT4 (i.e. numbers less that 4)
What is:
- LT4(5)
- Ølt4
- P(Ølt4)
- odd Ú lt4
- P(Odd Ú LT4)

P(A Ú B) = P(A) + P(B) - P(A Ù B)
P(A Ù B) subtracted because the intersection of P(A) and P(B) is included twice.


Example
Prior or unconditional probability of event, drawing an Ace of spades from a deck of cards is P(Ace of spades) = 1/52.
After drawing a card, P(Ace of spades) = 0 or 1
Example - Joint probability table:
P(A,B)
a Øa b 0.11 0.09 Øb 0.63 0.17
- P(a) = SP(a,b) = 0.11 + 0.63 (sum out b)
- P(Øa) = SP(Øa,b) = 0.09 + 0.17 (sum out b)
- P(a Ù b) = 0.11
- P(a Ú b) = P(a) + P(b) - P(a Ù b) = (0.11+0.63) + (0.11+0.09) - 0.11
- What is the prior probability P(b)?
- Of Øa Ú b?




Example: P(A,B)
a Øa b 0.11 0.09 Øb 0.63 0.17
- P(a) =0.11 + 0.63 = 0.74
- P(Øa) = 0.09 + 0.17
- P(a Ù b) = 0.11
- P(a Ú b) = P(a) + P(b) - P(a Ù b) = (0.11+0.63) + (0.11+0.09) - 0.11
- P(b|a) = P(b Ù a) = 0.11 / 0.74 = 0.15 conditional probability of proposition b given proposition a
P(a)
- What is the conditional probability P(a|b)?

Full joint distribution: P(Toothache, Cavity, Catch) represented in 2x2x2 table.

What is P(cavity), the probability the proposition cavity is true?

What is P(cavity Ù Øtoothache)?

What is P(cavity | toothache)?

Normalization - the process whereby the posterior (conditional) probabilities of a pair of variables are divided by a fixed value to ensure the sum is 1. Useful shortcut in probability calculations because P(A) = 1 - P(ØA).
a of P(cavity, toothache) = 1 / P(toothache) = 1 / (.012 + .108 + .016 + .064) = 1 / .2 = 5
a of P(toothache, catch)

Inferencing by enumeration works but why is it often not practical?

- B conditionally dependent on A
P(A Ù B) = P(B|A) * P(A)
- Independent A and B
P(A Ù B) = P(B) * P(A)
- Toothache depends on cavity
- Catch depends on cavity
- Weather is independent
P(Toothache|Cavity)
P(Catch|Cavity)



- The full joint distribution has size 8, conditional independence reduces the size to 5. Use the definition of conditional probability to demonstrate.
- Suppose we had 32 Boolean variables, what is the size of the full joint distributi


What does the above say? P(m|s) Calculate P(s|m), given that one has meningitis, they have a stiff neck.
Example
A ØA B 0.11 0.09 ØB 0.63 0.17
- P(A) =0.11 + 0.63 = 0.74
- P(ØA) = 0.09 + 0.17
- P(A Ù B) = 0.11
- P(A Ú B) = P(A) + P(B) - P(A Ù B) = (0.11+0.63) + (0.11+0.09) - 0.11
- P(B|A) = P(B Ù A) = 0.11 / 0.74 = 0.15 probability of B given A
P(A)- P(A Ù B) = P(A|B)P(B) = P(B|A)P(A) = P(B Ù A)
Bayes' Rule
Example: Medical diagnosis
HT ØHT C 0.00008 0.00002 ØC 0.00092 0.99918
How is P(HT|C) calculated from the joint probability table? Verify P(HT|C)=0.80
What is P(ØHT|ØC)?

Bayesian belief network is represented as an acyclic directed graph.
- Nodes represent evidence or hypotheses
- Arc represents a dependence between two nodes.
Independent A and B
P(B Ù A) = P(A) * P(B)
and
P(B|A) = P(B)
that is the likelihood of B is unaffected by whether or not A occurs.
Dependent B on A
P(B Ù A) = P(A) * P(B|A)
P(B|A) ≠ P(B)
| Serial | P(A,V,B)=P(A)*P(V|A)*P(B|V) |
![]() |
| Diverging | P(V,A,B)=P(V)*P(A|V)*P(B|V) | ![]() |
| Converging | P(A,B,V)=P(A)*P(B)*P(V|A,B) | ![]() |
Example
- Bayesian belief network
- Nodes represent evidence or hypotheses
- Arc represents a dependence between two nodes.
- P(A) = 0.1
- P(B) = 0.7
- P(C|A) = 0.2
- P(C|ØA) = 0.4
- P(D|AÙB) = 0.5
- P(D|AÙØB) = 0.4
- P(D|ØAÙB) = 0.2
- P(D|ØAÙØB) = 0.0001
- P(E|B)=0.2
- P(E|ØB)=0.1
A has only prior probabilities since independent.
B has only prior probabilities since independent.
C dependent on A, 2 cases A and ØA.
D dependent on A and B, 4 cases.
E dependent on B, 2 cases B and ØB.
Expressed as conditional probability tables:
P(A) 0.1
P(B) 0.7
A P(C) true
false0.2
0.4
B P(E) true
false0.2
0.1
A B P(D) true
true
false
falsetrue
false
true
false0.5
0.4
0.2
0.0001
Given A and B are true, P(D) = 0.5
etc.
Joint probability using definition of conditional probability
- P(B|A) = P(B Ù A)
P(A)
Hence
P(A,B,C,D,E) = P(E|A,B,C,D)*P(A,B,C,D)
applying this rule recursively:
P(A,B,C,D,E) = P(E|A,B,C,D)*P(D|A,B,C)*P(C|A,B)*P(B|A)*P(A)
Observing that:
E is not dependent on A, C or D
P(E|A,B,C,D) = P(E|B)
C is dependent only on A
P(C|A,B) = P(C|A)
D is dependent only on A and B (A and B are independent)
P(D|A,B,C) = P(D|A,B)
= P(A|D)*P(B|D)*P(D)
P(A Ù B)B is independent of A so
P(B|A) = P(B)
can reduce
P(A,B,C,D,E) = P(E|A,B,C,D)*P(D|A,B,C)*P(C|A,B)*P(B|A)*P(A)
= P(E|B)*P(D|A,B)*P(C|A)*P(B)*P(A)
Note that to calculate joint probability, the nodes must be ordered such that if a node X is dependent on node Y, Y appears before X. Either of the following would work:
P(B,A,C,D,E)
P(A,C,B,D,E)
Example
P(C) 0.2
C P(S) true
false0.8
0.2
C P(P) true
false0.6
0.5
S P P(E) true
true
false
falsetrue
false
true
false0.6
0.9
0.1
0.2
P P(F) true
false0.9
0.7
P(ØC) = 1 - P(C)
P(F|P) = 0.9 since P is true
P(E|SÙP) = 0.6 since S and P are true
What is the probability of:
- having fun given that you party?
- passing exams if you don't study and don't party?
P(C=true, S=true, P=false, E=true, F=false) or P(C,S,ØP,E,ØF)
the probability that you will go to college, study, not party, pass exams and not have fun!
P(C,S,ØP,E,ØF) = P(C)*P(S|C)*P(ØP|C)*P(E|SÙØP)*P(ØF|ØP)
= .2*.8*.4*.9*.3
= 0.01728
Bayesian belief networks, because no direct connection between C and E, E is independent of C, given S and P.
P(E|CÙSÙP) = P(E|SÙP)=0.6
Compute the probability of going to college, study, party, pass exams and have fun
Can calculate more complex conditional probabilities, for example pass Exams given that you:

P(E|FÙCÙSÙØP) = P(E|S Ù ØP) = 0.9
P(C) 0.2
C P(S) true
false0.8
0.2
C P(P) true
false0.6
0.5
S P P(E) true
true
false
falsetrue
false
true
false0.6
0.9
0.1
0.2
P P(F) true
false0.9
0.7
Diagnoses
Can make diagnoses by determining posterior probabilities.
- C go to college
- F have fun
- S study
- E passed exams
- P partied
Want to determine whether partied or not.
P(CÙSÙFÙEÙP) = P(C)*P(S|C)*P(P|C)*P(E|SÙP)*P(F|P)
= .2*.8*.6*.6*.9 = .05184
P(CÙSÙFÙEÙØP) = P(C)*P(S|C)*P(ØP|C)*P(E|SÙØP)*P(F|ØP)
= .2*.8*.4*.9*.7 = .04032
so more likely you partied.
Training Example
x y z Classification 2 3 2 A 4 1 4 B 1 3 2 A 2 4 3 A 4 2 4 B 2 1 3 C 1 2 4 A 2 3 3 B 2 2 4 A 3 3 3 C 3 2 1 A 1 2 1 B 2 1 4 A 4 3 4 C 2 2 4 A
Summary classification table
A's - 8
B's - 4
C's - 3
15 totalA
value x y z 1 2 1 1 2 5 4 2 3 1 2 1 4 0 1 4 B
value x y z 1 1 1 1 2 1 2 0 3 0 1 1 4 2 0 2 C
value x y z 1 0 1 0 2 1 0 0 3 1 2 2 4 1 0 1
Classify: (x=2, y=3, z=4)
Use P(ci)*PP(dj|ci) to compute posterior probability of ci
P(A)*P(x=2|A)*P(y=3|A)*P(z=4|A) =
8/15*5/8 *2/8 *4/8 =0.0417 maximumP(B)*P(x=2|B)*P(y=3|B)*P(z=4|B) =
4/15*1/4 *1/4 *2/4 =0.0083P(C)*P(x=2|C)*P(y=3|C)*P(z=4|C) =
3/15*1/3 *2/3 *1/3 =0.015Classify as A since maximum posterior probability
Summary classification table
A's - 8
B's - 4
C's - 3
15 totalA
value x y z 1 2 1 1 2 5 4 2 3 1 2 1 4 0 1 4 B
value x y z 1 1 1 1 2 1 2 0 3 0 1 1 4 2 0 2 C
value x y z 1 0 1 0 2 1 0 0 3 1 2 2 4 1 0 1
Problems - when no training data to calculate probability
Classify: (x=1, y=2, z=2)
P(A)*P(x=1|A)*P(y=2|A)*P(z=2|A) =
8/15*2/8 *2/8 *2/8 =0.0083P(B)*P(x=1|B)*P(y=2|B)*P(z=2|B) =
4/15*1/4 *1/4 *0/4 = 0P(C)*P(x=1|C)*P(y=2|C)*P(z=2|C) =
3/15*0/3 *0/3 *0/3 = 0
m-estimate - estimate probability of a specific attribute value given a specific classification.
a + mp
b+ma = number of training examples that match attribute value (for P(x=1|C), a is the number of training examples where x=1 and categorized as C. In example, a=0)
b = total number of training examples categorized as C. In example, 3
p = estimate of probability trying to obtain. Usually assume each attribute value equally likely; in example with four values of x=1,2,3 or 4, for P(x=1|C), p=1/4; P(x=2|C), p=1/4, etc.
m = constant known as equivalent sample size.
Example
Calculate the m-estimate for x=1 given a classification of C; P(x=1|C).
Pick m = 5.
a + mp = 0 + 5*1/4 = 0.156
b+m 3+5
Summary classification table
A's - 8
B's - 4
C's - 3
15 totalA
value x y z 1 2 1 1 2 5 4 2 3 1 2 1 4 0 1 4 B
value x y z 1 1 1 1 2 1 2 0 3 0 1 1 4 2 0 2 C
value x y z 1 0 1 0 2 1 0 0 3 1 2 2 4 1 0 1 Classify: (x=1, y=2, z=2) Use P(ci)*PP(dj|ci)
Category A
x=1 y = 2 z = 2 2+5/4 = 0.25
8+53+5/4 = 0.33
8+51+5/4 = 0.17
8+5P(A)*P(x=1|A)*P(y=2|A)*P(z=2|A) =
8/15*0.25 *0.33 *0.17 =0.0075Category B
x=1 y = 2 z = 2 1+5/4 = 0.25
4+52+5/4 = 0.36
4+50+5/4 = 0.138
4+5P(B)*P(x=1|B)*P(y=2|B)*P(z=2|B) =
4/15*0.25 *0.36 *0.138 = 0.0033Category C
x=1 y = 2 z = 2 0+5/4 = 0.156
3+50+5/4 = 0.156
3+50+5/4 = 0.156
3+5P(C)*P(x=1|C)*P(y=2|C)*P(z=2|C) =
3/15*0.156 *0.156 *0.156 = 0.0008
A was the correct classification after all.






Consider 3 cases below:
P1,3ÙP2,2ÙP3,1
P1,3ÙP2,2ÙØP3,1
P1,3ÙØP2,2ÙP3,1
Do not consider the following because not possible given what is known:
P1,3ÙØP2,2ÙØP3,1


P(P1,3=true|known,b) = 0.31
P(P1,3=false|known,b) = 0.0.69
Recall that summing over the fringe (P2,2 and P3,1) analogous to summing over a (hyper-dimensional) row of a joint distribution table, in this case where known is true and P1,3=false or true.
P(P1,3=true)=0.2
- P(P2,2=true)=0.2 P(P3,1=true)=0.2
- P(P2,2=true)=0.2 P(P3,1=false)=0.8
- P(P2,2=false)=0.2 P(P3,1=true)=0.8
P(P1,3=false)=0.8
- P(P2,2=true)=0.2 P(P3,1=true)=0.2
- P(P2,2=true)=0.2 P(P3,1=false)=0.8


Why are only P2,2 and P3,1 considered for P1,3=false? Why can OTHER be excluded from the calculation of conditional probability for P1,3?

$"ÞÛº≠ØÎÚÙ