[[
wikihub
]]
Search
⌘K
Explore
People
For Agents
Sign in
Explore
People
For Agents
Sign in
@jemoka / Jemoka Knowledge Base / wiki/concepts/naive_bayes.md
Suggest edit
Cancel
Submit suggestion
Title
Name
Note
--- title: "Bayesian Network Naive Bayes" type: concept related: [Inference, Naive Bayes, Baysian Network, Bayes Theorem, Probability] source: https://www.jemoka.com/posts/kbhnaive_bayes/ confidence: high status: active --- Naive Bayes is a special class of Baysian Network inference problem which follows a specific structure used to solve classification problems. The Naive Bayes classifier is a Baysian Network of the shape: (Why is this backwards(ish)? Though we typically think about models as a function M(obs) = cls, the real world is almost kind of opposite; it kinda works like World(thing happening) = things we observe. Therefore, the observations are a RESULT of the class happening.) We consider, naively, \(o_{1:n}\) are all conditionally independent on \(c\). From this graph, we can therefore use the probability chain rule + conditional probability to write that: \begin{equation} P(c, o_{1:n}) = P( c) \prod_{i=1}^{n} P(o_{i} | c) \end{equation} so, to actually compute this, we don’t want to bother going over all the multiplications because of underflow, we write: \begin{equation} \hat{y} = \arg\max_{y} \log \hat{P}(y) + \sum_{i=1}^{m} \log \hat{P}(x|y) \end{equation} brute-force Bayes classifier \begin{equation} \arg\max_{y} P(y|x) = \arg\max_{y} \frac{P(x|y)P(y)}{P(x)} \end{equation} but because we are taking argmax, we can not normalize: \begin{equation} \arg\max_{y} P(y|x) = \arg\max_{y} P(x|y)P(y) \end{equation} this only works if \(x\) is a single value (i.e. you have a one-feature classifier This system has 6 parameters; they can be MLE for Bernouli from data, but you can also use Baysian Parameter Learning Method y = 0 y = 1 x1 = 0 theta0 theta2 x1 = 1 theta1 theta3 y = 0 y = 0 theta4 y = 1 theta5 (=1-theta4) to perform estiimation with MAP \begin{equation} p(X=1| Y=0) = \frac{\text{examples where X=1, Y=0}}{\text{examples where Y=0}} \end{equation} whith MLE with a Laplace prior: \begin{equation} p(X=1| Y=0) = \frac{\text{(examples where X=1, Y=0)}+1}{\text{(examples where Y=0)}+\text{(nclass = 2)}} \end{equation} We can keep going; for instance, if you wave \(x_1, x_2\) two diffferent features: \begin{equation} \arg\max_{y} P(y|x) = \arg\max_{y} P(x_1, x_2|y)P(y) \end{equation} but this requires us to have \(2^{2}\) and ultimately \(2^{n}\) parameters, which is exponential blowup. Hence, we need to treat the variables as—naivly—independent so we can multiply them. Hence: Naive Bayes assumption we assume independence between the input features. That is, we assume: \begin{equation} P(x_1, \dots, x_{n}|y) = \prod_{i=1}^{n} P(X_{i}|y) \end{equation} inference with Naive Bayes Recall the definition of inference, for our case here: given observations \(o_{1:n}\), we desire to know what’s the probability of \(c\) happening. That is, from conditional probability: \begin{equation} P(c | o_{1:n}) = \frac{P(c, o_{1:n})}{P(o_{1:n})} \end{equation} Now, from above we have \(P(c, o_{1:n})\) already. To get the denominator, we invoke law of total probability to add up the probability of all observations occurring given all classes. That is: \begin{equation} P(o_{1:n}) = \sum_{c \in C} P(c, o_{1:n}) \end{equation} You will note that this value \(P(o_{1:n})\) is actually constant as long as the network structure does not change. Therefore, we tend to write: \begin{align} P(c | o_{1:n}) &= \frac{P(c, o_{1:n})}{P(o_{1:n})} \\ &= \kappa P(c, o_{1:n}) \end{align} or, that: \begin{equation} P(c|o_{1:n}) \propto P(c, o_{1:n}) \end{equation} “the probability of a class occurring given the inputs is proportional to the probability of that class occurring along with the inputs” Multiple believes \begin{equation} P(A=a | R_1) \propto P(R_1 | A=a) \cdot P(A=a) \end{equation} But now Motivation: Bayes rule This will give us: However, what if we don’t want to use the law of total probability to add up \(P(FB’)\)? We can actually write a relation that essentially reminds us that the fact that \(P(FB’)\) as not dependent on \(TSF\), so we can write: \begin{equation} P(TSF^{1}|FB^{1}) \porpto P(TSF^{1})P(FB^{1} | TSF^{1}) \end{equation}