Skip to article frontmatterSkip to article content

Naive Bayes

Cornell University

From Lecture: Bayes Classifier and Naive Bayes

Generative vs. Discriminative Algorithm

Introduction - Estimating Probability Directly From Data

Something Generative

Our training consists of the set D={(x1,y1),,(xn,yn)}D=\{(\mathbf{x}_1,y_1),\dots,(\mathbf{x}_n,y_n)\} drawn from some unknown distribution P(X,Y)P(X,Y). Because all pairs are sampled i.i.d., we obtain

P(D)=P((x1,y1),,(xn,yn))=Πα=1nP(xα,yα).P(D)=P((\mathbf{x}_1,y_1),\dots,(\mathbf{x}_n,y_n))=\Pi_{\alpha=1}^n P(\mathbf{x}_\alpha,y_\alpha).

If we do have enough data, we could estimate P(X,Y)P(X,Y) simply through counting:

P^(x,y)=# (x,y) appeared in our data# total data=i=1nI(xi=xyi=y)n\hat P(\mathbf{x},y) =\frac{\text{\# (x,y) appeared in our data}}{\text{\# total data}} = \frac{\sum_{i=1}^{n} I(\mathbf{x}_i = \mathbf{x} \wedge {y}_i = y)}{n}

where I(xi=xyi=y)=1I(\mathbf{x}_i=\mathbf{x} \wedge {y}_i=y)=1 if xi=x\mathbf{x}_i=\mathbf{x} and yi=y{y}_i=y and 0 otherwise.

Something Discriminative

We are also interested in using similar technique to estimate P(YX)P(Y|X). Because we know if we have P(YX)P(Y|X), we can then use the Bayes Optimal Classifier to make predictions. We now want to make some prediction P^(yx)\hat{P}(y|\mathbf{x}) based on dataset, and pretend our estimate P^(yx)\hat{P}(y|\mathbf{x}) to well describe our real distribution P(YX)P(Y|X), so we use P^(yx)\hat{P}(y|\mathbf{x}) in place of P(yx){P}(y|\mathbf{x}) in BOC.

So how can we estimate P^(yx)\hat{P}(y | \mathbf{x})? Previously we have derived that P^(y)=i=1nI(yi=y)n\hat P(y)=\frac{\sum_{i=1}^n I(y_i=y)}{n}. Similarly, P^(x)=i=1nI(xi=x)n\hat P(\mathbf{x})=\frac{\sum_{i=1}^n I(\mathbf{x}_i=\mathbf{x})}{n} and P^(y,x)=i=1nI(xi=xyi=y)n\hat P(y,\mathbf{x})=\frac{\sum_{i=1}^{n} I(\mathbf{x}_i = \mathbf{x} \wedge {y}_i = y)}{n}. We can put these two together

P^(yx)=P^(y,x)P(x)=i=1nI(xi=xyi=y)i=1nI(xi=x)\hat{P}(y|\mathbf{x}) = \frac{\hat{P}(y,\mathbf{x})}{P(\mathbf{x})} = \frac{\sum_{i=1}^{n} I(\mathbf{x}_i = \mathbf{x} \wedge {y}_i = y)}{ \sum_{i=1}^{n} I(\mathbf{x}_i = \mathbf{x})}
img

The Venn diagram illustrates that the MLE method estimates P^(yx)\hat{P}(y|\mathbf{x}) as

P^(yx)=CB\hat{P}(y|\mathbf{x}) = \frac{|C|}{|B|}

Problem: The estimate is only good if there are many training data with the exact same features as x\mathbf{x}! In high dimensional spaces (or with continuous x\mathbf{x}), this never happens! So B0|B| \rightarrow 0 and C0|C| \rightarrow 0.

Naive Bayes

Naive Bayes will give a solution. It is a kind of generative learning.

So we change gears a bit and turn to estimate P(xy) and P(y)P(\mathbf{x} | y) \text{ and } P(y) in the Bayes formula:

P(yx)=P(xy)P(y)P(x).P(y | \mathbf{x}) = \frac{P(\mathbf{x} | y)P(y)}{P(\mathbf{x})}.

Estimating P(y)P(y) is easy (assume there are not many classes in this classification problem). We simply need to count how many times we observe each class. Define the fraction of time we see label cc as π^c\hat\pi_c.

P(y=c)=i=1nI(yi=c)n=π^cP(y = c) = \frac{\sum_{i=1}^{n} I(y_i = c)}{n} = \hat\pi_c

Estimating P(xy)P(\mathbf{x}|y), however, is not easy! Here we have to make a very bold additional assumption: the Naive Bayes assumption

Naive Bayes Assumption: Each feature values are mutually independent given the label.

P(xy)=P([x1,x2,,xd]y)=α=1dP([x]αy)each entry is mutually independent\begin{align} P(\mathbf{x} | y) &= P(\begin{bmatrix} x_1, x_2, \cdots, x_d \end{bmatrix} \mid y) \\ &= \prod_{\alpha = 1}^{d} P([\mathbf{x}]_\alpha | y) &&\text{each entry is mutually independent} \end{align}

Naive Bayes Assumption helps us decompose a single d-dimension problem into d 1-dimension problems.

这是大胆的是因为假设我们用各种词出现的次数来判断一封邮件是否是垃圾邮件,那么明显各个词出现的概率是相关的,他们不可能 independent。

img Illustration behind the Naive Bayes algorithm: 假设 dim(x)=2dim(x)=2, 竖轴是 [x]1[x]_1, 横轴是 [x]2[x]_2. 我们实际上有 [x]1,[x]2,y,P[x]_1, [x]_2, y, P 四个量需要表示。yy 用颜色来表示,所以应该是个三维图来着,那么高度轴就应该是 probability PP,但是这里就画了二维,碍于图画我们只能想象一下了画的这些曲线其实都在高度轴上

We then define our Bayes Classifier as:

h(x)=argmaxyP(yx)=argmaxy  P(xy)P(y)P(x)=argmaxy  P(xy)P(y)(P(x) does not depend on y)=argmaxy  P(y)α=1dP(xαy)(by the naive Bayes assumption)=argmaxy  π^yα=1dP(xαy)previous definition of π^y=argmaxy  log(π^y)+α=1dlog(P(xαy))(as log is a monotonic function)\begin{align} h(\mathbf{x}) &= \operatorname*{argmax}_y P(y | \mathbf{x}) \\ &= \operatorname*{argmax}_y \; \frac{P(\mathbf{x} | y)P(y)}{P(\mathbf{x})} \\ &= \operatorname*{argmax}_y \; P(\mathbf{x} | y) P(y) && \text{($P(\mathbf{x})$ does not depend on $y$)} \\ &= \operatorname*{argmax}_y \; P(y) \prod_{\alpha=1}^{d} P(x_\alpha | y) && \text{(by the naive Bayes assumption)}\\ &= \operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y) && \text{previous definition of $\hat\pi_y$} \\ &= \operatorname*{argmax}_y \; \log(\hat\pi_y) + \sum_{\alpha = 1}^{d} \log(P(x_\alpha | y)) && \text{(as log is a monotonic function)} \end{align}

Estimating log(P(xαy))\log(P(x_\alpha | y)) is easy as we only need to consider one dimension each time.

Take a log or not, this formula defines Naive Bayes.

h(x)=argmaxy  π^yα=1dP(xαy)\boxed{h(\mathbf{x}) = \operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y)}

Estimating P([x]αy)P([\mathbf{x}]_\alpha | y) - 3 Notable Cases

We will talk about 3 cases where we use Naive Bayes to make predictions. In each case, we make explicit assumptions (refer back to our first chapter to learn about what “assumption” means) about distribution of our samples.

Case #1: Categorical features

For dd dimensional data, think of it as having dd independent dices. Each dice corresponds to a feature in our data point. Each dice has some different values. We roll each dice exactly once, record the result in a corresponding entry in our feature vector.

img

Features

[x]α{f1,f2,,fKα}[\mathbf{x}]_\alpha \in \{f_1, f_2, \cdots, f_{K_\alpha}\}

Each feature α\alpha falls into one of KαK_\alpha categories. (So if we have binary features at entry α\alpha, we would have Kα=2K_\alpha = 2.)

Model P(xαy)P(x_\alpha \mid y):

对于 xx 的每一维度 α\alpha, 我们都有一个 Kα×CK_\alpha \times |C| 的矩阵来表示给定 label y=cy=c, xαx_{\alpha} 取值 jj 的概率。

P(xα=jy=c)=[θjc]α and j=1Kα[θjc]α=1P(x_{\alpha} = j | y=c) = [\theta_{jc}]_{\alpha} \\ \text{ and } \sum_{j=1}^{K_\alpha} [\theta_{jc}]_{\alpha} = 1

where [θjc]α[\theta_{jc}]_{\alpha} is the probability of feature α\alpha having the value jj, given that the label is cc. And the constraint indicates that xαx_{\alpha} must have one of the categories 1,,Kα{1, \dots, K_\alpha}.

Parameter estimation:

[θ^jc]α=i=1nI(yi=c)I([xi]α=j)+li=1nI(yi=c)+lKα\begin{align} [\hat\theta_{jc}]_{\alpha} &= \frac{\sum_{i=1}^{n} I(y_i = c) I([\mathbf{x}_i]_\alpha = j) + l} {\sum_{i=1}^{n} I(y_i = c) + lK_\alpha} \end{align}

where ll is a smoothing parameter. By setting l=0l=0 we get an MLE estimator, l>0l>0 leads to MAP. If we set l=+1l= +1 we get Laplace smoothing. (Here we made an implicit assumption that each category has the same possibility of happening, so we will have lKαlK_\alpha imaginary draws and we get [xi]α=j[\mathbf{x}_i]_\alpha = j happens ll times)

[θ^jc]α=# of samples with label c that have feature α with value j # of samples with label c[\hat\theta_{jc}]_{\alpha} = \frac{\text{\# of samples with label c that have feature } \alpha \text{ with value $j$ }} {\text{\# of samples with label $c$}}

Prediction:

argmaxy  P(yx)=  argmaxy  π^yα=1dP(xαy)previous definition  argmaxy  π^cα=1d[θ^jc]α\begin{align} & \operatorname*{argmax}_y \; P(y\mid \mathbf{x}) \\ = \; &\operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y) &&\text{previous definition} \\ \propto \; &\operatorname*{argmax}_y \; \hat\pi_c \prod_{\alpha = 1}^{d} [\hat\theta_{jc}]_\alpha \end{align}

这个预测共需要 dim×Kα×Cdim \times K_\alpha \times |C| 个值,对比如果我们直接采用 P(x,y)P(x,y) 进行预测,我们需要 Kαdim×CK_\alpha^{dim} \times |C| 个值是一个巨大的改进。

Case #2: Multinomial features

We have one die of dd values. Roll it multiple times and jot down the number of times each value appeared. Record that number in the entry corresponding to that value. Therefore, the value of the ithi^{th} feature shows how many times a particular value appeared.

img

接着以根据词频判别垃圾邮件作为例子,在 multinomial features 中,[x]α=j[x]_\alpha=j 就是某个词 α\alpha 出现了 jj 次。

Features:

xα0,1,2,,m and m=α=1dxα\begin{align} x_\alpha \in {0, 1, 2, \dots, m} \text{ and } m = \sum_{\alpha = 1}^d x_\alpha \end{align}

mm is the length of this letter and dd is the size of the vocabulary

Model P(xy)P(\mathbf{x} \mid y):

虽然本情境中每个词的出现是互相独立的,但是由于信件的长度已经确定,我们只能按照总体来估计,不能像原来一样估计一个P(xαy)P(x_\alpha \mid y). This situation can be modeled by this multinomial distribution

P(xm,y=c)=m!x1!x2!xd!α=1d(θαc)xαP(\mathbf{x} \mid m, y=c) = \frac{m!}{x_1! \cdot x_2! \cdot \dots \cdot x_d!} \prod_{\alpha = 1}^d \left(\theta_{\alpha c}\right)^{x_\alpha}

where θαc\theta_{\alpha c} is the probability of selecting word xαx_\alpha and α=1dθαc=1\sum_{\alpha = 1}^d \theta_{\alpha c} =1 (you have to select one of the dd words). This equation describes the probability of having such an email (of some number of appearance times of each vocabulary) given it is a spam (or not).

So, we can use this to generate a spam email, i.e., a document x\mathbf{x} of class y=spamy = \text{spam} by picking mm words independently at random from the vocabulary of dd words using P(xy=spam)P(\mathbf{x} \mid y = \text{spam}).

Parameter estimation:

θ^αc=i=1nI(yi=c)xiα+li=1nI(yi=c)mi+ld\begin{align} \hat\theta_{\alpha c} = \frac{\sum_{i = 1}^{n} I(y_i = c) x_{i\alpha} + l}{\sum_{i=1}^{n} I(y_i = c) m_i + l \cdot d } \end{align}

where mi=β=1dxiβm_i=\sum_{\beta = 1}^{d} x_{i\beta} denotes the number of words in document ii and ll is the smoothing parameter.

θ^αc=# of times word α appears in all spam emails# of words in all spam emails combined.\hat\theta_{\alpha c} = \frac{\text{\# of times word } \alpha \text{ appears in all spam emails}} {\text{\# of words in all spam emails combined}}.

Prediction:

If you look back at our Bayesian Classifier definition: argmaxy  P(xy)P(y)P(x)\operatorname*{argmax}_y \; \frac{P(\mathbf{x} | y)P(y)}{P(\mathbf{x})}

The factorial terms are present both at the factor P(xy)P(\mathbf{x} | y) and the denominator P(x)P(\mathbf{x}) so they cancel out. And all we need to care about is just the θ\theta and π\pi

argmaxc  P(y=cx)=  argmaxy  π^yα=1dP(xαy)  argmaxc  π^cα=1dθ^αcxα\begin{align} &\operatorname*{argmax}_c \; P(y = c \mid \mathbf{x}) \\ = \; &\operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y) \\ \propto \; &\operatorname*{argmax}_c \; \hat\pi_c \prod_{\alpha = 1}^d \hat\theta_{\alpha c}^{x_\alpha} \end{align}

Case #3: Continuous Normal Features (Gaussian Naive Bayes)

In this situation, we assume each class conditional feature distribution P(xαy)P(x_\alpha|y) originates from a Gaussian distribution with its own mean μα,y\mu_{\alpha,y} and variance σα,y2\sigma_{\alpha,y}^2. So P(xαy=c)=N(μαc,σαc2)P(x_\alpha \mid y=c) = \mathcal{N}\left(\mu_{\alpha c}, \sigma^{2}_{\alpha c}\right). Since we have the Naive Bayes Assumption, each class conditional feature distribution is actually independent from each other.

Features:

xαR(each feature takes on a real value)\begin{align} x_\alpha \in \mathbb{R} && \text{(each feature takes on a real value)} \end{align}

Model P(xαy)P(x_\alpha \mid y):

Since it is a Gaussian distribution,

P(xαy=c)=N(μαc,σαc2)=12πσαce12(xαμαcσαc)2\begin{align} P(x_\alpha \mid y=c) = \mathcal{N}\left(\mu_{\alpha c}, \sigma^{2}_{\alpha c}\right) = \frac{1}{\sqrt{2 \pi} \sigma_{\alpha c}} e^{-\frac{1}{2} \left(\frac{x_\alpha - \mu_{\alpha c}}{\sigma_{\alpha c}}\right)^2} \end{align}

To computer μαc,σαc2\mu_{\alpha c}, \sigma^{2}_{\alpha c}: The mean μα,y\mu_{\alpha,y} of dim  αdim \; \alpha given label cc is just the average feature value of dimension α\alpha from all samples with label yy. The (squared) standard deviation is simply the variance of this estimate.

μαc1nci=1nI(yi=c)xiαwhere nc=i=1nI(yi=c)σαc21nci=1nI(yi=c)(xiαμαc)2\begin{align} \mu_{\alpha c} &\leftarrow \frac{1}{n_c} \sum_{i = 1}^{n} I(y_i = c) x_{i\alpha} && \text{where $n_c = \sum_{i=1}^{n} I(y_i = c)$} \\ \sigma_{\alpha c}^2 &\leftarrow \frac{1}{n_c} \sum_{i=1}^{n} I(y_i = c)(x_{i\alpha} - \mu_{\alpha c})^2 \end{align}

The full distribution is P(xy)N(μy,Σy)P(\mathbf{x}|y)\sim \mathcal{N}(\mathbf{\mu}_y,\Sigma_y), where Σy\Sigma_y is a diagonal covariance matrix with [Σy]α,α=σα,y2[\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha,y}. Diagonal because we have our Naive Bayes Assumption of each dimension being independent from each other.

Parameter estimation:

According to

argmaxc  P(y=cx)=  argmaxy  π^yα=1dP(xαy)\operatorname*{argmax}_c \; P(y = c \mid \mathbf{x}) = \; \operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y)

We only have to multiply each P(xαy)P(x_\alpha | y) together.

Naive Bayes is a linear classifier

A linear classifier has the form y^=wx+b\hat y = \mathbf{w}^\top \mathbf{x} + b

Naive Bayes leads to a linear decision boundary in many common cases (including multinomial and continuous normal).

Multinomial Case

Suppose that yi{1,+1}y_i \in \{-1, +1\} and features are multinomial

We can show that w,b\exists \mathbf{w}, b such that

h(x)=argmaxy  P(y)α1dP(xαy)=sign(wx+b)h(\mathbf{x}) = \operatorname*{argmax}_y \; P(y) \prod_{\alpha - 1}^d P(x_\alpha \mid y) = \textrm{sign}(\mathbf{w}^\top \mathbf{x} + b)

That is, our naive bayes classifier always gives the same classification as our linear classifier.

h(x)=+1wx+b>0h(\mathbf{x}) = +1 \Longleftrightarrow \mathbf{w}^\top \mathbf{x} + b > 0

As we showed before, P(xαy=+1)θα+xαP(x_\alpha|y=+1)\propto\theta_{\alpha+}^{x_\alpha} and P(Y=+1)=π+P(Y=+1)=\pi_+. We continue to define

[w]α=log(θα+)log(θα)b=log(π+)log(π)\begin{align} [\mathbf{w}]_\alpha &= \log(\theta_{\alpha +}) - \log(\theta_{\alpha -}) \\ b &= \log(\pi_+) - \log(\pi_-) \end{align}

Let’s start a linear classification with this w\textbf{w} and bb, we will arrive at our Naive Bayes classifier h(x)h(x) as go through the steps below.

$$

wx+b>0α=1d[x]α(log(θα+)log(θα))[w]α+log(π+)log(π)b>0(Plugging in definition of w,b.)exp(α=1d[x]α(log(θα+)log(θα))+log(π+)log(π))>1(exponentiating both sides)α=1dexp(logθα+[x]α+log(π+))exp(logθα[x]α+log(π))>1Because alog(b)=log(ba) and exp(ab)=eaeb operationsα=1dθα+[x]απ+θα[x]απ>1Because exp(log(a))=a and ea+b=eaebα=1dP([x]αY=+1)π+α=1dP([x]αY=1)π>1Because P([x]αY=1)=θαx]αP(xY=+1)π+P(xY=1)π>1By the naive Bayes assumption. P(Y=+1x)P(Y=1x)>1By Bayes rule (the denominator P(x) cancels out, and π+=P(Y=+1).)P(Y=+1x)>P(Y=1x)argmaxyP(Y=yx)=+1i.e. the point x lies on the positive side of the hyperplane iff Naive Bayes predicts +1\begin{align} \mathbf{w}^\top \mathbf{x} + b > 0 &\Longleftrightarrow \sum_{\alpha = 1}^{d} [\mathbf{x}]_\alpha \overbrace{(\log(\theta_{\alpha +}) - \log(\theta_{\alpha -}))}^{[\mathbf{w}]_\alpha} + \overbrace{\log(\pi_+) - \log(\pi_-)}^b > 0 && \text{(Plugging in definition of $\mathbf{w},b$.)}\\ &\Longleftrightarrow \exp\left(\sum_{\alpha = 1}^{d} [\mathbf{x}]_\alpha {(\log(\theta_{\alpha +}) - \log(\theta_{\alpha -}))} + {\log(\pi_+) - \log(\pi_-)} \right)> 1 && \text{(exponentiating both sides)}\\ &\Longleftrightarrow \prod_{\alpha = 1}^{d} \frac{\exp\left( \log\theta_{\alpha +}^{[\mathbf{x}]_\alpha} + \log(\pi_+)\right)} {\exp\left(\log\theta_{\alpha -}^{[\mathbf{x}]_\alpha} + \log(\pi_-)\right)} > 1 && \text{Because $a\log(b)=\log(b^a)$ and $\exp{(a-b)}=\frac{e^a}{e^b}$ operations}\\ &\Longleftrightarrow \prod_{\alpha = 1}^{d} \frac{\theta_{\alpha +}^{[\mathbf{x}]_\alpha} \pi_+} {\theta_{\alpha -}^{[\mathbf{x}]_\alpha} \pi_-} > 1 && \text{Because $\exp(\log(a))=a$ and $e^{a+b}=e^ae^b$}\\ &\Longleftrightarrow \frac{\prod_{\alpha = 1}^{d} P([\mathbf{x}]_\alpha | Y = +1)\pi_+}{\prod_{\alpha =1}^{d}P([\mathbf{x}]_\alpha | Y = -1)\pi_-} > 1 && \text{Because $P([\mathbf{x}]_\alpha | Y = -1)=\theta^{\mathbf{x}]_\alpha}_{\alpha-}$}\\ &\Longleftrightarrow \frac{P(\mathbf{x} | Y = +1)\pi_+}{P(\mathbf{x} | Y = -1)\pi_-} > 1 && \text{By the naive Bayes assumption. }\\ &\Longleftrightarrow \frac{P(Y = +1 |\mathbf{x})}{P( Y = -1|\mathbf{x})}>1 && \text{By Bayes rule (the denominator $P(\mathbf{x})$ cancels out, and $\pi_+=P(Y=+1)$.)} \\ &\Longleftrightarrow P(Y = +1 | \mathbf{x}) > P(Y = -1 | \mathbf{x}) \\ &\Longleftrightarrow \operatorname*{argmax}_y P(Y=y|\mathbf{x})=+1 && \text{i.e. the point $\mathbf{x}$ lies on the positive side of the hyperplane iff Naive Bayes predicts +1} \end{align}

$$

Continuous Features Case

we can show that

P(yx)=11+ey(wx+b)P(y \mid \mathbf{x}) = \frac{1}{1 + e^{-y (\mathbf{w}^\top \mathbf{x} +b) }}

. This model is also known as logistic regression. Naive Bayes and Logistic Regression produce asymptotically the same model if the Naive Bayes assumption holds.