Naive Bayes

From Lecture: Bayes Classifier and Naive Bayes

Generative vs. Discriminative Algorithm¶

When we estimate $P(X,Y)$ , we call it generative learning. We try to model the distribution behind the scene - the distribution we draw all our samples from. (Remember how we described this omnipotent distribution when talking about the Bayes Optimal Classifier) We achieve this often by modelling $P(X,Y)=P(X|Y)P(Y)$ .
When we only estimate $P(Y|X)$ directly, then we call it discriminative learning. We try to model the probability of a label given features. This is more aligned with our common definition of “predicting” things.

Introduction - Estimating Probability Directly From Data¶

Something Generative¶

Our training consists of the set $D=\{(\mathbf{x}_1,y_1),\dots,(\mathbf{x}_n,y_n)\}$ drawn from some unknown distribution $P(X,Y)$ . Because all pairs are sampled i.i.d., we obtain

P(D)=P((\mathbf{x}_1,y_1),\dots,(\mathbf{x}_n,y_n))=\Pi_{\alpha=1}^n P(\mathbf{x}_\alpha,y_\alpha).

(1)

If we do have enough data, we could estimate $P(X,Y)$ simply through counting:

\hat P(\mathbf{x},y) =\frac{\text{\# (x,y) appeared in our data}}{\text{\# total data}} = \frac{\sum_{i=1}^{n} I(\mathbf{x}_i = \mathbf{x} \wedge {y}_i = y)}{n}

(2)

where $I(\mathbf{x}_i=\mathbf{x} \wedge {y}_i=y)=1$ if $\mathbf{x}_i=\mathbf{x}$ and ${y}_i=y$ and 0 otherwise.

Something Discriminative¶

We are also interested in using similar technique to estimate $P(Y|X)$ . Because we know if we have $P(Y|X)$ , we can then use the Bayes Optimal Classifier to make predictions. We now want to make some prediction $\hat{P}(y|\mathbf{x})$ based on dataset, and pretend our estimate $\hat{P}(y|\mathbf{x})$ to well describe our real distribution $P(Y|X)$ , so we use $\hat{P}(y|\mathbf{x})$ in place of ${P}(y|\mathbf{x})$ in BOC.

So how can we estimate $\hat{P}(y | \mathbf{x})$ ? Previously we have derived that $\hat P(y)=\frac{\sum_{i=1}^n I(y_i=y)}{n}$ . Similarly, $\hat P(\mathbf{x})=\frac{\sum_{i=1}^n I(\mathbf{x}_i=\mathbf{x})}{n}$ and $\hat P(y,\mathbf{x})=\frac{\sum_{i=1}^{n} I(\mathbf{x}_i = \mathbf{x} \wedge {y}_i = y)}{n}$ . We can put these two together

\hat{P}(y|\mathbf{x}) = \frac{\hat{P}(y,\mathbf{x})}{P(\mathbf{x})} = \frac{\sum_{i=1}^{n} I(\mathbf{x}_i = \mathbf{x} \wedge {y}_i = y)}{ \sum_{i=1}^{n} I(\mathbf{x}_i = \mathbf{x})}

(3)

The Venn diagram illustrates that the MLE method estimates $\hat{P}(y|\mathbf{x})$ as

\hat{P}(y|\mathbf{x}) = \frac{|C|}{|B|}

(4)

Problem: The estimate is only good if there are many training data with the exact same features as $\mathbf{x}$ ! In high dimensional spaces (or with continuous $\mathbf{x}$ ), this never happens! So $|B| \rightarrow 0$ and $|C| \rightarrow 0$ .

Naive Bayes¶

Naive Bayes will give a solution. It is a kind of generative learning.

So we change gears a bit and turn to estimate $P(\mathbf{x} | y) \text{ and } P(y)$ in the Bayes formula:

P(y | \mathbf{x}) = \frac{P(\mathbf{x} | y)P(y)}{P(\mathbf{x})}.

(5)

Estimating $P(y)$ is easy (assume there are not many classes in this classification problem). We simply need to count how many times we observe each class. Define the fraction of time we see label $c$ as $\hat\pi_c$ .

P(y = c) = \frac{\sum_{i=1}^{n} I(y_i = c)}{n} = \hat\pi_c

(6)

Estimating $P(\mathbf{x}|y)$ , however, is not easy! Here we have to make a very bold additional assumption: the Naive Bayes assumption

Naive Bayes Assumption: Each feature values are mutually independent given the label.

\begin{align} P(\mathbf{x} | y) &= P(\begin{bmatrix} x_1, x_2, \cdots, x_d \end{bmatrix} \mid y) \\ &= \prod_{\alpha = 1}^{d} P([\mathbf{x}]_\alpha | y) &&\text{each entry is mutually independent} \end{align}

(7)

Naive Bayes Assumption helps us decompose a single d-dimension problem into d 1-dimension problems.

这是大胆的是因为假设我们用各种词出现的次数来判断一封邮件是否是垃圾邮件，那么明显各个词出现的概率是相关的，他们不可能 independent。

Illustration behind the Naive Bayes algorithm: 假设 $dim(x)=2$ , 竖轴是 $[x]_1$ , 横轴是 $[x]_2$ . 我们实际上有 $[x]_1, [x]_2, y, P$ 四个量需要表示。 $y$ 用颜色来表示，所以应该是个三维图来着，那么高度轴就应该是 probability $P$ ，但是这里就画了二维，碍于图画我们只能想象一下了画的这些曲线其实都在高度轴上

fig1: $P(x|y)$
fig2: $P([x]_1|y)$
fig3: $P([x]_2|y)$
fig4: $\prod_\alpha P(x_\alpha|y)$ 这幅图所表示的实际上是三维空间向二维空间的一个投影

We then define our Bayes Classifier as:

\begin{align} h(\mathbf{x}) &= \operatorname*{argmax}_y P(y | \mathbf{x}) \\ &= \operatorname*{argmax}_y \; \frac{P(\mathbf{x} | y)P(y)}{P(\mathbf{x})} \\ &= \operatorname*{argmax}_y \; P(\mathbf{x} | y) P(y) && \text{($P(\mathbf{x})$ does not depend on $y$)} \\ &= \operatorname*{argmax}_y \; P(y) \prod_{\alpha=1}^{d} P(x_\alpha | y) && \text{(by the naive Bayes assumption)}\\ &= \operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y) && \text{previous definition of $\hat\pi_y$} \\ &= \operatorname*{argmax}_y \; \log(\hat\pi_y) + \sum_{\alpha = 1}^{d} \log(P(x_\alpha | y)) && \text{(as log is a monotonic function)} \end{align}

(8)

Estimating $\log(P(x_\alpha | y))$ is easy as we only need to consider one dimension each time.

Take a log or not, this formula defines Naive Bayes.

\boxed{h(\mathbf{x}) = \operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y)}

(9)

Estimating $P([\mathbf{x}]_\alpha | y)$ - 3 Notable Cases¶

We will talk about 3 cases where we use Naive Bayes to make predictions. In each case, we make explicit assumptions (refer back to our first chapter to learn about what “assumption” means) about distribution of our samples.

Case #1: Categorical features¶

For $d$ dimensional data, think of it as having $d$ independent dices. Each dice corresponds to a feature in our data point. Each dice has some different values. We roll each dice exactly once, record the result in a corresponding entry in our feature vector.

Features¶

[\mathbf{x}]_\alpha \in \{f_1, f_2, \cdots, f_{K_\alpha}\}

(10)

Each feature $\alpha$ falls into one of $K_\alpha$ categories. (So if we have binary features at entry $\alpha$ , we would have $K_\alpha = 2$ .)

Model $P(x_\alpha \mid y)$ :¶

对于 $x$ 的每一维度 $\alpha$ , 我们都有一个 $K_\alpha \times |C|$ 的矩阵来表示给定 label $y=c$ , $x_{\alpha}$ 取值 $j$ 的概率。

P(x_{\alpha} = j | y=c) = [\theta_{jc}]_{\alpha} \\ \text{ and } \sum_{j=1}^{K_\alpha} [\theta_{jc}]_{\alpha} = 1

(11)

where $[\theta_{jc}]_{\alpha}$ is the probability of feature $\alpha$ having the value $j$ , given that the label is $c$ . And the constraint indicates that $x_{\alpha}$ must have one of the categories ${1, \dots, K_\alpha}$ .

Parameter estimation:¶

\begin{align} [\hat\theta_{jc}]_{\alpha} &= \frac{\sum_{i=1}^{n} I(y_i = c) I([\mathbf{x}_i]_\alpha = j) + l} {\sum_{i=1}^{n} I(y_i = c) + lK_\alpha} \end{align}

(12)

where $l$ is a smoothing parameter. By setting $l=0$ we get an MLE estimator, $l>0$ leads to MAP. If we set $l= +1$ we get Laplace smoothing. (Here we made an implicit assumption that each category has the same possibility of happening, so we will have $lK_\alpha$ imaginary draws and we get $[\mathbf{x}_i]_\alpha = j$ happens $l$ times)

[\hat\theta_{jc}]_{\alpha} = \frac{\text{\# of samples with label c that have feature } \alpha \text{ with value $j$ }} {\text{\# of samples with label $c$}}

(13)

Prediction:¶

\begin{align} & \operatorname*{argmax}_y \; P(y\mid \mathbf{x}) \\ = \; &\operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y) &&\text{previous definition} \\ \propto \; &\operatorname*{argmax}_y \; \hat\pi_c \prod_{\alpha = 1}^{d} [\hat\theta_{jc}]_\alpha \end{align}

(14)

$P(y) = \hat\pi_c$ was defined previously
$P(\mathbf{x}) = \prod_{\alpha = 1}^{d} [\hat\theta_{jc}]_\alpha$ is derived from the Naive Bayes Assumption
$P(\mathbf{x})$ is a constant

这个预测共需要 $dim \times K_\alpha \times |C|$ 个值，对比如果我们直接采用 $P(x,y)$ 进行预测，我们需要 $K_\alpha^{dim} \times |C|$ 个值是一个巨大的改进。

Case #2: Multinomial features¶

We have one die of $d$ values. Roll it multiple times and jot down the number of times each value appeared. Record that number in the entry corresponding to that value. Therefore, the value of the $i^{th}$ feature shows how many times a particular value appeared.

接着以根据词频判别垃圾邮件作为例子，在 multinomial features 中， $[x]_\alpha=j$ 就是某个词 $\alpha$ 出现了 $j$ 次。

Features:¶

\begin{align} x_\alpha \in {0, 1, 2, \dots, m} \text{ and } m = \sum_{\alpha = 1}^d x_\alpha \end{align}

(15)

$m$ is the length of this letter and $d$ is the size of the vocabulary

Model $P(\mathbf{x} \mid y)$ :¶

虽然本情境中每个词的出现是互相独立的，但是由于信件的长度已经确定，我们只能按照总体来估计，不能像原来一样估计一个 $P(x_\alpha \mid y)$ . This situation can be modeled by this multinomial distribution

P(\mathbf{x} \mid m, y=c) = \frac{m!}{x_1! \cdot x_2! \cdot \dots \cdot x_d!} \prod_{\alpha = 1}^d \left(\theta_{\alpha c}\right)^{x_\alpha}

(16)

where $\theta_{\alpha c}$ is the probability of selecting word $x_\alpha$ and $\sum_{\alpha = 1}^d \theta_{\alpha c} =1$ (you have to select one of the $d$ words). This equation describes the probability of having such an email (of some number of appearance times of each vocabulary) given it is a spam (or not).

So, we can use this to generate a spam email, i.e., a document $\mathbf{x}$ of class $y = \text{spam}$ by picking $m$ words independently at random from the vocabulary of $d$ words using $P(\mathbf{x} \mid y = \text{spam})$ .

Parameter estimation:¶

\begin{align} \hat\theta_{\alpha c} = \frac{\sum_{i = 1}^{n} I(y_i = c) x_{i\alpha} + l}{\sum_{i=1}^{n} I(y_i = c) m_i + l \cdot d } \end{align}

(17)

where $m_i=\sum_{\beta = 1}^{d} x_{i\beta}$ denotes the number of words in document $i$ and $l$ is the smoothing parameter.

\hat\theta_{\alpha c} = \frac{\text{\# of times word } \alpha \text{ appears in all spam emails}} {\text{\# of words in all spam emails combined}}.

(18)

Prediction:¶

If you look back at our Bayesian Classifier definition: $\operatorname*{argmax}_y \; \frac{P(\mathbf{x} | y)P(y)}{P(\mathbf{x})}$

The factorial terms are present both at the factor $P(\mathbf{x} | y)$ and the denominator $P(\mathbf{x})$ so they cancel out. And all we need to care about is just the $\theta$ and $\pi$

\begin{align} &\operatorname*{argmax}_c \; P(y = c \mid \mathbf{x}) \\ = \; &\operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y) \\ \propto \; &\operatorname*{argmax}_c \; \hat\pi_c \prod_{\alpha = 1}^d \hat\theta_{\alpha c}^{x_\alpha} \end{align}

(19)

Case #3: Continuous Normal Features (Gaussian Naive Bayes)¶

In this situation, we assume each class conditional feature distribution $P(x_\alpha|y)$ originates from a Gaussian distribution with its own mean $\mu_{\alpha,y}$ and variance $\sigma_{\alpha,y}^2$ . So $P(x_\alpha \mid y=c) = \mathcal{N}\left(\mu_{\alpha c}, \sigma^{2}_{\alpha c}\right)$ . Since we have the Naive Bayes Assumption, each class conditional feature distribution is actually independent from each other.

Features:¶

\begin{align} x_\alpha \in \mathbb{R} && \text{(each feature takes on a real value)} \end{align}

(20)

Model $P(x_\alpha \mid y)$ :¶

Since it is a Gaussian distribution,

\begin{align} P(x_\alpha \mid y=c) = \mathcal{N}\left(\mu_{\alpha c}, \sigma^{2}_{\alpha c}\right) = \frac{1}{\sqrt{2 \pi} \sigma_{\alpha c}} e^{-\frac{1}{2} \left(\frac{x_\alpha - \mu_{\alpha c}}{\sigma_{\alpha c}}\right)^2} \end{align}

(21)

To computer $\mu_{\alpha c}, \sigma^{2}_{\alpha c}$ : The mean $\mu_{\alpha,y}$ of $dim \; \alpha$ given label $c$ is just the average feature value of dimension $\alpha$ from all samples with label $y$ . The (squared) standard deviation is simply the variance of this estimate.

\begin{align} \mu_{\alpha c} &\leftarrow \frac{1}{n_c} \sum_{i = 1}^{n} I(y_i = c) x_{i\alpha} && \text{where $n_c = \sum_{i=1}^{n} I(y_i = c)$} \\ \sigma_{\alpha c}^2 &\leftarrow \frac{1}{n_c} \sum_{i=1}^{n} I(y_i = c)(x_{i\alpha} - \mu_{\alpha c})^2 \end{align}

(22)

The full distribution is $P(\mathbf{x}|y)\sim \mathcal{N}(\mathbf{\mu}_y,\Sigma_y)$ , where $\Sigma_y$ is a diagonal covariance matrix with $[\Sigma_y]_{\alpha,\alpha}=\sigma^2_{\alpha,y}$ . Diagonal because we have our Naive Bayes Assumption of each dimension being independent from each other.

Parameter estimation:¶

According to

\operatorname*{argmax}_c \; P(y = c \mid \mathbf{x}) = \; \operatorname*{argmax}_y \; \hat\pi_y \prod_{\alpha=1}^{d} P(x_\alpha | y)

(23)

We only have to multiply each $P(x_\alpha | y)$ together.

Naive Bayes is a linear classifier¶

A linear classifier has the form $\hat y = \mathbf{w}^\top \mathbf{x} + b$

Naive Bayes leads to a linear decision boundary in many common cases (including multinomial and continuous normal).

Multinomial Case¶

Suppose that $y_i \in \{-1, +1\}$ and features are multinomial

We can show that $\exists \mathbf{w}, b$ such that

h(\mathbf{x}) = \operatorname*{argmax}_y \; P(y) \prod_{\alpha - 1}^d P(x_\alpha \mid y) = \textrm{sign}(\mathbf{w}^\top \mathbf{x} + b)

(24)

That is, our naive bayes classifier always gives the same classification as our linear classifier.

h(\mathbf{x}) = +1 \Longleftrightarrow \mathbf{w}^\top \mathbf{x} + b > 0

(25)

As we showed before, $P(x_\alpha|y=+1)\propto\theta_{\alpha+}^{x_\alpha}$ and $P(Y=+1)=\pi_+$ . We continue to define

\begin{align} [\mathbf{w}]_\alpha &= \log(\theta_{\alpha +}) - \log(\theta_{\alpha -}) \\ b &= \log(\pi_+) - \log(\pi_-) \end{align}

(26)

Let’s start a linear classification with this $\textbf{w}$ and $b$ , we will arrive at our Naive Bayes classifier $h(x)$ as go through the steps below.

\begin{align} \mathbf{w}^\top \mathbf{x} + b > 0 &\Longleftrightarrow \sum_{\alpha = 1}^{d} [\mathbf{x}]_\alpha \overbrace{(\log(\theta_{\alpha +}) - \log(\theta_{\alpha -}))}^{[\mathbf{w}]_\alpha} + \overbrace{\log(\pi_+) - \log(\pi_-)}^b > 0 && \text{(Plugging in definition of $\mathbf{w},b$.)}\\ &\Longleftrightarrow \exp\left(\sum_{\alpha = 1}^{d} [\mathbf{x}]_\alpha {(\log(\theta_{\alpha +}) - \log(\theta_{\alpha -}))} + {\log(\pi_+) - \log(\pi_-)} \right)> 1 && \text{(exponentiating both sides)}\\ &\Longleftrightarrow \prod_{\alpha = 1}^{d} \frac{\exp\left( \log\theta_{\alpha +}^{[\mathbf{x}]_\alpha} + \log(\pi_+)\right)} {\exp\left(\log\theta_{\alpha -}^{[\mathbf{x}]_\alpha} + \log(\pi_-)\right)} > 1 && \text{Because $a\log(b)=\log(b^a)$ and $\exp{(a-b)}=\frac{e^a}{e^b}$ operations}\\ &\Longleftrightarrow \prod_{\alpha = 1}^{d} \frac{\theta_{\alpha +}^{[\mathbf{x}]_\alpha} \pi_+} {\theta_{\alpha -}^{[\mathbf{x}]_\alpha} \pi_-} > 1 && \text{Because $\exp(\log(a))=a$ and $e^{a+b}=e^ae^b$}\\ &\Longleftrightarrow \frac{\prod_{\alpha = 1}^{d} P([\mathbf{x}]_\alpha | Y = +1)\pi_+}{\prod_{\alpha =1}^{d}P([\mathbf{x}]_\alpha | Y = -1)\pi_-} > 1 && \text{Because $P([\mathbf{x}]_\alpha | Y = -1)=\theta^{\mathbf{x}]_\alpha}_{\alpha-}$}\\ &\Longleftrightarrow \frac{P(\mathbf{x} | Y = +1)\pi_+}{P(\mathbf{x} | Y = -1)\pi_-} > 1 && \text{By the naive Bayes assumption. }\\ &\Longleftrightarrow \frac{P(Y = +1 |\mathbf{x})}{P( Y = -1|\mathbf{x})}>1 && \text{By Bayes rule (the denominator $P(\mathbf{x})$ cancels out, and $\pi_+=P(Y=+1)$.)} \\ &\Longleftrightarrow P(Y = +1 | \mathbf{x}) > P(Y = -1 | \mathbf{x}) \\ &\Longleftrightarrow \operatorname*{argmax}_y P(Y=y|\mathbf{x})=+1 && \text{i.e. the point $\mathbf{x}$ lies on the positive side of the hyperplane iff Naive Bayes predicts +1} \end{align}

(27)

Continuous Features Case¶

we can show that

P(y \mid \mathbf{x}) = \frac{1}{1 + e^{-y (\mathbf{w}^\top \mathbf{x} +b) }}

(28)

. This model is also known as logistic regression. Naive Bayes and Logistic Regression produce asymptotically the same model if the Naive Bayes assumption holds.

CS4780 Intro to Machine Learning

MLE and MAP

CS4780 Intro to Machine Learning

Logistic Regression

Generative vs. Discriminative Algorithm¶

Introduction - Estimating Probability Directly From Data¶

Something Generative¶

Something Discriminative¶

Naive Bayes¶

Estimating P([x]α∣y)P([\mathbf{x}]_\alpha | y)P([x]α​∣y) - 3 Notable Cases¶

Case #1: Categorical features¶

Features¶

Model P(xα∣y)P(x_\alpha \mid y)P(xα​∣y):¶

Parameter estimation:¶

Prediction:¶

Case #2: Multinomial features¶

Features:¶

Model P(x∣y)P(\mathbf{x} \mid y)P(x∣y):¶

Parameter estimation:¶

Prediction:¶

Case #3: Continuous Normal Features (Gaussian Naive Bayes)¶

Features:¶

Model P(xα∣y)P(x_\alpha \mid y)P(xα​∣y):¶

Parameter estimation:¶

Naive Bayes is a linear classifier¶

Multinomial Case¶

Continuous Features Case¶

Estimating $P([\mathbf{x}]_\alpha | y)$ - 3 Notable Cases¶

Model $P(x_\alpha \mid y)$ :¶

Model $P(\mathbf{x} \mid y)$ :¶

Model $P(x_\alpha \mid y)$ :¶