Skip to article frontmatterSkip to article content

The Perceptron

Cornell University

Assumptions

  1. Binary classification (i.e. yi{1,+1}y_i \in \{-1, +1\})
  2. Data is linearly separable

Classifier

h(xi)=sign(wxi+b)h(x_i) = \textrm{sign}(\mathbf{w}^\top \mathbf{x}_i + b)
img

ww is the normal vector of the hyperplane. It defines this hyperplane. bb is the bias term (without the bias term, the hyperplane that w\mathbf{w} defines would always have to go through the origin). Dealing with bb can be a pain, so we ‘absorb’ it into the feature vector w\mathbf{w} by adding one additional constant dimension. Under this convention,

xibecomes[xi1]wbecomes[wb]\mathbf{x}_i \hspace{0.1in} \text{becomes} \hspace{0.1in} \begin{bmatrix} \mathbf{x}_i \\ 1 \end{bmatrix} \\ \mathbf{w} \hspace{0.1in} \text{becomes} \hspace{0.1in} \begin{bmatrix} \mathbf{w} \\ b \end{bmatrix}

We can verify that

[xi1][wb]=wxi+b\begin{bmatrix} \mathbf{x}_i \\ 1 \end{bmatrix}^\top \begin{bmatrix} \mathbf{w} \\ b \end{bmatrix} = \mathbf{w}^\top \mathbf{x}_i + b

于是,我们通过把 b 藏在一个新的维度中“消除”了 b. Using this, we can simplify the above formulation of h(xi)h(\mathbf{x}_i) to

h(xi)=sign(wx)h(\mathbf{x}_i) = \textrm{sign}(\mathbf{w}^\top \mathbf{x})
img
在这里我们看到两个例子,一个原数据在一维,一个原数据在二维。
(Left:) 我们无法找到一个过原点的平面将两个数据分开
(Right:) 但在我们再加入一维度时 (xi[xi1]x_i \to \begin{bmatrix} x_i \\1 \end{bmatrix}),就可以找到过原点的分隔平面了

Observation: Note that

yi(wxi)>0    classification (wxi) and result yi has the same sign     xiis classified correctlyy_i(\mathbf{w}^\top \mathbf{x}_i) > 0 \iff \text{classification }(\mathbf{w}^\top \mathbf{x}_i) \text{ and result } y_i \text{ has the same sign } \iff \mathbf{x}_i \hspace{0.1in} \text{is classified correctly}

where “classified correctly” means that xix_i is on the correct side of the hyperplane defined by w\mathbf{w}. Also, note that the left side depends on yi1,+1y_i \in {-1, +1} (it wouldn’t work if, for example yi0,+1y_i \in {0, +1}).

Perceptron Algorithm

img

An Intuitive Example

img
(Left:) The hyperplane defined by wt\mathbf{w}_t misclassifies one red (-1) and one blue (+1) point.
(Middle:) The red point x\mathbf{x} is chosen and used for an update. Because its label is -1 we need to subtract x\mathbf{x} from wt\mathbf{w}_t.
(Right:) The udpated hyperplane wt+1=wtx\mathbf{w}_{t+1}=\mathbf{w}_t-\mathbf{x} separates the two classes and the Perceptron algorithm has converged.

How did we update ww?

Why do we want to update ww by setting w=w+yx\vec{w} \prime = \vec{w} + y\vec{x}?

Note we updated ww based on point xx because we classified xx incorrectly, so we want to move to a more correct direction. Let’s look at the classification result of xx after such an update: wx=(w+yx)x=wx+yx2>wx\vec{w}\prime \cdot \vec{x}= (\vec{w} + y\vec{x}) \cdot \vec{x} = \vec{w} \cdot \vec{x} + y\vec{x}^2 \gt \vec{w} \cdot \vec{x}. Though we still do not know whether xx is now labelled correctly (wx\vec{w}\prime \cdot \vec{x}), but at least we know the classification result is somewhat better because it increased a bit.

Proving Perceptron Converges

The Perceptron was arguably the first algorithm with a strong formal guarantee.

Claim

If a data set is linearly separable, the Perceptron will find a separating hyperplane in a finite number of updates. (If the data is not linearly separable, it will loop forever.)

Setup

The argument goes as follows: Suppose the answer classification hyperplane exists, so w\exists \mathbf{w}^* such that (xi,yi)D,yi(xw)>0\forall (\mathbf{x}_i, y_i) \in D, y_i(\mathbf{x}^\top \mathbf{w}^* ) > 0.

Now, suppose that we rescale each data point and the w\mathbf{w}^* such that all data points are within a unit hypersphere and ww^* normal vector of hyperplane is exactly on the unit sphere.

w=1andxiDxi1||\mathbf{w}^*|| = 1 \hspace{0.3in} \text{and} \hspace{0.3in} \forall \mathbf{x}_i \in D \hspace{0.1in}||\mathbf{x}_i|| \le 1

Let us define the Margin γ\gamma of the hyperplane w\mathbf{w}^* as γ=min(xi,yi)Dxiw \gamma = \min_{(\mathbf{x}_i, y_i) \in D}|\mathbf{x}_i^\top \mathbf{w}^*|: 即所有数据点中离 hyperplane 最近的距离

Observe: x,we have y(xw)=xwγ\forall\mathbf{x}, \text{we have } y(\mathbf{x}^\top \mathbf{w}^*)=|\mathbf{x}^\top \mathbf{w}^*|\geq \gamma. Because w\mathbf{w}^* is a perfect classifier, so all training data points (x,y)(\mathbf{x},y) lie on the “correct” side of the hyper-plane and therefore y=sign(xw)y=sign(\mathbf{x}^\top \mathbf{w}^*). The second inequality follows directly from the definition of the margin γ\gamma.

img

To summarize our setup:

WTS

If all of the above holds, then the Perceptron algorithm makes at most 1/γ21 / \gamma^2 mistakes.

Strategy

Keeping what we defined above, consider the effect of an update (w\mathbf{w} becomes w+yx\mathbf{w}+y\mathbf{x}) on the two terms ww\mathbf{w}^\top \mathbf{w}^* and ww\mathbf{w}^\top \mathbf{w}.

Proof

We will use two facts:

Call the updated plane w=w+yx\mathbf{w}\prime = \mathbf{w} + y\mathbf{x}

  1. Consider the effect of an update on ww\mathbf{w}^\top \mathbf{w}^*:

    ww=(w+yx)w=ww+y(xw)ww+γ\mathbf{w}\prime^\top \mathbf{w}^* = (\mathbf{w} + y\mathbf{x})^\top \mathbf{w}^*= \mathbf{w}^\top \mathbf{w}^* + y(\mathbf{x}^\top \mathbf{w}^*) \ge \mathbf{w}^\top \mathbf{w}^* + \gamma

    The inequality follows from the fact that, for w\mathbf{w}^*, the distance from the hyperplane defined by w\mathbf{w}^* to x\mathbf{x} must be at least γ\gamma (i.e. y(xw)=xwγy (\mathbf{x}^\top \mathbf{w}^*)=|\mathbf{x}^\top \mathbf{w}^*|\geq \gamma).

    his means that for each update, ww\mathbf{w}^\top \mathbf{w}^* grows by at least γ\gamma.

  2. Consider the effect of an update on ww\mathbf{w}^\top \mathbf{w}:

    ww=(w+yx)(w+yx)=ww+2y(wx)<0+y2(xx)0 this value 1ww+1\mathbf{w}\prime^\top \mathbf{w}\prime = (\mathbf{w} + y\mathbf{x})^\top (\mathbf{w} + y\mathbf{x}) = \mathbf{w}^\top \mathbf{w} + \underbrace{2y(\mathbf{w}^\top\mathbf{x})}_{<0} + \underbrace{y^2(\mathbf{x}^\top \mathbf{x})}_{0\leq \ \text{this value} \ \leq 1} \le \mathbf{w}^\top \mathbf{w} + 1

    The inequality follows from the fact that

    • 2y(wx)<02y(\mathbf{w}^\top \mathbf{x}) < 0 as we had to make an update, meaning x\mathbf{x} was misclassified
    • 0y2(xx)10\leq y^2(\mathbf{x}^\top \mathbf{x}) \le 1 as y2=1y^2 = 1 and all xx1\mathbf{x}^\top \mathbf{x}\leq 1 (because x1\|\mathbf x\|\leq 1).

    This means that for each update, ww\mathbf{w}^\top \mathbf{w} grows by at most 1.

  3. Now remember from the Perceptron algorithm that we initialize w=0\mathbf{w}=\mathbf{0}. Hence, initially ww=0\mathbf{w}^\top\mathbf{w}=0 and ww=0\mathbf{w}^\top\mathbf{w}^*=0 and after MM updates the following two inequalities must hold:

    • wwMγ\mathbf{w}^\top\mathbf{w}^*\geq M\gamma
    • wwM\mathbf{w}^\top \mathbf{w}\leq M.

    We can then complete the proof:

    MγwwBy (1)=wcos(θ)by definition of inner-product, where θ is the angle between w and w.wby definition of cos, we must have cos(θ)1.=wwby definition of wMBy (2) MγMM2γ2MM1γ2And hence, the number of updates M is bounded from above by a constant.\begin{align} M\gamma &\le \mathbf{w}^\top \mathbf{w}^* &&\text{By (1)} \\ &=\|\mathbf{w}\|\cos(\theta) && \text{by definition of inner-product, where $\theta$ is the angle between $\mathbf{w}$ and $\mathbf{w}^*$.}\\ &\leq ||\mathbf{w}|| &&\text{by definition of $\cos$, we must have $\cos(\theta)\leq 1$.} \\ &= \sqrt{\mathbf{w}^\top \mathbf{w}} && \text{by definition of $\|\mathbf{w}\|$} \\ &\le \sqrt{M} &&\text{By (2)} \\ & \textrm{ }\\ &\Rightarrow M\gamma \le \sqrt{M} \\ &\Rightarrow M^2\gamma^2 \le M \\ &\Rightarrow M \le \frac{1}{\gamma^2} && \text{And hence, the number of updates $M$ is bounded from above by a constant.} \end{align}

Problems and History

  1. Perceptron suffers from the limitation of a linear model: If no separating hyperplane exists, perceptron will NEVER converge. (There will always be some points on the wrong side and it will iterate forever)

    Counter Example
  2. It does not generalize well either. Though it corrects all data points correctly, the decision boundary is almost arbitrary. And if a test point is given, Perceptron will very likely to classify it to the wrong side.

Especially suffered from the 1st problem, AI winter came as a result.