Machine Learning Basics - Incomplete CS Notes @ Cornell

From Lecture: ML Setup

Basic Definitions¶

Labeled Data $D = \{(\vec{x_1}, y_1), ..., \vec{x_n}, y_n)\} \sim P^n$ : 我们常说"任取n个样本"，但虽然是"任"取的，还是要从某一处取出来。所以 P 的意义是我们的样本分布在一个未知的 distribution P, 我们从这个 P 中抓取 n 个样本，每个样本有 features $\vec{x_i}$ (因为有多个features所以总体用向量表示) 以及 label $\vec{y}$ .

Machine Learning¶

Train/Validate/Test Split¶

Train your model on the Training data set.
Only run Test set for one time. Because you want to test your model’s generalizability on a dataset representative of the whole distribution. If you train your model to get high score on Test set, it loses its meaning.
If you want to get a sense of your model’s generalizability, you can have another set called Validate set. Train your model until you get expected score on Validate set and then move on to the Test set.

Splitting the Data¶

i.i.d. data: uniformly at random
temporal data: by time (We want to train on the past and predict/test the future)
specific case: by patient / instance 如果每个病人有多个病历，则我们不能随机分配病历，而要随机分配病人及它们对应的病历，尽量避免同一病人的某些病历在 train 某些病历在 test。这是因为同一病人虽然病历不同，但是他的病症大概率是相同的，要是这个人既出现在 train 里，也出现在 test 里，不就是作弊吗。

No Free Lunch and Making Assumptions¶

There is no simple ML algorithm that works for all settings. Each model only works in some specific settings/background. You must make assumptions in order to learn. Each model only works under specific assumptions. Assumptions and settings are actually the same thing here.

“Assumptions” just mean our belief of how our data is distributed. Most of the time when we choose a model, we make assumptions about the distribution our data was drawn from. For example, when we try to determine whether an email is spam or not, we count number of appearance of each word in the email. So we are assuming our features $x$ are drawn from a multimodal distribution. When we use logistic regression, we assume conditional over x has a form like $P(y|\mathbf{x}_i)=\frac{1}{1+e^{-y(\mathbf{w}^T \mathbf{x}_i+b)}}$ . These are both very bold assumptions and are very likely to fail. When our assumptions fail, our model will do a terrible job in prediction. Therefore, when our algorithm does not perform well, one big reason is that our assumption does not hold.

CS3110 Functional Programming

The Environment Model

CS4780 Intro to Machine Learning

K-Nearest Neighbors