Skip to article frontmatterSkip to article content

INFO1998 Intro to Machine Learning

Cornell University

The goal of this course is to provide you with a high-level exposure to a wide range of Data Science techniques and Machine Learning models. From the basics of getting your Jupyter environment setup, to manipulating and visualizing data, to building supervised and unsupervised models, this class aims to give you the base intuition and skillset to continue developing and working on ML projects. We hope you exit the course with an understanding of how models and optimization techniques work, as well as have the confidence and tools to solve future problems on your own.

Lec2 Data Manipulation

Introduction to Pandas

Dealing with missing data

Lec3 Data Visualization

Types of Graphs

Coloring Graphs

plt.scatter(Longitude, Latitude, c=Temp.values.ravel(),cmap=plt.cm.OrRd) color a scattered plot based on values of Temp with color scheme cm.OrRd. Find more color schemes from matplotlib manual.

Lec4 Linear Regression

Preparing Data

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# X must be a table (in case there are multiple x in y = a1*x1 + a2*x2 + ... + k)
X = data[['cost','compl_4']] 
# Y must be one column
Y = data['median_earnings'] 

from sklearn.model_selection import train_test_split
# test is 20% of all data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

Predicting and Fitting

# creates Linear Regression model 
LR = LinearRegression()
# note LR is an object by calling fit, we set all of its coefficients
LR.fit(x_train, y_train)
# predict() returns the predicted value
y_predicted = LR.predict(x_test)
# score(x,y') first computes the predicted value y based on x and our model, then compare it with y'
score = LR.score(x_test,y_test)

Describing the Model

# Gives a comprehensive view of Y = a1*x1 + a2*x2 + ... + k
LR?

# coefficients of x (a1, a2, ...)
LR.coef_

# intercept k
LR.intercept_

Lec5 Measuring Model’s Accuracy

When determining accuracy, usually want to compare our model to a baseline. Therefore, instead of comparing our model’s prediction to each specific y value, we compare it with the mean y value.

from sklearn.metrics import mean_squared_error
celcius_MSE = mean_squared_error(y_test, celcius_predictions)

test_goal_mean = y_test.mean()
baseline = np.full((len(celcius_predictions),), test_goal_mean)
baseline_MSE = mean_squared_error(baseline, celcius_predictions)

Lec6 Classifiers

Linear regression is used to predict the value of a continuous variable. Classifiers are used to predict categorical or binary variables.

KNN

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2)

k = 10
model = KNeighborsClassifier(k) # specify k nearest elements
model.fit(x_train,y_train)
predictions = model.predict(x_test)

Lec7 Other Supervised Learning Models

Decision Trees

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
model = tree.DecisionTreeClassifier(max_depth=5)

How to reduce overfitting?

Logistic Regression

Value always between 0 and 1. Accept if value higher than threshold, reject if lower.

K-fold Cross Validation

Rather than doing test-train split only once, we do it k times: First separate our sample into k pieces and each time we take one of them as test set, the others as training set. Use from sklearn.model_selection import KFold to achieve this. Calculate a score for each of the split and take its average as the final score. This score is usually closer to real errors.

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, accuracy_score

incX = inc_data[['education.num']]
incY = inc_data['income']

kf = KFold(n_splits = 5)
accuracy = 0
for train_index, test_index in kf.split(incX):
    X_train = incX.iloc[train_index]
    Y_train = incY.iloc[train_index]
    X_test = incX.iloc[test_index]
    Y_test = incY.iloc[test_index]
    
    # best_depth 是我们前一题找到的使分最高的 depth level of decision tree
    model = tree.DecisionTreeClassifier (max_depth = best_depth)
    model.fit(X_train, Y_train)
    pred_test = model.predict(X_test)
    accuracy += accuracy_score(Y_test, pred_test)
    
accuracy /= 5
print(accuracy)

Lec9 Unsupervised Learning

Hierarchical Clustering

Hierarchical clustering groups observations into multiple levels of sets; the top-level set includes all of the data, and the bottom-level sets contain individual observations. The levels in between contain sets of observations with similar features.

from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

# Standardize features by removing the mean and scaling to unit variance
data = StandardScaler().fit_transform(data)
# build our model from data
clust = linkage(data) 
# draw the dendrogram visulization
dendrogram(clust)
plt.show()

K-Means Clustering

We want to cluster the data into k groups. We first randomly choose k points in this dataset. Then we assign other data points to the group they are closest to. After assigning all data points to some group, we recompute the center of each group by taking the means of all points in that group. Repeat this process until no points change group assignment after one iteration.

from sklearn import cluster
k = 3
kmeans = cluster.KMeans(n_clusters = k) #cluster into k groups
kmeans.fit(data)