INFO1998 Intro to Machine Learning
The goal of this course is to provide you with a high-level exposure to a wide range of Data Science techniques and Machine Learning models. From the basics of getting your Jupyter environment setup, to manipulating and visualizing data, to building supervised and unsupervised models, this class aims to give you the base intuition and skillset to continue developing and working on ML projects. We hope you exit the course with an understanding of how models and optimization techniques work, as well as have the confidence and tools to solve future problems on your own.
Lec2 Data Manipulation¶
Introduction to Pandas¶
Series
: one dimensional arrayDataFrame
: 2-D table- Filtering DataFrames:
loc
- Cleaning-Up DataFrames:
df.dropna()
,df[df['Open'].notnull()]
(These two methods both return a new DataFrame instead of modifying the existed one) - View DataFrames:
head
,tail
, ... - Summary Statistics:
mean
,median
, ...describe
- Filtering DataFrames:
Dealing with missing data¶
Fill in some random info of our choice:
#if we there is no record about which cabin he is in, we assume he is on the Top Deck df['Cabin']=df['Cabin'].fillna('Top Deck')
Using summary statistics: fill missing entries with median or mean
works well with small set
Use regression and clustering: will be covered later
Lec3 Data Visualization¶
Types of Graphs¶
- Heatmap
- Correlation Plots
Coloring Graphs¶
plt.scatter(Longitude, Latitude, c=Temp.values.ravel(),cmap=plt.cm.OrRd)
color a scattered plot based on values of Temp
with color scheme cm.OrRd
. Find more color schemes from matplotlib manual.
Lec4 Linear Regression¶
Preparing Data¶
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# X must be a table (in case there are multiple x in y = a1*x1 + a2*x2 + ... + k)
X = data[['cost','compl_4']]
# Y must be one column
Y = data['median_earnings']
from sklearn.model_selection import train_test_split
# test is 20% of all data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
Predicting and Fitting¶
# creates Linear Regression model
LR = LinearRegression()
# note LR is an object by calling fit, we set all of its coefficients
LR.fit(x_train, y_train)
# predict() returns the predicted value
y_predicted = LR.predict(x_test)
# score(x,y') first computes the predicted value y based on x and our model, then compare it with y'
score = LR.score(x_test,y_test)
Describing the Model¶
# Gives a comprehensive view of Y = a1*x1 + a2*x2 + ... + k
LR?
# coefficients of x (a1, a2, ...)
LR.coef_
# intercept k
LR.intercept_
Lec5 Measuring Model’s Accuracy¶
When determining accuracy, usually want to compare our model to a baseline. Therefore, instead of comparing our model’s prediction to each specific y
value, we compare it with the mean y
value.
from sklearn.metrics import mean_squared_error
celcius_MSE = mean_squared_error(y_test, celcius_predictions)
test_goal_mean = y_test.mean()
baseline = np.full((len(celcius_predictions),), test_goal_mean)
baseline_MSE = mean_squared_error(baseline, celcius_predictions)
- overfitting: too specific to the data given, doesn’t predict any other data
- underfitting: no matter what data you use to train this model, it gives the same curve, so it doesn’t have prediction power either because it doesn’t show any pattern of the data.
Lec6 Classifiers¶
Linear regression is used to predict the value of a continuous variable. Classifiers are used to predict categorical or binary variables.
KNN¶
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2)
k = 10
model = KNeighborsClassifier(k) # specify k nearest elements
model.fit(x_train,y_train)
predictions = model.predict(x_test)
Lec7 Other Supervised Learning Models¶
Decision Trees¶
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
model = tree.DecisionTreeClassifier(max_depth=5)
How to reduce overfitting?
- Reduce levels of trees
- Train multiple decision trees (maybe one for each training data) and take its average as final result
Logistic Regression¶
Value always between 0 and 1. Accept if value higher than threshold, reject if lower.
K-fold Cross Validation¶
Rather than doing test-train split only once, we do it k times: First separate our sample into k pieces and each time we take one of them as test set, the others as training set. Use from sklearn.model_selection import KFold
to achieve this. Calculate a score for each of the split and take its average as the final score. This score is usually closer to real errors.
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, accuracy_score
incX = inc_data[['education.num']]
incY = inc_data['income']
kf = KFold(n_splits = 5)
accuracy = 0
for train_index, test_index in kf.split(incX):
X_train = incX.iloc[train_index]
Y_train = incY.iloc[train_index]
X_test = incX.iloc[test_index]
Y_test = incY.iloc[test_index]
# best_depth 是我们前一题找到的使分最高的 depth level of decision tree
model = tree.DecisionTreeClassifier (max_depth = best_depth)
model.fit(X_train, Y_train)
pred_test = model.predict(X_test)
accuracy += accuracy_score(Y_test, pred_test)
accuracy /= 5
print(accuracy)
Lec9 Unsupervised Learning¶
- Supervised Learning: The desired solution (target) is also included in the dataset
- Unsupervised Learning: The training data is unlabeled and algorithm tries to learn by itself
Hierarchical Clustering¶
Hierarchical clustering groups observations into multiple levels of sets; the top-level set includes all of the data, and the bottom-level sets contain individual observations. The levels in between contain sets of observations with similar features.
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt
# Standardize features by removing the mean and scaling to unit variance
data = StandardScaler().fit_transform(data)
# build our model from data
clust = linkage(data)
# draw the dendrogram visulization
dendrogram(clust)
plt.show()
K-Means Clustering¶
We want to cluster the data into k groups. We first randomly choose k points in this dataset. Then we assign other data points to the group they are closest to. After assigning all data points to some group, we recompute the center of each group by taking the means of all points in that group. Repeat this process until no points change group assignment after one iteration.
from sklearn import cluster
k = 3
kmeans = cluster.KMeans(n_clusters = k) #cluster into k groups
kmeans.fit(data)