INFO1998 Intro to Machine Learning (sklearn, pandas)

Manual

 2020/10/02 

The goal of this course is to provide you with a high-level exposure to a wide range of Data Science techniques and Machine Learning models. From the basics of getting your Jupyter environment setup, to manipulating and visualizing data, to building supervised and unsupervised models, this class aims to give you the base intuition and skillset to continue developing and working on ML projects. We hope you exit the course with an understanding of how models and optimization techniques work, as well as have the confidence and tools to solve future problems on your own.

Lec2 Data Manipulation

Introduction to Pandas

Series: one dimensional array
DataFrame: 2-D table
- Filtering DataFrames: loc
- Cleaning-Up DataFrames: df.dropna(), df[df['Open'].notnull()] (These two methods both return a new DataFrame instead of modifying the existed one)
- View DataFrames: head, tail, …
- Summary Statistics: mean, median, … describe

Dealing with missing data

Fill in some random info of our choice:

1 2	#if we there is no record about which cabin he is in, we assume he is on the Top Deck df['Cabin']=df['Cabin'].fillna('Top Deck')

Using summary statistics: fill missing entries with median or mean

works well with small set
Use regression and clustering: will be covered later

Lec3 Data Visualization

Types of Graphs

Heatmap
Correlation Plots

Coloring Graphs

plt.scatter(Longitude, Latitude, c=Temp.values.ravel(),cmap=plt.cm.OrRd) color a scattered plot based on values of Temp with color scheme cm.OrRd. Find more color schemes from matplotlib manual.

Lec4 Linear Regression

Preparing Data

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# X must be a table (in case there are multiple x in y = a1*x1 + a2*x2 + ... + k)
X = data[['cost','compl_4']] 
# Y must be one column
Y = data['median_earnings'] 

from sklearn.model_selection import train_test_split
# test is 20% of all data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

Predicting and Fitting

# creates Linear Regression model 
LR = LinearRegression()
# note LR is an object by calling fit, we set all of its coefficients
LR.fit(x_train, y_train)
# predict() returns the predicted value
y_predicted = LR.predict(x_test)
# score(x,y') first computes the predicted value y based on x and our model, then compare it with y'
score = LR.score(x_test,y_test)

Describing the Model

# Gives a comprehensive view of Y = a1*x1 + a2*x2 + ... + k
LR?

# coefficients of x (a1, a2, ...)
LR.coef_

# intercept k
LR.intercept_

Lec5 Measuring Model’s Accuracy

When determining accuracy, usually want to compare our model to a baseline. Therefore, instead of comparing our model’s prediction to each specific y value, we compare it with the mean y value.

from sklearn.metrics import mean_squared_error
celcius_MSE = mean_squared_error(y_test, celcius_predictions)

test_goal_mean = y_test.mean()
baseline = np.full((len(celcius_predictions),), test_goal_mean)
baseline_MSE = mean_squared_error(baseline, celcius_predictions)

overfitting: too specific to the data given, doesn’t predict any other data
underfitting: no matter what data you use to train this model, it gives the same curve, so it doesn’t have prediction power either because it doesn’t show any pattern of the data.

Lec6 Classifiers

Linear regression is used to predict the value of a continuous variable. Classifiers are used to predict categorical or binary variables.

KNN

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2)

k = 10
model = KNeighborsClassifier(k) # specify k nearest elements
model.fit(x_train,y_train)
predictions = model.predict(x_test)

Lec7 Other Supervised Learning Models

Decision Trees

1
2
3

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
model = tree.DecisionTreeClassifier(max_depth=5)

How to reduce overfitting?

Reduce levels of trees
Train multiple decision trees (maybe one for each training data) and take its average as final result

Logistic Regression

Value always between 0 and 1. Accept if value higher than threshold, reject if lower.

K-fold Cross Validation

Rather than doing test-train split only once, we do it k times: First separate our sample into k pieces and each time we take one of them as test set, the others as training set. Use from sklearn.model_selection import KFold to achieve this. Calculate a score for each of the split and take its average as the final score. This score is usually closer to real errors.

from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, accuracy_score

incX = inc_data[['education.num']]
incY = inc_data['income']

kf = KFold(n_splits = 5)
accuracy = 0
for train_index, test_index in kf.split(incX):
    X_train = incX.iloc[train_index]
    Y_train = incY.iloc[train_index]
    X_test = incX.iloc[test_index]
    Y_test = incY.iloc[test_index]
    
    # best_depth 是我们前一题找到的使分最高的 depth level of decision tree
    model = tree.DecisionTreeClassifier (max_depth = best_depth)
    model.fit(X_train, Y_train)
    pred_test = model.predict(X_test)
    accuracy += accuracy_score(Y_test, pred_test)
    
accuracy /= 5
print(accuracy)

Lec9 Unsupervised Learning

Supervised Learning: The desired solution (target) is also included in the dataset
Unsupervised Learning: The training data is unlabeled and algorithm tries to learn by itself

Hierarchical Clustering

Hierarchical clustering groups observations into multiple levels of sets; the top-level set includes all of the data, and the bottom-level sets contain individual observations. The levels in between contain sets of observations with similar features.

from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

# Standardize features by removing the mean and scaling to unit variance
data = StandardScaler().fit_transform(data)
# build our model from data
clust = linkage(data) 
# draw the dendrogram visulization
dendrogram(clust)
plt.show()

K-Means Clustering

We want to cluster the data into k groups. We first randomly choose k points in this dataset. Then we assign other data points to the group they are closest to. After assigning all data points to some group, we recompute the center of each group by taking the means of all points in that group. Repeat this process until no points change group assignment after one iteration.

from sklearn import cluster
k = 3
kmeans = cluster.KMeans(n_clusters = k) #cluster into k groups
kmeans.fit(data)

Author：Yao Lirong

Link：https://yao-lirong.github.io/blog/2020-10-02-INFO1998-Intro-to-Machine-Learning/

Publish date：October 2nd 2020, 12:00:00 am

Update date：June 9th 2022, 3:30:54 am

License：本文采用 Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) 进行许可

Next Post

CS4820 及 Algorithm Design 一般性内容总结
Previous Post

Add "Open with Windows Terminal" to Right-Click Menu