Yao Lirong's Blog

INFO1998 Intro to Machine Learning (sklearn, pandas)

2020/10/02

The goal of this course is to provide you with a high-level exposure to a wide range of Data Science techniques and Machine Learning models. From the basics of getting your Jupyter environment setup, to manipulating and visualizing data, to building supervised and unsupervised models, this class aims to give you the base intuition and skillset to continue developing and working on ML projects. We hope you exit the course with an understanding of how models and optimization techniques work, as well as have the confidence and tools to solve future problems on your own.

Lec2 Data Manipulation

Introduction to Pandas

  • Series: one dimensional array
  • DataFrame: 2-D table
    • Filtering DataFrames: loc
    • Cleaning-Up DataFrames: df.dropna(), df[df['Open'].notnull()] (These two methods both return a new DataFrame instead of modifying the existed one)
    • View DataFrames: head, tail, …
    • Summary Statistics: mean, median, … describe

Dealing with missing data

  • Fill in some random info of our choice:

    1
    2
    #if we there is no record about which cabin he is in, we assume he is on the Top Deck
    df['Cabin']=df['Cabin'].fillna('Top Deck')
  • Using summary statistics: fill missing entries with median or mean

    works well with small set

  • Use regression and clustering: will be covered later

Lec3 Data Visualization

Types of Graphs

  • Heatmap
  • Correlation Plots

Coloring Graphs

plt.scatter(Longitude, Latitude, c=Temp.values.ravel(),cmap=plt.cm.OrRd) color a scattered plot based on values of Temp with color scheme cm.OrRd. Find more color schemes from matplotlib manual.

Lec4 Linear Regression

Preparing Data

1
2
3
4
5
6
7
8
9
10
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# X must be a table (in case there are multiple x in y = a1*x1 + a2*x2 + ... + k)
X = data[['cost','compl_4']]
# Y must be one column
Y = data['median_earnings']

from sklearn.model_selection import train_test_split
# test is 20% of all data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

Predicting and Fitting

1
2
3
4
5
6
7
8
# creates Linear Regression model 
LR = LinearRegression()
# note LR is an object by calling fit, we set all of its coefficients
LR.fit(x_train, y_train)
# predict() returns the predicted value
y_predicted = LR.predict(x_test)
# score(x,y') first computes the predicted value y based on x and our model, then compare it with y'
score = LR.score(x_test,y_test)

Describing the Model

1
2
3
4
5
6
7
8
# Gives a comprehensive view of Y = a1*x1 + a2*x2 + ... + k
LR?

# coefficients of x (a1, a2, ...)
LR.coef_

# intercept k
LR.intercept_

Lec5 Measuring Model’s Accuracy

When determining accuracy, usually want to compare our model to a baseline. Therefore, instead of comparing our model’s prediction to each specific y value, we compare it with the mean y value.

1
2
3
4
5
6
from sklearn.metrics import mean_squared_error
celcius_MSE = mean_squared_error(y_test, celcius_predictions)

test_goal_mean = y_test.mean()
baseline = np.full((len(celcius_predictions),), test_goal_mean)
baseline_MSE = mean_squared_error(baseline, celcius_predictions)
  • overfitting: too specific to the data given, doesn’t predict any other data
  • underfitting: no matter what data you use to train this model, it gives the same curve, so it doesn’t have prediction power either because it doesn’t show any pattern of the data.

Lec6 Classifiers

Linear regression is used to predict the value of a continuous variable. Classifiers are used to predict categorical or binary variables.

KNN

1
2
3
4
5
6
7
8
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size=0.2)

k = 10
model = KNeighborsClassifier(k) # specify k nearest elements
model.fit(x_train,y_train)
predictions = model.predict(x_test)

Lec7 Other Supervised Learning Models

Decision Trees

1
2
3
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
model = tree.DecisionTreeClassifier(max_depth=5)

How to reduce overfitting?

  • Reduce levels of trees
  • Train multiple decision trees (maybe one for each training data) and take its average as final result

Logistic Regression

Value always between 0 and 1. Accept if value higher than threshold, reject if lower.

K-fold Cross Validation

Rather than doing test-train split only once, we do it k times: First separate our sample into k pieces and each time we take one of them as test set, the others as training set. Use from sklearn.model_selection import KFold to achieve this. Calculate a score for each of the split and take its average as the final score. This score is usually closer to real errors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, accuracy_score

incX = inc_data[['education.num']]
incY = inc_data['income']

kf = KFold(n_splits = 5)
accuracy = 0
for train_index, test_index in kf.split(incX):
X_train = incX.iloc[train_index]
Y_train = incY.iloc[train_index]
X_test = incX.iloc[test_index]
Y_test = incY.iloc[test_index]

# best_depth 是我们前一题找到的使分最高的 depth level of decision tree
model = tree.DecisionTreeClassifier (max_depth = best_depth)
model.fit(X_train, Y_train)
pred_test = model.predict(X_test)
accuracy += accuracy_score(Y_test, pred_test)

accuracy /= 5
print(accuracy)

Lec9 Unsupervised Learning

  • Supervised Learning: The desired solution (target) is also included in the dataset
  • Unsupervised Learning: The training data is unlabeled and algorithm tries to learn by itself

Hierarchical Clustering

Hierarchical clustering groups observations into multiple levels of sets; the top-level set includes all of the data, and the bottom-level sets contain individual observations. The levels in between contain sets of observations with similar features.

1
2
3
4
5
6
7
8
9
10
11
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
from matplotlib import pyplot as plt

# Standardize features by removing the mean and scaling to unit variance
data = StandardScaler().fit_transform(data)
# build our model from data
clust = linkage(data)
# draw the dendrogram visulization
dendrogram(clust)
plt.show()

K-Means Clustering

We want to cluster the data into k groups. We first randomly choose k points in this dataset. Then we assign other data points to the group they are closest to. After assigning all data points to some group, we recompute the center of each group by taking the means of all points in that group. Repeat this process until no points change group assignment after one iteration.

1
2
3
4
from sklearn import cluster
k = 3
kmeans = cluster.KMeans(n_clusters = k) #cluster into k groups
kmeans.fit(data)
CATALOG
  1. 1. Lec2 Data Manipulation
    1. 1.1. Introduction to Pandas
    2. 1.2. Dealing with missing data
  2. 2. Lec3 Data Visualization
    1. 2.1. Types of Graphs
    2. 2.2. Coloring Graphs
  3. 3. Lec4 Linear Regression
    1. 3.1. Preparing Data
    2. 3.2. Predicting and Fitting
    3. 3.3. Describing the Model
  4. 4. Lec5 Measuring Model’s Accuracy
  5. 5. Lec6 Classifiers
    1. 5.1. KNN
  6. 6. Lec7 Other Supervised Learning Models
    1. 6.1. Decision Trees
    2. 6.2. Logistic Regression
    3. 6.3. K-fold Cross Validation
  7. 7. Lec9 Unsupervised Learning
    1. 7.1. Hierarchical Clustering
    2. 7.2. K-Means Clustering