The goal of this course is to provide you with a high-level exposure to a wide range of Data Science techniques and Machine Learning models. From the basics of getting your Jupyter environment setup, to manipulating and visualizing data, to building supervised and unsupervised models, this class aims to give you the base intuition and skillset to continue developing and working on ML projects. We hope you exit the course with an understanding of how models and optimization techniques work, as well as have the confidence and tools to solve future problems on your own.
Lec2 Data Manipulation
Introduction to Pandas
Series
: one dimensional arrayDataFrame
: 2-D table- Filtering DataFrames:
loc
- Cleaning-Up DataFrames:
df.dropna()
,df[df['Open'].notnull()]
(These two methods both return a new DataFrame instead of modifying the existed one) - View DataFrames:
head
,tail
, … - Summary Statistics:
mean
,median
, …describe
- Filtering DataFrames:
Dealing with missing data
Fill in some random info of our choice:
1
2#if we there is no record about which cabin he is in, we assume he is on the Top Deck
df['Cabin']=df['Cabin'].fillna('Top Deck')Using summary statistics: fill missing entries with median or mean
works well with small set
Use regression and clustering: will be covered later
Lec3 Data Visualization
Types of Graphs
- Heatmap
- Correlation Plots
Coloring Graphs
plt.scatter(Longitude, Latitude, c=Temp.values.ravel(),cmap=plt.cm.OrRd)
color a scattered plot based on values of Temp
with color scheme cm.OrRd
. Find more color schemes from matplotlib manual.
Lec4 Linear Regression
Preparing Data
1 | from sklearn.linear_model import LinearRegression |
Predicting and Fitting
1 | # creates Linear Regression model |
Describing the Model
1 | # Gives a comprehensive view of Y = a1*x1 + a2*x2 + ... + k |
Lec5 Measuring Model’s Accuracy
When determining accuracy, usually want to compare our model to a baseline. Therefore, instead of comparing our model’s prediction to each specific y
value, we compare it with the mean y
value.
1 | from sklearn.metrics import mean_squared_error |
- overfitting: too specific to the data given, doesn’t predict any other data
- underfitting: no matter what data you use to train this model, it gives the same curve, so it doesn’t have prediction power either because it doesn’t show any pattern of the data.
Lec6 Classifiers
Linear regression is used to predict the value of a continuous variable. Classifiers are used to predict categorical or binary variables.
KNN
1 | from sklearn.model_selection import train_test_split |
Lec7 Other Supervised Learning Models
Decision Trees
1 | from sklearn import tree |
How to reduce overfitting?
- Reduce levels of trees
- Train multiple decision trees (maybe one for each training data) and take its average as final result
Logistic Regression
Value always between 0 and 1. Accept if value higher than threshold, reject if lower.
K-fold Cross Validation
Rather than doing test-train split only once, we do it k times: First separate our sample into k pieces and each time we take one of them as test set, the others as training set. Use from sklearn.model_selection import KFold
to achieve this. Calculate a score for each of the split and take its average as the final score. This score is usually closer to real errors.
1 | from sklearn.model_selection import KFold |
Lec9 Unsupervised Learning
- Supervised Learning: The desired solution (target) is also included in the dataset
- Unsupervised Learning: The training data is unlabeled and algorithm tries to learn by itself
Hierarchical Clustering
Hierarchical clustering groups observations into multiple levels of sets; the top-level set includes all of the data, and the bottom-level sets contain individual observations. The levels in between contain sets of observations with similar features.
1 | from sklearn.preprocessing import StandardScaler |
K-Means Clustering
We want to cluster the data into k groups. We first randomly choose k points in this dataset. Then we assign other data points to the group they are closest to. After assigning all data points to some group, we recompute the center of each group by taking the means of all points in that group. Repeat this process until no points change group assignment after one iteration.
1 | from sklearn import cluster |