Introduction to Datascience
Goals/Anti-goals
Goal: Provide a tour of several different libraries
Goal: Demonstrate the synergy of the “Python Universe”
Goal: Demonstrate paradigm “get idea then translate into code”
Goal: Demonstrate paradigm “write one line, test, then continue”
Anti-goal: Teach specific functions and specific arguments
Anti-goal: Confuse you
Roadmap
Obtain dataset from the internet
Prepare dataset for analysis with pandas
Cluster dataset with sklearn
Dimensionality reduction with UMAP, visualization with matplotlib
A) Obtain dataset from the internet
Description:
23 species
Coincidentally 23 features (columns)
8124 samples (rows)
11th column has missing data
Image of dataset:
B) Prepare dataset for analysis using pandas
pandas
library in python
perfect for matrix-like data of mixed types (numbers, strings, etc.)
Overall Strategy
Get dataset into python
Deal with missing data (drop column)
Transform letters into numbers (one-hot encoding)
0) Get dataset into python
get pandas library
get a hard coded address for the file
try a simple usage of read_csv
check our work
update our usage of read_csv and check work again
[26]:
# get library
import pandas as pd
[27]:
# get a hard coded address for the file
mushroom_dataset_address='../data/agaricus-lepiota.csv'
[28]:
# try a simple usage of read_csv
my_Panda=pd.read_csv(mushroom_dataset_address)
[29]:
# check our work
my_Panda
[29]:
p | x | s | n | t | p.1 | f | c | n.1 | k | ... | s.2 | w | w.1 | p.2 | w.2 | o | p.3 | k.1 | s.3 | u | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | e | x | s | y | t | a | f | c | b | k | ... | s | w | w | p | w | o | p | n | n | g |
1 | e | b | s | w | t | l | f | c | b | n | ... | s | w | w | p | w | o | p | n | n | m |
2 | p | x | y | w | t | p | f | c | n | n | ... | s | w | w | p | w | o | p | k | s | u |
3 | e | x | s | g | f | n | f | w | b | k | ... | s | w | w | p | w | o | e | n | a | g |
4 | e | x | y | y | t | a | f | c | b | n | ... | s | w | w | p | w | o | p | k | n | g |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8118 | e | k | s | n | f | n | a | c | b | y | ... | s | o | o | p | o | o | p | b | c | l |
8119 | e | x | s | n | f | n | a | c | b | y | ... | s | o | o | p | n | o | p | b | v | l |
8120 | e | f | s | n | f | n | a | c | b | n | ... | s | o | o | p | o | o | p | b | c | l |
8121 | p | k | y | n | f | y | f | c | n | b | ... | k | w | w | p | w | o | e | w | v | l |
8122 | e | x | s | n | f | n | a | c | b | y | ... | s | o | o | p | o | o | p | o | c | l |
8123 rows × 23 columns
[30]:
# update our usage of read_csv and check work again
my_Panda=pd.read_csv(mushroom_dataset_address,header=None)
my_Panda
[30]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | p | x | s | n | t | p | f | c | n | k | ... | s | w | w | p | w | o | p | k | s | u |
1 | e | x | s | y | t | a | f | c | b | k | ... | s | w | w | p | w | o | p | n | n | g |
2 | e | b | s | w | t | l | f | c | b | n | ... | s | w | w | p | w | o | p | n | n | m |
3 | p | x | y | w | t | p | f | c | n | n | ... | s | w | w | p | w | o | p | k | s | u |
4 | e | x | s | g | f | n | f | w | b | k | ... | s | w | w | p | w | o | e | n | a | g |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8119 | e | k | s | n | f | n | a | c | b | y | ... | s | o | o | p | o | o | p | b | c | l |
8120 | e | x | s | n | f | n | a | c | b | y | ... | s | o | o | p | n | o | p | b | v | l |
8121 | e | f | s | n | f | n | a | c | b | n | ... | s | o | o | p | o | o | p | b | c | l |
8122 | p | k | y | n | f | y | f | c | n | b | ... | k | w | w | p | w | o | e | w | v | l |
8123 | e | x | s | n | f | n | a | c | b | y | ... | s | o | o | p | o | o | p | o | c | l |
8124 rows × 23 columns
1) Deal with missing data (drop column)
We need a fast and clear approach
use the function DataFrame.drop
check our work
[31]:
# use the function DataFrame.drop
## labels indicates the "name" of the column to drop
## axis indicates whether we are dropping from columns or rows (10 exists on both)
## inplace means that we are modifying the variable my_Panda instead of getting a result returned
my_Panda_column_dropped=my_Panda.drop(labels=11,axis='columns')
[32]:
# check our work
my_Panda_column_dropped
[32]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | p | x | s | n | t | p | f | c | n | k | ... | s | w | w | p | w | o | p | k | s | u |
1 | e | x | s | y | t | a | f | c | b | k | ... | s | w | w | p | w | o | p | n | n | g |
2 | e | b | s | w | t | l | f | c | b | n | ... | s | w | w | p | w | o | p | n | n | m |
3 | p | x | y | w | t | p | f | c | n | n | ... | s | w | w | p | w | o | p | k | s | u |
4 | e | x | s | g | f | n | f | w | b | k | ... | s | w | w | p | w | o | e | n | a | g |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8119 | e | k | s | n | f | n | a | c | b | y | ... | s | o | o | p | o | o | p | b | c | l |
8120 | e | x | s | n | f | n | a | c | b | y | ... | s | o | o | p | n | o | p | b | v | l |
8121 | e | f | s | n | f | n | a | c | b | n | ... | s | o | o | p | o | o | p | b | c | l |
8122 | p | k | y | n | f | y | f | c | n | b | ... | k | w | w | p | w | o | e | w | v | l |
8123 | e | x | s | n | f | n | a | c | b | y | ... | s | o | o | p | o | o | p | o | c | l |
8124 rows × 22 columns
2) Transform letters into numbers
Why?
Background: clustering
8,124 samples -> which samples come from the same species?
Clustering: Group datapoints based on location
Image
source:: https://miro.medium.com/max/1200/1*rw8IUza1dbffBhiA4i0GNQ.png
One-hot encoding
To cluster, we need to put data points in spatial locations
Right now each datapoint has 23 letters
We use the strategy of “one-hot encoding”
image
source: https://miro.medium.com/max/875/1*ggtP4a5YaRx6l09KQaYOnw.png
Code
[33]:
# one hot encoding
## this time, we got a copy of the dataset back instead of operating on the variable my_Panda directly
## we specify what to encode using 'data'
## we provide a list of columns using my_Panda.columns - we could have also type the list [0,1,2,...,20,21,22]
my_Panda_dummies=pd.get_dummies(data=my_Panda_column_dropped,columns=my_Panda_column_dropped.columns)
[34]:
# check our work
my_Panda_dummies
[34]:
0_e | 0_p | 1_b | 1_c | 1_f | 1_k | 1_s | 1_x | 2_f | 2_g | ... | 21_s | 21_v | 21_y | 22_d | 22_g | 22_l | 22_m | 22_p | 22_u | 22_w | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8119 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8120 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8121 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8122 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8123 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
8124 rows × 114 columns
C) Cluster the data with sklearn
Background:
In the above image, data had 2 dimensions so we could visually assess clustering
For n-dimensional case We turn to a clustering algorithm.
Within sklearn we choose the algorithm K-means.
Because time constraints we treat KMeans like a black box
Image: Black box
Black box: give input, get output, dont ask how
Step 1) * Parameters: arguments that our chosen algorithm requires
Step 2) * Input: my_Panda_dummies * Output: list of labels
Giving parameters to K-means black box
KMeans requires the number of clusters (23)
[35]:
from sklearn.cluster import KMeans
[36]:
#declare our black box by calling KMeans
##we give the black box/calculator the name my_KMeans_tool
my_KMeans_tool=KMeans(n_clusters=23)
[37]:
print(my_KMeans_tool)
KMeans(n_clusters=23)
[39]:
#we havent told my_KMeans_tool about our data yet, so we expect this to fail
print(my_KMeans_tool.labels_)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-9d53e9cea4d7> in <module>
1 #we havent told my_KMeans_tool about our data yet, so we expect this to fail
----> 2 print(my_KMeans_tool.labels_)
AttributeError: 'KMeans' object has no attribute 'labels_'
Black Box Operation
sklearn (often) relies on the function “fit_transform” to connect the data with the algorithm
[45]:
#tell my_KMeans_tool about our data
## we ignore the output
## notice how sklearn's KMeans flawlessly accepts a panda as input
my_KMeans_tool.fit_transform(X=my_Panda_dummies)
[45]:
array([[3.77102947, 4.54519275, 4.26060768, ..., 3.7859389 , 3.64273399,
5.07033858],
[3.58657502, 5.01968512, 3.73980095, ..., 3.05505046, 3.88277768,
4.85197554],
[3.44952673, 5.08565682, 3.87835876, ..., 3.41565026, 3.96295279,
5.11940752],
...,
[4.51655119, 5.20279094, 4.51386752, ..., 4.65474668, 4.1459457 ,
5.16801058],
[4.4688548 , 2.70655571, 4.72434593, ..., 4.61880215, 3.59594233,
4.76532615],
[4.59105719, 5.20279094, 4.44565956, ..., 4.72581563, 4.1672875 ,
5.26387057]])
[46]:
#now we see that we have the labels that expected
print(my_KMeans_tool.labels_)
print(len(my_KMeans_tool.labels_))
[ 4 11 11 ... 10 8 10]
8124
Done?
Got a label for every data point.
Philosophical viewpoint: We did perfectly, the algorithm worked exactly as intended
Are our clusters meaningful?
Hard to say. We could try to relate our clusters to some external dataset. We don’t have that here.
Lets try a visualization.
In our case, we are going to: * Take our 114 dimensional dataset (114 columns) * Reduce the dimensionality to 2. * Plot all datapoints, getting (x,y) from dimensionality reduction and colors from clustering label.
If each color of datapoints is separated from every other cluster, then we have some assurance that our dataset is “robustly clusterable”. That’s it.
Dimensionality Reduction and Visualization
Naive approach:
choose two columns and drop all of the rest
Which algorithm?
PCA, PLS-DA, t-SNE, UMAP
demonstrate the bredth/synergy of the python language, so we choose UMAP (not in sklearn)
Even though UMAP is not in sklearn, the authors of UMAP wrote it in such a way that it “works like” sklearn functions.
Strategy
create our UMAP black box
operate that black box on our dataset
obtain a 2-d coordinate pair for every datapoint
plot those pairs, with the color of the datapoint reflecting the clustering label from KMeans
Code
[47]:
# create our black box
## import our library
import umap
## declare a couple of parameters that we wont get into
n_neighbors=10
min_dist=0.1
n_components=2
metric='euclidean'
##just like last time, we declare a "UMAP calculator" and send a couple of parameters
my_UMAP=umap.UMAP(n_neighbors=n_neighbors,min_dist=min_dist,n_components=n_components,metric=metric)
[48]:
print(my_UMAP)
UMAP(dens_frac=0.0, dens_lambda=0.0, n_neighbors=10)
[49]:
# again, we use fit transform to inform our calculator of our dataset
## again, notice how it flawlessly accepts a panda
## this time, we actually want the thing that gets spit back out
my_UMAP.fit_transform(my_Panda_dummies)
[49]:
array([[ -7.6440673 , -12.353603 ],
[ -0.26427558, 8.624089 ],
[ -0.47456995, 7.971237 ],
...,
[ 11.546015 , -6.2244453 ],
[ 19.653965 , 8.905327 ],
[ 11.451685 , -6.2179003 ]], dtype=float32)
We redo to capture the 2D coordinate list
[50]:
# 2) and 3)
my_numpy_2d=my_UMAP.fit_transform(my_Panda_dummies)
print(len(my_numpy_2d))
8124
[51]:
import matplotlib.pyplot as plt
[52]:
#4)
#scatter plot takes
#list of x coordinates
#list of y coordinates
plt.scatter(
my_numpy_2d[:,0],
my_numpy_2d[:,1]
)
plt.show()
[53]:
#4) redo with arguments to make the color in accordance wiht the KMeans cluster labels
#an opacity parameter
plt.scatter(
my_numpy_2d[:,0],
my_numpy_2d[:,1],
#a label list
c=my_KMeans_tool.labels_,
#a color bar (literally the rainbox)
cmap='gist_rainbow',
#make the points mildly transparent so we can see generalities
alpha=0.2
)
plt.show()
Observations
Our dimensionality reduction technique independently produced 23 visually-discernible clusters, but those clusters only partially agree with the KMeans clustering
Bonus: Dim Reduction with PCA
[54]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
my_PCA=PCA()
my_PCAd_coordinates=my_PCA.fit_transform(my_Panda_dummies)
plt.scatter(range(len(my_PCA.explained_variance_ratio_)),my_PCA.explained_variance_ratio_)
plt.show()
print(my_PCAd_coordinates.shape)
print(my_PCAd_coordinates[:,0].shape)
plt.scatter(my_PCAd_coordinates[:,0],my_PCAd_coordinates[:,1],c=my_KMeans_tool.labels_,cmap='gist_rainbow',alpha=0.2)
plt.show()
(8124, 114)
(8124,)