In this tutorial, you will apply our data science skills to see how well you can classify a mushroom by its edibility. We will be using mushroom data provided by UC Irvine to train our machine learning algorithms, and rely heavily on the python toolkit scikit-learn. The tutorial will walk through the process of creating decision trees and support vector machines for a binary classification given categorical features, which often makes analysis slightly more difficult.
While mushrooms are not often life-threatening or life-changing, these supervised learning algorithms are applicable to almost any classification problem. And, of course, this isn't to say that mushrooms aren't ever life-changing.
Inside this tutorial:
Before you begin, you will need to install a couple things. First, make sure you have Anaconda installed on your computer. Then, use Anaconda's package manager to install 'pandas', 'numpy', 'matplotlib', 'graphviz', and 'sklearn'.
Also, download the mushroom dataset here.
After setting up the environment and dataset, create a Jupyter Notebook and import the following packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from graphviz import Source
from operator import itemgetter
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import zero_one_loss
from sklearn import svm
from operator import itemgetter
from os import system
1) Import the "mushrooms.csv" file and create a dataframe.
A cursory glance at the data set will tell you that each of the features corresponds to a single-char string encoding the class of a feature. The first feature is what you are trying to predict: edibility, where e = edible and p = poisonous. The dataset source page on Kaggle enumerates the possible values for each attribute.
We're lucky in this situation because the CSV file we want to read from is already pre-curated. We don't need to make numerous API requests, scrape HTML table data from a webpage, or travel the globe smelling mushrooms and taking notes: UC Irvine has done all of this for us.
pd.set_option('display.max_columns', None)
shrooms_df = pd.read_csv('mushrooms.csv')
shrooms_df.head(5)
2) Check to make sure the data is clean.
One way to do this is to define the classifications you expect from the dataset's description and then check to make sure all of values from the .csv file are valid according to those classifications. You should also look for any missing values and deal with them accordingly.
Missing Data and Imputation
Here, we deal with missing or unusual values by simply setting them to NaN. When we reencode our data, this will be equivalent to the mushroom not possessing any attribute. In general, we would normally try to impute the missing data, but closer investigation shows that the values which are left as '?' in the table all correspond to a single column, stalk_root. This is reason to believe that these data are Missing Not At Random (MNAR) and therefore it would be inappropriate to try to impute their values from the rest of the data set. Perhaps these mushrooms possess some quality that makes it difficult or impossible for their stalk root to be measured, which may affect other attributes as well.
poison = ['p', 'e']
cap_shape = ['b', 'c', 'x', 'f', 'k', 's']
cap_surface = ['f', 'g', 'y', 's']
cap_color = ['n', 'b', 'c', 'g', 'r', 'p', 'u', 'e', 'w', 'y']
bruises = ['t', 'f']
odor = ['a', 'l', 'c', 'y', 'f', 'm', 'n', 'p', 's']
gill_attachment = ['a', 'd', 'f', 'n']
gill_spacing = ['c', 'w', 'd']
gill_size = ['b', 'n']
gill_color = ['k', 'n', 'b', 'h', 'g', 'r', 'o', 'p', 'u', 'e', 'w', 'y']
stalk_shape = ['e', 't']
stalk_root = ['b', 'c', 'u', 'e', 'z', 'r']
stalk_surface_above_ring = ['f', 'y', 'k', 's']
stalk_surface_below_ring = ['f', 'y', 'k', 's']
stalk_color_above_ring = ['n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y']
stalk_color_below_ring = ['n', 'b', 'c', 'g', 'o', 'p', 'e', 'w', 'y']
veil_type = ['p', 'u']
veil_color = ['n', 'o', 'w', 'y']
ring_number = ['n', 'o', 't']
ring_type = ['c', 'e', 'f', 'l', 'n', 'p', 's', 'z']
spore_print_color = ['k', 'n', 'b', 'h', 'r', 'o', 'u', 'w', 'y']
population = ['a', 'c', 'n', 's', 'v', 'y']
habitat = ['g', 'l', 'm', 'p', 'u', 'w', 'd']
classifications = [poison, cap_shape, cap_surface, cap_color, bruises, \
odor, gill_attachment, gill_spacing, gill_size, \
gill_color, stalk_shape, stalk_root, stalk_surface_above_ring, \
stalk_surface_below_ring, stalk_color_above_ring, \
stalk_color_below_ring, veil_type, veil_color, \
ring_number, ring_type, spore_print_color, population, \
habitat]
printed = False
for index, row in shrooms_df.iterrows():
for i in range(0, len(classifications)):
if (not (row[i] in classifications[i])):
shrooms_df.set_value(index, shrooms_df.columns.values[i], np.nan)
3) Split the dataset into training and test sets.
This will be useful for the machine learning part of this tutorial. It doesn't hurt to do it now.
train_df, test_df = train_test_split(shrooms_df, test_size = 1/3, random_state = 1)
4) Re-code the data for later analysis.
Unfortunately, the decision trees implemented in scikit-learn only use numerical features (details here), and our features are categorical. Apply a one-hot encoding to the dataset to convert the features with categorical classes into binary features representing each class. This is extremely useful in classification problems because unlike other numerical encoding systems that maintain an attribute and simply replace categorical classes with numbers (for example for anonymity), one-hot encodings do not impose an inherent ordering on the feature's classes. Decision trees and random forests in R handle categorical features, but as we're working in Python, we need to be leery of ordinality.
encoded_train = pd.get_dummies(train_df, \
columns = list(train_df.columns.values)[1:])
encoded_train.head(5)
Exploratory data analysis (EDA) is a way of getting a feel for a dataset before we actually begin to analyze or run machine learning algorithms on it. Earlier, when we cleaned up the data, we were able to do so without even seeing it because we already had an expectation of what it should contain. Much like the neatness of the CSV file we read from, this is unusual and very convenient.
However, we can still learn from this data set before we jump into decision trees. For instance, we can first look at the shape of the data:
print("Shrooms DF:\t\t", shrooms_df.shape)
print("Encoded Training Data:\t", encoded_train.shape)
From the above output, we can see that the original data frame contained 23 features (really 22, because the first attribute, e or p, is what we hope to predict). After applying a one-hot encoding to the training set, we now have 115 features, now with binary values.
shrooms_df.describe()
Unfortunately, this doesn't tell us much. We might be more interested in determining which of the attributes are the best predictors of edibility. Before we train a decision tree, let's look at the information gain associated with each attribute, so we have a good idea of what the most important splits in the tree will be.
def entropy (df):
num_rows = len(df)
class_counts = df['class'].value_counts() # Counts for poisonous/edible classifications in df
if (len(class_counts) < 2): # No entropy if either class is empty
return 0
p_ratio = (class_counts['p']/num_rows)
e_ratio = (class_counts['e']/num_rows)
return - p_ratio*math.log(p_ratio, 2) - e_ratio*math.log(e_ratio, 2)
entropy(train_df)
num_rows = len(train_df)
gain = {}
full_entropy = entropy(train_df)
# Calculate information gain for each attribute
for attribute in train_df.columns.values[1:]:
value_counts = train_df[attribute].value_counts()
attr_entropy = 0
# Sum up entropy contributions of each value taken by the attribute
for val in value_counts.index:
val_ent = entropy(train_df[train_df[attribute] == val])
attr_entropy = attr_entropy + (value_counts[val]/num_rows) * val_ent
# Store information gain from splitting on the attribute
gain[attribute] = full_entropy - attr_entropy
# Sort gain and turn into lists for plotting
sorted_gain = sorted(gain.items(), key=itemgetter(1), reverse = True)
list_gains = list(map(list, zip(*sorted_gain)))
attributes = list_gains[0]
y_pos = np.arange(len(attributes))
gains = list_gains[1]
# Plot chart
fig = plt.figure(figsize=(17,8))
plt.bar(y_pos, gains, align='center', alpha=0.5)
plt.xticks(y_pos, attributes, rotation = 'vertical')
plt.ylabel('Gain')
plt.xlabel('Attribute')
plt.title('Attribute Gain Values')
plt.show()
plt.close()
From this chart, we can see that odor will be a very important split; and that it is a good indicator of edibility. The higher information gain of an attribute, the more important it will be to our classifiers. One of many ways we can visualize this trend is by plotting counts of edible and poisonous mushrooms classified by their attribute categories:
# Create new dataframes to hold counts
counts_odor_df = pd.DataFrame(index=odor, columns=['edible','poison'])
counts_color_df = pd.DataFrame(index=cap_color, columns=['edible','poison'])
# Populate counts from original dataframe
for c in odor:
counts_odor_df.ix[c, 'edible'] = len(shrooms_df[(shrooms_df['class']=='e') & (shrooms_df['odor']==c)])
counts_odor_df.ix[c, 'poison'] = len(shrooms_df[(shrooms_df['class']=='p') & (shrooms_df['odor']==c)])
for c in cap_color:
counts_color_df.ix[c, 'edible'] = len(shrooms_df[(shrooms_df['class']=='e') & (shrooms_df['cap-color']==c)])
counts_color_df.ix[c, 'poison'] = len(shrooms_df[(shrooms_df['class']=='p') & (shrooms_df['cap-color']==c)])
# Plot bar graph comparison for odor
ax = counts_odor_df.plot(kind='bar', title ="Edible and Poisonous Mushrooms by Odor Classification", \
figsize=(17, 6), legend=True, fontsize=12)
ax.set_xlabel("Odor Classification", fontsize=12)
ax.set_ylabel("Counts", fontsize=12)
plt.show()
plt.close()
# Plot bar graph comparison for cap color
ax = counts_color_df.plot(kind='bar', title ="Edible and Poisonous Mushrooms by Cap Color", \
figsize=(17, 6), legend=True, fontsize=12)
ax.set_xlabel("Cap Color", fontsize=12)
ax.set_ylabel("Counts", fontsize=12)
plt.show()
plt.close()
These somewhat-awkward grouped bar graphs illustrate how neatly mushrooms can be classified as 'edible' or 'poisonous' by their odor (most odor categories correspond to only edible or only poisonous mushrooms), and conversely how cap color is almost never a certain predictor of edibility (most cap color categories can apply to a comparable number of edible and poisonous mushrooms).
The first classifier we're going to look at is a decision tree.
Decision trees are "a non-parametric supervised learning method used for classification".
As a sample, let us initialize sklearn's DecisionTreeClassifier with a max_depth of 3. After initialization, let us fit our training data and export the decision tree to an image.
y_train = encoded_train['class']
x_train = encoded_train[list(encoded_train.columns[1:])]
decision_tree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 3) # min_samples_split=20, random_state=99)
decision_tree.fit(x_train, y_train)
with open("assets/dt.dot", 'w') as f:
export_graphviz(decision_tree, out_file = f, feature_names=encoded_train.columns[1:])
f.close()
with open("assets/dt.dot", 'r') as f:
temp = f.read()
s = Source(temp, filename="assets/dt", format="png")
s.render(cleanup = True) # Save as png file
f.close()

As we can see, each node of the decision tree shows how many samples are at the node before branching. There are other information such as entropy and its branching feature.
One tricky thing about decision trees is determining the depth of the tree. If we let the tree grow for too much, we may run into overfitting our data. Overfitting may cause our decision tree classifier to do poorly when it comes to predicting test data because the classifier lose generality.
One way to search for a optimal depth is to do a k-fold cross-validation with a range of max_depths to choose from. In a k-fold cross-validation, the data is split into k subsets so that k-1 subsets are used for testing and 1 subset is used for testing.
By using GridSearchCV, we can do both a k-fold cross validation and also test different depth parameters.
def k_fold_cv (X, Y, classifier, param_grid, cv):
grid_search = GridSearchCV(classifier, param_grid=param_grid, cv=cv)
grid_search.fit(X,Y)
return grid_search
def get_top_n_scores(grid_scores, n):
top_n_scores = sorted(grid_scores, key=itemgetter(1), reverse=True)[:n]
for i, score in enumerate(top_n_scores):
print("Rank: " + str(i + 1))
print("Mean Test score: " + str(score.mean_validation_score))
print("Max-Depth: " + str(score.parameters) + "\n")
param_grid = {"max_depth": np.array(range(1, len(train_df.columns[1:])))}
dt_classifier = DecisionTreeClassifier()
grid_search = k_fold_cv(x_train, y_train, dt_classifier, param_grid, cv = 10)
get_top_n_scores(grid_search.grid_scores_, 5)
After running our grid search, we can see that a max_depth of 8 produced the most accurate predictions with a .99815 accuracy. Here below, we can see how our decision tree with max_depth set to 8.
decision_tree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 8)
decision_tree.fit(x_train, y_train)
with open("assets/optimal.dot", 'w') as f:
export_graphviz(decision_tree, out_file = f, feature_names=encoded_train.columns[1:])
f.close()
with open("assets/optimal.dot", 'r') as f:
temp = f.read()
s = Source(temp, filename="assets/optimal", format="png")
s.render(cleanup = True) # Save as png file
f.close()

# Retrieve training data again
y_train = encoded_train['class']
x_train = encoded_train[list(encoded_train.columns[1:])]
# Encode and split test set
encoded_test = pd.get_dummies(test_df, \
columns = list(test_df.columns.values)[1:])
y_test = encoded_test['class']
x_test = encoded_test[list(encoded_test.columns[1:])]
# Dictionaries to store loss for model on training and test data
loss = {'training': {}, 'test': {}}
# Get training and test loss for each max depth hyperparameter
for i in np.array(range(1, len(train_df.columns[1:20]))):
decision_tree = DecisionTreeClassifier(criterion = 'entropy', max_depth = i)
decision_tree.fit(x_train, y_train)
train_predict = decision_tree.predict(x_train)
test_predict = decision_tree.predict(x_test)
loss['training'][i] = zero_one_loss(y_train, train_predict)
loss['test'][i] = zero_one_loss(y_test, test_predict)
# Plot loss as a function of max depth
fig = plt.figure(figsize=(17,8))
y_pos = np.array(range(1, len(train_df.columns[1:20])))
train_line, = plt.plot(list(loss['training'].keys()), list(loss['training'].values()), label = "Training")
test_line, = plt.plot(list(loss['test'].keys()), list(loss['test'].values()), label = "Test")
plt.xticks(y_pos, y_pos)
plt.ylabel('Loss')
plt.xlabel('Maximum Depth Hyperparameter')
plt.title('Normalized Zero-One Loss vs Max Depth')
plt.legend(handles=[train_line, test_line])
plt.show()
By varying the maximum depth of the decision tree classifier as our model complexity hyperparameter, we observe the expected behavior of training loss decreasing with increase in complexity. However, the model appears to be robust to the perils of overfitting, as the loss in the test error also decreases to a negligible value with increasing complexity and does not increase afterward.
According the rankings of the optimal depths obtained above, it would appear that each of the trials generated with max depth paramter greater than or equal to 8 resulted in identical trees, indicating that an optimal depth with no further necessary splits was reached within 8 steps. Those 8 splits appear to generalize very well, as practically no loss is experienced in the test set as well.
Now, let's generate a support vector classification for edibility of mushrooms. This is very simple now that we've already separated our data into training and testing sets.
Support Vector Machines are also supervised learning models which, given an input, classify it into discrete categories. SVMs are particularly useful when performing a binary classification, as we are here (p/e). They work by finding the hyperplane that splits the two classes with the largest margin. We will be using scikit-learn's svm.SVC class.
First, initialize your classification and fit the training data:
# Initialize Support Vector Classification
svc = svm.SVC(kernel='linear')
# Fit the support vector machine model according to the training set
svc.fit(x_train, y_train)
Scikit does all of the work. If you just want to view the results, you can feed the test data straight into svc's predict function and receive an array of p/e classifications.
# Perform classifications on sample
print(svc.predict(x_test))
As you might expect, this class will also calculate the accuracy of the classification given the results of test data:
# Return the mean accuracy on given test data and labels
score = svc.score(x_test, y_test)
print("SVC accuracy: ", score*100)
The support vector machine we have created has an accuracy of 100% for our test data, which isn't much of a surprise after the results of our decision tree and given that we know at least one of the features — odor — is a very good predictor of edibility. Unfortunately, because we have split our features using the one-hot encoding, and because the features themselves were categorical to begin with, there isn't an especially useful way to visualize this split (we certainly can't display the hyperplane in all of its dimensions).
Based on our test results, we can see that our machine learning classifiers can determine a mushroom's edibility quite accurately. We're excited to continue learning about data science and we hope to solve more real-world problems.
Thank you, Professor John Dickerson for teaching us data science and igniting new interests.