Decision Trees - Never feel stumped again!
"Decisions - they're hard."
Rawan Hassunah
Decision trees allow you to follow a path, that given a set of requirements or features will lead you to the correct answer or prediction. Note: can probably be used to determine whether or not you should text him.
Unlike structured models such as linear regression and logistic regression, which are made up of fixed and finite parameters, decision trees are an example of non-parametric models. Um no, non-parametric does not mean that the model contains no parameters. It actually means that it may contain an infinite number of parameters. The number depends on the size and complexity of the data.
Method.
Classification trees:
At each node, ask a question, follow the path to the right answer, repeat.
Answers are categorical: this or that.
A leaf node is classified as a certain class by taking the majority vote.
Regression trees:
At each node, ask a question, define a threshold to split on, follow the path to the right answer, repeat.
Answers depend on a threshold: below this or above that.
A leaf node is classified as a certain class by taking the average.
Data.
Decision trees can take in and output any type of data (continuous and discrete). Referred to in our Galvanize lecture as AIAO: anything in, anything out (insert inappropriate joke here).
Terminology.
Root Node: The node at the top of the tree.
Parent Node: Each node (except the root node) has a parent node, which is the node directly above it.
Child Node: Each node has a child node (except the leaf node), which is the node directly below it.
Leaf Node: A node with no children (nodes that determine the answer).
Branch/Edge: Line that connects the nodes.
Pros and Cons.
Like pretty much anything, there are pros and cons for this algorithm.
Let's start with the bad news:
Decision trees are very easy to overfit.
What does that mean? Overfitting happens when the algorithm tries to decrease the training set error (the error observed when the model is tested on data it has seen before) and in turn ends up increasing the testing error (the error observed when the model is tested on data that it has never seen before). This happens when the model interprets noise as a signal.
How do we fix it? There's fix called pruning (lol @ literal name), which is a method that reduces the size of the tree.
Decision trees can take great resources to train and are a greedy algorithm.
What does that mean? Greedy algorithms make locally optimal decisions in the hopes that at the end they will make a globally optimal decision. Many times - it doesn't actually make a globally optimal decision, it does however make a pretty good approximation.
Now the good:
Decision trees are easily interpretable (just follow the path to the answer) and take little resources to interpret.
Decision trees can handle various types of data, as well as missing data and outliers.
How Are Decision Trees Built?
A tree is built on the bases that at each split, the optimal feature to split on is chosen. This is done by minimizing some cost function, and the split that gives the lowest cost wins! You can also say that the split that has the lowest cost produces the purist child nodes. A completely pure node contains data points that belong to the same class.
Information Gain (Entropy) & Gini Impurity.
There are two popular methods to determine the purity of a split in the case of classification trees and one method in the case of regression trees.
Note: P below is the percentage of class i.
Classification Trees: Method 1 - Entropy
Entropy is a measure of disorder. Basically: high bad, low good.
We can then calculate the information gain, which is the reduction in uncertainty given a specific split. It is calculated by subtracting the entropy of a split from the entropy of the parent node (how much information have we gained by making this split).
Classification Trees: Method 2 - Gini
Gini measures the probability of the correct classification of a random sample.
Regression Trees: Method 1 - Residual Sum of Squares (RSS)
RSS is the sum of squared errors, specifically the difference of a specific response with the average response at a leaf node (t).
To Be Continued.
In my next post about decision trees I will discuss tree variants and uses, as well as pruning methods.