Decision Tree Classifier

Introduction

This type of learning typically involves the use of Decision Trees. It is similar to a flowchart. models where the target variable can take a discreet set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. It has certain conditions that needs to be followed in order to advance into the leaf nodes. The first node of the tree is known as its parent node or it’s root node. All the leaf nodes branch out from this parent node. In short, Decision Trees are a type of Supervised Machine Learning Algorithm where the data is continuously split according to certain parameters. The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are where the data is split. As mentioned before, this model sort of works like a flowchart. From now on I will be referring to a Decision Tree Classifier as DTC. According to the input a decision is made ( this corresponds to the decision nodes on a DTC ) then based on the output of that decision the information travels to the next decision node where another condition is solved.

pts.png

This process continues until we reach the last node of the DTC called the leaf node, which we will refer to as the output node. Based on the features that the initial data has the model has now successfully classified the observation into a category. A DTC is a relatively simple algorithm in the ML Pipeline.

To understand better we can see a picture here of a DTC trained to the Iris dataset which we discussed in the Classification part of this blog series

dtx

Over a period of time with repeated training the DTC now knows all the features that it has to use along with the decision boundaries that it needs. Now any new observation goes through the entire tree answering the questions and conditions imposed by the decision nodes. Once all the questions and conditions are satisfied the output lands on a certain output node. At this stage, the new observation has been classified into its respective category. This model is relatively easy to learn and is considered one of the easiest in the entire ML Pipeline.

Further information on Decision Trees

Advantages:

  1. Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical & ML background. It does not require any statistical knowledge to read and interpret them. Its graphical representation is very intuitive and users can easily relate their hypothesis.
  2. Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and relation between two or more variables. With the help of decision trees, we can create new variables / features that has better power to predict target variable. It can also be used in data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, in that case a decision tree will help to identify most significant variable.
  3. Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not influenced by outliers and missing values to a fair degree.
  4. Data type is not a constraint: It can handle both numerical and categorical variables.
  5. Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision trees have no assumptions about the space distribution and the classifier structure.

Disadvantages:

 

  • Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning.
  • Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.

 

Terminology:

 

  • Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
  • Splitting: It is a process of dividing a node into two or more sub-nodes.
  • Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
  • Leaf/ Terminal/ Output Node: Nodes do not split is called Leaf or Terminal node.
  • Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
  • Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
  • Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

Are tree based model better than Linear models?

Well, the answer to that question is that it depends on the type of problem we want to solve. here are a few key factors that will help you in choosing the right model:

  1. If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model.
  2. If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method
  3. If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!

Now that we have understood the rough intuition of a Decision Tree we will try to solve a problem in Python. For this we will make our very own problem. Imagine that we have an unknown fruit X in a container and we are told the weight of the fruit as well as its texture( smooth or bumpy ). We are also given one more clue that the type of fruit in the container can only be either an apple or an orange. How will you model this? Let’s find out. Let us label our outputs in the form of Fruit{ 0:”Apple”, 1:”Orange”} and the input in the form of Weight, Texture{0: Bumpy, 1: Smooth}. We will use a DTC and get our results. Here we will be simulating what a ML Researcher does, creating our own data through observations. We have noted the weights & texture of 6 fruits( 3 apples & 3 oranges each ) in the variable X and named the fruits accordingly{ 0:”Apple”, 1:”Orange”} in variable y.

Code

Code for our DTC
This is the Python code that we need to write in order to create a DTC and train it on our custom dataset

Output

The output given by our DTC that we previously coded

nteredI hope you enjoyed this simple intuition, explanation and tutorial on the DecisionTreeClassifier. In the next blog we will discuss about an unsupervised form of Machine Learning, Clustering using the kMeans Algorithm. Until then have a nice day and enjoy Machine Learning! 🙂

-MANAS HEJMADI

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s