Everyone as an individual must have encountered sometimes the situation where he/she has to choose what to do, yes it's “what to do” that is worked upon by making smart decisions on different conditions.
Suppose, in childhood, you might have taken a decision, like, what to wear, what to eat, go to school or not and many more. After growing up, you have to make a decision on a serious note as these decisions are directly or indirectly related to your profitability, more complex situations come when these decision makings take account into the business perspectives.
For example, the market manager seeks for a number of customers who purchased more products, the loan manager asks for identifying risky loan applications to reach a cheaper loan failure rate.
A decision tree is generally a prediction modeling technique, it is a decision-supporting tool. It uses a tree-like representation or design and decision model to get accurate inferences.
The very basic goal of decision trees is to develop a model that predicts the value of a target by taking some attributes into account and making decisions accordingly.
The decisions generally depend on if and else conditional statements. The deeper the tree, the more complex the rules and the better the model. It is the most demandable method in supervised learning and has a wide range of applications. It has a flowchart like structure that is constructed through algorithmic approaches to identify in which ways splitting will be done based on different conditions.
If we talk about the structure of the flowchart, it contains a root node from where a model building is initiated, the internal node to represent a test on any feature, branches to show the outcome of the test, and the leaf node to give a group of the same values which is created after taking decisions on all related attributes.
Decision trees are hugely used in regression and classification based problems. They built automated predictive models that have many applications in machine learning, data science, data mining, and statistics. Tree-based models enable predictive models while delivering high accuracy, more stability, and extremely interpretable that’s why it is easy to understand. They map the non-linear relationship quite well, unlike linear models.
A decision tree is undoubtedly very fast as compared to other techniques, the only thing that limits it is the condition of overfitting that arises when the trees grow and become complex or dense, in order to overcome the problem of overfitting, we should use the random forest, i.e nothing but the group of decision trees that performs decision making on a sub-part of the dataset, therefore it reduces the chance of overfitting and it still remains fast.
How does a decision tree work?
In machine learning terms, under the supervised learning algorithm, decision trees are mostly applied on classification or regression-based problems, it works for both continuous and categorical variables, in this method, we divide the entire population or sample(dataset) into a various number of subpopulation sets on the basis of different attributes.
Decision trees use various algorithms to recognize the most significant variables, the split, and the best possible value as a result that produces a further subpopulation set.
The image below represents the workflow of the decision tree, it is showing how data is divided into test and training dataset, decision tree algorithms are applied and model performance is evaluated later on. You can learn about how to analyze data here.
Workflow structure of the Decision Tree
I am explaining it with a very simple example, let's say we want to check whether a person is fit or not(a root of the tree). There are some parameters or you would say features of the tree(internal node), on which decision is taken(to produce branches of the tree- what we call each line), suppose here we set the parameter as to check a person of age less than 30 is fit or not.
First splitting is done on the set parameter, now for further splitting, other sets of parameters are required, such as if he eats lots of food or not, does he do exercise in the morning or not, and so on, and at last, we got the results(leaves of the tree-everything which is not roots or branches).
Leaves are basically the decisions and don’t split, a tree has decided whether a person is fit or not. This can clearly be understood by following the working chart of taken example.
Example of the Decision Tree
The decision tree has its own representation and solves the problem, as mentioned above it contains roots, branches, internal nodes, and leaves. Following are the steps to make tree representation ;
Optimize the best attribute and put it at the root of the tree.
Divide the dataset into subsets, using the previous attribute make sure subsets must have the same values for an attribute.
Repeat the process discussed in step 1 and step 2, until you find the leaf nodes for all branches of the tree.
Analysis of Decision tree
Decision tree as a classification tree or regression tree
In the above-mentioned example of loan manager, this is a simple example to classify the loan applications into safe or risky loan application on the basis of some attributes, here, attributes are some possible or real-time events on which decision depends. And the criteria of classification as a decision tree comes.
The classification is basically a two-stepped process, first is the learning step in which an arbitrary model is built on a given set of training data and the second one is the prediction step in which the model is implemented to predict the response for a given data set.
Sometimes, situations come in which a decision was made on continuous data, or when the target variable will be a real number, like, to predict the price of a product on the basis of the cost of raw material used to manufacture that product, to obtain the salary of the customer by utilizing his consumptions, job or home location and other information provided in applicant form, etc. Here, the target variable is either a real value or a part of the dataset is continuous data used for predicting the target variable.
And hence, the criteria of regression as a decision tree comes. The regression tree takes into account the observation about the various features of an object and trains the model to predict data in order to give meaningful continuous output.
Decision trees needs more attention on data preprocessing as there are many attributes in a dataset that we may not even need and in a algorithm like decision tree, every attribute contributes in decision making of the algorithm to produce results, therefore it is highly important to clean and prepare data in such a way that there exist no way by which we would get unwanted result.
Similarities and Differences between the Regression and Classification tree
Let’s discuss the primary differences and similarities in the regression tree and the classification tree; Regression trees are used when the target variable is a continuous variable and the value received in the training dataset to the end of a terminal node is the mean response of all the observation lying under that section.
Also, if any other known observation comes in that section, it is then replaced by a mean value, and prediction is done. In contrast to this, Classification trees work are required when the target variable is a categorical variable and the value received in the training dataset to the end of a terminal node is the mode of all the observation lying under that section. Also, if any other known observation comes in that section, it is then replaced by a mode value, and prediction is done.
If we discuss similarities; Both the trees split up independent variables into definite and non-overlapping regions and use a recursive binary approach, i.e. splitting initiates from the top of the tree while all the observations lying in one single region and divide the independent variable into two new branches. The splitting process is continued and results in a fully developed tree when the stopping criteria get fulfilled, defined by a user.
Decision trees have some advantages and disadvantages
Implementing decision trees in machine learning has several advantages;
We have seen above it can work with both categorical and continuous data and can generate multiple outputs.
Decision trees are easiest to interact and understand, even anyone from a non-technical background can easily predict his hypothesis using decision tree pictorial representation.
The model can interpret accurate results and trees’ reliability can be trusted and quantified.
Decision trees require less time for data preparation as it doesn’t require dummy variables, data normalization, replacement of missing values, etc.
Also, it takes very less time for data exploration, to find the most important variables and its relationship with other variables, to create new features that strengthen the target variable.
Decision trees are very helpful in data cleaning, it takes much less time in the data cleaning process in comparison to other modeling techniques as it doesn’t get affected by outliers and missing values up to a certain mark.
Decision trees are often considered as a non-parametric method, they have no opinions about space arrangement and designing of classifiers.
Even though non-linear relationships between various features are not able to influence the performance and efficiency of trees.
Let’s learn about some disadvantages in decision trees;
While dealing with categorical data having multiple observations, the information gain gets biased in approval of the attributes with the most observations.
As datasets have values with many levels, these are interconnected so calculations become more complex.
Decision trees often struck with the problems of overfitting, there might be situations when over-complex trees can’t generalize the data well. By constraining the number of parameters at the leaf node or setting the maximum depth of the tree, this problem can be minimized.
Very small changes in data might result in generating a completely different tree, this is termed as variance, so decision trees are treated as unstable. The concept of bagging, boosting, etc are introduced for the same
In comparison with other modeling techniques, it produces low prediction accuracy for any data.
I hope you might have got an idea about the beginning part of the decision tree, this blog surely inspires you to study decision trees more profoundly. You have seen that Decision trees belong to the class of supervised learning, other various methods such as random forest, gradient boosting, etc are famous for solving data science problems. Decision trees are basically used for solving regression and classification based problems. For more blogs in analytics and new technologies do read Analytics Steps.