Everyone as an individual must have encountered sometimes with the situation where he/she has to choose what to do, yes its “what to do” that is worked upon by making smart decisions on different conditions.
Suppose, in childhood, you might have taken a decision, like, what to wear, what to eat, go to school or not and many more. After growing up, you have to make a decision on a serious note as these decisions directly or indirectly related to your profitability, more complex situations come when these decision makings take account into the business perspectives.
The very basic goal of decision trees is to develop a model that predicts the value of a target by taking consideration of some attributes into account and make decisions accordingly.
The decisions generally depend on if and else conditional statements. The deeper the tree, the more complex the rules and better the model. It is the most demandable method in supervised learning and has a wide range of applications. It has a flowchart like structure that is constructed through algorithmic approaches to identify in which ways splitting will be done based on different conditions.
If we talk about the structure of the flowchart, it contains a root node from where a model building is initiated, the internal node to represent a test on any feature, branches to show to the outcome of the test, and the leaf node to give a group of the same values which is created after taking decisions on all related attributes.
Decision trees are hugely used in regression and classification based problems. They built automated predictive models that have many applications in machine learning, data science, data mining, and statistics. Tree-based models enable predictive models while delivering high accuracy, more stability, and extremely interpretable that’s why it is easy to understand. They map the non-linear relationship quite well, unlike linear models.
In machine learning terms, under the supervised learning algorithm, decision trees are mostly applied on classification or regression-based problems, it works for both continuous and categorical variables, in this method, we divide the entire population or sample(dataset) into a various number of subpopulation sets on the basis of different attributes.
Decision trees use various algorithms to recognize the most significant variables, the split, and the best possible value as a result that produces further subpopulation set.
The image below represents the workflow of the decision tree, it is showing how data is divided into test and training dataset, decision tree algorithms are applied and model performance is evaluated later on. You can learn about how to analyze data here.
I am explaining it with a very simple example, let's say we want to check whether a person is fit or not(a root of the tree). There are some parameters or you would say features of the tree(internal node), on which decision is taken(to produce branches of the tree- what we call each line), suppose here we set the parameter as to check a person of age less than 30 is fit or not.
First splitting is done on the set parameter, now for further splitting, other sets of parameters are required, such as if he eats lots of food or not, does he do exercise in the morning or not, and so on, and at last, we got the results(leaves of the tree-everything which is not roots or branches).
Leaves are basically the decisions and don’t split, a tree has decided whether a person is fit or not. This can clearly be understood by following the working chart of taken example.
The decision tree has its own representation and solves the problem, as mentioned above it contains roots, branches, internal nodes, and leaves. Following are the steps to make tree representation;
Optimize the best attribute and put it at the root of the tree.
Divide the dataset into subsets, using the previous attribute make sure subsets must have the same values for an attribute.
Repeat the process discussed in step 1 and step 2, until you find the leaf nodes for all branches of the tree.
In the above-mentioned example of loan manager, this is a simple example to classify the loan applications into safe or risky loan application on the basis of some attributes, here, attributes are some possible or real-time events on which decision depends. And the criteria of classification as a decision tree comes.
The classification is basically a two-stepped process, first is the learning step in which an arbitrary model is built on a given set of training data and the second one is the prediction step in which the model is implemented to predict the response for a given data set.
Sometimes, situations come in which decision was made on continuous data, or when the target variable will be a real number, like, to predict the price of a product on the basis of the cost of raw material used to manufacture that product, to obtain the salary of the customer by utilizing his consumptions, job or home location and other information provided in applicant form, etc. Here, the target variable is either real value or a part of the dataset is continuous data used for predicting the target variable.
And hence, the criteria of regression as a decision tree comes. The regression tree takes into account the observation about the various features of an object and trains the model to predict data in order to give meaningful continuous output.
Let’s discuss the primary differences and similarities in the regression tree and the classification tree; Regression trees are used when the target variable is a continuous variable and the value received in the training dataset to the end of a terminal node is the mean response of all the observation lying under that section.
Also, if any other known observation comes in that section, it is then replaced by a mean value and prediction is done. In contrast to this, Classification trees work are required when the target variable is categorical variable and the value received in the training dataset to the end of a terminal node is the mode of all the observation lying under that section. Also, if any other known observation comes in that section, it is then replaced by a mode value and prediction is done.
If we discuss similarities; Both the trees split up independent variables into definite and non-overlapping regions and use a recursive binary approach, i.e. splitting initiates from the top of the tree while all the observation lying in one single region and divide the independent variable into two new branches. The splitting process is continued and results in a fully developed tree when the stopping criteria get fulfilled, defined by a user.
Implementing decision trees in machine learning has several advantages;
We have seen above it can work with both categorical and continuous data and can generate multiple outputs.
Decision trees are easiest to interact and understand, even anyone from a non-technical background can easily predict his hypothesis using decision tree pictorial representation.
The model can interpret accurate results and trees’ reliability can be trusted and quantify.
Decision trees require less time for data preparation as it doesn’t require dummy variables, data normalization, replacement of missing values, etc.
Also, it takes very less time for data exploration, to find the most important variables and its relationship with other variables, to create new features that strengthen the target variable.
Decision trees are very helpful in data cleaning, it takes much less time in the data cleaning process in comparison to other modeling techniques as it doesn’t get affected by outliers and missing values up to a certain mark.
Decision trees are often considered as a non-parametric method, they have no opinions about space arrangement and designing of classifiers.
Even though, non-linear relationships between various features are not able to influence the performance and efficiency of trees.
Let’s learn about some disadvantages in decision trees;
While dealing with categorical data having multiple observations, the information gain gets biased in approval of the attributes with the most observations.
As datasets have values with many levels, these are interconnected so calculations become more complex.
Decision trees often struck with the problems of overfitting, there might be situations come when over-complex trees can’t generalize the data well. By constraining the number of parameters at the leaf node or setting the maximum depth of the tree, this problem can be minimized.
Very small changes in data might result in generating a completely different tree, this is termed as variance, so decision trees are treated as unstable. The concept of bagging, boosting, etc are introduced for the same
In comparison with other modeling techniques, it produces low prediction accuracy for any data.
I hope, you might have got an idea about the beginning part of the decision tree, this blog surely inspires you to study decision trees more profoundly. You have seen that Decision trees belong to the class of supervised learning, other various methods such as random forest, gradient boosting, etc are famous for solving data science problems. Decision trees are basically used for solving regression and classification based problems. You will learn about random forest and gradient boosting in an upcoming blog. For more blogs in analytics and new technologies do read Analytics Steps.
Introduction to Time Series Analysis: Time-Series Forecasting Machine learning Methods & ModelsREAD MORE
How is Artificial Intelligence (AI) Making TikTok Tick?READ MORE
The Essence of Game Theory in Artificial Intelligence - 5 Types of Game Theory and Nash EquilibriumREAD MORE
7 Types of Activation Functions in Neural NetworkREAD MORE
Convolutional Neural Network (CNN): Graphical Visualization with Code ExplanationREAD MORE
Deep Learning - Overview, Practical Examples, Popular AlgorithmsREAD MORE
6 Dynamic Challenges in Formulating the Imperative Recommendation SystemREAD MORE
Introduction to Machine Learning: Supervised and Unsupervised LearningREAD MORE
What are the roles, opportunities and challenges posed by Big Data in Tourism?READ MORE
Driving Digital Transformation with Data Science: What, How and Why?READ MORE