There are massive amounts of data available in the world and processing and analyzing them are regular tasks of data scientists. While working on any data-rich document, it becomes quite hard for the person to figure out the important and unimportant factors of the entire dataset while going through it manually.
It is almost impossible to go through every minute detail as it not only wastes the time of the employees but also reduces efficiency and money for the business. To solve this problem and tackle high-dimensional big data with more efficiency, dimensionality reduction techniques are used.
In this article we are going to learn about:
(Must read: What is linear discriminant analysis?)
What is Dimensionality Reduction?
As we know, there are tremendous amounts of data that are produced every day, and to analyze them, we need to reduce the high dimensional data into a more approachable format.
Dimensionality reduction techniques help us in doing that. We can use these techniques to reduce the number of features present in a dataset without risking any loss of important or relevant information. It not only reduces our work but also helps us improve the performance of our model.
Visualization of data is very important in the process of understanding it. There are many techniques of data visualization, like charts, graphs, etc. Let us consider this example explained by Analytics Vidhya.
Two variables of humans are considered: Age and Height and a scatter line do not affect plotted to visualize the relationship between the two.
Plotting two variables is easy, but in case we have a large number of variables, there will be a huge number of plots. In those cases, we have to select a subset of the variables that are responsible for as much information as the original set of variables.
For example, if we have weights of similar objects in different metrics like Kilograms (here denoted by X1) and Pounds (here denoted by X2), it would be convenient to use only one of the two variables as they both convey similar information. So a 2-dimensional graph like this can be converted into a 1-dimensional graph.
Hence the dimensionality of any graph with p dimensions can be reduced into a subset of k dimensions where k<p. This process is called dimensionality reduction.
Dimensionality reduction is very important as it not only saves storage space but also reduces the computational time, resulting in the efficient working of algorithms.
It can be accomplished by either retaining just the most important variables from the original dataset also known as feature selection or by identifying a smaller collection of new variables, each of which is a mixture of the input variables and contains essentially the same information as the input variables (this technique is called dimensionality reduction).
There are many techniques used for dimensionality reduction that are discussed further in the article.
(Read also: Feature engineering in machine learning)
Top 10 Dimensionality reduction techniques:
Principal Component Analysis (PCA):
Principal Component Analysis (PCA) is one of the most popular methods of dimensionality reduction as it is used for both data analysis and predictive modeling. It is a statistical method that uses an orthogonal transformation to turn observations of correlated characteristics into a set of linearly uncorrelated data.
The Principal Components are the newly changed characteristics and are extracted in such a way that the initial component explains maximum variance in the data set.
The second principal component, which is unrelated to the first principal component, attempts to explain the remaining variation in the dataset.
The third principle component attempts to explain the variation that the previous two principal components do not explain, and so on.
Missing Value Ratio:
If there are too many missing values in a dataset, we remove such variables since they do not provide any relevant information. To do this, we may define a threshold level, and if a variable has more missing values than that threshold, we will drop it. The greater the threshold value, the greater the efficiency of the decrease.
Forward Feature Selection:
The method of forwarding feature selection is the converse of the procedure of backward elimination. This implies that with this method, we will not delete the feature; instead, we will discover the best characteristics that will result in the greatest gain in the model's performance.
To make this technique applicable, we have to start with a single feature only, and progressively add features, one at a time. In this section, we will train the model on each feature independently.
The feature with the highest performance is chosen. The procedure will be repeated until we see a significant improvement in the model's performance.
Backward Feature Elimination:
The backward feature removal approach is most commonly employed when creating a Linear Regression or Logistic Regression model. In this method, we can specify the ideal amount of features required for machine learning algorithms by picking the best model performance and the lowest tolerated error rate.
To begin, all n variables from the provided dataset are used to train the model in this approach. The model's performance is evaluated. Now we will eliminate one element at a time and train the model on n-1 features for n times before calculating the performance of the model.
We will look for the variable that has made the least or no difference in the model's performance, and then we will remove that variable or features, leaving us with n-1 features. Continue the entire procedure till no feature can be dropped.
Random Forest is one of the most commonly used feature selection algorithms. It comes with built-in feature importance, so you don't have to program it individually. This allows us to choose a smaller group of characteristics. Because this method has a built-in feature significance package, we don't need to program it individually.
In this method, we must construct a huge number of trees against the target variable and then discover the subset of features using use statistics for each attribute. Because the random forest method only accepts numerical variables, we must use hot encoding to transform the input data into numeric data.
Factor analysis is a technique in which each variable is retained inside a group based on its connection with other variables. This implies that variables within a group might have a strong correlation among themselves but a low correlation with variables from other groups.
Variables in the Factor Analysis approach are categorized based on their correlations, i.e., all variables in one group will have a high correlation among themselves but a low correlation with variables in another group (s). Each group is referred to as a factor in this context. These parameters are few in comparison to the original dimensions of the data. These variables, however, are difficult to observe.
(Suggested blog: Feature Scaling In Machine Learning Using Python)
Independent Component Analysis:
Independent Component Analysis (ICA) is an information-theory-based dimensionality reduction approach that is also one of the most commonly utilized.
The primary distinction between PCA and ICA is that PCA seeks uncorrelated components whereas ICA seeks independent factors. If two variables are uncorrelated, it indicates they have no linear relationship. They are independent if they are not affected by other variables.
The provided variables are assumed to be linear mixes of some unknown latent variables in this technique. It also implies that these latent variables are mutually independent, i.e., not dependent on other variables, and hence are referred to as the independent components of the observed data.
Low Variance Filter:
Data columns with certain changes in the data provide less information, just like the missing value ratio approach. As a result, we must compute the variance of each variable, and any data columns with a variance less than a certain threshold are eliminated because low variance characteristics have no effect on the target variable.
(Recommended blog: Cost function in machine learning)
High Correlation Filter:
When two variables convey roughly identical information, this is referred to as a high correlation. The model's performance may suffer as a result of this factor. The correlation coefficient is determined based on the correlation between the independent numerical variables.
If this value is more than the threshold value, one of the variables in the dataset can be removed. We can look at factors or features that have a strong association with the target variable.
t-SNE performs well on big datasets, but it has drawbacks, including loss of large-scale information, long computation time, and difficulty to properly represent very large datasets.
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction approach with a quicker runtime that can maintain as much of the local and more of the global data structure as t-SNE. It is capable of handling big datasets and high-dimensional data with ease. It combines the power of visualization with the capacity to decrease the data's dimensionality.
It not only retains the local structure of the data but also preserves the global structure of the data. UMAP translates nearby points on the manifold to nearby points in the low-dimensional representation, and vice versa for distant locations.
(Read also: Exponential Smoothing in Machine Learning)
Features aren’t going anywhere and we must know about each and every feature present in the specific technical field in which we work. Be it a beginner who has just started with machine learning and data science or a professional who is dealing with different things every day, there are some things that everyone needs to keep in mind.
(Must catch: Generative Adversarial Network (GAN) in Machine Learning)
Dimension reduction techniques are one of them. Working with hundreds of millions of features is a necessary ability for every data scientist. The quantity of data we generate each day is unparalleled, and we need to figure out how to use it in novel ways. Dimensionality reduction is a highly helpful method that has done wonders for many people, both professionally and in machine learning hackathons.