20 Data Science Interview Questions

  • Bhumika Dutta
  • Aug 25, 2021
  • Machine Learning
20 Data Science Interview Questions title banner



Data science is one of the most popular career options in the field of technology. This subject is growing rapidly and is taking reigns in almost all fields. Let us start the blog by understanding what data science is. 


In simple words, Data science is the process of extracting useful information from raw data. There are huge amounts of data collected every day from the world and data scientists of companies analyze these data and extract clean and useful data from the lot and use it for the benefit of the company. 


Without data scientists, all the valuable information hidden within raw data could be easily missed. So, the job of a data scientist in any company is on-demand. To become a data scientist, one must sit for the interviews that are conducted by companies during the selection process. There are many topics related to data science that an individual must study before sitting for the interview. 


In this article, we are going to list out 20 most common topics and questions that are asked during any data science interview.


(Must read: Why Choose Data Science for Your Career?)


Listing Data Science Interview Questions:


  1. What do you mean by Linear Regression?


Linear regression is a type of supervised learning algorithm that finds the linear relationship between two variables. In the relation, there is a predictor called the independent variable and a response known as the dependent variable. 


Linear regression helps in understanding the linear relationship between dependent and independent variables. It determines how the dependent variables change with respect to the independent variables.


Simple linear regression is used when there is only one independent variable, while multiple linear regression is used when there are several independent variables.



  1. What are the assumptions related to Linear Regression?


The assumptions related to linear regression are:


  • The relationship between the dependent variable and the independent variable must be linear.

  • Their features must be independent of each other.

  • Homoscedasticity - For various input data, the output variation must be constant.

  • The distribution of the dependent variable along the independent variable should be the Normal distribution.


(Also read: Statistical data distribution models)



  1. What is the major difference between supervised and unsupervised learning?


Supervised learning uses known and labeled data as input and has a feedback mechanism whereas unsupervised learning uses unlabeled data as input and has no feedback mechanism.


  • Supervised learning algorithms - decision trees, logistic regression, and support vector machine
  • Unsupervised learning algorithms - k-means clustering, hierarchical clustering, and apriori algorithm.



  1. What is Logistic Regression?


Logistic regression is a type of binary classification. By estimating probability using its underlying logistic function, logistic regression assesses the connection between the dependent variable and one or more independent variables. It als uses the logit function on the top of probability to provide 0 or 1 as output.


Given below is a diagram that shows the working of logistic regression:

The image is illustrating working of logistics regression.

Working of logistic regression (source)

  1. What is the difference between regression and classification?


There is no clear difference between Regression and Classification but there are few properties that differentiate between  the two. Classification predicts discrete labels, whereas regression produces a continuous quantitative value.


The quantity is predicted via regression. For regression, we may use both discrete and continuous data as input. Time series forecasting is created when input data is organised according to time. 


Whereas, in case of classification, Binary Classification refers to the challenge of classifying two classes. Multi-class classification and multi-label classification are two types of classification. In Classification, we place a greater emphasis on accuracy, but in Regression, we place a greater emphasis on the error term.


(Also read: A Classification and Regression Tree (CART) Algorithm)



  1. How to build a decision tree?


The following steps are required in making a decision tree:


  • First, the entire data set is taken as input.

  • Then, entropy of the target variable is calculated as predictor attributes.

  • The information gain of all attributes are calculated. 

  • The attribute with the highest information gain is chosen as the root node. 

  • The same procedure is repeated on every branch until the decision node of each branch is finalized. 



  1. What is a confusion matrix and how does it help in evaluating the performance of any model?


The Confusion Matrix is a matrix used to determine a classification model's performance. In general, it is a 2 x 2 matrix with one side representing prediction and the other representing actual values. 


The different accuracy measures can be found using a confusion matrix. Accuracy, Recall, Precision, F1 Score, and Specificity are the metrics in question.



  1. Define Bias in Data science.


Bias is a sort of mistake that arises in a Data Science model when an algorithm is used that isn't powerful enough to capture the underlying patterns or trends in the data. 


To put it another way, this mistake happens when the data is too complex for the algorithm to comprehend, causing it to construct a model based on basic assumptions. As a result of the underfitting, accuracy suffers. Linear regression, logistic regression, and other algorithms can cause high bias. (From)



  1. Define Variance in Data science. 


Variance is a form of mistake that arises when a Data Science model becomes too sophisticated and learns characteristics from data while also taking into account noise. 


Even though the data and underlying patterns and trends are relatively straightforward to detect, this type of mistake can arise if the method used to train the model is complicated. 


As a result, the model is extremely sensitive, performing well on the training dataset but poorly on the testing dataset and on any type of data it hasn't seen before. Variance leads to poor testing accuracy and overfitting in most cases.



  1. Why is Sampling important?


For huge datasets, we can't go through the entire volume at once to analyse the data. We'll need to gather some data samples that can reflect the entire population. 


We should select data that is a genuine representation of the entire data set when creating a sample from complete data. There are 2 types of sampling techniques: Probability sampling and Non Probability sampling.



  1. What is Normalisation? State the difference between Normalisation and standardization


Normalisation is the process of reducing the number of features to a manageable number so that the model can perform effectively and not get biassed toward any one characteristic.


The techniques of Features Conversion are both Normalisation and Standardization. In terms of conversions, however, the approaches are different. After normalisation, the data scales from 0 to 1. Standardization, on the other hand, scales the data so that the mean is zero.



  1. What is the Decision Tree algorithm?


A Supervised Machine Learning method is a decision tree algorithm. It prepares a model based on past output using preset decision data. It uses a technology to recognise patterns and forecast classes or output variables based on prior data.



  1. What do you mean by pruning and entropy in a Decision Tree Algorithm?


Pruning a decision tree is eliminating portions of the tree that are no longer required or superfluous. Pruning results in a reduced decision tree with greater performance, accuracy, and speed.


Entropy is a measure of impurity or unpredictability in a decision tree algorithm. The entropy of a dataset indicates how pure or impure the dataset's values are. In basic terms, it informs us about the dataset's volatility.



  1. How does a recommender system work?


Many consumer-facing, content-driven web sites utilise a recommender system to produce recommendations for consumers from a library of accessible material. These systems provide suggestions based on what they know about the users' preferences from their platform activity.


Consider the case where we have a movie-streaming service comparable to Netflix or Amazon Prime. If a user has previously viewed and loved films in the action and horror genres, it is safe to assume that the user likes these genres. In such a scenario, it would be preferable to suggest similar films to this person. These suggestions might also be based on what other people with similar tastes enjoy watching.



  1. What is Naive Bayes Classifier?


A probabilistic model is the Naive Bayes Classifier method. This model is based on the Bayes Theorem. By combining Naive Bayes with other kernel functions to create a perfect Classifier, the accuracy of Naive Bayes may be greatly enhanced.



  1. What do you mean by Recurrent Neural Network (RNN)?


A recurrent neural network, or RNN for short, is an artificial neural network-based Machine Learning method. RNNs are used to detect patterns in a series of data, such as time series, stock market data, temperature data, and so on. 


RNNs are a type of feedforward network in which data from one layer is sent to the next and each node conducts mathematical operations on the input. 


RNNs hold contextual information about past calculations in the network, thus these operations are temporal. It's termed recurrent because it repeats operations on the same data each time it's passed. However, depending on previous computations, the result might be different.



  1. What is a ROC curve?


ROC is the abbreviation for Receiver Operating Characteristic. It's essentially a plot of a true positive rate vs a false positive rate that aids us in determining the best tradeoff between the true positive rate and the false positive rate for various probability thresholds of projected values. 


As a result, the model is better if the curve is closer to the top left corner. To put it another way, the curve with the most area under it is the superior model. This may be seen in the graph below:

Showing an ROC curve

ROC curve (source)

  1. State the difference between Data modeling and Database design.


Data modelling is the initial stage in the process of designing a database. It constructs a conceptual model based on the interrelationships between several data models. Moving from the conceptual stage through the logical model to the physical schema is part of the process. It entails a methodical approach to data modelling approaches.


Database design, on the other hand, is the process of designing the database. The database design generates a comprehensive data model of the database as an output. 


Database design, strictly speaking, refers to the comprehensive logical model of a database, but it may also refer to physical design decisions and storage factors.



  1. What is P-value and how is it used?


The statistical significance of an observation is measured by the P-value. The probability demonstrates the relevance of the result in relation to the data. The p-value is used to determine a model's test statistics. It usually aids us in deciding whether to accept or reject the null hypothesis.


The p-value is used to determine if the provided data accurately describes the observed effect.


(Must read: Hypothesis testing and p-value)



  1. What is the relation between Data science and machine learning?


Data science and machine learning, both deal with data but there are few fundamental differences between them.


Data Science is a wide topic that works with enormous amounts of data and helps us to extract meaning from it. The complete Data Science process takes care of a number of processes that are involved in extracting insights from the given data. Data collection, data analysis, data modification, data visualisation, and other critical processes are all part of this process.


Machine learning, on the other hand, may be regarded as a sub-field of Data science. It likewise deals with data, but we're only interested in learning how to turn the processed data into a functional model that can be used to map inputs to outputs, such as a model that can take a picture as an input and tell us if it includes a flower as an output.


(Suggest reading: Data science vs machine learning)




Data science is a vast topic and includes subfields like Machine Learning, Statistics, Probability, Mathematics, etc. 


(Similar read: Top Machine Learning Interview Questions)


If someone is planning to work as a data scientist, he/she must be knowledgeable in all the fields and have a good grasp on all the concepts.