Science, where can we not find it? It is present everywhere, in each and every aspect of our life. We can see it in the computers around it, in the simplest of codes, in the games we play every day, even while we are doing nothing and just sitting on a chair.
There is not a single time when science doesn’t come into the picture. It is always there, sometimes hidden, sometimes not, sometimes clearly visible, while sometimes it takes a need of discovery. Remember how Newton discovered gravity? It was there in the picture but it needed light of discovery.
We all at every point of our life have dealt with data. Be it the data of our monthly expenses or the data of our savings. Be it the bank passbook or an electricity bill. Data has been there as a part of our lives. What if someone asks us to define data?
We will simply define it in one line saying, “data is the collection of information, most often collected in numeric form through different observations. It is a set of information, which can be of any kind from words to numbers and simple figures, which are stored at a specific place.
Data in computer science too, means the same, only change being, it is stored in the computer memory. From the observations written in our diaries to the ones stored in an excel sheet, everything is data.
As we know, there is science in everything, so here data is, with its own kind of science. Science is there to combine computer technology with data and statistics. Let us have a closer look at what data science is.
Data science is a study that combines programming skills, with the mathematics of data and statistics to get meaningful insights from the data. Data scientists are the ones who deal with data at this larger level.
One example of the implementation of data science is what we see in daily news headlines. The daily data of covid cases in India is studied by data scientists through various methods and with it they gather conclusions and insights like the time of the peak of covid or the time when it will start reducing, all these things.
Data science deals with the storage of data in a clear way and performing analytical studies on it through different known methods. This handling of data and combining it with various other fields to get meaningful insights is called data science of the science of data.
When we were talking about data science, one more term popped up, “Statistics”.
(Most related: Importance of statistics in data science)
Statistics as we know is a branch of mathematics. It deals with numeric data. It is the study of collected numeric data to provide meaningful information with it. It is the study that collects, analyses, interprets, and presents data in the best possible way to extract pieces of information through it.
For example, we might have seen a pie chart or a bar graph describing the poverty ratio or sex ratio of an area. Here, what happens is that data from different sources is collected and then is studied to gather a meaningful outcome like per 1000 boys there are 990 girls or 30 percent of people in a specific area are below the poverty line. These are the outcomes we get after the statistical processes being applied to the collected data.
(Must check: Introduction to Bayesian Statistics)
Among many, three statistical processes are MEAN, MEDIAN, and MODE. We will look at each one of them one by one. But before going there, we need to know about the arrangement of data.
Data can be arranged in two forms
Raw form
Tabular form
Under the raw form, the data is kept together without any kind of arrangement. For example, if we are given two numbers, say, X1 and x2.
Then, X1, X2 is the raw form of data.
Under tabular form, there is a table with the frequency of the data(f) and then the data itself (x)
For example-
f |
10 |
20 |
30 |
40 |
x |
20 |
30 |
50 |
100 |
Another such example of the tabular form where data is given in intervals is-
x |
f |
10-20 |
1 |
20-30 |
2 |
30-40 |
3 |
40-50 |
4 |
Ever calculated the average of something in a math class? That’s exactly what arithmetic means. It is basically, all the observations added up and then divided by the no. of observations.
Let us have a look at an example-
If a student scores 25, 20,15, 24, 23 in 5 subjects, the full marks being 25. What will be his mean score?
To calculate this, we will have to first add up all the scores of the student i.e. 25+20+15+24+23, and will give us a sum of 107.
Now to calculate the mean we will divide the sum by the no. of observation i.e 5.
So the mean will be, 107/5 which equals 21.4.
So here, as we found out, the mean score of the student came out to be 21.4 which can also be written as the average score of the student per subject. So, formula for mean is
Mean= Sum of observations/ No of observations
Now, when data is given in the simple tabular form, how do we calculate the mean?
In that case, we simply add up the product of frequency(f) and data(x) i.e. (fx) and then divide it by the sum of frequencies.
The formula being-
Where the x with the bar on top says "the mean of x"
What if the data given in a table is in intervals what do then?
For example here is the table
data |
frequency |
10-20 |
1 |
20-30 |
2 |
30-40 |
3 |
40-50 |
4 |
Here for x, we calculate a “class mark” which is, the upper limit of interval+lower limit of interval divided by 2.
For example here if we calculate the class mark(x)
The new table will be-
data |
Frequency (f) |
Class mark (x) |
10-20 |
1 |
15 |
20-30 |
2 |
25 |
30-40 |
3 |
35 |
40-50 |
4 |
45 |
Now, as we got both f and x, we will follow the same process as we did before to find the mean of the given data.
(Recommended blog: What is Confusion Matrix?)
The median of the data is the value of the middlemost observation obtained after arranging the data in ascending order.
For example, we get a set of numbers,
1, 2, 3, 4, 5, 6
We will arrange them in ascending order and then will divide the no. of observations by 2. Here it will be 3.
Now we will add the 3rd and 4th term and divide the sum by 2. The outcome will be our median.
This is the case of even no. of observations.
If the observations are in odd numbers, then we simply add 1 to no. of observations and divide them by two. The result for example if is 4 then the 4th term of the set will be our median.
Let us take this grouped data into consideration
x |
f |
c |
10-20 |
2 |
2 |
20-30 |
10 |
2+10= 12 |
30-40 |
33 |
12+33= 45 |
C here is the cumulative frequency which is calculated to find the median.
The median is found when the data is continuous and in the form of a frequency distribution, as shown below:
Step 1: The first step is to determine the median class.
Step 2: To find the median, use the formula below.
Let n be the total no. of observations.
Then the median class will be where “n/2” lies.
After this, we use the following formula to find the median
l+[(n/2-c)/f]*h
Where,
l=the median class's lower limit
c= cumulative frequency of the preceding class before the median class
f=the median class's frequency
h=the composition of the class.
Now, as we have cleared the concept of the median, let us move to Mode.
(Must check: Crash course in Statistics)
Mode is the simplest of the three we discussed till now. It is nothing but the observation which has the highest frequency among all the observations, i.e. the one observation which appears the most no. of times.
For example here is a data set
1, 2, 3, 1, 5, 2, 4, 2
In it, 2 appears 3 times which is the highest frequency, therefore, 2 is the mode of this dataset.
For tabular data, we first take the interval with the highest frequency and call it the modal class. Then we apply the following formula here
L + [ f_{m} − f_{m-1}/(f_{m} − f_{m-1}) + (f_{m} − f_{m+1})]× w
where:
L is the lower class boundary of the modal group
f_{m-1} is the frequency of the group before the modal group
f_{m} is the frequency of the modal group
f_{m+1} is the frequency of the group after the modal group
w is the group width
This is how we calculate the Mode of grouped data.
After looking at these processes one can get an idea of how complex the other processes must be. How complex are the study of data and its handling?
(Also read: What is Vital Statistics?)
Now that we have gone through these processes let us conclude it with a simple relationship between mean median and mode. For any given data- Mean > median> mode.
Elasticity of Demand and its Types
READ MOREWhat is PESTLE Analysis? Everything you need to know about it
READ MORE6 Major Branches of Artificial Intelligence (AI)
READ MORE5 Factors Influencing Consumer Behavior
READ MOREDijkstra’s Algorithm: The Shortest Path Algorithm
READ MOREAn Overview of Descriptive Analysis
READ MORE5 Factors Affecting the Price Elasticity of Demand (PED)
READ MORE8 Most Popular Business Analysis Techniques used by Business Analyst
READ MOREReliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working Ecosystem
READ MORETop 10 Big Data Technologies
READ MORE
Comments