When it comes to summarize, present and describe data in the simplest possible way, the descriptive statistics help. They are often called the first and important step in statistical analysis. Most of the time an actual analysis starts when an analyst digs the data and presents some descriptive statistics out of it for the user. The descriptive statistics allows us to understand the data with just an overview of the same. They are more or less the best alternatives when it comes to understanding the data and if done properly can already be a better start to the deep or advanced statistical analysis.
In this chapter, we are about to see how the descriptive statistics can be taken out under R programming with hands-on examples.
When we start doing analysis, the basic summary statistics that describe the data with single unique values are key. They allow us to understand the data more precisely and that too with a single representative value for the same. Such summary statistics are nothing but descriptive statistics. They consist of minimum value, maximum value, range, mean, median, quartiles, Interquartile Range, Standard Deviation, Variance, and more. Through this article, we will discuss a few of them.
Here in this article, we are using a data named ‘Orange’ and you can see what does it look alike by just typing its name through R console as shown below:
How to load the dataset into R
Well, the data consists of 35 observations and three variables as we can see in the image above namely Tree, age, and circumference. The “Tree” is an ordered factorial variable that contains five levels on the scale of 1 to 5 and represents the tree on which the measurements are made. The “age” is a variable that stores the age of trees in the number of days from date “1968/12/31”. Finally, the “circumference” is a variable that represents value for circumferences of the tree at breast height.
You can always check about the initial structure of your data using the str() function as shown below:
Example code for str() function with output
To find the minimum and maximum values for any variable under the given dataset, we have the functions called min() and max() under R. The minimum and maximum values are crucial as they get you a rough idea about the spread of data.
Let us find the minimum and maximum of the “circumference” variable.
Example code for min() and max() with output
We can see that the minimum and maximum circumference values among the thirty-five trees are 30 and 214 mm respectively.
You also can get the minimum and maximum values together under range function in R.
Example code with output for the range() function
If you notice, the range() function doesn’t actually return the range; instead, it returns the minimum and maximum values as elements. You can access these values using the slicing as “rg” is an object that holds those.
Are you not aware about the functions in R programming? Read our articles out on Functions in R for a better realisation.
The range in statistics is nothing but the difference between the maximum and the minimum value. It gives you a better picture of the spread of the data.
Example code for finding out the range with output
Unfortunately, we don’t have any dedicated function in R that computes the range for us. However, we are free to develop one of your own. See an example below:
Example code for creating a function that computes the range
The mean or average is the sum of all elements divided by the total number of elements in statistical terms. Under R, we have a function named “mean()” that computes mean for the given set of values.
Let us find out the mean value for circumference under the Orange dataset.
Example code with output for the mean() function
We can say that the average circumference of the trees we have sampled under our data is 115.8571 mm.
Remember that, if any of the values under the dataset is missing, this function will return “NA” as an output.
The median is a value that is the center of your data and divides it into half. Meaning, half of the observations/values are below this value and half of them are above this value. Under R, we have a function named median() that does the work for us.
Example code with output for the median() function
Quartiles are the data points that divide your data into four equal parts each of the parts is representing a quarter portion from your data. The first quartile represents 25% of your data, the second quartile represents 50% of your data which is also a median value, and so on.
We have a function named quantile() that allows us to compute the first, second, and third quartile. We just specify the second argument as 0.25, 0.5, and 0.75 to get them respectively.
Example code with output for the quantile() function
The difference between the first and third quartile is known as the interquartile range in statistics. We can use the same quantile() function to get the interquartile range as shown below or else there is a function named IQR() that helps us to get the interquartile range for the given variable.
Example code that computes Inter Quartile Range
The standard deviation is a measure that specifies how far the points/elements from the given group are deviating from its mean value.
We can compute the same using sd() function under R.
The variance is nothing but the square of the standard deviation or on the other hand, you can say the standard deviation is a square root of the variance.
We have var() function that computes the variance for the given group of objects.
A thing we have to note, these two functions are always computing the variance and standard deviations assuming the given data as a sample. There is no such function in R, that computes variance and standard deviation for the population.
Finding standard deviation and variance in R
Now, what if I tell you some or most of these descriptive statistics we have computed above can be generated using a single function in R. Will you believe? That is the beauty here. We have a function summary() that gives us the minimum, maximum, range, mean, median, first and third quartiles.
Example code with output for the summary() function
Finding out the descriptive statistics for given data is a first step towards statistical analysis.
The min() and max() functions help us to get the minimum and the maximum values for a group of values.
The range() function generates the minimum and maximum values together and those can be extracted by slicing the object.
The mean() and the median() functions compute the mean and the median for us in R.
The quantile() function can be used to compute the quartiles as well as percentiles in R.
The sd() and the var() function allows us to get the standard deviation and the variance for given data.
summary() function generates the minimum, maximum, mean, median, and first as well as third quartile in R.
This is it from the article. In the next article, we will come up with one more interesting article in the field of R programming. Also, look into our previous article that talks about the dates in R at Dates in R. Until we meet again, stay safe! Keep enhancing! :)
6 Major Branches of Artificial Intelligence (AI)READ MORE
Reliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working EcosystemREAD MORE
Top 10 Big Data TechnologiesREAD MORE
8 Most Popular Business Analysis Techniques used by Business AnalystREAD MORE
Deep Learning - Overview, Practical Examples, Popular AlgorithmsREAD MORE
7 types of regression techniques you should know in Machine LearningREAD MORE
7 Types of Activation Functions in Neural NetworkREAD MORE
What Are Recommendation Systems in Machine Learning?READ MORE
Introduction to Time Series Analysis in Machine learningREAD MORE
How Does Linear And Logistic Regression Work In Machine Learning?READ MORE