adplus-dvertising

Descriptive Statistics in R

  • Lalit Salunkhe
  • Sep 02, 2020
  • R Programming
Descriptive Statistics in R title banner

When it comes to summarize, present and describe data in the simplest possible way, the descriptive statistics help. They are often called the first and important step in statistical analysis. Most of the time an actual analysis starts when an analyst digs the data and presents some descriptive statistics out of it for the user. The descriptive statistics allows us to understand the data with just an overview of the same. They are more or less the best alternatives when it comes to understanding the data and if done properly can already be a better start to the deep or advanced statistical analysis. 

 

In this chapter, we are about to see how the descriptive statistics can be taken out under R programming with hands-on examples.


 

What are they?

 

When we start doing analysis, the basic summary statistics that describe the data with single unique values are key. They allow us to understand the data more precisely and that too with a single representative value for the same. Such summary statistics are nothing but descriptive statistics. They consist of minimum value, maximum value, range, mean, median, quartiles, Interquartile Range, Standard Deviation, Variance, and more. Through this article, we will discuss a few of them.

 

 

The Data to be Used

 

Here in this article, we are using a data named ‘Orange’ and you can see what does it look alike by just typing its name through R console as shown below:


The image shows how to load the dataset into R

How to load the dataset into R


Well, the data consists of 35 observations and three variables as we can see in the image above namely Tree, age, and circumference. The “Tree” is an ordered factorial variable that contains five levels on the scale of 1 to 5 and represents the tree on which the measurements are made. The “age” is a variable that stores the age of trees in the number of days from date “1968/12/31”. Finally, the “circumference” is a variable that represents value for circumferences of the tree at breast height.


 

You can always check about the initial structure of your data using the str() function as shown below:


This image shows how to check out the structure of a data set.

Example code for str() function with output


Finding out Minimum and Maximum

 

To find the minimum and maximum values for any variable under the given dataset, we have the functions called min() and max() under R. The minimum and maximum values are crucial as they get you a rough idea about the spread of data.

Let us find the minimum and maximum of the “circumference” variable.


This image shows how the min() and max() functions work under R programming.

Example code for min() and max() with output


We can see that the minimum and maximum circumference values among the thirty-five trees are 30 and 214 mm respectively.

You also can get the minimum and maximum values together under range function in R.


The image shows how the range() function works under R.

Example code with output for the range() function


If you notice, the range() function doesn’t actually return the range; instead, it returns the minimum and maximum values as elements. You can access these values using the slicing as “rg” is an object that holds those.

 

Are you not aware about the functions in R programming? Read our articles out on Functions in R for a better realisation.

 

Finding out the Range

 

The range in statistics is nothing but the difference between the maximum and the minimum value. It gives you a better picture of the spread of the data.


The image shows how to compute the range i.e. difference between maximum and minimum value from the data set.

Example code for finding out the range with output


Unfortunately, we don’t have any dedicated function in R that computes the range for us. However, we are free to develop one of your own. See an example below:


creating a function that computes the range for the given variable in the dataset.

Example code for creating a function that computes the range



The Mean or Average

 

The mean or average is the sum of all elements divided by the total number of elements in statistical terms. Under R, we have a function named “mean()” that computes mean for the given set of values.

 

Let us find out the mean value for circumference under the Orange dataset.


The image shows how the mean() function works under R programming to return the average value.

Example code with output for the mean() function


We can say that the average circumference of the trees we have sampled under our data is 115.8571 mm.

 

Remember that, if any of the values under the dataset is missing, this function will return “NA” as an output.

 

The Median

 

The median is a value that is the center of your data and divides it into half. Meaning, half of the observations/values are below this value and half of them are above this value. Under R, we have a function named median() that does the work for us.


This image shows how the median() function works to get us the median value under R.

Example code with output for the median() function


The Quartiles

 

Quartiles are the data points that divide your data into four equal parts each of the parts is representing a quarter portion from your data. The first quartile represents 25% of your data, the second quartile represents 50% of your data which is also a median value, and so on.

 

We have a function named quantile() that allows us to compute the first, second, and third quartile. We just specify the second argument as 0.25, 0.5, and 0.75 to get them respectively.


This image shows how the quantile() function helps us in finding out the first, second, and third quartile.

Example code with output for the quantile() function


The Interquartile Range

 

The difference between the first and third quartile is known as the interquartile range in statistics. We can use the same quantile() function to get the interquartile range as shown below or else there is a function named IQR() that helps us to get the interquartile range for the given variable.


This image shows how to find out the interquartile range in R using the quantile() function and IQR() function respectively.

Example code that computes Inter Quartile Range


The Standard Deviation and the Variance

 

The standard deviation is a measure that specifies how far the points/elements from the given group are deviating from its mean value.

We can compute the same using sd() function under R.

 

The variance is nothing but the square of the standard deviation or on the other hand, you can say the standard deviation is a square root of the variance.

We have var() function that computes the variance for the given group of objects.

A thing we have to note, these two functions are always computing the variance and standard deviations assuming the given data as a sample. There is no such function in R, that computes variance and standard deviation for the population.


This image shows how the sd(), and var() function helps in finding out the standard deviation as well as the variance respectively for the given group of data.

Finding standard deviation and variance in R


The summary() Function

 

Now, what if I tell you some or most of these descriptive statistics we have computed above can be generated using a single function in R. Will you believe? That is the beauty here. We have a function summary() that gives us the minimum, maximum, range, mean, median, first and third quartiles. 


The image shows how the summary() function works in R and generates the descriptive statistics at once.

Example code with output for the summary() function


Summary

 

  • Finding out the descriptive statistics for given data is a first step towards statistical analysis.

  • The min() and max() functions help us to get the minimum and the maximum values for a group of values.

  • The range() function generates the minimum and maximum values together and those can be extracted by slicing the object.

  • The mean() and the median() functions compute the mean and the median for us in R.

  • The quantile() function can be used to compute the quartiles as well as percentiles in R.

  • The sd() and the var() function allows us to get the standard deviation and the variance for given data.

  • summary() function generates the minimum, maximum, mean, median, and first as well as third quartile in R.

 

This is it from the article. In the next article, we will come up with one more interesting article in the field of R programming. Also, look into our previous article that talks about the dates in R at Dates in R. Until we meet again, stay safe! Keep enhancing! :)

0%

Comments