• Category
  • >R Programming

Data Structures in R : Part 2

  • Lalit Salunkhe
  • Jul 06, 2020
Data Structures in R : Part 2 title banner

In my last article Data Structures in R : Part 1, we have discussed the first three out of the six data structures namely Vectors, Lists and Matrices in R programming. In this article, we are going to deal with the remaining of the three data structures in R Programming language, that are Data Frames, Arrays, and Factors. If you still have not gone through the previous article of this series, you should click the link Data Structures in R : Part 1 to get the detailed overview of the first three data structures in R. Let's discuss the Data Frames, Arrays, and Factors in depth under R.


 

Data Frames in R Programming

 

Well, why data frames? Is the first question that used to pop-up every time in my mind before I started paying serious attention towards the R programming language. The reasons are quite simple. In the day-to-day life of any data analyst or data scientist, the data appears to be in a tabular form (two dimensional data array with different number of rows and columns). This table may consist of multiple columns with different data types. 


 

Let us consider an example of a student's data of a particular school from any part of the world. What columns will that data hold? Student ID, Student Name, Gender, Standard/Class, Subjects, and what not. Well, all these columns may be of different data types. We didn’t have anything under R that can tackle in the best possible way with data that consists of columns with different types. Surely the lists were there. But they were not tabular or two-dimensional structures and this is where I believe the creators have decided to be more creative and dynamic to specify a data structure that could allow to hold columns with multiple data types together in a two-dimensional structure without losing their original properties or without coercing. 


 

Therefore, in general, we can say that data frames are a unique data structure in R that allows users to store data in a tabular form (with rows and columns) with different data types. However, it is a thing to note that, even if we are allowed to store data with different types together, all the columns under data frames should be of unique length. Meaning, they all should contain an equal number of elements in them.

 

Let’s  see an example through the screenshot below:


Code Illustration 1

Code Illustration 1


In the image above, we  have created four vectors of equal size which represent student id, student name, gender, and subjects they are aligned with, respectively. These four vectors are used as an argument while creating a data frame using data.frame() function (which is built-in) under R and stored under variable df1.


 

Once we print the  result, we can see that the columns are holding their actual data types and not forcing to cause a coercion.

This is how a data frame is created under R. 

There are certain characteristics a data frame possesses, let’s discuss them here for a better understanding.

 

Characteristics of a data frame

 

  • Data Frames are allowed to store data of different types (that are valid under R environment) in different columns. However, within each column, the data should be of homogeneous type. Remember the thumb rule - Within Homogeneous, Between Heterogeneous!

  • Each column must have a header (column name).

  • The number of entries (rows/elements) in each column must be of the same number.

 

Let’s move towards the next data structure in R.

 

Arrays in R Programming

 

Until now, we have seen data structures of one-dimension (vectors and lists) as well as data structures of two-dimensions (matrices and data frames with rows and columns). However, apart from that, there is a separate data structure that allows us to store data in more than two-dimensions. We are talking about the arrays.

 

Arrays are unique data structures in R, that allow an user to store data in more than two-dimensions or for that sake, in multiple dimensions. It stores data in dimensions of matrices, rows, and columns. Meaning if we are having an array with dimensions (2, 2, 4), it will be a one with two matrices with rows and four columns respectively. We have a dedicated pre-defined function called array() under R that allows us to create an array in R.

 

Syntax for an array in R is as shown below:

 

name_of_array <- array(data, dim = (row_size, column_size, matrices, dimnames))

 

Where,

data - stands for an argument which will be a vector/s (numerical preferably), that allows us to create an array.

dim - stands for dimensions.

row_size - stands for the number of rows you want to be in your array.

column_size - stands for the number of columns you want to be in your array.

matrices - specifies the number of matrices you wanted to be in the array.

dimnames - allows you to set the names for all the dimensions (rows, columns and matrices)

 

Let’s see how to create an array in R with an example.


Code Illustration 2

Code Illustration 2


Here, in this code illustration, we have used two vectors x and y of length 4 and 8 respectively. These vectors are then used as an argument under array() function to create an array with four matrices of two rows and two columns respectively.

 

Now is a time to move towards the final data structure in R programming i.e. Factors. Let’s see how factors are working under R.


 

Factors in R Programming

 

When we work on a categorical set of data points, we come up with factors which are nothing but the levels under that categorical set of data. For example, when we purchase a product, we give a rating (star rating) that specifies how satisfied a customer is regarding that product. One star means unsatisfied and five stars means fully satisfied. Whenever, we will work on the product rating data set on any consumer sight, we will get a predefined set of rating values which will either fall into one of the stars rating. Therefore, we can say that, for product rating data, we have five factors (one star, two star, three star, four star, and five star).

 

Factor can be created using a built-in function in R, factor().

 

Let us see an example where we try to find out  the levels within a series of repeated text values.


Code Illustration 3

Code Illustration 3


In this example, we have created a vector ‘d’ with a series of character values. After that, we have tried checking if the created vector has levels associated with it or not using the function, is.factor(). The function returns us a boolean output. TRUE meaning the data consist of factors and FALSE meaning the data doesn’t consist of the factors.

 

We then used factor() function to convert the given set of data into factor data and after printing the output, you can see that now the data consist of five levels. Note that the factors levels printed under the output section are always being sorted alphabetically. If you are using numeric values, then it will be in ascending order of chronology.

 

 

Conclusion


We have learned about the data frames that are dynamic data structures in R which allow us to store the data of multiple types within a two-dimensional array, and that too without losing their original properties or without coercing. We have seen arrays which are multi-dimensional data structures and allow us to represent data in multiple dimensions apart from conventional two-dimensional structures. Finally, we have walked through the Factors which are data structures that allow us to store results at multiple levels based on the data points. We will stop this article here and get back to you with a new article in this series soon. Until then, stay safe! 😊

Latest Comments

  • Tanesh balodi

    Aug 13, 2020

    Very informative