Data visualization is a technique being used for almost 250 years (definitely more than that, just an approximation). Though it may take a lot of working hours to develop a visualization behind a computer and with thousands of data rows, it is worth all those efforts.
There are dedicated tools for data visualization in recent years that have reduced a lot of hours behind the desktop (tools such as Tableau, PowerBI, etc.), most of the users still prefer programming languages itself due to the cost being involved in those data visualization tools.
In this article, we will discuss the basic data visualization techniques used under R Programming with hands-on examples.
What is Data Visualization?
Data visualization is a technique of representing data as a graph, or in a pictorial format. This helps the management to take the decisions precisely without even actually taking efforts to go through the entire table. Graphs, which are important parts of data visualization, make this decision-making task easier for them as they can see all the ups and downs, all the patterns and trends, and almost everything of the data in a simple pictorial chart.
Four basic plots are used in R Programming:
We will discuss these charts (with some advanced features between each of them) one by one in detail through this article.
These are one of those few charts, data visualizations that we have studied throughout our high school days.
Whenever we have variables that contain categorical values, variables with limited numeric values, we can use the bar charts to present a visual chart based on those. This chart creates a bar for distinct grouping values of the variable on the X-axis and then plots their frequencies on the Y-axis. We are using the mtcars data for creating a barplot visualization here. See the code below:
#Creating a barplot visualization
Here, the variable used from mtcars represents the number of gears a can could have. This variable has values 3, 4, and 5. We have used the table() function that allows us to achieve the frequency associated with each gear value. See the visualization below:
Barplot visualization for the gear variable from the mtcars data
Here, we can easily say that most of the cars we have in our data are with three gear system. This is the power of data visualization. You don’t always need to go through entire data to get meaningful insight.
This graph though looks shabby as we don’t have the labels for X-axis, Y-axis, title for the graph, colours, etc. we can customize the boxplot function with these multiple options.
#Creating a barplot visualization with additional arguments
barplot(table(mtcars$gear), xlab = "Number of Gears",
ylab = "Frequencies", main = "Cars with number of Gears",
col = "navyblue", border = "red")
Here in this code, we have used the xlab, ylab, main, col, and border as additional arguments. Each of these specifies the label for the x-axis, label for the y-axis, main title for the graph, the colour of bars, and the colour of the border respectively. Let’s see how this changes our graphical layout.
The output of the bar chart with additional arguments
Now, the graph above looks more compact in comparison with the previous one, as this one has the title, axis labels, colours for bars and borders, etc. which makes it look visually appealing.
When you have to represent a single variable in a way that the probability distribution of that univariate data comes visible, you prefer the histogram as a graphical representation. In R, we have a hist() function that does the task for us. Here, we will use the air quality data which is a built-in dataset in R, to run the histogram.
#Creating a histogram visualization
This simple one-line code can do the task for you and you can identify the probability distribution of a numeric variable. Here in this example, we have used a Temp variable from the air quality data. This variable represents the temperature values from the dataset. Let us see how the output of this code snippet looks like.
Histogram for the Tem variable from the air quality dataset
This graph though comes with the standard visualization where we have axis labels for the x-axis and the y-axis along with the main title. We though can be able to customize those along with colour for the histogram bars as well as borders. Let us see the code below that allows us to customize the histogram.
hist(airquality$Temp, xlab = "Temp in Fahrenheit", ylab = "Frequncy Numbers", main = "Histogram for Temp from airquality data", breaks = 10, col = "lightseagreen", border = "red")
Here, we have used some additional arguments the same as the ones we used in the barplot to customize the histogram. Here in the code above, we have customized the labels for both the axes, main title, colour, and border of the bars. This graph generates a probability distribution of the Temp variable under the air quality data. See the output graph below:
Customized histogram with additional arguments
Sometimes, some situations lead you towards a conclusion that requires additional information other than the measures of central tendency (mean, median, mode). There is a box plot visualization which helps us to get information beyond measures of central tendency associated with the data you are working on. In R, we have a function named boxplot() which comes as a part of base R. Let us see how to create a basic barplot in R.
#Creating a boxplot visualization
In this code, we are trying to create a boxplot for the Wind variable from the air quality dataset. Let us see how the code above generates a boxplot for us as shown in the screenshot below:
Boxplot for the Wind variable from air quality data
We can also see the relationship of one variable over the other variable from the data in a single visual altogether so that we can see the change in spread/distribution for multiple variables at once. See the code below:
#Creating a boxplot for one variable vs other
boxplot(Wind ~ Month, data = airquality)
Here, we are allowing the boxplot function to plot Wind over multiple values of the Month variable. See the output as shown in the image below:
Boxplot in comparison with two variables
We can customize the boxplot, the same way we customize the barplot and histogram. Here we will use the same xlab, ylab for adding labels to the x, and the y-axis respectively. Additionally, we are using the pch and cex arguments to make the additional shapes (those hollow spears around the corners) look more intact. See the code as shown below:
#Creating a boxplot for one variable vs other
boxplot(Wind ~ Month, data = airquality,
xlab = "Months",
ylab = "Mile/hr. Avg. Wind Speed",
main = "Months vs Mile/hr. Avg. Wind Speed",
pch = 20,
cex = 2,
col = "turquoise",
border = "red")
Here, the box plot will be updated with additional labels for both the axes, main title, outliers size, the colour of boxes, and whiskers respectively. See the image shown below for a better realization:
Scatterplots are important when we wanted to deal with relationships (present if any) among the two numeric variables. The scatterplots allow us a way to look at the relationship between two numeric variables and give a glimpse of what sort of relationship they both could have (negative relationship: increase in one variable shows decrease in other etc.) They are very useful in day to day life of a data scientist.
To generate a scatterplot, we have the plot() function in R, which does the work for us. See the code below for creating scatter plots.
#Creating a scatterplot
plot(speed ~ dist, data = cars)
Here, this code above plots the data points for the speed variable across the data points for the dist variable from the cars dataset. All these data points for which speed across dist is plotted are cumulatively considered as a scatterplot. See the image below for a better realization.
Scatterplot for speed vs dist
Here, if you try to find out the pattern between the points plotted (rather scattered on the plane), you can see that as the speed increases, distance increases as well which means that values for both these variables are moving in the same direction. This is an important piece of information that allows us to find out that these two variables share a positive relationship among them.
Let us customize the scatterplot the same way we customized our previous plots.
#Creating a customized scatterplot
plot(speed ~ dist, data = cars,
xlab = "distance in ft",
ylab = "Speed in Miles/hr",
main = "Scatterplot of distance vs speed of cars",
pch = 20,
cex = 2,
col = "blue")
Let us see how this code changes our scatterplot visually.
Customized scatter plot
This is how the basic four charts in R Programming work. Let me summarize this article for you.
Data visualization has gained a lot of popularity in recent years as it allows the user to have a look into the data in a different way, where, running through tables and values is eliminated and just visuals are there to look into the entire data.
There are four basic plots in R Programming namely, bar plots, histograms, box plots, and scatter plots.
The bar plots are mainly used for representing the variables which have either qualitative or finite numeric values graphically.
When we have to figure out the distribution pattern of a numeric variable which is on a continuous scale, we usually use the histograms.
Boxplots are useful to get more summary statistics than just the central tendency in the same visual. They are also useful to find out the relationship between a numeric and a categorical/finite numeric variable.
The scatterplots are useful to get an idea about the relationship between two numeric variables where the change in values for one variable forces the change in values of the other.
This is it from this article. In our next article, we will discuss some advanced visualizations from R programming. Until then, stay safe!