Category
>SQL

How to Use SQL for Data Analysis?

Bhumika Dutta
Aug 17, 2021

Introduction

If someone is interested in Data science or has a basic idea about it, it is quite possible that they have heard about SQL by now. SQL or Structured Query Language is a programming language that was mainly designed for data management in a relational database. It was first established in the 1970s and today is the most widely used programming language for accessing data in databases.

So, it is a must-learn for data science professionals or enthusiasts as it is hard to carve out a career in data science or data analytics without learning SQL. By now, one must be wondering, what exactly makes SQL so important?

The production of data has skyrocketed and every day we receive huge amounts of data. Since most of the data received are from the world, they end up being unstructured data.

To analyze this data and make use of the data, it is important for businesses to hire a business analyst or a data scientist for purposes like data analysis, data mining, and getting general insights from the available data.

(Must read: Data mining tools)

For them, SQL has proven itself to be the greatest tool. SQL can be used by a data analyst to access, read, manipulate and analyze the data stored in a database and produce helpful insights to help one make better decisions. If we are talking about databases, let us learn what a database is.

What is a database?

The formal definition of a database according to Wikipedia is, “A database in computing is an organized collection of data stored and accessed technically from a computer system”.

A list of students from any class, a grocery shopping list, a list of employees of a company- all of them are practical examples of a dataset. Digitally, there can be numerous ways to access and organize different databases that might be designed for different purposes.

If one is well versed with Excel, or even has a basic idea about it, they will understand the concept of tables and sorting of data according to any factor like alphabetically sorted names, or years from oldest to newest, etc.

SQL for data analysis

As mentioned earlier, SQL is the most popular programming language used for data analytics. SQL creates user-friendly dashboards that show data in a number of ways. SQL is also a great technology for creating data warehouses because of its ease of use, clear organization, and ability to communicate efficiently.

Many people utilize SQL data analytics by incorporating them directly into other frameworks, which provides extra functionality and communication capabilities without requiring them to be built from the ground up.

SQL analytics may be utilized with languages like Python, Scala, and Hadoop, which are three of the most popular for data science and large data administration and manipulation at the moment. SQL can be used as an intermediate between end-users and a more complicated data storage system that is more accessible to professionals and data scientists due to its ability to connect directly with databases developed in these languages.

(Must read: SQL for data scientists)

In this article, we are going to discuss some SQL techniques for data analysis, and for that, we are going to use an example of a database from Analytics Vidhya.

Dataset of 10 consumers, image source: Analytics Vidya

In the example given above, the name of ten consumers of a retail store is listed along with their localities, the total amount of money spent by them and the industry in which the consumer belongs to. Using this database, we are going to discuss the different SQL techniques.

Counting Rows and Items:

Count Function:

To count the number of rows in a table, a function named COUNT() is available. This is the first and most simple operation of any data analysis.

To do this, we have to type the command:

 mysql> select count(*) from ConsumerDetails;

Distinct Function:

The DISTINCT function helps in attaining unique values in cases of duplication in the dataset. This only prints the unique values without any repetition.

This function can also be used to count the number of unique rows by using the COUNT function along with the DISTINCT function.

Aggregation Functions:

These functions are the base of any data analysis and offer us an overview of the database.

Sum Function:

The SUM() functions calculate the sum of any numerical column in a table. In this case, to calculate the total amount of money spent by all the consumers combined, this function is used.

Average Function:

The AVG() function is used to calculate the average of the numerical columns. The average expenditure by all the individual customers can be found using this command.

Standard Deviation Function:

The standard deviation of any dataset is calculated using the STDDEV() function. This cannot be achieved by the AVG() function. To find the standard deviation of the total amount spent by the consumers, we can write:

Extreme Value Identification:

To search and identify the extreme values in any dataset, the following functions are used:

Max function:

The MAX() function can identify the maximum number in any numeric column.

Min function:

The MIN() function is used to identify the minimum number in any numeric column.

(Recommended blog: Introduction to MySQL)

Slicing Data:

It is a very important procedure in data analysis. This portion of the study will serve as the foundation for advanced inquiries and will allow users to access data based on certain criteria.

From the given database, if someone wants to search for all the customers who are coming from the localities of Shakti Nagar and Shanti Vihar, they can find it like this.

By using the WHERE clause, the conditional database of the consumers has been sorted out. Now taking another example, if we have to search for consumers who come from the above localities and have spent more than Rs. 2000, we can find it like this.

Here, two people named Shantanu and Natasha are clearing out this criterion.

Limiting Data:

Limit clause:

The SELECT statement cannot be used directly as it would print the entire table onto the print screen. This might turn out to be an intense computational process. Instead of that, we can use the LIMIT clause.

Offset clause:

The OFFSET clause is used to skip any specified number of rows. This is used when we only want to work on selective rows.

Sorting Data:

The data can be sorted using the ORDER BY keyword. It is used to sort the data into ascending or descending order. Although, this keyword sorts the data in ascending order by default. Let us sort the given table according to the total amount of money spent by the consumers from least value to the highest value.

To sort the data in descending order, we can write it in the following manner.

(Top readings: SQL project ideas for beginners)

Filtering Pattern:

We can filter the columns of the data set according to some specific patterns. Let us talk about different operators and clauses.

LIKE operator:

The LIKE operator is used within a WHERE clause to search and identify any specified pattern in a column.

Wildcard Characters:

The Wildcard Character is a character that may be used to replace one or more characters in a string. Along with the LIKE operator, they are utilized.

The following are the two most common wildcard characters:

[ % ]- This character represents 0 or more characters
[ _ ]- This character represents a single character

Let us try and search for all the localities in our dataset that end up with ‘Nagar’. For that, we would need to analyze the names of all of the localities and any number of strings present before the word Nagar would not matter and can be ignored. The “%” wildcard can be used for that purpose.

Let us take another example. This time, we will try to search for the names of all the customers whose second character has “a” in their names. In this case, the first character can be ignored as we have to check the second character, so it can be replaced by the “_” wildcard.

After the second character, there can be any number of characters, therefore we use the wildcard " % " to replace those characters. This is how the finished pattern matching will look:

Groups, Rolling Up Data and Filtering in Groups:

The GROUP BY statement is used for the grouping of data. This statement is particularly useful for determining the distribution of categorical data. This is accomplished by combining the GROUP BY statement with aggregate functions such as COUNT, SUM, AVG, and so forth.

For example, if we need to find the number of customers corresponding to the industries they belong to, we can write:

If we have to find the sum of expenditure of the consumers grouped by the industry they belong to, we can write:

The most amount of money is spent by the customers from the manufacturing industry.

HAVING clause:

If we have to find which of the industries have a total sum above Rs.2500. The HAVING clause is similar to the WHERE clause, however, it is only used to filter groups by data. We have to always remember that it comes after the GROUP BY declaration.

(Suggested reading: SQL vs NoSQL)

Conclusion

So, in this article, we have learned about eight SQL techniques that can be used to analyze any given dataset. Data analysis is very important as it makes our lives easier and more organized. If we have a data set present for any amount of data, it becomes much easier to process the data in order to make smart decisions with it.

(Recommended read: Introduction to Statistical Data Analysis)

With the above techniques, one can sort, filter, arrange, group, and find data from any database and analyze them.

Latest Comments

brenwright30

May 11, 2024

THIS IS HOW YOU CAN RECOVER YOUR LOST CRYPTO? Are you a victim of Investment, BTC, Forex, NFT, Credit card, etc Scam? Do you want to investigate a cheating spouse? Do you desire credit repair (all bureaus)? Contact Hacker Steve (Funds Recovery agent) asap to get started. He specializes in all cases of ethical hacking, cryptocurrency, fake investment schemes, recovery scam, credit repair, stolen account, etc. Stay safe out there! Hackersteve911@gmail.com https://hackersteve.great-site.net/

How to Use SQL for Data Analysis?

Introduction

What is a database?

SQL for data analysis

Counting Rows and Items:

Count Function:

Distinct Function:

Aggregation Functions:

Sum Function:

Average Function:

Standard Deviation Function:

Extreme Value Identification:

Max function:

Min function:

Slicing Data:

Limiting Data:

Limit clause:

Offset clause:

Sorting Data:

Filtering Pattern:

LIKE operator:

Wildcard Characters:

Groups, Rolling Up Data and Filtering in Groups:

HAVING clause:

Conclusion

Share Blog :

Trending blogs

Latest Comments

brenwright30