Category
>SQL

SQL for Data Scientists–Part 1

Amar Kumar
Aug 19, 2020

Introduction to SQL

SQL stands for Structured Query Language that is considered as a very critical tool for Data Scientist. A query language is a type of programming language which is designed to facilitate the information retrieval from the databases as per requirement. We can also call, in layman terms, SQL a language for databases.

Most of the companies store their data in some forms of databases, it might be MySQL, MS SQL Server or PostgreSQL, but, the underlying concept behind querying remains more or less same and if you are aware of the basic notions behind SQL, you will be able to work with any of them.

Even if you are planning to do your analysis with Python, R or any other language, chances are there, you would be required to get the data you need from the company's database and having a good foundation of SQL would just ease the process of analysis. (As talking about R programming language, visit the section “R Programming” to explore more its function and characters.)

What is RDBMS?

RDBMS stands for Relation Database Management System that stores related information across multiple tables in a structured manner. And, one can retrieve information from more than one table at the same time based on some common column present across multiple tables of interest.

Below is the sample table that is found in RDBMS databases:

Dummy table in the RDBMS database

Here in the table, each row represents the information related to a country and each column represents a different attribute of that country, like which continent it belongs to, what is the area, population and GDP of the country.

Now, after the basic understanding of the tables, it’s time to learn how to retrieve a piece of specific information from the database using SQL.

SELECT, COUNT & DISTINCT COUNT, LIMIT

To get an overall idea of the data present in a table, we can have a look at some of the entries present and for that, we can use LIMIT with SELECT.

SELECT fetches the data whereas LIMIT makes sure we are able to see a particular number of rows only.

We will be using the above table for illustration purpose and we can assume the table name to be the world.

select * from world limit 5;

Result in applying “select” command

The above query selects 5 rows from the world table, if we want to look at all entries at once we can simply remove limit 5 from the above query.

If we want to check on which column “world table” is unique at, we can achieve this by counting all the rows and counting the distinct entries of the particular column which we feel the table should be unique at. If both the counts are equal, we can say that the table is unique at that column.

We can assume here the world table is unique at name column which is basically the country name, and our assumptions would be correct if we get the same count for all the rows and distinct countries.

Implementation of Count command and its result

COUNT(*) gives the count of all the entries in world table whereas COUNT(DISTINCT name) gives a count of the distinct name (country name) in the world table. We can assign names to these counts by using AS syntax as we have done in the above query.

We found that the total entries and count of unique countries are same, hence our assumption was correct that the world table is unique at the name column, or in other words, we have only one row for each country.

WHERE & AGGREGATE FUNCTIONS (COUNT, SUM, MAX, MIN, AVG)

WHERE is used to filter the table on some specific conditions whereas aggregate functions like count, sum (for total), min (for Minimum), max (for Maximum) and avg (for Average) are used to find aggregate outcomes.

Aggregate Functions

Result of the above query is below:

Result of Aggregate Functions

With the help of aggregate functions, we are able to obtain a total number of countries, total, minimum, maximum and average population of the world.

Let’s say we want to compare world’s population numbers with Asia continent, that’s when WHERE comes into the picture and we can simply add Asia in the where the condition of the above query to obtain required results.

WHERE query

The obtained result for Asia continent is below:

The result after applying WHERE query

Few of the insights from the above analysis are:

There are a total of 195 countries in the world table out of which 47 belongs to Asia continent
The average population of the Asia (98 M) is higher than the World average population (39 M), this can be due to the fact that two of the most populous countries China and India belong to Asia

GROUP BY & HAVING

From the above analysis, we compared world population numbers with Asia continent, what if we have to compare population or GDP numbers across different continents. For a single comparison above, we wrote two queries and if we have to compare 7 continents based on population or GDP, we would have to write many queries to obtain numbers. This is when GROUP BY comes to the rescue.

Group by aggregates the point of interest columns in our case population or GDP, based on the columns we want it to aggregate which is continent in our case. We can perform “group by” on the basis of single or multiple columns but one thing to keep in mind is that we would have to include those columns in the select statement as well.

Above we discussed how the WHERE statement is used to filter out the table, in the same way HAVING statement is used to filter out the results of a GROUP BY query. Enough said, let’s understand these concepts through example.

Group by query

From the above query, you can see that we have applied the same aggregate functions used earlier but a simple inclusion of GROUP BY makes a whole lot of difference. Here, the population has been group by continent so we have included continent in the select statement as well. Below is the result of the above query:

Group by queries result

You can now observe the power of the group by, a single query is able to obtain so much information and we can compare population statistics across all continents from this single result.

Few of the insights that this powerful query gave are:

Africa continent has the highest countries among all continents that are 53, whereas Asia and Europe come at second and third place with the numbers 47 and 44 respectively
Asia’s total and the average population is highest among all continents, whereas Africa comes at the second place
The highest populous country is present in Asia which can be seen from max_population column whereas the lowest populous country is present in Europe continent which can be seen from min_population column.

What if we are told to compare the results of only those continents which has at least 40 countries in it, from the above result we can see that our final output should contain Africa, Asia and Europe but how do we get this?

This is when HAVING is used, it is used to filter out the results of the GROUP BY statement just like WHERE was used to filter out the base table (in our case world table).

HAVING query

Result of the above query is below:

Result of HAVING query

We got the results we wanted by simply including HAVING statement at the end in the same query. HAVING is executed after the GROUP BY statement and slices the result of GROUP BY query based on the condition provided in the HAVING clause.

In the above query, the requirement was to get the population statistics of those continents only which has at least 40 countries in it, and if you see;

we have used total_countries>=40 in the HAVING clause, which is nothing but the count(name) which is aliased total_countries with the AS clause and count(name) is same as counting the number of countries.
And, this count will be obtained continent-wise because we have GROUP BY (or aggregated) on the continent, so in this way, you understand the flow of query execution and final information retrieval as per our requirement.

WHERE VS HAVING

WHERE is used to filter/slice the base table/tables, in our case, which is world table whereas HAVING is used to slice/filter the results of GROUP BY statements. The base table is not equivalent to GROUP BY query results and that is why HAVING should always be preceded by GROUP BY meaning if there is no GROUP BY, there is no HAVING.

If in the above query if we would use WHERE instead of HAVING, we would get an error. Let’s see this.

Applying WHERE query

If we use this query, we would get below error:

WHERE can be used along with GROUP BY but it should be used before GROUP BY statement not after it as we have used in the above query which gave an error.

For example, if the requirement is to get the population statistics at continent level but we should take into consideration only the countries which has gdp>=10 Billion, then we can use WHERE along with GROUP BY

Below is the query and results as per the new requirement:

Applying both WHERE and GROUP BY queries

Here, you can see that WHERE has been used along with GROUP BY but it is used before it not after and hence the query is valid and it will run in the following sequence:

From world table, this query will filter out those countries which have gdp<10 Billion
Now, our base table will become small and hence the population statistics and countries in each continent will be low because some of the countries from each continent may have less than 10 B gdp.
After it, the GROUP BY will be executed on the reduced data and population statistics will be calculated at the continent level

Below is the result of the above query:

Result of both the WHERE and GROUP BY queries combinedly

We can see from the total_countries column that the countries in each continent have reduced. Africa continent used to have 53 countries but now it has only 29 countries, which implies 24 (53-29) countries in the Africa continent have GDP less than 10 Billion. (References: https://sqlzoo.net/ for dataset and code window)