Learn SQL in Everyday Language
Updated: Dec 5, 2020
What is SQL After All?
SQL stands for "Structured Query Language", so it is a programming language just like python, java or R. What it differs from most common language is that it is a type of declarative language, which means it tells computers what to do instead of how to do it. As its name suggests, SQL is the language used to communicate with the database for the purpose of requesting and extracting the data we want. Let's first understand what is database? The easiest way to understand "database" is to see it as a huge collection of spreadsheets or tables that are intertwined and associated with each other.
Many people may confused SQL with some buzz words that we frequently encountered, such as MySQL, MS SQL server, or SQLite. First of all, the fundamental difference is that SQL is a language, not a software or a system. However, MySQL and SQL server are software that utilises SQL. Hopefully, the table below helps you to differentiate these concepts.
Basic Database Terminology
When it comes to describing a database, there are some commonly used words: "entity", "attribute" and "instance". Simply put, they are just the nerdy version of "table", "column" and "row".
Entity: it refers to the table in database. Each entity gathers a group of similar objects or individuals. It can be tangile or intangible, e.g. cars, employees, customers, or transactions, accounts etc.
Instance: each row in the table is called an instance. It represents an individual in the group, e.g. one transaction, a customer, a product etc.
Attribute: it is used to describe a dimension or a characteristic of the instances. So basically, each column of the table represents an attribute.
Write the Simplest SQL Statement
As mentioned earlier, SQL is basically a language built to communicate with the database and to ask questions like "can I have all the transactions from last month?"or "where are the majority of my audience coming from?". Then SQL is a structured way to ask those questions so that it is easier for the database to understand.
The simplest statement to ask a question would be:
SELECT <what_we_want> FROM <where_to_find_what_we_want>;
This statement is composed of three parts:
1. Where to find the information we want?
This refers to the entity or the table that stores the data. Imagine that you want to get information from a folder with hundreds of spreadsheets, you first need to know what is the name of the table. FROM statement allows you to tell the database what tables you are looking for. For example, if we want to know the age distribution of the members, and all customer related data are stored in the table "Customer_Data". Then we write the FROM statement as:
FROM Customer_Data
2. What information do you want to know?
We use SELECT statement to get the information that we want. Sometimes, we want all the information from one table, then in this case, we use * to represents all columns. For example, "SELECT * FROM Customer_Data;" would give you all the contents in the table Customer_Data. Keep in mind that SELECT statement is a column-wise manipulation, which means that it controls the columns that we select in a table rather than the rows.
When we are presented with too many columns of information, we want to limit the number of attributes in the view. Using SELECT statement is just like the "Hide" function in Excel which helps to clean up the spreadsheet and filter down to the attributes required for further analysis.
For example, to analyze the customer demographics, then we potential need to select these attributes "age", "gender", "country" using the following SELECT statement:
SELECT age, gender, country
3. The semi-colon
Depends on the SQL development environment you are using, some will give you an error if you forget to put the ";" at the end of the statement, and you end up getting stuck for minutes or even hours only to find out that it is this simple symbol that caused the pain.
Overall, put all of three essential components together, we can write down the simplest SQL statement:
SELECT age, gender, country
FROM Customer_Data;
SQL is Just Like Excel
Once we have finished the basic SELECT statement and FROM statement, we can apply additional functions and clauses to get more interesting and specific answers from the query.
1. WHERE Statement
WHERE statement operates just like the "filter" function in excel. The purpose is to perform row-wise manipulation to include the instances that are concerned.
For example, if only interested in the customers with age between 25 and 30. Then in excel, it is a simple step of selecting the "age" columns and apply filter and use number filter to limit the number to be between this range.
Similarly, in SQL, we just need to apply this condition using WHERE statement.
WHERE age BETWEEN 25 AND 30
As you can see, it is exactly the same procedure. Firstly, specify the field that we want to filter on, which is "age" in this case. Secondly, write the required logic condition.
Here is the syntax of some frequently encountered logic conditions:
equal to: WHERE <attribute> = <value> e.g. WHERE age = 22
not equal to: WHERE <attribute> != <value> e.g. WHERE age != 22
greater than: WHERE <attribute> > <value> e.g. WHERE age > 22
less than: WHERE <attribute> < <value> e.g. WHERE age < 22
greater than or equal to: WHERE <attribute> >= <value> e.g. WHERE age >= 22
less than or equal to: WHERE <attribute> <= <value> e.g. WHERE age <= 22
between: WHERE <attribute> BETWEEN <min_value> AND <max_value> e.g. WHERE age BETWEEN 25 and 30
To combine several conditions together:
and: WHERE <condition1> AND <condition2> e.g. WHERE age >22 and age <25
or: WHERE <condition1> OR <condition2> e.g. WHERE age <= 22 or age >= 25
2. ORDER BY and LIMIT
If you are familiar with the sort function in excel, then ORDER BY function would be hard to understand at all.
In excel, for numbers, we are able to sort from smallest to greatest value and vice versa; for strings, we are able to sort from A-Z and from Z-A. Likewise, we use the word “ORDER BY” in SQL to indicate that we want to sort an attribute either ascendingly or descendingly, which follows this syntax:
ORDER BY <sorted_column> ASC|DESC
Ascending order refers to smallest to largest for numbers; or alphabetical order (A-Z) for strings; or from the earliest to latest for date. It is the default order which can be implemented implicitly using “ORDER BY <column_name>”. If we want to sort by descending order, we have to explicitly specify it using keyword DESC. For example, if we want to see the largest "sales" value first, then we want to sort the "sales" attribute by descending order, then we simply put down
ORDER BY sales DESC
LIMIT function allows us to only include a certain number of rows in the output. It is usually combined with ORDER BY function to find out the top N or last N values. If we only need to know the top 10 sales then we use “ORDER BY sales DESC LIMIT 10” to limit the number of output displayed.
Now Let's Combine Everything
SELECT <columns_name>
FROM <table_name>
WHERE <condition>
ORDER BY <sorted_column> ASC|DESC
LIMIT <number_of_rows>;
This is the skeleton of a basic SQL statement that combines what we have gone through so far. Let's use a simple example to put some flesh on the bones.
"To gain customer insights and provide tailored customer relationship management, an e-commerce store is interested in knowing the gender and location of their top 100 customers who generate the most sales and are born between 1970 and 2000. All information is given in the table below"
Now let's break down these criteria into steps, and fill in the skeleton of the SQL statement.
1. What is the entity that they are interested in? Hence where is the information stored?
They are interested in customer data which is stored in "Customer_Data" table, then we can fill in the statement:
FROM Customer_Data
2. What are the attributes in this entity that we would like to know?
They want to know the gender and location of the customers, therefore this gives us the information for the SELECT statement
SELECT gender, location
3. What conditions should be applied to filter the instances?
The customers are born between 1970 and 2000, therefore the attribute "year_of_birth" should be restricted to be between 1970 and 2000.
WHERE year_of_birth BETWEEN 1970 AND 2000
4. Do we need to view the values in a certain order?
We want to know the information of the customers that generate the largest sales, therefore, we want the sales to be ordered from the greatest to the smallest and set the limit to100. As a result, we can directly find the customers with the sales value in top 100.
ORDER BY sales DESC
LIMIT 100
5. Lastly, don't forget the semi-colon ";".
So, the query has now been solved using the following statement:
SELECT gender, location
FROM Customer_Data
WHERE year_of_birth BETWEEN 1970 AND 2000
ORDER BY sales DESC
LIMIT 100;
Take-Home Message
In this article, we learn about some simple SQL clauses, including SELECT statement, FROM statement, WHERE statement, ORDER BY and LIMIT. If you are having trouble taking the first step into SQL, use excel or any kinds of spreadsheets as an analogy to make the transition easier. The aim of this article is to let anyone with an interest in analysis to learn SQL at ease.