Get Started with SQL JOINs
To perform advanced analytical processing and data discovery, one table is often not enough to bring valuable insights, hence combining multiple tables together is unavoidable. SQL, as a tool to communicate with relational database, provides the functionality to build relationships among tables. This article introduces how to use SQL to link tables together. If you want to learn more about the basics of SQL, I suggest have a read of my first article about learning SQL in everyday language. It gives a comprehensive SQL introduction for absolute beginners.
Why we need to learn SQL JOIN?
Maybe you haven’t even realized, we frequently come across joining in Excel as well. This is achieved by VLOOKUP function as shown below. VLOOKUP function allows us to perform matching of one column and return the corresponding reference values from another column. In this case, we are able to find the country_name related to each criterion by matching the country_code to the fips_code in Country Code table.
We may encounter many situations like this when the information in one table cannot suffice. SQL join uses the exact same logic, yet it is more powerful since it is matching and merging two or more tables together rather than two columns. In the following section, we will dive deeper into how to use SQL to address this problem.
Another very practical and realistic reason for learning SQL join is INTERVIEW! That’s right, SQL join is almost an inevitable question in any data scientist or data analyst interview. Some questions are more theoretical and explicit, “Can you tell me the difference between LEFT JOIN and INNER JOIN?”; others are more practical and implicit “Can you write down a SQL statement to find the name of this employee’s manager?” So it gives you another reason to learn and distinguish the implementation of each kinds of join.
How to Create JOIN?
SELECT <attributeName>, <attributeName>, <attributeName>, ...
FROM <TableA>
<LEFT|INNER|FULL OUTER|...> JOIN <TableB>
ON <TableA.commonAttribute> = <TableB.commonAttribute>
SQL join follows this syntax and I break it down into three components: 1) select the attributes and tables; 2) determine the join condition; 3) choose the appropriate join type.
Attributes and Tables
Just like other SQL statements, it is necessary to specify the attribute and table name in the form of SELECT <attribute> FROM <table>. But the difference is that more than one table is required to join. If same attribute names exist in more than one table, then simply referring to the attribute by name will be ambiguous since the database is uncertain about which table you are selecting this attribute from. To solve this problem, we need to use the table name as the attribute prefix.
For example, if “name” attribute exists in both Customer and Country tables, and we only select the name of customers. Then we refer to the attribute as “Customer.name”.
JOIN Condition
Tables are joined together by at least one common attribute. This common attribute is often referred to as a foreign key. As demonstrated in the excel example, VLOOKUP function also takes advantage of this shared attribute. We use this shared attribute (foreign key) as the matching point where we can find corresponding information of each row in another table. This common attribute needs to be explicitly indicated as the join condition in the SQL statement.
Let’s continue with this country code example. The aim of this exercise is to find the country name of each criterion in the “Google_Ads_GeoTargets” table. The datasets are from Google Public Dataset. Have an exploration and try to implement it in the BigQuery if you are interested.
In the left table, we have already got the country_code for each criterion and the Country_Code table provides us with the country name of fips_code. Therefore, the logic is to match the country_code in the GeoTarget table to the fips_code in the Country_Code table and find the corresponding name. country_code and fips_code are the common attributes that build the relationship between two tables.
We write down the following statement to indicate the join condition. Notice that it is better to use the table name as the attribute prefix and don’t forget the keyword “ON”.
ON Google_Ads_GeoTargets.country_code = Country_Code.fips_code
JOIN Type
There are more five major join types: left join, inner join, full join, self join, and cross join. In order to communicate to the database, we need to explicitly or implicitly indicate which join type in the statement. This is achieved by using keywords “LEFT JOIN”, “INNER JOIN” or “FULL OUTER JOIN” etc. Each type has its distinct use cases. Hopefully, the comparison below will help you to distinguish their subtle differences.
1. Left Join
Left join is the most similar to VLOOKUP in excel. The table on the right can be seen as a reference table or a dictionary from which we are extending the existing knowledge stored in the left table. Therefore, left join is adopted to return all records in the left table and reference the corresponding values from the right table. Left join can also provide us with more insights on why some values failed to find a match. E.g. Is it a result of incorrect records or typos in the left table or is it because of inexhaustive data in the right table?
We use this statement to generate LEFT JOIN result shown in the picture:
SELECT criteria_id, country_code, country_name, fips_code
FROM Google_Ads_GeoTargets gt
LEFT JOIN Country_Code cc
ON gt.country_code = cc.fips_code;
All rows from the Google_Ads_Geo_Targets are returned as the result even though some failed to match the Country_Code table (consequently, null is returned).
You may wonder why it is not necessary to use right join most of the time. This is because right join can be achieved swapping the directions of table and perform left join.
Besides, right table is often regarded as a reference book. In this instance, the Country_Code table is seen as a reference or a dictionary. There is no point returning the matching for all fips_codes. Just think of the situation we come across a new word. We would find the meaning of this word in a dictionary rather than go through each word in dictionary to match the words we enountered. The scope would be too broad and vague.
2. Inner Join
Inner join is useful when we are interested in the intersection of tables. Since it only shows the records that exist in both tables, inner join usually has the least number of rows returned. As shown in the result below, all null values are filtered out. We indicate this join type using the INNER JOIN keyword:
SELECT criteria_id, country_code, country_name, fips_code
FROM Google_Ads_GeoTargets gt
INNER JOIN Country_Code cc
ON gt.country_code = cc.fips_code;
3. Full Outer Join
Full outer join covers every row from both tables regardless if a match has been found. It is used to have a thorough grasp of information stored in both tables and pick up any mismatches. In this example, the first row shows as null for country_name and fips_code because there is no match of fips_code = “AZ” in the Country_Code table. On the flip side, the last row has no criteria_id and country_code because there is no criteria with country_code = “ZA” in the Google_Ads_GeoTargets table.
To write down a FULL OUTER JOIN statement:
SELECT criteria_id, country_code, country_name, fips_code
FROM Google_Ads_GeoTargets gt
FULL OUTER JOIN Country_Code cc
ON gt.country_code = cc.fips_code;
4. Self Join
Self join means linking the table to itself and we apply it to address unary relationship. It is powerful to create hierarchical relationships, for example, employee and manager, category and subcategory etc. Eventually, it still can be seen as joining two tables together.
I am using an Employee example to explain self join. In this example, if we want to find the manager name of each employee, we need to use a self join to find the name of each employee's manager, and we use managerID to derive the name. managerID is indicated as the foreign key in the ERD diagram because it is borrowed from the employeeID.
SELECT e.employeeID AS employee ID, e.name AS name, m.name as manager
FROM Employee e
LEFT JOIN Employee m
ON e.managerID = m.employeeID
As demonstrated in the ERD, We can think of Manager table as a subset of Employee thus this self join is equivalent to left join Employee Table to its subset Manager. The join condition is that the manager ID in Employee table needs to be equal to employeeID in the Manager table, hence e.managerID = m.employeeID. For self join, we still need to consider what is the join type. In this case, we want to show all employee records even though they don't have a manager (e.g. employee 3 and 6), thus a left join is needed. It is worth noting that renaming the tables is required to clearly indicate which attributes from which table you are referring to. Also, we can use the AS keyword to rename attributes into something readable.
5. Cross Join
I would say that cross join is the most irrational join, because it doesn’t have any joining condition.
SELECT criteria_id, country_code, country_name, fips_code
FROM Google_Ads_GeoTargets gt
CROSS JOIN Country_Code cc;
Cross Join matches each row in one table to all rows in the other table, just as in the example below. The Geo Target criteria with criteria_id = “1000998” matches to all records in the Country_Code table. The table below is not exhaustive but we can calculate that there are 42 rows in total (7 rows of Google Ads criteria x 6 rows of country code). We may accidentally create a cross join when there are duplicated rows in tables. Therefore, it is essential to eliminate duplicated rows while preprocessing the data. I will explain this in more detail in the upcoming article, stay tuned if interested :)
Cross join is high in computation cost and should be avoided when possible. The only use case I can think of is to populate combinations of all records in two tables, for example, size x color. Please comment down below if you have any suggestions on cross join use cases.
Take-Home Message
The main take away from this article is to boil down SQL join into three steps:
1. select the table and attributes
2. determine the join condition
3. choose the appropriate join type from left join, inner join, self join, and full join
Hope that this article helps you to upgrade your basic SQL skills to more advanced analysis through combining tables.
Comments