In the previous article, I briefed about the major components of Databases and their related terminologies and mechanism. Furthermore, in the article, we will dive into the basics of Structured Query Language (SQL) to discover and manage any data repository at hand.
Data Definition Language vs Data Manipulation Language
Most Data enthusiasts, especially those transitioning from a no technical knowledge background have had a hard time understanding the difference between Data Definition Language and Data Manipulation Language. Fortunately, the differences between them are not that difficult to understand, as will be explained briefly below
In simple terms, Data Definition Language is the part of SQL which deals with the structure of the database and to define, change, or drop data.
Additionally, Data Manipulation Language relates to read, query, and modify data. These statements are used to explore the data, by Data Analysts to get familiar with the data at hand so that it becomes easier for them to extract insights from it to answer business questions from the clients or employers.
Create Table Statement
Create Table statement is one of the first DDL statements which the related technical professionals use to start developing the structure for the table.
For example: If we want to create for the table titled Author, then we would query the data with the following structure in which the first part with the brackets will be dedicated to the columns/attributes of the table. As you can notice that SQL is very easy to understand as it is written in plain English.
The format for Create Table will be executed this way:
CREATE TABLE (COLUMN 1: DATA TYPE, COLUMN2: DATA TYPE, COLUMN3: DATA TYPE);
Alternatively, if you want to drop the table then simply mention the table/entity title along with the command drop.
The format for Dropping Table altogether would be in this pattern:
The main purpose of a database management system, is not just to store the data but also facilitate retrieval of the data. So, after creating a relational database table and inserting data into the table, we want to see the data. To see the data, we use the SELECT statement. The SELECT statement is a data manipulation language statement. Data Manipulation Language statements or DML statements are used to read and modify data. The SELECT statement is called a query, and the output we get from executing this query is called a result set or a result table.
The format for the Select statement is:
Select COLUMN1, COLUMN2, … from TABLE1;
To retrieve all columns from the table we could use “*” instead of specifying individual column names:
Select * from TABLE1;
The WHERE clause can be added to your query to filter results or get specific rows of data. To retrieve data for all rows in the table where the ID is less than 5:
Select * from TABLE where ID <5;
In the case of character-based columns the values of the predicates in the where clause need to be enclosed in single quotes. For example, to retrieve the data for the country with country code “CA” we would issue:
Select * from TABLE where CCODE = ‘CA’;
Expressions to be used along with Select statement
We’ll briefly present a few useful expressions that are used with select statements.
The first one is COUNT, COUNT is a built-in database function that retrieves the number of rows that match the query criteria. For example, get the total number of rows in a given table,
select COUNT(*) from tablename.
Let’s say you create a table called MEDALS which has a column called COUNTRY, and you want to retrieve the number of rows where the medal recipient is from Canada. You can issue a query like this:
Select COUNT(COUNTRY) from MEDALS where COUNTRY=’CANADA.’
The second expression is DISTINCT. DISTINCT is used to remove duplicate values from a result set. Example, to retrieve unique values in a column, select DISTINCT columnname from tablename. In the MEDALS table mentioned earlier, a country may have received a gold medal multiple time. For example, retrieve the list of unique countries that received gold medals. That is, removing all duplicate values of the same country.
Select DISTINCT COUNTRY from MEDALS where MEDALTYPE = ‘GOLD’.
The third expression is LIMIT, LIMIT is used for restricting the number of rows retrieved from the database. Example, retrieve just the first 10 rows in a table. Select * from tablename LIMIT 10. This can be very useful to examine the results set by looking at just a few rows instead of retrieving the entire result set which may be very large. For example, retrieve just a few rows in the MEDALS table for a particular year.
Select * from MEDALS where YEAR = 2018 LIMIT 5;
This statement is used to populating a relational database table. After the table is created, the table needs to be populated with data. To insert data into a table, we use the INSERT statement. The INSERT statement is used to add new rows to a table. The INSERT statement is one of the data manipulation language statements. Data manipulation language statements or DML statements are used to read and modify data.
The syntax of the INSERT statement looks like this, insert into table name, column name, values. In this statement, the table name identifies the table, the column name list identifies each column in the table, and the values clause specifies the data values to be added to the columns in the table.
Insert Into TABLE1 ( COLUMN1, COLUMN2) Values (VALUE1, VALUE2);
It is important that the number of values provided in the values clause is equal to the number of column names specified in the column name list. This ensures that each column has a value. Tables do not need to be populated one row at a time, multiple rows can be inserted by specifying each row in the values clause. In the values clause, each row is separated by a comma.
The syntax of Inserting multiple rows of data into a table is:
Insert into TABLE1 (COLUMN1, COLUMN2) Values (VALUE1, VALUE2), (VALUE3, VALUE4);
UPDATE and DELETE Statements
Instead of deleting a table in a database altogether, most of the time we need to update or delete rows of data from any table. For such situations, we can execute UPDATE or DELETE statements.
The syntax of the UPDATE statement is:
Update from TABLE set [[Column Name]=[Value]] Where [Condition];
To delete row(s) of data, the syntax for the Delete Statement will be like this:
Delete from TABLE where [Condition]
Usage of Patterns, Strings, and Ranges within the Database
As we know that data is not only to store data in tables, but the main function of them is to retrieve data from it. For these purposes, patterns, strings, and ranges are used within SQL statements to extract filtered data.
- String Patterns
If a Data Analyst has just started to explore the database, then he had to go through the technical documentation to become familiar with the data and structure of the table. Instead of doing these tedious tasks, he can filter the data even further by specifying the letters of the data as well.
The syntax for the data pattern is:
Select * from TABLE where COLUMN1 like ‘<condition>’;
If you are unable to understand the syntax, just consider that you need to retrieve data of the applicants from the Column First_Name from the table BIO, whose first name starts with ‘FA’. Then the syntax of the SQL statement will be:
Select FIRST_NAME from BIO where FIRST_NAME like ‘FA%’;
- String Ranges
Suppose, if we want to specify the range of data that we need to extract on the basis of the ID of the row, the syntax of the statement will be:
Select * from TABLE1 where ID >= 5 and ID <=150
Alternatively, you can specify it as:
Select * from TABLE1 where ID between 5 and 150
Sorting Result Set
While exploring the data, there will be many instances where you would need to sort the data in ascending or descending order.
To accomplish this task, one can easily include the command ORDER BY at the end of the Select statement
The syntax for Sorting is:
Select * from TABLE1 order by ascending;
Select * from TABLE1 order by descending;
Grouping Result sets
At times, a select statement result set can contain duplicate values. To group the values of results, group by clause helps to accomplish this task.
Its syntax is simple enough to understand:
Select * from COLUMN1, COUNT(COLUMN1) from TABLE1 group by COLUMN1;
If you want to further filter the results based on the count of data values within each group, you just have to include having a clause. Its syntax is below:
Select * from COLUMN1, COUNT(COLUMN1) from TABLE1 group by COLUMN1 having count (COLUMN1) > 4;
Built-in Database Functions
While it is very much possible to first get your data from a database and then perform operations on it from your application and notebooks, most databases come with Built-in Functions. These functions can be included in SQL statements, allowing you to perform operations on data right within the database itself. Using database functions can significantly reduce the amount of data that needs to be retrieved from the database. That is, they reduce both network traffic and the use of the bandwidth. When working with large data sets, it may be faster to use built-in functions, rather than first retrieving the data into your application and then executing functions on the retrieved data. Please note, that it’s also possible to create your own functions, known as User-Defined Functions in the database; but that is a more advanced topic.
So, what are aggregate and column functions? An aggregate function takes a collection of light values, such as all of the values in a column, as input, and returns a single value or null. Examples of aggregate functions include sum, minimum, maximum, and average. Aggregate functions can also be applied to a subset of data instead of an entire column. Please note that we can perform mathematical operations between columns, and then apply aggregate functions on them.
Now let’s look at Scalar and String functions. Scalar functions perform operations on individual values. For example, round up or down every value in the sale price column to the nearest integer, is shown as:
select ROUND (SALEPRICE) from PETSALE;
There is a class of scalar functions called string functions, that can be used for operations on strings. That is char and varchar values. For example, retrieve the length of each value in the animal column, is shown as:
select LENGTH (ANIMAL) from PETSALE;
Uppercase and lowercase functions can be used to return uppercase or lowercase values of strings. For example: retrieve animal values in uppercase is shown as:
select UPPERCASE (ANIMAL) from PETSALE;
Scalar functions can be used in the where clause. For example, get lowercase values of the animal column for cat, is shown as:
select star from PETSALE where LOWERCASE(ANIMAL) equals cat
This type of statement is useful for matching values in the where clause if you’re not sure whether the values are stored in upper, lower, or mixed case in the table. You can also have one function operate on the output of another function. For example, get unique values for the animal column and uppercase is shown as:
select DISTINCT (UPPERCASE(ANIMAL)) from PETSALE
Sub-Queries and Nested Selects
Sub-queries or sub-selects are like regular queries but are placed within parentheses and nested inside another query. This allows you to form more powerful queries than would have been otherwise possible. An example of a nested query is shown. In this example, the sub-query is inside the where clause of another query.
Let’s say, we want to retrieve the list of employees who earn more than the average salary. To do so, you could try this code.
Select * from employees, where salary > AVG(salary).
However, running this query will result in an error like the one shown. Indicating an invalid use of the aggregate function. One of the limitations of built-in aggregate functions, like the average function, is that they cannot always be evaluated in the WHERE clause. So, to evaluate a function like average in the WHERE clause, we can make use of a sub-select expression like the one shown here.
Select EMP_ID, F_NAME, L_NAME, SALARY from employees where SALARY < (select AVG (SALARY) from employees).
Another option is to make the sub-query be part of the FROM clause. Sub-queries like these are sometimes called derived tables or table expressions. Because the outer query uses the results of the sub-query as a data source. Let’s look at an example to create a table expression that contains non-sensitive employee information.
Select * from (select EMP_ID, F_NAME, L_NAME, DEP_ID from employees) AS EMP4ALL
The derived table in a sub-query does not include sensitive fields like date of birth or salary. This example is a trivial one, and we could just as easily have included the columns in the outer query. However, such derived tables can prove to be powerful in more complex situations such as when working with multiple tables and doing joins.