Ever since the importance of Data Science in organizations has been recognized by adapting to it in their core systems, there is an ongoing discussion about whether Data Analysts and Scientists should be proficient in SQL or not. On one hand, there is an opinion that SQL is not needed to be learned by the Data Scientists as SQL Administrators are there within the organization to extract data for them. However, in reality, not many organizations have that hierarchical setup in place. It is very much appreciated by the employers and for the Data Analysts themselves for not being dependent on them and have the power to explore relevant data independently for their data models.
According to Stack Overflow’s 2018 Dev Survey, SQL topping the list of the languages being used by the Data Scientists further increases the importance of this language to be considered compulsory for being successful in the world of Data.
For data analysts who knows at least one programming language, having a basic knowledge of databases and SQL is easier for them than for the one who is from a non-technical educational background. I hope the first part of this article about knowledge of Databases (with the second part being about the basic syntax of Structured Query Language) will be beneficial in the longer run.
What is a Database
A database is a repository of data. It is a program that stores data. A database also provides the functionality for adding, modifying, and querying that data. There are different kinds of databases with different requirements. The data can be stored in various forms. When data is stored in tabular form, the data is organized in tables like in a spreadsheet, which is columns and rows. That’s a relational database. The columns contain properties about the item such as the last name, first name, email address, city.
What is a Table
A table is a collection of any data like a list of employees or a list of book authors. In a relational database, you can form relationships between tables. So, a database is a repository of data.
Relational Database Management Systems (RDMS)
The terms database, database server, database system, data server, and database management systems are often used interchangeably. For relational databases, it’s called a relational database management system or RDBMS.
RDBMS is a set of software tools that controls the data such as access, organization, and storage, and it also serves as the backbone of applications in many industries like banking, transportation, health, and so on. Examples of relational database management systems are MySQL, Oracle Database, DB2 Warehouse on Cloud, and DB2 Express C.
Information Model vs Logical Model
The figure below illustrates the relationship between an information model and a data model.
The Informational model is the logical or conceptual model to better understand the system before it is transferred to the implementation phase. It includes all of the entities, relationships, properties, and the operations that will be performed on them.
After the informational model is approved, it is transferred to the Data modeling phase. Data Model is a concrete / detailed model for the respective Database developers to extract and transform the structure into a more ‘technical’ model. It is more or less like a blueprint for the database. Data is stored in simple data structure tables. This provides logical data independence, physical data independence, and physical storage independence.
Entity Relationship Model
An entity-relationship data model or ER data model is an alternative to a relational data model.
It proposes thinking of a database as a collection of entities, rather than being used as a model on its own. The ER model is used as a tool to design relational databases.
Before progressing further, it is imperative to know the basic concepts of the Relational Model for understanding the ER Model at a larger scale
- Attribute: Each column in a Table. Attributes are the properties that define a relation. e.g., Student_Rollno, NAME, etc.
- Tables: Inthe Relational model, the relations are saved in the table format. It is stored along with its entities. A table has two properties rows and columns. Rows represent records and columns represent attributes.
- Tuple: It is nothing but a single row of a table, which contains a single record.
- Relation Schema: A relation schema represents the name of the relation with its attributes.
- Degree: The total number of attributes in the relation is called the degree of relation.
- Cardinality: Total number of rows present in the Table.
- Column: The column represents the set of values for a specific attribute.
- Relation instances: A relation instance is a finite set of tuples in the RDBMS system. Relation instances never have duplicate tuples.
- Relation key: Every row has one, two, or multiple attributes, which is called a relation key.
- Attribute domain: Every attribute has some pre-defined value and scope which is known as attribute domain
In the ER model, entities (tables) are objects that exist independently of any other entities in the database. It is simple to convert an ER diagram into a collection of tables.
The building blocks of an ER diagram are entities and attributes. Entities have attributes (columns in a table), which are the data elements that characterize the entity. Attributes tell us more about the entity. In an ER diagram, an entity is drawn as a rectangle, and attributes are drawn as ovals. Entities can be a noun, person, place, or thing.
Using a simplified library as an example, a book is an example of an entity. Attributes are certain properties of characteristics of an entity and tell us more about the entity. The Entity book has attributes such as the book title, the addition of the book, the year the book was written, etc. Attributes are connected to exactly one entity. The entity book becomes a table in the database, and the attributes become the columns in a table. Continuing the simplified library example, books are written by authors, book is an entity, an author is an entity. In the same way, we can create entities and attributes for the library management system’s Entity Relationship Diagram.
This is the final ER Diagram. Each entity becomes a table in the database. You can now describe the difference between an information model and a data model, explain the advantage of the relational model, and describe the difference between an entity and an attribute.
Types of Relationship
It is imperative to have basic knowledge about the different types of relationships between the tables. The building blocks of a relationship are
- Entities sets: Denoted by a rectangle
- Relationship sets: Denoted by a diamond
- Crows foot notations: Denoted by a greater than, less than and a vertical line.
This is applied when one instance of a table is related to only one instance of another table in a database. For example, a person can only have one ID card.
This relationship is applicable when one instance of a table is associated with more than one instance of another table. For example, A person can own more than one car at the same time. However, a car can be owned by only one person.
The final relationship type is applied when more than one instance of a table is related to more than one instance of another table. For example, there is a many-to-many relationship between the table of author and books because a book can have more than one author and similarly, an author can write more than one book at the same time.
Relational Model Constraints
Within any business, data must adhere to certain constraints or rules that complement the business operations. These constraints must be present for a valid relation between entities in a database.
To understand the overview of these constraints, one must understand Primary Key, Foreign Key, Parent Table, and Dependent Table as well.
Primary Key: Primary key is a row that is unique in a table and it uniquely identifies a set of rows in a table
Foreign Key: Foreign key is a row that is dependent on Primary keys or another table and its primary function is to fetch the relevant set of records from its table under the matching primary key to form a relationship.
Parent Table: A table containing a primary key that is related to at least one foreign key is called a parent table. In the example, the author entity is the parent table. The book entity is also a parent table.
Dependent Table: A table containing one or more foreign keys is called a dependent table. It might also be referred to as a child table. In the example, the author underscores list entity has foreign keys that refer to different parent tables, author, and book.
After understanding the basic concepts about relational constraints, let’s understand the six constraints in a bit detail.
Entity Integrity Constraint: As we already know that a primary key should we unique and shouldn’t contain null or duplicate values. It should represent a set or records or tuples to be identified uniquely. These rules that are related to primary key comes under Entity Integrity Constraint.
In the snapshot of the table above, Author_ID is a primary of the author table. As you can notice that there are a couple of NULL values in the first attribute of the author entity. As we know that null values violate the rules of the Entity Integrity Constraint, so it must be fixed. One way of easily implementing the constraint is through indexing.
Referential Integrity Constraint: Referential integrity constraint defines relationships between tables and ensures that these relationships remain valid. The validity of the data is enforced using a combination of primary keys and foreign keys.
Semantic Integrity Constraint: It relates to correctness of the data with the type of data needed in the corresponding tuple or a column. For example, In the City column, invalid data such as %FYG#@^ will be considered as dirty and invalid because there is no city with such name.
Domain Constraint: A domain constraint is related to specifying the permissible values for each attribute in an entity. For example, in the relation author, the attribute country must contain a two-letter country code such as CA for Canada or IN for India. If a number value of 34 is entered for the country attribute instead of a two-letter country code, the value 34 does not have any meaning. As we saw previously, the entity integrity constraint states that no attribute participating in the primary key is allowed to accept null values. The null constraint specifies that attribute values cannot be null.
NULL Constraint: As the title of the constraint suggest, this rule is related to null values not being permissible in a tuple.
Check Constraint: This constraint is related to enforcing the correctness of data in a column. For example: In the book table below, Check constraint is enforced in the Year column to evaluate that the year when the book was published.
JOIN Operations in Database Management Systems
A simple SELECT statement retrieves data from a table. The next level of complexity is retrieving data from two or more tables. This leads to multiple possibilities of how the result set is generated. To combine data from two tables, we use the JOIN operator.
A JOIN combines the rows from two or more tables based on a relationship between certain columns in these tables. With the JOIN operator, you are combining data from more than one table based on a relationship between certain columns in these tables.
If we consider the same library management system, then it will help us understand how the relationships between tables are established. For tables to have relationships between them, we need to assign primary and foreign keys in between the tables. Additionally, we should be careful in assigning the keys to the tables and think twice before establishing which table should have which type of key because carelessness in this step will surely lead to a database not being able to display the data which was initially intended.
Following are the JOIN operators which we need to fully understand for the correct implementation of a database to suit our needs.
Inner Join: This is the most common type of join which is being used quite often by the SQL Administrators. It revolves around displaying only those rows which matches two or more tables.
Left Outer Join: Left Outer Join revolves around selecting all of the rows from the left table and only the matching rows from the right table.
Right Outer Join: When you want to select only the matching rows from the left table and all the rows from the table, a query of Right outer join is selected.
Full Outer Join: In Full Outer Join, all of the rows are displayed from the left as well as from the right table. As expected, you will be displayed with a lot of data.