While the amount of the data being generated is increasing at an exponential rate, so are the problems and issues related to it. One of the problems associated in that not all data is incomplete. In a world full of forms and manually entered data collections, incomplete observations and unavoidable. Database developers are aware of this problem and are seen trying to overcome this problem. The most common remedy for this problem is marking them as NULL values.
A NULL value is a special marker, or a placeholder used in SQL to indicate that a data value does not exist in that database which is used to query and extract data from. NULL values can be confusing and cumbersome to handle by the analysts, but it is imperative for them to better know how to handle them. For this reason, value tips and tricks can make lives of the data professionals much, much better.
Arithmetic and Comparison operations with a NULL value produce NULL Values.
What will be the answer to NULL divided by 100? Is NULL bigger than 10? How many decimal places does NULL value have? The answer to all those questions is easy, NULL!
Think like this: If you have a NULL value in front of you, then all the arithmetic and comparison expressions associated with it will always NULL.
LEN (0) will catch both NULL and “”
In practice, encoding unknown texts can be accomplished by using NULL or as empty strings. In some cases, you might even find data sources that use mix of both. This becomes problematic if you want to count how many values are NULL. Commands like IS NULL will ignore the empty spaces. Luckily, LEN (0) can help with covering up all those problems.
NULL, NA and “” can be casted into string values.
Unfortunately, it is quite common that if you import data from one platform to another, NULL values and other special markers (NaN, NA) might be casted into strings and it makes distinguishing them or categorizing them as NULL values becomes problematic, even after using commands like IS NULL and LEN (0). So, you should be skeptical and inspect your data thoroughly for handling them.
Checking out Data with a Histogram
To avoid being smacked with an error notification for invalid values, people generally enter arbitrary values (e.g., 99, 100, 0, -1) directly into the database. If the value appears suspiciously often, then it might be a prime candidate for NULL values being entered repeatedly. The reason for entering these values is to avoid being forced to enter the correct ones. That is why, it is quite normal to find values like -99 in the age column.
Although I highly recommend ETL tools like Power Query or Tableau Prep Builder for data wrangling, but you can always build a histogram to find out the distribution of values with any column. This practice enables you to find out the values which are NULL and should be dealt with.
NULL should not always be considered NULL Values.
Removing all NULL values might not always be a wise decision. Especially if there is a systematic reasoning behind its existence in the database. In statistics, there is a distinction between data:
- Missing completely at random (MCAR).
- Missing at random (MAR).
- Missing not at random (MNAR).
Suppose you have a dataset where the salary value for the people under the age of 18 is marked as NULL. Technically, their existence is justified as children below the age of 18 are not legally allowed to work. So, removing these NULL values should not be practiced and should be dealt in a much better way possible.
Counting the Rows
Sometimes you are concerned with the number of rows and sometimes you might care about the number of rows that are not NULL. For those cases, you can add COUNT (*) and COUNT (1) respectively to your SQL queries.
Missing Rows are not NULL Values.
This might sound obvious yet confusing to others that missing rows are NOT equal to NULL values in a database.
Imagine you have a dataset of a retail store that never opens on Sundays. When you analyze the transaction data, there will not be any data present for any Sunday (i.e., rows that are not present in the data). By the end of the year, you have transactions for roughly 313 days instead of the usual 365 days in a year.
- The distinct count of days in your fact table will be equal to the number of days with purchases and it will NOT be equal to the number of days that have passed since your business opened.
- Using commands like IFNULL or ISNULL will not have any effect on missing rows. Those only affect existing rows with NULL values.
The only way to identify these missing values is to create a table consisting of all the possible dates in a year and left join with the fact table (the table with the data) for identification of them.
Thank you for reading the tips and tricks which I have learned and experienced so far in my quest for transition towards Data Analytics. If you have any query, comment of anything in between, please contact me via the LinkedIn profile which is mentioned below this article.