As we already know that data science is a mesmerizing world that is related to the conversion of boring numerical and categorical data into meaningful knowledge to help us with revealing exciting and interesting insights. Although there is so much to learn in this field of science, there are the core set of fundamental concepts that remain essential for a student of Data Science.
Some of these ideas are highlighted here that are key to review when preparing for a job interview or just to refreshen up the basics.
Just as the name implies, Data Science is the branch of science that makes use of different statistical and mathematical methods with the ultimate goal of studying the relationships between them to uncover insights to help corporations with their business decisions. Data is, therefore, a key component of whatever is related to Data Science. Dataset is the collection of relevant, in structured or unstructured form, which may be in numerical, textual, image, voice, or video data. The dataset can be static or runtime data as well. For people in the beginning stage of their Data Science careers, it is available to them in CSV format.
Realistically, data being uncovered by the Data Analysts is never clean and is ready to be analyzed. In fact, most of the time data is cluttered with wrong, illogical, and dirty data and it is more likely in a file, a database, or extracted from documents such as web pages, tweets, or PDFs. So, to make the data presentable for analysis, it should be processed through Data Wrangling which is the process of converting from raw data form to a tidy form for analysis. Data Wrangling is a very critical and important step in data preprocessing and includes several processes like data importing, data cleaning, data structuring, string processing, HTML parsing, handling dates and times, handling missing data, and text mining.
Data visualization is one of the most important branches of Data Science. It involves usage of different libraries available in R and Python, along with paid and open-source software like Tableau, Power BI, and ClickView to analyze and study relationships between different variables with the help of scatter plots, line graphs, histograms, ggplots, box plots, pair plots, heat maps, etc. Additionally, Data Visualization is also used in machine learning for data preprocessing and analysis, feature selection, model building, model testing, and model evaluation. Data Science professionals consider Data Visualization more of an Art than science.
Outliers are data points that are very different and unique from the rest of the dataset. Generally, outliers are considered bad data, which is generated due to malfunctioned error, human error, or any inefficiencies in data preprocessing. They are very common and are found regularly in real-world datasets. Although there are several ways to detect outliers, box plots are generally used all over the world. Removing real data outliers can be too optimistic, leading to non-realistic models. Advanced methods for dealing with outliers include the RANSAC method.
Most datasets contain missing values in them. One way to handle these values is by manually entering the correct values. But naturally, it would take a lot of time, especially if you have thousands and millions of rows of data. To counter this problem, one can apply mean imputation, which is related to taking the mean of the values of the corresponding column and filling it automatically. Similarly, one can use the median and the mode of the values and entering them in their respective columns.
Data Scaling in Machine learning can be related to making a delicious glass of mixed fruit juice. To make it to be it as a ‘mix’ fruit juice, we need to mix all fruit not by their size but based on their right proportion. We just need to remember apples and strawberries are not the same unless we make them similar in some context to compare their attribute.
Similarly, in almost every ML algorithms, data scaling of all the numerical variables is very important for the right execution of the algorithm by deciding the realistic amount of data scaling of every feature which the Data Scientist seeks to analyze. For example, 10 grams of salt and 10 dollars, although both are equal numerically then again, we know that they represent different things and we should do some analysis and calculations to make them equal. The “Weight” cannot have a meaningful comparison with the “Price.” So, the assumption algorithm makes that since “Weight” > “Price,” thus “Weight,” is more important than “Price.” So, these more significant number starts playing a more decisive role while training the model.
Instead of going too deep, this simple Table will help to understand how data scaling will be used.
Principal Component Analysis (PCA)
Large datasets with hundreds and thousands of features lead to redundancy and the reason is the high amount of correlation between them. Training a model on a high-dimensional dataset having too many features can sometimes lead to overfitting (the model captures both real and random effects). It makes it very hard to interpret this dataset, analyze and answer, and predict the problems.
One way of solving it and making the lives easier for the data analysts is making a careful selection of features within the dataset for analysis through a technique known as PCA (Principal Component Analysis). This is a statistical method for reducing the number of features for the final model by focusing only on the components accounting for the majority of the variance in the dataset. Additionally, it removes the correlation between features.
Linear Discriminant Analysis (LDA)
The goal of LDA is to find the feature subspace that optimizes class separability and reduce dimensionality (see figure below). Hence, LDA is a supervised algorithm.
Generally, for the implementation of an algorithm in machine learning, the dataset is partitioned or divided into two parts: Training and Testing Datasets. The training dataset is used to building the algorithm and the testing dataset is the one which is an unseen dataset and ultimately it is used to test out the efficiency of data.
In scikit-learn, the train/test split estimator can be used to split the dataset as follows:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
Here, X is the features matrix, and y is the target variable. In this case, the testing dataset is set to 30%.
As the name suggests, supervised learning includes those machine learning algorithms which are guided or supervised by labeled data or data of known structure. There are many supervised learning algorithms and some of them are:
- Linear Regression
- KNeighbors Regression
- Support Vector Regression
- Perception Classifier
Unsupervised learning contains a collection of machine learning algorithms that revolves around unlabeled data. Using unsupervised learning techniques, we can explore the structure of our data to extract meaningful information without the guidance of a known outcome variable or reward function.
In reinforcement learning, the goal is to develop a system that improves its performance by interacting with the environment.
Statistics and Probability Concepts
Although there are many concepts related to statistics and probability one should know these concepts as a start:
Mean, Median, Mode, Standard deviation/variance, Correlation coefficient, and the covariance matrix, Probability distributions (Binomial, Poisson, Normal), p-value, Bayes Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value, Confusion Matrix, ROC Curve), Central Limit Theorem, R_2 score, Mean Square Error (MSE), A/B Testing, Monte Carlo Simulation
Model Parameters and Hyperparameters
In machine learning, there are two sets of parameters that are in front of a data scientist
- Model Parameters
These are the parameters in the model that must be determined using the training data set. These are the fitted parameters.
These are adjustable parameters
- that must be tuned to obtain a model with optimal performance.
Overfitting and Underfitting
In Data Science, bias-variance tradeoff relates to overcoming the problem of biasness and variance of resultant values by the model not being too overfitting and underfitting.
The bias is an error when you have an erroneous assumption so much that the model gives results that are near perfect under training datasets.
The variance is an error from sensitivity to small fluctuations in the training set that leads to failure of the whole model.
Cross-validation is the process of analyzing the performance of a data model with different techniques involving training and testing data. For example, K-Fold cross-validation is a technique in which the whole dataset is partitioned into training and testing datasets, which is then evaluated k times to know the averages of testing and training datasets for gauging the performance.
Knowledge of productivity tools and programming languages is as important as the theories themselves. So, it is always advisable to have a hand on knowledge of Jupyter Notebook, Python, SQL, Tableau and Power BI, Power Query.