Getting hired for a Data Analytics or Data Science job is the dream of many Computer Science enthusiasts. To get hired for these much coveted jobs you must be acquainted with the basic questions you may be asked during your interview. Here are some of the most expected Data Science/Data Analytics Interview Questions.
First 10 Important Data Analytics Interview Questions
1. What do you mean by sampling?
Sampling is a statistical technique to understand the available dataset for any decision-making or analytics process. Various sampling approaches can be applied to data to understand the data by taking a subset of the dataset into consideration. It gives us a great perspective to look into a large dataset without understanding the complete data. It is ideal to draw first-hand conclusions by observing a proportion of the populace of data. It is a cost-effective process.
2. What are Eigen Vectors?
- Eigenvectors are derived from linear algebra
- They are used to understand the linear transformations in a dataset
- Eigen vectors are calculated for the correlation matrix
- Mathematically it can be represented as follows:
Eigen vector is a scalar multiple of any number. Mathematically, for any matrix M, a non-zero vector e is called an ‘eigenvector’ iff M*e is a scalar multiple of e as showcased below:
Here ß is a scalar which is called as ‘eigen value’. Trace and Determinant in mathematics can be used to determine the eigenvectors and eigen values.
3. What is Normal Distribution?
- Normal distribution is a statistical techniques that is used to check the distribution of data
- It is also called Gaussian distribution
- In a symmetric distribution, most data points are concentrated around the mean or the center and the extreme values are extremely unlikely
- A normal distribution’s probability graph is symmetrical on both sides
- A normal distribution curve is a bell-shaped curve
- It is highly useful in exploratory analytics.
4. What is an outlier and how to handle it?
Outliers are data points that do not follow the standard data pattern being followed by most of the dataset points. Outliers can be found out by plotting the dataset on a scatterplot. Outliers can be eliminated for higher efficiency of an analytics model. Outliers are deviations from previously identified patterns. When outliers are present in data, it’s a good practice to pursue their origins to ensure that nothing is wrong with the data.
5. What is data cleaning and what are its uses?
Data cleaning is used to tune the data by altering it for higher accuracy. Alterations in data can be made by modifying or deleting incorrect data. Irrelevant data which is not related can also be removed. Good formatting of data is also a part of data cleaning. It is used to ensure that the answers that we need to get from the data are accurate and reliable to make correct decisions.
6. What is Time Series Analytics?
It is an analytical technique that includes data that occurs in a series of intervals. In the context of data points, time series analytics can be used to predict future outcomes, understanding the past outcomes by analyzing the data, making policy suggestions etc. Time series is a sequence of data that takes successive measurements or changes over a period of time. Some example of time series are ocean waves, stock market practices etc.
Time series forecasting is the future prediction of results from a time series based on previous results of the time series. Some of the key areas where Time Series Analysis is used are Census Analytics, Economic forecasting and so on.
7. What is the difference between Type I and Type II Testing?
- Type I error is used to check the false positives
- Type 2 error indicates the true negatives.
8. What is Data Wrangling?
Data Wrangling is a technique of cleaning, formatting and structuring the raw data into a format that is in alignment with the requirements for any decision-making process. Data Wrangling must be done before performing analytics. Deep knowledge of the available data must be available to perform the data wrangling process. After the data wrangling is done, a check for consistency and quality should be done as a final step of the technique.
9. What is Data profiling?
Data profiling process deals with the understanding of inconsistencies and irregularities in data. It is used to enhance the quality of data for better results of decision making. It is used in the organization of data which is done by verifying the information available in the data in any business. Data Profiling also deals with the discovery and detailed understanding of data. Data profiling also helps in underlining any consistencies present in the data. It ensures that the data follows all the statistical measures in accordance with the business rules that it is intended to be used for.
10. What is the use of ROC curve?
ROC curve or the receiver Operating Characteristics Curve is used to show the performance of a classification model. The curve represents 2 parameters on its axis i.e. True Positive (TP) rate and false positive (FP) rate. It represents the tradeoff between TP and FP. The area under the ROC curve defines the accuracy of the classification model. The larger the area, the higher is the accuracy of the model.
These are the first 10 Data Science/Data Analytics Interview Questions. Read the next post to get another round of update on this topic.