Dear Readers, Welcome to Data Mining Interview Questions and Answers have been designed specially to get you acquainted with the nature of questions you may encounter during your Job interview for the subject of Data Mining. These Data Mining Questions are very important for campus placement test and job interviews. As per my experience good interviewers hardly plan to ask any particular questions during your Job interview and these model questions are asked in the online technical test and interview of many IT companies.
Data warehousing is merely extracting data from different sources, cleaning the data and storing it in the warehouse. Where as data mining aims to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc.
E.g. a data warehouse of a company stores all the relevant information of projects and employees. Using Data mining, one can use this data to generate different reports like profits generated etc.
The process of cleaning junk data is termed as data purging. Purging data would mean getting rid of unnecessary NULL values of columns. This usually happens when the size of the database gets too large.
A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily.
E.g. using a data cube A user may want to analyze weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube.
An IT system can be divided into Analytical Process and Transactional Process.
OLTP – categorized by short online transactions. The emphasis is query processing, maintaining data integration in multi-access environment.
OLAP – Low volumes of transactions are categorized by OLAP. Queries involve aggregation and very complex. Response time is an effectiveness measure and used widely in data mining techniques.
• Data mining helps analysts in making faster business decisions which increases revenue with lower costs.
• Data mining helps to understand, explore and identify patterns of data.
• Data mining automates process of finding predictive information in large databases.
• Helps to identify previously hidden patterns.
Exploration: This stage involves preparation and collection of data. it also involves data cleaning, transformation. Based on size of data, different tools to analyze the data may be required. This stage helps to determine different variables of the data to determine their behavior.
Model building and validation: This stage involves choosing the best model based on their predictive performance. The model is then applied on the different data sets and compared for best performance. This stage is also called as pattern identification. This stage is a little complex because it involves choosing the best pattern to allow easy predictions.
Deployment: Based on model selected in previous stage, it is applied to the data sets. This is to generate predictions or estimates of the expected outcome.
Discreet data can be considered as defined or finite data. E.g. Mobile numbers, gender. Continuous data can be considered as data which changes continuously and in an ordered fashion. E.g. age.
Models in Data mining help the different algorithms in decision making or pattern matching. The second stage of data mining involves considering various models and choosing the best one based on their predictive performance.
Data warehousing can be used for analyzing the business needs by storing data in a meaningful form. Using Data mining, one can forecast the business needs. Data warehouse can act as a source of this forecasting.
A decision tree is a tree in which every node is either a leaf node or a decision node. This tree takes an input an object and outputs some decision. All Paths from root node to the leaf node are reached by either using AND or OR or BOTH. The tree is constructed using the regularities of the data. The decision tree is not affected by Automatic Data Preparation.
Naïve Bayes Algorithm is used to generate mining models. These models help to identify relationships between input columns and the predictable columns. This algorithm can be used in the initial stage of exploration. The algorithm calculates the probability of every state of each input column given predictable columns possible states. After the model is made, the results can be used for exploration and making predictions.
Clustering algorithm is used to group sets of data with similar characteristics also called as clusters. These clusters help in making faster decisions, and exploring data. The algorithm first identifies relationships in a dataset following which it generates a series of clusters based on the relationships. The process of creating clusters is iterative. The algorithm redefines the groupings to create clusters that better represent the data.
Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series. The algorithm generates a model that can predict trends based only on the original dataset. New data can also be added that automatically becomes a part of the trend analysis.
E.g. Performance one employee can influence or forecast the profit.
Association algorithm is used for recommendation engine that is based on a market based analysis. This engine suggests products to customers based on what they bought earlier. The model is built on a dataset containing identifiers. These identifiers are both for individual cases and for the items that cases contain. These groups of items in a data set are called as an item set. The algorithm traverses a data set to find items that appear in a case. MINIMUM_SUPPORT parameter is used any associated items that appear into an item set.
Sequence clustering algorithm collects similar or related paths, sequences of data containing events. The data represents a series of events or transitions between states in a dataset like a series of web clicks. The algorithm will examine all probabilities of transitions and measure the differences, or distances, between all the possible sequences in the data set. This helps it to determine which sequence can be the best for input for clustering.
E.g. Sequence clustering algorithm may help finding the path to store a product of “similar” nature in a retail ware house.
Data mining is used to examine or explore the data using queries. These queries can be fired on the data warehouse. Explore the data in data mining helps in reporting, planning strategies, finding meaningful patterns etc. it is more commonly used to transform large amount of data into a meaningful form. Data here can be facts, numbers or any real time information like sales figures, cost, meta data etc. Information would be the patterns and the relationships amongst the data that can provide information.
SQL Server data mining offers Data Mining Add-ins for office 2007 that allows discovering the patterns and relationships of the data. This also helps in an enhanced analysis. The Add-in called as Data Mining client for Excel is used to first prepare data, build, evaluate, manage and predict results.
Data mining extension is based on the syntax of SQL. It is based on relational concepts and mainly used to create and manage the data mining models. DMX comprises of two types of statements: Data definition and Data manipulation. Data definition is used to define or create new models, structures.
Example:
CREATE MINING SRUCTURE
CREATE MINING MODEL
Data manipulation is used to manage the existing models and structures.
Example:
INSERT INTO
SELECT FROM .CONTENT (DMX)
A data mining extension can be used to slice the data the source cube in the order as discovered by data mining. When a cube is mined the case table is a dimension.
There are several ways of doing this. One can use any of the following options:
- BACKUP/RESTORE,
- Dettaching/attaching databases,
- Replication,
- DTS,
- BCP,
- logshipping,
- INSERT...SELECT,
- SELECT...INTO,
- creating INSERT scripts to generate data.
Data Transformation Services is a set of tools available in SQL server that helps to extract, transform and consolidate data. This data can be from different sources into a single or multiple destinations depending on DTS connectivity. To perform such operations DTS offers a set of tools. Depending on the business needs, a DTS package is created. This package contains a list of tasks that define the work to be performed on, transformations to be done on the data objects.
Import or Export data: DTS can import data from a text file or an OLE DB data source into a SQL server or vice versa.
Transform data: DTS designer interface also allows to select data from a data source connection, map the columns of data to a set of transformations, and send the transformed data to a destination connection. For parameterized queries and mapping purposes, Data driven query task can be used from the DTS designer.
Consolidate data : the DTS designer can also be used to transfer indexes, views, logins, triggers and user defined data. Scripts can also be generated for the sane For performing these tasks, a valid connection(s) to its source and destination data and to any additional data sources, such as lookup tables must be established.
What is built-in function? Explain its type i.e. Rowset, Aggregate and scalar.
A built in function of sql is used for performing calculations. These are standard functions provided by sql.
Aggregate functions: these functions perform calculation on a column and return a single value. Example: AVG(), SUM(), MIN(), MAX()
Scalar functions : these functions perform calculation on an input value and return a single value. Example: ROUND(), MID(), LCASE(), UCASE()
a. Can be used in a number of places without restrictions as compared to stored procedures.
b. Code can be made less complex and easier to write.
c. Parameters can be passed to the function.
d. They can be used to create joins and also be sued in a select, where or case statement.
e. Simpler to invoke.