Frequently asked Data Warehouse Interview Questions with detailed answers and examples. Tips and Tricks for cracking Data Warehouse interview. Happy job . CS DATA WAREHOUSING AND MINING. TWO MARKS QUESTIONS AND ANSWERS. echecs16.info Data mining. It refers to extracting or “mining” knowledge. DataWarehousing Interview Questions echecs16.info - Download as PDF File the raw material for management's decision support system. such as data mining.
|Language:||English, Spanish, Japanese|
|ePub File Size:||19.72 MB|
|PDF File Size:||17.43 MB|
|Distribution:||Free* [*Register to download]|
Top 50 Data Warehouse Interview Questions & Answers Data Mining is set to be a process of analyzing the data in different dimensions or perspectives and summarizing into a useful information. Can be . Download PDF. + Data Mining Interview Questions and Answers, Question1: What is data mining? Question2: Differentiate between Data Mining and Data warehousing?. TOP Data Mining Interview Questions and Answers pdf download A data warehouse is a electronic storage of an Organization's.
Because the data model uses easily understood notations and natural language, it canbereviewedandverifiedascorrectbytheendusers. In computer science, data modeling is the process of creating a data model by applyingadatamodeltheorytocreateadatamodelinstance. Adatamodeltheory is a formal data model description. When data modeling, we are structuring and organizingdata. Thesedatastructuresarethentypicallyimplementedinadatabase will impose implicitly or explicitly constraints or limitations on the data placed withinthestructure. Data models describe structured data for storage in data management systems such as relational databases. They typically do not describe unstructured data, such as word processing documents, email messages, pictures, digitalaudio,andvideo.
It was proposed by Han, Fu, Wang, et al.
Although, it was based on the Structured Query Language. These query languages are designed to support ad hoc and interactive data mining.
Also, it provides commands for specifying primitives. We can use DMQL to work with databases and data warehouses as well. We can also use it to define data mining tasks. Particularly we examine how to define data warehouses and data marts in DMQL. Read more about data query language. Generally, we have a syntax, which allows users to specify the display of discovered patterns in one or more forms.
While others view data mining as an essential step in the process of knowledge discovery. Basically, in this step, data is transformed into forms appropriate for mining. Also, by performing summary or aggregation operations. Generally, helps in an extract, transform and load transaction data onto the data warehouse system. Artificial Neural Networks. Neural Network. Especially, there are two main phases present to work on classification.
Also, it can be easily identified within the statistical community. Also, in which many of which attempt has to take. Moreover, it provides an estimate of the joint distribution of the feature within each class. Further, that can, in turn, provide a classification rule. Generally, statistical procedures have to characterize by having a precise fundamental probability model and that is used to provides a probability of being in each class instead of just a classification.
Also, we can assume that the techniques will use by statisticians.
Hence some human involvement has to assume with regard to variable selection. Generally, it covers automatic computing procedures. Also, it was based on logical or binary operations. Further, we use to learn a task from a series of examples. Here, we have to focus on decision-tree approaches.
Also, ss classification results come from a sequence of logical steps. Also, its principle would allow us to deal with more general types of data including cases. While, the number and type of attributes may vary. Generally, the id3 calculation starts with the original set as the root hub.
Also, on every cycle, it emphasizes through every unused attribute of the set and figures. Moreover, the entropy of attribute. Furthermore, at that point chooses the attribute. Also, it has the smallest entropy value. Basically, metadata is simply defined as data about data. The decision tree is not affected by Automatic Data Preparation. What Is Naive Bayes Algorithm? Naive Bayes Algorithm is used to generate mining models. These models help to identify relationships between input columns and the predictable columns.
This algorithm can be used in the initial stage of exploration. The algorithm calculates the probability of every state of each input column given predictable columns possible states.
After the model is made, the results can be used for exploration and making predictions. Explain Clustering Algorithm? Clustering algorithm is used to group sets of data with similar characteristics also called as clusters. These clusters help in making faster decisions, and exploring data. The algorithm first identifies relationships in a dataset following which it generates a series of clusters based on the relationships.
The process of creating clusters is iterative. The algorithm redefines the groupings to create clusters that better represent the data. Time series algorithm can be used to predict continuous values of data. Once the algorithm is skilled to predict a series of data, it can predict the outcome of other series.
The algorithm generates a model that can predict trends based only on the original dataset. New data can also be added that automatically becomes a part of the trend analysis. Association algorithm is used for recommendation engine that is based on a market based analysis. This engine suggests products to customers based on what they bought earlier.
The model is built on a dataset containing identifiers. These identifiers are both for individual cases and for the items that cases contain. These groups of items in a data set are called as an item set. The algorithm traverses a data set to find items that appear in a case. What Is Sequence Clustering Algorithm? Sequence clustering algorithm collects similar or related paths, sequences of data containing events.
The data represents a series of events or transitions between states in a dataset like a series of web clicks.
The algorithm will examine all probabilities of transitions and measure the differences, or distances, between all the possible sequences in the data set. This helps it to determine which sequence can be the best for input for clustering. Data mining is used to examine or explore the data using queries.
Data here can be facts, numbers or any real time information like sales figures, cost, meta data etc. Information would be the patterns and the relationships amongst the data that can provide information. SQL Server data mining offers Data Mining Add-ins for office that allows discovering the patterns and relationships of the data.
This also helps in an enhanced analysis. The Add-in called as Data Mining client for Excel is used to first prepare data, build, evaluate, manage and predict results. Data mining extension is based on the syntax of SQL. It is based on relational concepts and mainly used to create and manage the data mining models. DMX comprises of two types of statements: Data definition and Data manipulation.
Data definition is used to define or create new models, structures. A data mining extension can be used to slice the data the source cube in the order as discovered by data mining. When a cube is mined the case table is a dimension. There are several ways of doing this. One can use any of the following options: Can be used in a number of places without restrictions as compared to stored procedures.
Code can be made less complex and easier to write. Parameters can be passed to the function. They can be used to create joins and also be sued in a select, where or case statement.
Simpler to invoke. Define Pre Pruning? A tree is pruned by halting its construction early. Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset samples.
What Are Interval Scaled Variables? Interval scaled variables are continuous measurements of linear scale. For example, height and weight, weather temperature or coordinates for any cluster.
These measurements can be calculated using Euclidean distance or Minkowski distance. What Is A Sting? In STING method, all the objects are contained into rectangular cells, these cells are kept into various levels of resolutions and these levels are arranged in a hierarchical structure.
What Is A Dbscan? DBSCAN is a density based clustering method that converts the high-density objects regions into clusters with arbitrary shapes and sizes. Define Density Based Method? Density based method deals with arbitrary shaped clusters. In density-based method, clusters are formed on the basis of the region where the density of the objects is high.
Define Chameleon Method? Chameleon is another hierarchical clustering method that uses dynamic modeling. Chameleon is introduced to recover the drawbacks of CURE method. In this method two clusters are merged, if the interconnectivity between two clusters is greater than the interconnectivity between the objects within a cluster. In partitioning method a partitioning algorithm arranges all the objects into various partitions, where the total number of partitions is less than the total number of objects.
Here each partition represents a cluster. The two types of partitioning method are k-means and k-medoids. Define Genetic Algorithm? Enables us to locate optimal binary string by processing an initial random population of binary strings by performing operations such as artificial mutation , crossover and selection.
A collection of operation or bases data that is extracted from operation databases and standardized, cleansed, consolidated, transformed, and loaded into an enterprise data architecture. An ODS is used to support data mining of operational data, or as the store for base data that is summarized for a data warehouse. The ODS may also be used to audit the data warehouse to assure summarized and derived data is calculated properly. The ODS may further become the enterprise shared operational database, allowing operational systems that are being reengineered to use the ODS as there operation databases.
What Is Spatial Data Mining? Spatial data mining is the application of data mining methods to spatial data. Spatial data mining follows along the same functions in data mining, with the end objective to find patterns in geography.
So far, data mining and Geographic Information Systems GIS have existed as two separate technologies, each with its own methods, traditions and approaches to visualization and data analysis.
Particularly, most contemporary GIS have only very basic spatial analysis functionality. The immense explosion in geographically referenced data occasioned by developments in IT, digital mapping, remote sensing, and the global diffusion of GIS emphasises the importance of developing data driven inductive approaches to geographical analysis and modeling.
Data mining, which is the partially automated search for hidden patterns in large databases, offers great potential benefits for applied GIS-based decision-making. Recently, the task of integrating these two technologies has become critical, especially as various public and private sector organizations possessing huge databases with thematic and geographically referenced data begin to realise the huge potential of the information hidden there.
Among those organizations are:. What Is Smoothing? Smoothing is an approach that is used to remove the nonsystematic behaviors found in time series. It usually takes the form of finding moving averages of attribute values. It is used to filter out noise and outliers. Data Mining is used for the estimation of future.
Traditional approches use simple algorithms for estimating the future. But it does not give accurate results when compared to Data Mining. What Is Model Based Method? For optimizing a fit between a given data set and a mathematical model based methods are used. This method uses an assumption that the data are distributed by probability distributions. There are two basic approaches in this method that are 1.
Statistical Approach 2. Neural Network Approach. What Is An Index?
Indexes of SQL Server are similar to the indexes in books. They help SQL Server retrieve the data quicker. Indexes are of two types. Clustered indexes and non-clustered indexes. Rows in the table are stored in the order of the clustered index key.
There can be only one clustered index per table. Non-clustered indexes have their own storage separate from the table data storage. Non-clustered indexes are stored as B-tree structures. Leaf level nodes having the index key and it's row locater.
Define Binary Variables? Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and when state is 1, variable is present. There are two types of binary variables, symmetric and asymmetric binary variables. Symmetric variables are those variables that have same state values and weights. Asymmetric variables are those variables that have not same state values and weights. Preparing the data for classification and prediction: What Are Non-additive Facts?
Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. What Is Meteorological Data?
Meteorology is the interdisciplinary scientific study of the atmosphere. It observes the changes in temperature, air pressure, moisture and wind direction.
Usually, temperature, pressure, wind measurements and humidity are the variables that are measured by a thermometer, barometer, anemometer, and hygrometer, respectively.
There are many methods of collecting data and Radar, Lidar, satellites are some of them. Weather forecasts are made by collecting quantitative data about the current state of the atmosphere. The main issue arise in this prediction is, it involves high-dimensional characters. To overcome this issue, it is necessary to first analyze and simplify the data before proceeding with other analysis.
Some data mining techniques are appropriate in this context. Define Descriptive Model? It is used to determine the patterns and relationships in a sample data.
What Is A Star Schema? Star schema is a type of organising the tables such that we can retrieve the result from the database easily and fastly in the warehouse environment.
Usually a star schema consists of one or more dimension tables around a fact table which looks like a star,so that it got its name. What Is A Lookup Table? A lookUp table is the one which is used when updating a warehouse. What Is Attribute Selection Measure? The information Gain measure is used to select the test attribute at each node in the decision tree. Such a measure is referred to as an attribute selection measure or a measure of the goodness of split. Define Wave Cluster?
It is a grid based multi resolution clustering method. In this method all the objects are represented by a multidimensional grid structure and a wavelet transformation is applied for finding the dense region. Each grid cell contains the information of the group of objects that map into a cell. A wavelet transformation is a process of signaling that produces the signal of various frequency sub bands.
What Is Time Series Analysis?