DWH Two Marks Q & A

8/7/2019 DWH Two Marks Q & A

1/33

1

DWH 2Marks

Unit I

1.Define Data mining.It refers to extracting or miningknowledge from large amount of data. Datamining is a process of discoveringinteresting knowledge from large amountsof datastored either, in database, data warehouse, orother information repositories

2.Give some alternative terms for data

mining.

Knowledge miningKnowledge extractionData/pattern analysis.Data ArchaeologyData dredging

3.What is KDD.

KDD-Knowledge Discovery in Databases.

4.What are the steps involved in KDD

process.

Data cleaningData MiningPattern EvaluationKnowledge PresentationData IntegrationData SelectionData Transformation

5.What is the use of the knowledge base?

Knowledge base is domain knowledge thatis used to guide search or evaluate the

interestingness of resulting pattern. Suchknowledge can include concept hierarchiesused to organize attribute /attribute values into different levels of abstraction.Data Mining

6.Mention some of the data mining

techniques.

StatisticsMachine learningDecision TreeHidden markov modelsArtificial Intelligence

Genetic Algorithm

Meta learning

7.Give few statistical techniques.

Point EstimationData SummarizationBayesian TechniquesTesting HypothesisCorrelationRegression

8.What is the purpose of Data miningTechnique?

It provides a way to use various data miningtasks.

9.Define Predictive model.

It is used to predict the values of data bymaking use of known results from adifferent set of sample data.

10.Data mining tasks that are belongs to

predictive modelClassificationRegressionTime series analysis

11.Define descriptive model

It is used to determine the patterns and relationships in a sample data. Datamining tasks that belongs to descriptivemodel:Clustering

Summarization

Association rulesSequence discovery

12. Define the term summarization

The summarization of a large chunk of datacontained in a web page or adocument.


2/33

2

Summarization =caharcterization=generalization

13. List out the advanced database

systems.

Extended-relational databases

Object-oriented databasesDeductive databasesSpatial databasesTemporal databasesMultimedia databasesActive databasesScientific databasesKnowledge databases

14. Define cluster analysisCluster analyses data objects withoutconsulting a known class label. The classlabels are not present in the training datasimply because they are not known to beginwith.

15.Classifications of Data mining systems.Based on the kinds of databases mined:

o According to model

_ Relational mining system

_ Transactional mining system_ Object-oriented mining system_ Object-Relational mining system_ Data warehouse mining systemo Types of Data

_ Spatial data mining system_ Time series data mining system_ Text data mining system_ Multimedia data mining system

Based on kinds of Knowledge minedo According to functionalities

_ Characterization_ Discrimination_ Association_ Classification_ Clustering_ Outlier analysis_ Evolution analysis

o According to levels of abstraction of the

knowledge mined

_ Generalized knowledge (High level ofabstraction)_ Primitive-level knowledge (Raw data

level)o According to mine data regularities versusmine data irregularitiesBased on kinds of techniques utilized

o According to user interaction

_ Autonomous systems_ Interactive exploratory system_ Query-driven systemso According to methods of data analysis

_ Database-oriented_ Data warehouse-oriented

_ Machine learning_ Statistics_ Visualization_ Pattern recognition_ Neural networks

Based on applications adopted

o Financeo Telecommunicationo DNAo Stock marketso E-mail and so on

o

16.Describe challenges to data miningregarding data mining methodology and

user

interaction issues.

Mining different kinds of knowledge in databasesInteractive mining of knowledge at

multiple levels of abstractionIncorporation of background knowledge

Data mining query languages and ad hoc

data miningPresentation and visualization of data

mining resultsHandling noisy or incomplete dataPattern evaluation


3/33

3

17.Describe challenges to data miningregarding performance issues.

Efficiency and scalability of data mining algorithmsParallel, distributed, and incremental

mining algorithms

18.Describe issues relating to the diversityof database types.

Handling of relational and complex types of dataMining information from heterogeneous

databases and global informationsystems

19.What is meant by pattern?

Pattern represents knowledge if it is easilyunderstood by humans; valid on testdata with some degree of certainty; andpotentially useful, novel,or validates a hunchabout which the used was curious. Measuresof pattern interestingness, either objective orsubjective, can be used to guide thediscovery process.

20.How is a data warehouse different

from a database?

Data warehouse is a repository of multipleheterogeneous data sources, organizedunder a unified schema at a single site inorder to facilitate management decision-making.Database consists of a collection ofinterrelated data.

21.What are the uses of statistics in data

mining?

Statistics is used to

*to estimate the complexity of a data miningproblem;*suggest which data mining techniques aremost likely to be successful; and*identify data fields that contain the mostsurface information.

22. What is the main goal of statistics?

The basic goal of statistics is to extendknowledge about a subset of a collection tothe entirecollection.

23. What are the factors to be consideredwhile selecting the sample in statistics?

The sample should be*Large enough to be representative of thepopulation.*small enough to be manageable.*accessible to the sampler.*free of bias.

24. Name some advanced database

systems.

Object-oriented databases,Object-relationaldatabases.

25. Name some specific application

oriented databases.

Spatial databases,Time-series databases,Text databases and multimedia databases.

26. Define Relational datbases.

A relational databases is a collection of

tables,each of which is assigned a uniquename.Eachtable consists of a set of attributes(columnsor fields) and usually stores a large set oftuples(recordsor rows).Each tuple in a relational tablerepresents an object identified by a uniquekey and describedby a set of attribute values.

27.Define Transactional Databases.

A transactional database consists of a filewhere each record represents a transaction.Atransaction typically includes a uniquetransaction identity number(trans_ID), and alist of the itemsmaking up the transaction.

28.Define Spatial Databases.


4/33

4

Spatial databases contain spatial-relatedinformation.Such databases includegeographic(map)databases,VLSI chip design databases, andmedical and satellite image

databases.Spatial data may berepresented in raster format, consisting of n-dimensional bit maps or pixel maps.

29.What is Temporal Database?

Temporal database store time related data .Itusually stores relational data that includetimerelated attributes.These attributes mayinvolve several time stamps,each havingdifferent semantics.

30.What is Time-Series databases?

A Time-Series database stores sequences ofvalues that change with time,such as datacollected regarding the stock exchange.

31.What is Legacy database?

A Legacy database is a group ofheterogeneous databases that combinesdifferent kinds ofdata systems,such as relational or object-

oriented databases,hierarchicaldatabases,networkdatabases,spread sheets,multimediadatabases or file systems.

32. What are the steps in the data mining

process?

a. Data cleaningb. Data integrationc. Data selectiond. Data transformation

e. Data miningf. Pattern evaluationg. Knowledge representation

33. Define data cleaning

Data cleaning means removing theinconsistent data or noise and collectingnecessary information

34. Define data mining

Data mining is a process of extracting ormining knowledge from huge amount ofdata.

35. Define pattern evaluation

Pattern evaluation is used to identify thetruly interesting patterns representingknowledge basedon some interesting measures.

36. Define knowledge representation

Knowledge representation techniques areused to present the mined knowledge to theuser.

37. What is Visualization?

Visualisation is for depiction of data and togain intuition about data being observed. Itassists the analysts in selecting displayformats, viewer perspectives and datarepresentationschema

38. Name some conventional visualization

techniquesHistogramRelationship treeBar chartsPie chartsTables etc.

39. Give the features included in modern

visualisation techniques

a. Morphingb. Animation

c. Multiple simultaneous data viewsd. Drill-Downe. Hyperlinks to related data source

40. Define conventional visualisation

Conventional visualisation depictsinformation about a population and not thepopulation data itself


5/33

5

41. Define Spatial Visualisation

Spatial visualisation depicts actual membersof the population in their feature space

42.What is Descripive and predictive datamining?

Descriptive datamining describes the dataset in a concise and summarative mannerandpresents interesting general properties of thedata.Predictive datamining analyzes the data inorder to construct one or set of models andattempts to predict the behavior of new datasets.

43.Merits of Data Warehouse.

*Ability to make effective decisions fromdatabase*Better analysis of data and decision support*Discover trends and correlations thatbenefits business*Handle huge amount of data.

44.What are the characteristics of data

warehouse?

*Separate*Available*Integrated*Subject Oriented*Not Dynamic*Consistency*Iterative Development*Aggregation Performance

45.List some of the DataWarehouse tools?

*OLAP(OnLine Analytic Processing)

*ROLAP(Relational OLAP)*End User Data Access tool*Ad Hoc Query tool*Data Transformation services*Replication

46.Explain OLAP?

The general activity of querying andpresenting text and number data fromDataWarehouses, as wellas a specifically dimensional style ofquerying and presenting that is exemplified

by a number ofOLAP Vendours .The OLAP vendourstechnology is nonrelational and is almostalways biased onan explicit multidimensional cube ofdata.OLAP databases are also known asmultidimensional cubeof databases.

47.Explain ROLAP?

ROLAP is a set of user interfaces and

applications that give a relational database adimensionalflavour.ROLAP stands for Relational OnlineAnalytic Processing.

UNIT-II

1.Define data warehouse?

A data warehouse is a repository of multipleheterogeneous data sourcesorganized under a unified schema at a single

site to facilitate management decisionmaking .(or)A data warehouse is a subject-oriented,time-variant and nonvolatilecollection of data in support ofmanagements decision-making process.

2.What are operational databases?

Organizations maintain large database thatare updated by daily transactions are

called operational databases.

3.Define OLTP?

If an on-line operational database systems isused for efficient retrieval, efficientstorage and management of large amounts ofdata, then the system is said to be on-linetransaction processing.


6/33

6

4.Define OLAP?

Data warehouse systems serves users (or)knowledge workers in the role of dataanalysis and decision-making. Such systems

can organize and present data in variousformats. These systems are known as on-lineanalytical processing systems.

5.How a database design is represented in

OLTP systems?

Entity-relation model

6. How a database design is represented

in OLAP systems?

Star schema

Snowflake schemaFact constellation schema

7.List out the steps of the data warehouse

design process?

_ Choose a business process to model._ Choose the grain of the business process_ Choose the dimensions that will apply toeach fact table record._ Choose the measures that will populateeach fact table record.

8.What is enterprise warehouse?

An enterprise warehouse collects all theinformations about subjects spanning theentire organization. It provides corporate-wide data integration, usually from one (or)more operational systems (or) externalinformation providers. It contains detaileddata aswell as summarized data and can range insize from a few giga bytes to hundreds of

gigabytes, tera bytes (or) beyond. An enterprisedata warehouse may be implemented ontraditional mainframes, UNIX super servers(or) parallel architecture platforms. Itrequires business modeling and may takeyears to design and build.

9.What is data mart?

Data mart is a database that contains asubset of data present in a data warehouse.Data marts are created to structure the datain a data warehouse according to issues such

as hardware platforms and access controlstrategies. We can divide a data warehouseintodata marts after the data warehouse has beencreated. Data marts are usually implementedon low-cost departmental servers that areUNIX (or) windows/NT based. Theimplementation cycle of the data mart islikely to be measured in weeks rather thanmonths (or) years.

10.What are dependent and independentdata marts?

Dependent data marts are sourced directlyfrom enterprise data warehouses.Independent data marts are data capturedfrom one (or) more operational systems (or)external information providers (or) datagenerated locally with in particulardepartment(or) geographic area.

11.Define indexing?Indexing is a technique, which is used forefficient data retrieval (or) accessingdata in a faster manner. When a table growsin volume, the indexes also increase in sizerequiring more storage.

12.What are the types of indexing?

_B-Tree indexing_Bit map indexing_Join indexing

13.Define metadata?

Metadata is used in data warehouse is usedfor describing data about data.(i.e.) meta data are the data that definewarehouse objects. Metadata are created forthe


7/33

7

data names and definitions of the givenwarehouse.

14.Define VLDB?

Very Large Data Base. If a database whose

size is greater than 100GB, thenthe database is said to be very largedatabase.

15.What is data cleaning?

Data cleaning routines removeincomplete, noisy and inconsistent data by

filling in missing values

smoothing out noise

identifying outliers and

correcting inconsistencies in the data

16.Mention the categories of data that

may be encountered in mining.

The data used in the analysis by thedata mining techniques may fall underthe following categories

Incomplete data lacking

attribute value or certainattributes of interest

Noisy data Data containingerrors or outlier values thatdeviate from the expected. Noiseis defined as a random error orvariance in a measured variable

Inconsistent data There maybe inconsistencies in datarecorded in some transactions,inconsistencies due to dataintegration (where a given

attribute may have differentnames in different database),inconsistency due to dataredundancy

17.What are the various data smoothing

techniques to remove noise?

The various data smoothingtechniques are

Binning

Clustering

Combined computer and

human inspection Regression

18.What is Binning?

Binning is used to smooth datavalues by consulting its neighborhoodvalues. The sorted values are distributedinto a number of buckets or bins.The data are first sorted and thenpartitioned into equidepth bins. Thereare three types of binning

Smoothing by bin means Each value is replaced by themean value of the bin

Smoothing by bin median

Each bin value is replaced by thebin median

Smoothing by boundaries

The maximum and minimumvalues in the bin are identified asbin boundaries. Each value in the

bin is replaced by the closestboundary value

19.What is data integration? What are

the issues to be considered while

integrating data?

Data integration combines datafrom multiple sources into a coherent datastore. Issues to be considered are

a) Entity identification problemb) Correlation analysis

c) Detection and resolution ofdata value conflict

20.What is data transformation? What

are the various methods of transforming

data?

Data transformationtransforms and consolidates data


8/33

8

into forms appropriate formining. The following arevarious methods of transformingdata

i. Smoothing

ii. Aggregationiii. Generalizationiv. Normalizationv. Attribute construction

UNIT III

1. Define the concept of classification.

Two step process

A model is built describing a predefined

set of data classes or concepts.The model is constructed by analyzingdatabase tuples described byattributes.The model is used for classification.

2. What is Decision tree?

A decision tree is a flow chart like treestructures, where each internalnode denotes a test on an attribute, eachbranch represents an outcome of the test,and leaf nodes represent classes or class

distributions. The top most in a tree is theroot node.

3.What is tree pruning?

Tree pruning attempts to identifyand remove branches that reflect noise oroutliers in the training data with the goalof improving classification accuracy onunseen data.

4. What is Attribute Selection Measure?

The information Gain measure is used toselect the test attribute at each nodein the decision tree. Such a measure isreferred to as an attribute selection measureor a measure of the goodness of split.

5. Describe Tree pruning methods.

When a decision tree is built, many of thebranches will reflect anomalies inthe training data due to noise or outlier. Treepruning methods address thisproblem of over fitting the data.

Approaches:Pre pruningPost pruning

6. Define Pre Pruning

A tree is pruned by halting its constructionearly. Upon halting, the nodebecomes a leaf. The leaf may hold the mostfrequent class among the subsetsamples.

7. Define Post Pruning.Post pruning removes branches from aFully grown tree. A tree node ispruned by removing its branches.Eg: Cost Complexity Algorithm

8.Define information gain.

The information gain measure isused to select the test attribute at eachnode in the tree. Such a measure isreferred to as an attribute selection

measure ormeasure of goodness ofsplit. The attribute with the highestinformation gain is chosen as the testattribute for the current node.

9. How does tree pruning work?

There are two approaches to treepruning

a. In prepruning approach, a

tree is pruned by halting itsconstruction early. E.g. by

deciding not to further splitthe training samples at agiven node. Upon halting, thenode becomes a leaf node.

b. In postpruning approach, all

branches from a fully-growntree are removed. The lowestpruned node becomes a leaf


9/33

9

and is labeled by the mostfrequent class.

10.How are classification rules extracted

from a decision tree?

The knowledge representedin a decision tree can beextracted and represented in theform of classification IF-THENrules. One rule is created for eachpath from the root to a leaf node.E.g. IF age=


10/33

10

different objects into meaningful anddescriptive objects.

4. What are the fields in which clustering

techniques are used?

Clustering is used in biology to develop

new plants and animaltaxonomies.Clustering is used in business to enable

marketers to develop newdistinct groups of their customers andcharacterize the customer group on basisof purchasing.Clustering is used in the identification of

groups of automobilesInsurance policy customer.

Clustering is used in the identification of

groups of house in a city onthe basis of house type, their cost andgeographical location.Clustering is used to classify the

document on the web for informationdiscovery.

5.What are the requirements of cluster

analysis?

The basic requirements of cluster analysis

areDealing with different types of attributes.Dealing with noisy data.Constraints on clustering.Dealing with arbitrary shapes.High dimensionalityOrdering of input dataInterpretability and usabilityDetermining input parameter andScalability

6.What are the different types of dataused for cluster analysis?

The different types of data used for clusteranalysis are interval scaled, binary,nominal, ordinal and ratio scaled data.

7. What are interval scaled variables?

Interval scaled variables are continuousmeasurements of linear scale.For example, height and weight, weathertemperature or coordinates for any cluster.These measurements can be calculated using

Euclidean distance or Minkowski distance.

8. Define Binary variables? And what are

the two types of binary variables?

Binary variables are understood by twostates 0 and 1, when state is 0, variable isabsent and when state is 1, variable ispresent. There are two types of binaryvariables,symmetric and asymmetric binary variables.Symmetric variables are those variables that

have same state values and weights.Asymmetric variables are those variablesthat havenot same state values and weights.

9. Define nominal, ordinal and ratio

scaled variables?

A nominal variable is a generalization of thebinary variable. Nominal variablehas more than two states, For example, anominal variable, color consists of four

states,red, green, yellow, or black. In Nominalvariables the total number of states is N andit isdenoted by letters, symbols or integers.An ordinal variable also has more than twostates but all these states are orderedin a meaningful sequence.A ratio scaled variable makes positivemeasurements on a non-linear scale, suchas exponential scale, using the formula

AeBt

or Ae-Bt

Where A and B are constants.

10. What do u mean by partitioning

method?

In partitioning method a partitioningalgorithm arranges all the objects into


11/33

11

various partitions, where the total number ofpartitions is less than the total number ofobjects. Here each partition represents acluster. The two types of partitioningmethod are

k-means and k-medoids.

11. Define CLARA and CLARANS?

Clustering in LARge Applications is calledas CLARA. The efficiency ofCLARA depends upon the size of therepresentative data set. CLARA does notworkproperly if any representative data set fromthe selected representative data sets does notfind best k-medoids.

To recover this drawback a new algorithm,Clustering Large Applications basedupon RANdomized search (CLARANS) isintroduced. The CLARANS works likeCLARA, the only difference betweenCLARA and CLARANS is the clusteringprocessthat is done after selecting the representativedata sets.

12. What is Hierarchical method?

Hierarchical method groups all the objectsinto a tree of clusters that are arrangedin a hierarchical order. This method workson bottom-up or top-down approaches.

13. Differentiate Agglomerative and

Divisive Hierarchical Clustering?

Agglomerative Hierarchical clusteringmethod works on the bottom-up approach.In Agglomerative hierarchical method, eachobject creates its own clusters. The single

Clusters are merged to make larger clustersand the process of merging continues untilallthe singular clusters are merged into one bigcluster that consists of all the objects.Divisive Hierarchical clustering methodworks on the top-down approach. In this

method all the objects are arranged within abig singular cluster and the large cluster iscontinuously divided into smaller clustersuntil each cluster has a single object.

14. What is CURE?Clustering Using Representatives is calledas CURE. The clustering algorithmsgenerally work on spherical and similar sizeclusters. CURE overcomes the problem ofspherical and similar size cluster and is morerobust with respect to outliers.

15. Define Chameleon method?

Chameleon is another hierarchical clusteringmethod that uses dynamic modeling.

Chameleon is introduced to recover thedrawbacks of CURE method. In this methodtwoclusters are merged, if the interconnectivitybetween two clusters is greater than theinterconnectivity between the objects withina cluster.

16. Define Association rule mining.

Association rule miningsearches for interesting

relationships among items in agiven data set. Rule support andconfidence are the two measuresof rule interestingness.

17. What is occurrence frequency of an

itemset?

The occurrence frequency of anitemset is the number of transactions thatcontain the itemset. It is also known asfrequency, support count or count of the

itemset.

18.What are the two steps in mining

association rules?

Association rule mining is atwo step process

c. Find all frequent item sets


12/33

12

d. Generate strong associationrules from the frequentitemsets

19.How are association rules classified?

Association rules are classified asfollows

Based on the types of valueshandled in the rule

Based on the dimensions ofdata involved in the rule

Based on the levels ofabstraction involved in therule

Based on the variousextensions to association

mining

20.What is a quantitative association

rule?

If a rule describes associationbetween quantitative items orattributes, then it is a quantitativeassociation rule. In these rules, thequantitative values for items orattributes are partitioned intointervals.

21.What is a Boolean association rule?

If a rule concerns theassociation between the presenceor absence of an item, it is aBoolean association rule.

22.What are single dimensional and multi

dimensional association rules?

If the items or attributes in an

association rule reference only onedimension of a data cube, it is called singledimensional association rule.

If the items or attributes in anassociation rule reference more than onedimension of a data cube, it is called multi dimensional association rule.

23.What is multilevel association rule?

If an association rule refers to adimension at multiple levels ofabstraction, it is called multilevelassociation rule.

If an association rule does notrefer to a dimension at multiple levels ofabstraction, it is called single levelassociation rule.

24. Define Apriori algorithm?

Apriori is an influentialalgorithm for mining frequent itemsetsfor Boolean association rules. Thealgorithm uses prior knowledge offrequent itemset properties. Apriori

employs an iterative approach known aslevel wise search, where k itemsetsare used to explore (k+1) itemsets.

25. What is a cuboid?

Data cubes created for varyinglevels of abstraction are referred to ascuboids. A data cube consists of a latticeof cuboids. Each higher level ofabstraction reduces the data size

26. When we can say the association rulesare interesting?

Association rules are considered interestingif they satisfy both a minimumsupport threshold and a minimumconfidence threshold. Users or domainexpertscan set such thresholds.

27. Explain Association rule in

mathematical notations.Let I-{i1,i2,..,im} be a set of itemsLet D, the task relevant data be a set ofdatabase transaction T is a set ofitemsAn association rule is an implication of theform A=>B where A C I, B C I,


13/33

13

and An B= . The rule A=>B contains in the transaction set D with support s,where s is the percentage of transactions inD that contain AUB. The Rule A=> Bhas confidence c in the transaction set D if c

is the percentage of transactions in Dcontaining A that also contain B.

28. Define support and confidence in

Association rule mining.

Support S is the percentage of transactionsin D that contain AUB.Confidence c is the percentage oftransactions in D containing A that alsocontainB.

Support ( A=>B)= P(AUB)Confidence (A=>B)=P(B/A)support.

Support is the ratio of the number oftransactions that include all items in theantecedent andconsequent parts of the rule to the totalnumber of transactions. Support is anassociation ruleinterestingness measure.Confidence.

Confidence is the ratio of the number oftransactions that include all items in theconsequentas well as antecedent to the number oftransactions that include all items inantecedent. Confidence isan association rule interestingness measure.

29. How are association rules mined from

large databases?

I step: Find all frequent item sets:

II step: Generate strong association rules from frequent item sets

30. Describe the different classifications of

Association rule mining.

Based on types of values handled in the

Rule

i. Boolean association ruleii. Quantitative association rule

Based on the dimensions of data

involved

i. Single dimensional association rule

ii. Multidimensional association ruleBased on the levels of abstraction

involved

i. Multilevel association ruleii. Single level association rule

Based on various extensions

i. Correlation analysisii. Mining max patterns

31. What are the two main steps in

Apriori algorithm?

1) The join step2) The prune step

32. What is the purpose of Apriori

Algorithm?

Apriori algorithm is an influential algorithmfor mining frequent item sets forBoolean association rules. The name of thealgorithm is based on the fact that thealgorithm uses prior knowledge of frequentitem set properties.

33. Define anti-monotone property.

If a set cannot pass a test, all of its supersetswill fail the same test as well.

34. How to generate association rules

from frequent item sets?

Association rules can be generated asfollowsFor each frequent item set1, generate all nonempty subsets of 1.

For every non empty subsets s of 1, outputthe rule S=>(1-s)ifSupport count(1)=min_conf,Support_count(s)Where min_conf is the minimum confidencethreshold.


14/33

14

35. Give few techniques to improve the

efficiency of Apriori algorithm.

Hash based techniqueTransaction ReductionPortioning

Sampling

Dynamic item counting

36. What are the things suffering the

performance of Apriori candidate

generation technique.

Need to generate a huge number of candidate setsNeed to repeatedly scan the scan the

database and check a large set ofcandidates by pattern matching

37. Describe the method of generating

frequent item sets without candidate

generation.

Frequent-pattern growth(or FP Growth)adopts divide-and-conquerstrategy.Steps:

->Compress the database representingfrequent items into a frequent pattern treeor FP tree

->Divide the compressed database into a setof conditional database->Mine each conditional database separately

38. Define Iceberg query.

It computes an aggregate function over anattribute or set of attributes inorder to find aggregate values above somespecified threshold.Given relation R with attributes a1,a2,..,an

and b, and an aggregate function,agg_f, an iceberg query is the formSelect R.a1,R.a2,..R.an,agg_f(R,b)From relation RGroup by R.a1,R.a2,.,R.anHaving agg_f(R.b)>=threshold

39.What is hybrid dimension

association rules?

Multidimensional associationrules with repeated predicates whichcontain multiple occurrences of some

predicates are called hybrid dimension association rules

E.g. age (X, 2029) buys (X,

laptop) buys (X, b/w printer)

40. Mention few approaches to mining

Multilevel Association Rules

Uniform minimum support for all levels(or uniform support)Using reduced minimum support at lower

levels(or reduced support)Level-by-level independentLevel-cross filtering by single itemLevel-cross filtering by k-item set

41. What are multidimensional

association rules?

Association rules that involve two or moredimensions or predicatesInterdimension association rule:

Multidimensional association rule with no

repeated predicate or dimensionHybrid-dimension association rule:

Multidimensional association rule withmultiple occurrences of some predicates ordimensions.

42. Define constraint-Based Association

Mining.

Mining is performed under the guidance ofvarious kinds of constraintsprovided by the user.

The constraints include the followingKnowledge type constraintsData constraintsDimension/level constraintsInterestingness constraintsRule constraints.

43.What is strong association rule?


15/33

15

Association rules that satisfyboth user specified minimumconfidence threshold and user specified minimum support threshold arereferred to as strong association rules

44.What are the various factors used to

determine the interestingness measure?

1) Simplicity the patternshould be simple overall for humancomprehension

2) Certainty this is thevalidity or trustworthiness of the pattern

3) Utility this is the potentialusefulness of the pattern

4) Novelty novel patternsprovide new information or increase theperformance of the pattern

45. Explain the various OLAP operations.

a) Roll-up: The roll-up operation performsaggregation on a data cube, either byclimbing up a concept hierarchy for adimension.b) Drill-down: It is the reverse of roll-up. Itnavigates from less detailed data to more

detailed data.c) Slice: Performs a selection on onedimension of the given cube, resulting in asubcube.46. Discuss the concepts of frequent

itemset, support & confidence.

A set of items is referred to as itemset. Anitemset that contains k items is called k-itemset. Anitemset that satisfies minimum support isreferred to as frequent itemset.

Support is the ratio of the number oftransactions that include all items in theantecedent andconsequent parts of the rule to the totalnumber of transactions.Confidence is the ratio of the number oftransactions that include all items in theconsequent

as well as antecedent to the number oftransactions that include all items inantecedent.

47. What is the use of Regression?

Regression can be used to solve theclassification problems but it can also beusedfor applications such as forecasting.Regression can be performed using manydifferenttypes of techniques; in actually regressiontakes a set of data and fits the data to aformula.

48. What are the reasons for not using the

linear regression model to estimate theoutput data?

There are many reasons for that, One is thatthe data do not fit a linear model, It ispossible however that the data generally doactually represent a linear model, but thelinear model generated is poor because noiseor outliers exist in the data.Noise is erroneous data and outliers are datavalues that are exceptions to the usual andexpected data.

49. What are the two approaches used by

regression to perform classification?

Regression can be used to performclassification using the followingapproaches1. Division: The data are divided intoregions based on class.2. Prediction: Formulas are generated topredict the output class value.

50.What is linear regression?

In linear regression data are modeled using astraight line. Linear regression is thesimplestform of regression. Bivariate linearregression models a random variable Ycalled response variable


16/33

16

as a linear function of another randomvariable X, called a predictor variable.Y = a + b X

51. What is classification?

A bank loan officer wants to analyzewhich loan applicants are safe and whichare

risky for the bank.A marketing manager needs data analysis

to help guess whether a customer with agiven profile will buy a new computer.In the above examples, the data analysis

task is classification where a model orclassifier is constructed to predict

categorical labels such as safe or risky

forloan application data.

52. What is prediction?

Suppose the marketing manager wouldlike to predict how much a given customerwill

spend during a sale. The data analysistask is an example of numeric prediction.The

term prediction is used to refer to

numeric prediction.

53. How do classifications work?

OrExplain the steps involved in data

classification?Data classification is a two step process:Step1: A classifier is built describing a

predetermined set of data classes orconcepts.

This is the learning step(training phase),

where a classification algorithm builds theclassifier by analyzing or learning froma training set made up of database tuples

and their associated class labels.Step 2: The model is used for

classification. A test set is used, made up oftuples and their associated class labels.Learning

Training data are analyzed by aclassification algorithm

ClassificationTest data are used to estimate the

John HenryMiddleAged,

low

loan_decision = risky

54. What is supervised learning?

The class label of each training tuple isprovided in supervised learning (i.e. the

learning of the classifier is supervisedin that it is told to which class each training

tuple belongs)

Eg Learning Training data are analyzedby a classification algorithm.

Trainin

g

data

Name age income

loan_decision

Sandy young low risky

Jones

Caroline middle high safe

aged

Susan senior high safe

Lake

C

ru

If age

loan_

if inc

loan_

Classification rul

Test data New


17/33

17

Training dataEg Name Age Incomeloan_decision

Sandy Jones younglow risky

Caroline middle agedhigh safe

Susan Lake seniorlow safe

In the above table the class label attribute isloan_decision and the learned model orclassification is representing the form ofclassification rules.Eg. If age =young THEN loan_decision =risky

If income=high THENloan_decision=safeIf age=middle-aged and income=low

THEN loan_decision=risky55. What is unsupervised learning?

In unsupervised learning (orclustering),is which the class label or eachtrainingTuple is not known, and the number or setof classes to be learned may not be known in

advance.For eg: if we did not have the loan_decisiondata available for the training set we coulduse clustering to try to determine group oflike tuples, which may correspond to riskgroup within the loan application data.

56. What are preprocessing steps of

classification and prediction process?

The following preprocessing steps applied tothe data to help inform the

accuracy,efficieny and stability of theclassification or prediction process.data cleaning-this refers to preprocessing ofdata in order to remove or reduce noise (byapplying smoothing techniques )and thetreatment of missing valuesRelevance analysis- many of the attributes inthe data may be redundant. A strong

correlation between attributes A1 and A2would suggest that one of the two could beremoved from further analysis.Attributes subset selection can be used tofind a reduced set of attributes.

Relevance analysis in the form of correlationanalysis and attribute subset selection can beused to delete attributes that do notcontribute to the classification predictiontask.Data transform and reduction-the data maybe transformed by normalization. Data canalso be transformed by generalizing it tohigher level concepts.Eg: the attribute income can be generalizedto discriminate ranges such as low, medium

and high.57. What are the criteria used in

comparing classification and prediction

methods?

Accuracy-the accuracy of the classificationrefers to the ability of the classifier tocorrectly predict the class label of new orpreviously unseen data. (i.e. tuples withinclass label information)The accuracy of the prediction refers to how

well the given prediction can guess the valueof the predicted attributes for new or unseendata.Speed this refers to the computational costinvolved in generating and using the givenclassification or prediction.Robustness The ability to make correctpredications from the given noisy data ordata with missing values.Scalability Ability to construct theclassifier or predictor efficiency given large

amount data.Interpretability This refers to the level ofunderstanding and insight that is provided by the classifier or predictor .

58. What are Bayesian classifiers?

Bayesian classifiers are statisticalclassifiers. They can predict class


18/33

18

membership probabilities, such as theprobability that a given tuple belongs to aparticular class.Bayesian classification is based on Bayestheorem. Bayesian classifiers have exhibited

a high accuracy and speed when applied tolarge data bases.

59. Define Bayes theorem.

Let X be a data tuple. In Bayesian terms,X is considered evidence. Let H be somehypothesis such as the data tuple X belongsto a specified class C.P(H|X) is the posterior probabilities of Hconditioned in H.Suppose X is a 35 yr old customer with an

income of $40,000 and H is the hypothesisthat X will buy a computer given that weknow the customers age and income. Incontrast. P(H) is prior probability of H. Thisis the probability that any given customerwill buy a computer, regardless of age,income, or any other information. P(X|H) isthe posterior probability of X conditioned onH. i.e. , it is the probability that a customerX, is 35 yrs old and earns $40,000, giventhat we know the customer will buy a

computer. P(X) is the prior probability of X.It is the probability that a person from ourset of customers is 35yrs old and earns$40,000.How are the probabilities estimated?P(H), P(X|H), and P(X) may be estimatedfrom the given data. Bayes theorem isuseful in that it provides a way ofcalculating the posterior probability P(H|X),from P(H) P(X|H), and P(X).Bayes theorem is

P(H|X) = [ P(X|H) P(H)] / P(X)

60. What are Bayesian belief networks?

Give an example.

Bayesian belief networks specify jointconditional probability distribution. Theyprovide a graphical model of casualrelationships, on which learning can be

performed. Bayesian belief networks can beused for classification.

A belief network is defined by twocomponents - a directed acyclic graph(DAG) and a set of conditional probability

tables. The DAG represents a randomvariable. They may correspond to actualattributes given in the data believed to forma relationship (i.e. in the case of medicaldata a hidden variable may indicate asyndrome, representing a number ofsymptoms that together characterizes aspecific disease.Each represents a probabilistic dependence.

Y is the parent or immediate predecessor ofZ, and Z is the descendent of Y.

(a) A simple Bayesian belief network

Positive

X-rayDyspn

emphy

ma

zy

Familyhistory

smoker

Lung

cancer


19/33

19

CPT

FH,S FH,~S ~FH,S ~FH,~S

LC 0.8 0.5 0.7 0.1

~LC 0.2 0.5 0.3 0.9

(b) The conditional probability table forvalue of the variable Lung Cancer (LC)showing each possible combination of thevalue of its parents.A belief network has one conditionalprobability table (CPT) for each variable.The CPT for a variable Y specifies theconditional distribution P(Y | Parent(Y)),

where Parent(Y) are the parents of Y.

61. What is rule based classification?

Rules are good way of representinginformation or bits of knowledge. A rulebased classification uses a set of IF-THENrules for classification. An IF-THEN rule isan expression of the form

IF condition THEN conclusionEg:

R1: IF age = youth AND student = yes

THEN buys computerExplanation: The IF part (or left handside) of a rule is known as antecedent orprecondition. The THEN part (or righthand side) is the rule consequent.The rule R1 also can be written as

R1: (age = youth) (student = yes) =>(buys_computer = yes)

62. What is sequential covering

algorithm? How is it different from

decision tree induction?IF-THEN rules can be extracted directlyfrom the training data (i.e. without having togenerate a decision tree first) using asequential covering algorithm. Sequentialcovering algorithms are most widely usedapproach to mining disjunctive sets ofclassification rules. Popular sequential

covering algorithms are AQ, CN2, and themost recent RIPPER. The general strategy isas follows. Rules are learned one at a time.Each time a rule is learned, the tuplescovered by the rule are removed, and the

process repeats on the remaining tuples. Thesequential learning of rules is in contrast todecision tree induction. The path to each leafin a decision tree corresponds to a rule.

63. What is back propagation?

Back propagation is a neural networklearning algorithm. Neural network is a setof connected input/output units in whicheach connection has a weight associatedwith it. During the learning phase, the

network learns by adjusting the weights soas to be able to predict the correct label orinput tuples.Back propagation learns by iterativelyprocessing a data set of training tuplescomparing the networks prediction for eachtuple with the actual target value.

64. What is associative classification?

In associative classification, association

rules are generated and analyzed for use inclassification. The general idea is that wecan search for strong association betweenfrequent patterns (confirmation of attribute-value pairs) and class labels. The decisiontree induction considers only one attribute ata time where as association rules explorehighly confident associations amongmultiples attributes.

Various associative classification methods

are -CBA classification based associative CBAuses an iterative approach to frequent itemset mining.CMAR (classification based on multipleassociation rules) it differs from CBA inits strategy for a frequent item set miningand its construction of the classifier.


20/33

20

65. What are k-Nearest Neighbour

classifier?

Nearest neighbour classifier are based onlearning by analogy, i.e. by comparing a

given tuple with training tuples that aresimilar to it. The training tuples aredescribed by n attributes. Each tuplerepresents a point in an n-dimensional space.In this way, all of the training tuples arestored in an n-dimensional pattern space.When given an unknown tuple, k-nearest-neighbour classifier searches the patternspace for the k training tuples that areclosest to the unknown tuple.

66. What is regression analysis?Regression analysis can be used to modelthe relationship between one or moreindependent or predictor variables and adependent or response variable (which iscontinuous-valued). The predictor variablesare attributes of interest describing the tuple(i.e. making up the attribute vector). Ingeneral, the values of predictor variables areknown. The response variable is what wewant to predict. Given a tuple described by

predictor variables, we want to predict theassociated value of the response variable.Many problems can be solved by linearregression. Several packages exist to solveregression problems. Examples include SAS, SPSS and S-Plus.

67.What is non-linear regression?

If a given response variable and predictionvariable have a relationship that may bemodeled by a polynomial function, it is

called non-linear regression or polynomialregression. It can be modeled by addingpolynomial terms to the basis linear model.

68. Explain clustering by k-means

partitioning.

The k-means algorithm takes theinput parameter k , and partitions a set of n

objects into k clusters so that the resultingintra cluster similarity is high but intercluster similarity is low. Cluster similarity ismeasured in regard to the mean value of theobjects in a cluster (cluster centroid or

center of gravity).How does k-means algorithm work?The k-means algorithm proceeds as follows :First it randomly selects k of the objects,each of which initially represents a clustermean or center. For each of the remainingobjects, an object is assigned to the clusterto which it is the most similar, based on thedistance between the objects and the clustermean. It then computes the new mean foreach cluster. This process iterates until the

exterior function converges. Typically, thesquare error criterion is used, defined asE = i=1i=k p Ci | p-mi|2Where E = sum of the square errors for allobjects in the data set

p = point in space representing thegiven object

mi = mean of cluster mi

69. What is k Medoids method?

The k means algorithm is sensitive to

outliers because an object with an extremelylarge value may substantially distort thedistribution of data.Instead of taking the mean value of theobjects in a cluster as a reference point, wecan pick actual objects to represent theclusters, using one representative objects percluster.Each remaining object is clustered with therepresentative object to which it is the mostsimilar. The partitioning method is then

performed based on the principle ofminimizing the sum of the dissimilaritiesbetween each object and its correspondingreference point. The absolute error criterionis defined as

E = j=1j=k p Cj | p-oj|


21/33

21

Where E = sum of the absolute errors for allobjects in the data set

p = point in space representing thegiven object in cluster Cjoj is the representative object of Cj

The algorithm iterates until eventually, eachrepresentative object is actually medoid ormost centrally located object of its cluster.This is the basis of the k Medoids methodfor grouping n objects into k clusters.

70. What is outlier detection and

analysis?

One persons noise could be anotherpersons signal. Outliers are data objects that

do not comply with the general behavior orthe model of the data. Outliers can be causedby measurement or execution error. Manydata mining algorithms try to minimize theinference of outliers or eliminate them alltogether. However, outliers may be ofparticular interest such as in the case offraud detection.

71. What is outlier mining?

Outlier detection and analysis is a interesting

data mining task refered to as outlier mining.Outlier mining has wide applications. It canbe used in fraud detection (detecting unusualusage of credit cards etc). outlier mining canbe described as follows :Given a set of n data points or objects and k,the expected number of outliers, find the topkey objects that are considered thedissimilar, exceptional, or inconsistent withrespect to the remaining data.

72. Explain in brief different outlier

detection.

Computer based methods for outlierdetection :There are four approaches :

Statistical approach.The distance based approach.The density based local outlier approach.Deviation based approach.

UNIT V

1. What are the areas in which data

warehouses are used in present and in

future?

The potential subject areas in which dataware houses may be developed atpresent and also in future are(i).Census data:

The registrar general and census

commissioner of India decenniallycompiles information of all individuals,villages, population groups, etc. Thisinformationis wide ranging such as the individual slip.A compilation of information of individualhouseholds, of which a database of5%sample is maintained for analysis. A datawarehouse can be built from this databaseupon which OLAP techniques can beapplied,

Data mining also can be performed foranalysis and knowledge discovery(ii).Prices of Essential Commodities

The ministry of food and civil supplies,Government of India compliesdaily data for about 300 observation centersin the entire country on the prices ofessential commodities such as rice, edibleoil etc, A data warehouse can be builtfor this data and OLAP techniques can beapplied for its analysis

2. What are the other areas for Data

warehousing and data mining?

AgricultureRural developmentHealthPlanningEducation


22/33

22

Commerce and Trade

3. Specify some of the sectors in which

data warehousing and data mining are

used?

Tourism

Program ImplementationRevenueEconomic AffairsAudit and Accounts

4. Describe the use of DBMiner.

Used to perform data mining functions,including characterization,association, classification, prediction andclustering.

5. Applications of DBMiner.

The DBMiner system can be used as ageneral-purpose online analyticalmining system for both OLAP and datamining in relational database anddatawarehouses.Used in medium to large relational databaseswith fast response time.

6. Give some data mining tools.

DBMinerGeoMinerMultimedia minerWeblogMiner

7. Mention some of the application areas

of data mining

DNA analysisFinancial data analysisRetail IndustryTelecommunication industry

Market analysisBanking industryHealth care analysis.

8. Differentiate data query and

knowledge query

A data query finds concrete data stored in adatabase and corresponds to a

basic retrieval statement in a databasesystem.A knowledge query finds rules, patterns andother kinds of knowledge in adatabase and corresponds to querying

database knowledge includingdeduction rules, integrity constraints,generalized rules, frequent patterns andother regularities.

9.Differentiate direct query answering

and intelligent query answering.

Direct query answering means that a queryanswers by returning exactly whatis being asked.Intelligent query answering consists of

analyzing the intent of query andproviding generalized, neighborhood, orassociated information relevant to thequery.

10. Define visual data mining

Discovers implicit and useful knowledgefrom large data sets using data and/or knowledge visualization techniques.Integration of data visualization and datamining.

11. What does audio data mining mean?

Uses audio signals to indicate patterns ofdata or the features of data miningresults.Patterns are transformed into sound andmusic.To identify interesting or unusual patternsby listening pitches, rhythms, tuneand melody.Steps involved in DNA analysis

Semantic integration of heterogeneous,distributed genome databasesSimilarity search and comparison amongDNA sequencesAssociation analysis: Identification of co-occuring gene sequencesPath analysis: Linking genes to differentstages of disease development


23/33

23

Visualization tools and genetic data analysis

12.What are the factors involved while

choosing data mining system?

Data types

System issuesData sourcesData Mining functions and methodologiesCoupling data mining with database and/ordata warehouse systemsScalabilityVisualization toolsData mining query language and graphicaluser interface.

13. Define DMQL

Data Mining Query LanguageIt specifies clauses and syntaxes forperforming different types of data miningtasks for example data classification, dataclustering and mining associationrules. Also it uses SQl-like syntaxes to minedatabases.

14. Define text mining

Extraction of meaningful information fromlarge amounts free format textual

data.Useful in Artificial intelligence and patternmatchingAlso known as text mining, knowledgediscovery from text, or contentanalysis.

15. What does web mining mean

Technique to process information availableon web and search for useful data.To discover web pages, text documents ,

multimedia files, images, and othertypes of resources from web.Used in several fields such as E-commerce,information filtering, frauddetection and education and research.

16.Define spatial data mining.

Extracting undiscovered and implied spatialinformation.Spatial data: Data that is associated with alocationUsed in several fields such as geography,

geology, medical imaging etc.

17. Explain multimedia data mining.

Mines large data bases.Does not retrieve any specific informationfrom multimedia databasesDerive new relationships , trends, andpatterns from stored multimedia datamining.Used in medical diagnosis, stock markets,Animation industry, Airline

industry, Traffic management systems,Surveillance systems etc.

18. What is Time Series Analysis?

A time series is a set of attribute values overa period of time. Time SeriesAnalysis may be viewed as finding patternsin the data and predicting future values.

19. What are the various detected

patterns?

Detected patterns may include:Trends : It may be viewed as systematicnon-repetitive changes to the values overtime.Cycles : The observed behavior is cyclic.Seasonal : The detected patterns may be

based on time of year or month or day.Outliers : To assist in pattern detection ,

techniques may be needed to remove orreduce the impact of outliers

20.What is a spatial database?

A spatial database stores largeamount of space related data, such asmaps, preprocessed remote sensing or


24/33

24

medical imaging data, and VLSI chiplayout data.

21.Define spatial data mining?

Spatial data mining refers to the

extraction of knowledge, spatialrelationships, or other interestingpatterns not explicitly stored in thespatial database. Such mining demandsan integration of data mining with spatialdatabases.

22. What is spatial data warehouse?

A spatial data warehouse is asubject oriented, integrated, time variant, and nonvolatile collection both

spatial and nonspatial data in support ofspatial data mining and spatial data related decision making process.

23. What are the different dimensions in a

spatial data cube?

There are three dimensions in aspatial data cube

e. Nonspatial dimensionf. Spatial to nonspatial

dimensiong. Spatial to spatialdimension

24. What is progressive refinement?

Progressive refinement is anoptimization method for mining spatialassociation rules from spatial database. Thismethod first mines large data sets roughlyusing a fast algorithm and then improves thequality of mining in a pruned data set using

a more expensive algorithm.

25. What is spatial classification?

Spatial classification analysisspatial objects to derive classificationschemes in relevance to certain spatialproperties such as neighborhood of adistrict, highway, river etc.

26. What are the two multimedia indexing

and retrieval systems?

The following are the problems inusing decision tree

1)Description based retrievalsystem builds indices and performobject retrieval based on imagedescriptions such as keywords, time ofcreation, size etc 2)Content based retrievalsystem these systems support retrievalbased on the image content such as colorhistogram, texture, shape etc

27. What are the retrieval methods (based

on signature) proposed for similarity based retrieval in image databases?

h. Color histogram basedsignature

i. Multimedia composedsignature

j. Wavelet based signaturek. Wavelet based with region

based granularity

28. What is feature descriptor?

A feature descriptor is a set ofvectors for each visual characteristic.The main vectors are the color vector, aMFO (Most Frequent Orientation)vector, and a MFC (Most FrequentColor) vector.

29. What is a time series database?

A time series database consistsof a sequence of values or events thatchange with time. It is a sequence

database in which the values aremeasured at equal intervals.

30. What is a sequence database?

A sequence database containssequence of ordered events with orwithout concrete notion of time.


25/33

25

31. What are the two popular data

independent transformations?

1) Discrete Fourier Transform(DFT)

2) Discrete Wavelet Transform

(DWT)

32. What are similarity searches that

handle gaps and differences in offsets and

amplitudes?

The searches that handle gapsand amplitude are

Atomic matching

Window stitching

Subsequence ordering+

33. What are the parameters that affectthe result of sequential pattern mining?

o Duration

o Event folding window

o Interval

34.What is a serial episode and a parallel

episode?

A serial episode is a set ofevents that occurs in a total order,whereas a parallel episode is a set of

events whose occurrence ordering istrivial.

35. What is periodicity analysis? What

are the problems in periodic analysis?

Periodicity analysis is the miningof periodic patterns, i.e. the search forrecurring patterns in a time seriesdatabase. The following are theproblems in periodic analysis

l. Mining full periodic patterns

m. Mining partial periodicpatterns

n. Mining cyclic or periodicassociation rules

36. What is text database?

A text database consists of largecollection of documents from various

sources such as news articles, researchpapers, books, digital libraries, e mailmessages, and web pages. Data isstored in semi-structured form.

37. What is Information retrieval (IR)?Information Retrieval is

concerned with the organization andretrieval of information from largenumber of text based documents. Atypical information retrieval problem isto locate relevant documents based onuser input, such as keywords or exampledocuments.

38.What are the basic measures for

accessing quality of text retrieval?1) Precision This is thepercentage of retrieved documents thatare in fact relevant to the query.

2) Recall This is the percentageof documents that are relevant to thequery and where, in fact, retrieved

39.Write short notes on multidimensional

data model?

Data warehouses and OLTP tools are based

on a multidimensional data model.This model is used for the design ofcorporate data warehouses and departmentdatamarts. This model contains a Star schema,Snowflake schema and Fact constellationschemas. The core of the multidimensionalmodel is the data cube.

40.Define data cube?

It consists of a large set of facts (or)

measures and a number of dimensions.41.What are facts?Facts are numerical measures. Facts can alsobe considered as quantities by whichwe can analyze the relationship betweendimensions.

42.What are dimensions?


26/33

26

Dimensions are the entities (or) perspectiveswith respect to an organization forkeeping records and are hierarchical innature.

43.Define dimension table?A dimension table is used for describing thedimension.(e.g.) A dimension table for item maycontain the attributes item_ name, brand andtype.

44.Define fact table?

Fact table contains the name of facts (or)measures as well as keys to each of therelated dimensional tables.

45.What are lattice of cuboids?

In data warehousing research literature, acube can also be called as cuboids. Fordifferent (or) set of dimensions, we canconstruct a lattice of cuboids, each showingthedata at different level. The lattice of cuboidsis also referred to as data cube.

46.What is apex cuboid?

The 0-D cuboid which holds the highestlevel of summarization is called the apexcuboid. The apex cuboid is typically denotedby all.

47.List out the components of star

schema?

_A large central table (fact table) containingthe bulk of data with noredundancy.

_A set of smaller attendant tables

(dimension tables), one for eachdimension.

Star schema.

Multidimensional data modelcan exist in the form of star schema. Itconsists of

a) A large central table (facttable) containing data withno redundancy

b) A set of smaller attendanttables (dimension tables),

one for each dimension

48.What is snowflake schema?

The snowflake schema is a variant of thestar schema model, where somedimension tables are normalized therebyfurther splitting the tables in to additionaltables.

49.List out the components of fact

constellation schema?

This requires multiple fact tables to sharedimension tables. This kind of schemacan be viewed as a collection of stars andhence it is known as galaxy schema (or) factconstellation schema.

50.Point out the major difference between

the star schema and the snowflake

schema?

The dimension table of the snowflakeschema model may be kept in normalized

form to reduce redundancies. Such a table iseasy to maintain and saves storage space.

51.Which is popular in the data

warehouse design, star schema model (or)

snowflake schema model?

Star schema model, because the snowflakestructure can reduce the effectivenessand more joins will be needed to execute aquery.

52.Define concept hierarchy?A concept hierarchy defines a sequence ofmappings from a set of low-levelconcepts to higher-level concepts.

53.Define total order?

If the attributes of a dimension which formsa concept hierarchy such as


27/33

27

street


28/33

28

Design and construction of data warehouses based on the benefits of dataminingMultidimensional analysis of

sales,customers,products,time and region

Analysis of the effectiveness of sales

campaignsCustomer retention-analysis of customer

loyaltyPurchase recommendation and cross-

reference of item

66. Name some of the data mining

applications

Data mining for Biomedical and DNA data analysis

Data mining for Financial data analysis

Data mining for the Retail industryData mining for the Telecommunication

industry

67. What are the features of object-

relational and object oriented data bases?

Both kinds of systems deal with the

efficient storage and access of vast amountsof disk-based complex structured

objects.They organise a large set of dataobjects into classes , which in turn organisedinto class/sub class hierarchies.each objectin a class is associated with a. an object-identifier,b. a set of attributes,c. a set ofmethods that specify that computationalroutines or rulesassociated with each objectclass.

68. How data mining is performed on

complex data types?

Vast amounts of data are stored invarious complex forms.The complex datatype include objects,spatial data, multimediadata, text data and web data.Multidimensional analysis and data miningcan be performed by

a. class based generalization of complexobjects including set valued,listvalued,class-subclass hierarchies, and classcomposition hierarchiesb. constructing object data cube

c. performing generalization -based mining.

69. Give an example of star -schema of

spatial data warehouse.

There are 3000 weather probesdistributed in British Clombia (BC),Canada, each recording daily temperatureand precipitation for a designated small areaand transmitting signals to a provincialweather station with a spatial data

warehouse that supports spatial OLAP, auser can view weather patterns on a map bymouth, by region,etc.

70. How a spatial data warehouse is

constructed?

As with relational data, we can integeratespatial data to construct a data warehousethat facilitates spatial data mining.A spatial data warehouse is a subject

oriented, integerated, time-variant and nonvolatile collection of both spatial and nonspatial data.

71. What are spatial association rules?

Similar to mining of associations rules intransaction and relational databases, spatial


29/33


30/33

30

There are tremendous number of onlinedocuments available. Automated documentclassification is an important text miningtask as need exists to automatically prganizedocuments into classes to facilitate

document retrrreival and subsequentanalysis.A general procedure for automateddocument classificationFirst a set of pre classified document istaken as a trainiing set. The training set isthenanalyzed in order to derive aclassification scheme.Such a classificationscheme often needs to be refined with atesting process. The so-derived classificationscheme can be used for classification ofother on-line documents.

A few typical classification methods used intext classification area. Nearest-neighbour classificationb. Feature selection methodsc. Bayesian classification.

76. Explain breifly some data

classification methods.

a. Nearest-neighbor classification: Usingthe k-nearest-neighbor classification which

is based on the intution that similardocuments are expected to be assigned thesame class label.i)We can simply index the trainingdocuments and associate each with a classlabel.ii)The class label of Text document can bedetermined based on class label distributionof k nearest neighbors.By timing k and incorporationg refinements,this kind of classification can acheive

accuracy of a best classification. statisticallyuncorrelated with the class labels.c. bayesian classification first trains themodel by calculating a generative documentdistribution P(d/c) to each C of document dand then tests which class is most likely togenerate the test document.

77. What are the different methods of

document clustering?

Document clustering is one of the mostcrucial techniques for organizing documents

in an unsupervised manner ( class label notunown earlier)a. Spectral clustering method: first performsspectral embedding (dimensionalityreduction) on the original data, and thenapplies the traditional cluatering algorithm(eg k-means) on the reduced documentspace.b. The mixture modal clustering method :models the text datawith a mixturemodel(invloving mutilnormal component

models)Clustering involves two steps(1). estimating the model parameters basedon the text data and any additional priorknowledge and(2) infering the clusters based on theestimated model parameters.c. The latent semantic indexing (LPI) :These are linear dimensionality reductionmethods.We can acquire tranformationvectors or embedding function through

which we use function and embed all of thedata to lower-dimensional space.

78. What is time series data base?

A time series database consists ofsequences of values or events obtained overrepeated measurements of time(hourly ,daily, weekly) .TIme- series databases arepopular in many applications, such as stockmarket analysis ,economic and sales

forecasting , budgetory analysis.workloadprojections , process and quality control ,natural phenomena (such as atmosphere)temperature wind, earth quake), scientificand engineering experiments and medicaltreatments.The amount of time-series data is increasingrapidly (giga bytes/day) such as in stock


31/33

31

trading or even per minute (such as NASAspace programs). Need exists to findcorrelation relationships within time seriesdata as well as analysing huge numbers ofregular patterns, trends, bursts(such as

sudden sharp changes) and outline with fastor even real-time response.

79. What is trend analysis?

A time series data involving a variableY, representing, say, the closing price of ashare in a stock market, can be viewed as afunction of time t, that is , Y = f(t). Such afunction can be illustrated as a time -seriesgraph.

How can we study the time series data ?There are two goals(1)Modelling time series (to gain insight intothe mechanisms or underlying forces thatgenerate time series.(2)forecasting time series (to predict thefuture values of the time series variables.Trend analysis consists of following 4

major components

1)trend or long-term movements- displayedby a trend curve or a trend line.

2)Cyclic movements or cyclic variations -refer to cycles - the long-term oscillationsabout a trend line or curve.3)Seasonal movement5s or variations -These are systematic or claendar related.Eg: Events that recur annually - suddenincrease in sales of items before christmas.The observed increase in water consumptonduring summerb. A feature selection preocess can be usedto remove terms in the training documents

that are4)irregular or random movementsEg:floods, personal changes withincompanies.

.

fig : TIme series data of stock priceDashed curve shows the trend

80. What are the basic measures for textretrieval?

a.Precision - This is the percentage ofretreived documents that are relevant to thequery ( ie correct response)precision = | { Relevant} n(intersection){retreived} |

| { Retreived} |

b. Recall - This is the percentage ofdocuments that are relavant to the query andretreived

recall = | {Relevant} n {retreived} || {Relevant} |

81. What is an object cube?


32/33

32

In an object database, data generalizationand multidimensional analysis are notapplied to induvidual objects but to classesof objects. The attribute oriented inductionmethod developed for mining characteristics

of relational databases can be extended tomine data characteristics in objectdatabases.The generalization ofmultidimensional attributes of a complexobject class can be performed by a complexobject class can be performed by examiningeach attribute (or dimension ) generalizingeach attribute to simple - valued data, andconstructing a multidimensional datacube,called an object cube.once an object isconstructed , multidimensional analysis and

data mining can be performed on it in amanner similar to that for relational datacubes.

82. What are the challanges faced in web

data mining ?

1) The web seems to be too large foreffective datawarehousing and data mining.2) the complexity of web is far greater thanthat of any text document collection. Web

pages lack a unifying structure.They containfar more authorityu style and contentvariations.3) The web is highly dynamic informationsource. news, stock market , weather etc areupdated regularly on the web.4) The web serves a broad diversity of usercommunities. The internet currentlyconnects more than 100 million workstations. users can easily get lost bygrouping in the " darkness" of the network.

5) Only small portion of the information onthe web is truely relevant or useful.

83. Whatr are the data mining

applications?

1) Intrusion detection2) Association & correlation analysis.

3) Analysis of stream data.4) Distributed data mining.5) Visualizing & querying tools.

84. What are the recent trends in data

mining ?1)Applications exploration - in financialanalysis,telecommunications,biomedicine,countering terrorism etc.2)Scalable and interactive data miningmethods - constraint based mining3) integeration of DM systems with db,dw,and web dB systems.4) Standardiztion of data mining language.5) Visual data mining.6) New methods for mining complex types

of data.7) Bilogical data mining - mining DNA andprtein sequences etc.8) Dm applications in software engineering9) Web mining10) Distributed data mining11) Real-time or time critical DM12) graph mining13) Privacy protection and informationsecurity in DM14) Multi relational & multi database DM

85. What is web usage mining?

Besides mining web contents and webstructures , another important task for webmining is web usage mining which minesweblog records to discover user accesspatterns of web pages. This helps to identifyhigh potential custimers for electroniccommerce , improve web serverperformance etc. A web sewrver usually

registers a weblog entry , for every accessof web page. It includes URL requestes , IPaddress from which the request originatedand a time stamp.

86. What are similarity based retreival in

image data bases?


33/33

33

a. Description based retreival systems -which bulids indices and perform objectretreival based on image description such askeywords , captions, size and time ofevaluation.

b. Content based retreival systems - whichsupports retreival based on the imagecontent , such as color histogram , texture ,pattern , image topology , and the shape ofthe objects and their in the image.

87. What are the approaches used for

similarity based retreival in image data

bases?

1) Colo-histogram based signature : It is

based on the color composition of tha image,It does not contain aany information aboutshape,image topology or texture.2) Multifeature composed signature : Thesignature of the image includes multiplefeatures - color histogram,shape,imagetopology and texture. The extracted featuresare stored as meta data and images areindexed based on meta data.3) Wavelet based signature : This approachuses the dominant wavelet coeffiecients of

an image as signature.Wavelets captureshape,texture, and image topologyinformation in a single unified frame work.4) Wavelet-based signature with region-based granularity : The computation andcomparison of signatures are at thegranularity of regions.

DWH Two Marks Q & A

Documents