Top Banner

of 33

DWH Two Marks Q & A

Apr 08, 2018

Download

Documents

vethahas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/7/2019 DWH Two Marks Q & A

    1/33

    1

    DWH 2Marks

    Unit I

    1.Define Data mining.It refers to extracting or miningknowledge from large amount of data. Datamining is a process of discoveringinteresting knowledge from large amountsof datastored either, in database, data warehouse, orother information repositories

    2.Give some alternative terms for data

    mining.

    Knowledge miningKnowledge extractionData/pattern analysis.Data ArchaeologyData dredging

    3.What is KDD.

    KDD-Knowledge Discovery in Databases.

    4.What are the steps involved in KDD

    process.

    Data cleaningData MiningPattern EvaluationKnowledge PresentationData IntegrationData SelectionData Transformation

    5.What is the use of the knowledge base?

    Knowledge base is domain knowledge thatis used to guide search or evaluate the

    interestingness of resulting pattern. Suchknowledge can include concept hierarchiesused to organize attribute /attribute values into different levels of abstraction.Data Mining

    6.Mention some of the data mining

    techniques.

    StatisticsMachine learningDecision TreeHidden markov modelsArtificial Intelligence

    Genetic Algorithm

    Meta learning

    7.Give few statistical techniques.

    Point EstimationData SummarizationBayesian TechniquesTesting HypothesisCorrelationRegression

    8.What is the purpose of Data miningTechnique?

    It provides a way to use various data miningtasks.

    9.Define Predictive model.

    It is used to predict the values of data bymaking use of known results from adifferent set of sample data.

    10.Data mining tasks that are belongs to

    predictive modelClassificationRegressionTime series analysis

    11.Define descriptive model

    It is used to determine the patterns and relationships in a sample data. Datamining tasks that belongs to descriptivemodel:Clustering

    Summarization

    Association rulesSequence discovery

    12. Define the term summarization

    The summarization of a large chunk of datacontained in a web page or adocument.

  • 8/7/2019 DWH Two Marks Q & A

    2/33

    2

    Summarization =caharcterization=generalization

    13. List out the advanced database

    systems.

    Extended-relational databases

    Object-oriented databasesDeductive databasesSpatial databasesTemporal databasesMultimedia databasesActive databasesScientific databasesKnowledge databases

    14. Define cluster analysisCluster analyses data objects withoutconsulting a known class label. The classlabels are not present in the training datasimply because they are not known to beginwith.

    15.Classifications of Data mining systems.Based on the kinds of databases mined:

    o According to model

    _ Relational mining system

    _ Transactional mining system_ Object-oriented mining system_ Object-Relational mining system_ Data warehouse mining systemo Types of Data

    _ Spatial data mining system_ Time series data mining system_ Text data mining system_ Multimedia data mining system

    Based on kinds of Knowledge minedo According to functionalities

    _ Characterization_ Discrimination_ Association_ Classification_ Clustering_ Outlier analysis_ Evolution analysis

    o According to levels of abstraction of the

    knowledge mined

    _ Generalized knowledge (High level ofabstraction)_ Primitive-level knowledge (Raw data

    level)o According to mine data regularities versusmine data irregularitiesBased on kinds of techniques utilized

    o According to user interaction

    _ Autonomous systems_ Interactive exploratory system_ Query-driven systemso According to methods of data analysis

    _ Database-oriented_ Data warehouse-oriented

    _ Machine learning_ Statistics_ Visualization_ Pattern recognition_ Neural networks

    Based on applications adopted

    o Financeo Telecommunicationo DNAo Stock marketso E-mail and so on

    o

    16.Describe challenges to data miningregarding data mining methodology and

    user

    interaction issues.

    Mining different kinds of knowledge in databasesInteractive mining of knowledge at

    multiple levels of abstractionIncorporation of background knowledge

    Data mining query languages and ad hoc

    data miningPresentation and visualization of data

    mining resultsHandling noisy or incomplete dataPattern evaluation

  • 8/7/2019 DWH Two Marks Q & A

    3/33

    3

    17.Describe challenges to data miningregarding performance issues.

    Efficiency and scalability of data mining algorithmsParallel, distributed, and incremental

    mining algorithms

    18.Describe issues relating to the diversityof database types.

    Handling of relational and complex types of dataMining information from heterogeneous

    databases and global informationsystems

    19.What is meant by pattern?

    Pattern represents knowledge if it is easilyunderstood by humans; valid on testdata with some degree of certainty; andpotentially useful, novel,or validates a hunchabout which the used was curious. Measuresof pattern interestingness, either objective orsubjective, can be used to guide thediscovery process.

    20.How is a data warehouse different

    from a database?

    Data warehouse is a repository of multipleheterogeneous data sources, organizedunder a unified schema at a single site inorder to facilitate management decision-making.Database consists of a collection ofinterrelated data.

    21.What are the uses of statistics in data

    mining?

    Statistics is used to

    *to estimate the complexity of a data miningproblem;*suggest which data mining techniques aremost likely to be successful; and*identify data fields that contain the mostsurface information.

    22. What is the main goal of statistics?

    The basic goal of statistics is to extendknowledge about a subset of a collection tothe entirecollection.

    23. What are the factors to be consideredwhile selecting the sample in statistics?

    The sample should be*Large enough to be representative of thepopulation.*small enough to be manageable.*accessible to the sampler.*free of bias.

    24. Name some advanced database

    systems.

    Object-oriented databases,Object-relationaldatabases.

    25. Name some specific application

    oriented databases.

    Spatial databases,Time-series databases,Text databases and multimedia databases.

    26. Define Relational datbases.

    A relational databases is a collection of

    tables,each of which is assigned a uniquename.Eachtable consists of a set of attributes(columnsor fields) and usually stores a large set oftuples(recordsor rows).Each tuple in a relational tablerepresents an object identified by a uniquekey and describedby a set of attribute values.

    27.Define Transactional Databases.

    A transactional database consists of a filewhere each record represents a transaction.Atransaction typically includes a uniquetransaction identity number(trans_ID), and alist of the itemsmaking up the transaction.

    28.Define Spatial Databases.

  • 8/7/2019 DWH Two Marks Q & A

    4/33

    4

    Spatial databases contain spatial-relatedinformation.Such databases includegeographic(map)databases,VLSI chip design databases, andmedical and satellite image

    databases.Spatial data may berepresented in raster format, consisting of n-dimensional bit maps or pixel maps.

    29.What is Temporal Database?

    Temporal database store time related data .Itusually stores relational data that includetimerelated attributes.These attributes mayinvolve several time stamps,each havingdifferent semantics.

    30.What is Time-Series databases?

    A Time-Series database stores sequences ofvalues that change with time,such as datacollected regarding the stock exchange.

    31.What is Legacy database?

    A Legacy database is a group ofheterogeneous databases that combinesdifferent kinds ofdata systems,such as relational or object-

    oriented databases,hierarchicaldatabases,networkdatabases,spread sheets,multimediadatabases or file systems.

    32. What are the steps in the data mining

    process?

    a. Data cleaningb. Data integrationc. Data selectiond. Data transformation

    e. Data miningf. Pattern evaluationg. Knowledge representation

    33. Define data cleaning

    Data cleaning means removing theinconsistent data or noise and collectingnecessary information

    34. Define data mining

    Data mining is a process of extracting ormining knowledge from huge amount ofdata.

    35. Define pattern evaluation

    Pattern evaluation is used to identify thetruly interesting patterns representingknowledge basedon some interesting measures.

    36. Define knowledge representation

    Knowledge representation techniques areused to present the mined knowledge to theuser.

    37. What is Visualization?

    Visualisation is for depiction of data and togain intuition about data being observed. Itassists the analysts in selecting displayformats, viewer perspectives and datarepresentationschema

    38. Name some conventional visualization

    techniquesHistogramRelationship treeBar chartsPie chartsTables etc.

    39. Give the features included in modern

    visualisation techniques

    a. Morphingb. Animation

    c. Multiple simultaneous data viewsd. Drill-Downe. Hyperlinks to related data source

    40. Define conventional visualisation

    Conventional visualisation depictsinformation about a population and not thepopulation data itself

  • 8/7/2019 DWH Two Marks Q & A

    5/33

    5

    41. Define Spatial Visualisation

    Spatial visualisation depicts actual membersof the population in their feature space

    42.What is Descripive and predictive datamining?

    Descriptive datamining describes the dataset in a concise and summarative mannerandpresents interesting general properties of thedata.Predictive datamining analyzes the data inorder to construct one or set of models andattempts to predict the behavior of new datasets.

    43.Merits of Data Warehouse.

    *Ability to make effective decisions fromdatabase*Better analysis of data and decision support*Discover trends and correlations thatbenefits business*Handle huge amount of data.

    44.What are the characteristics of data

    warehouse?

    *Separate*Available*Integrated*Subject Oriented*Not Dynamic*Consistency*Iterative Development*Aggregation Performance

    45.List some of the DataWarehouse tools?

    *OLAP(OnLine Analytic Processing)

    *ROLAP(Relational OLAP)*End User Data Access tool*Ad Hoc Query tool*Data Transformation services*Replication

    46.Explain OLAP?

    The general activity of querying andpresenting text and number data fromDataWarehouses, as wellas a specifically dimensional style ofquerying and presenting that is exemplified

    by a number ofOLAP Vendours .The OLAP vendourstechnology is nonrelational and is almostalways biased onan explicit multidimensional cube ofdata.OLAP databases are also known asmultidimensional cubeof databases.

    47.Explain ROLAP?

    ROLAP is a set of user interfaces and

    applications that give a relational database adimensionalflavour.ROLAP stands for Relational OnlineAnalytic Processing.

    UNIT-II

    1.Define data warehouse?

    A data warehouse is a repository of multipleheterogeneous data sourcesorganized under a unified schema at a single

    site to facilitate management decisionmaking .(or)A data warehouse is a subject-oriented,time-variant and nonvolatilecollection of data in support ofmanagements decision-making process.

    2.What are operational databases?

    Organizations maintain large database thatare updated by daily transactions are

    called operational databases.

    3.Define OLTP?

    If an on-line operational database systems isused for efficient retrieval, efficientstorage and management of large amounts ofdata, then the system is said to be on-linetransaction processing.

  • 8/7/2019 DWH Two Marks Q & A

    6/33

    6

    4.Define OLAP?

    Data warehouse systems serves users (or)knowledge workers in the role of dataanalysis and decision-making. Such systems

    can organize and present data in variousformats. These systems are known as on-lineanalytical processing systems.

    5.How a database design is represented in

    OLTP systems?

    Entity-relation model

    6. How a database design is represented

    in OLAP systems?

    Star schema

    Snowflake schemaFact constellation schema

    7.List out the steps of the data warehouse

    design process?

    _ Choose a business process to model._ Choose the grain of the business process_ Choose the dimensions that will apply toeach fact table record._ Choose the measures that will populateeach fact table record.

    8.What is enterprise warehouse?

    An enterprise warehouse collects all theinformations about subjects spanning theentire organization. It provides corporate-wide data integration, usually from one (or)more operational systems (or) externalinformation providers. It contains detaileddata aswell as summarized data and can range insize from a few giga bytes to hundreds of

    gigabytes, tera bytes (or) beyond. An enterprisedata warehouse may be implemented ontraditional mainframes, UNIX super servers(or) parallel architecture platforms. Itrequires business modeling and may takeyears to design and build.

    9.What is data mart?

    Data mart is a database that contains asubset of data present in a data warehouse.Data marts are created to structure the datain a data warehouse according to issues such

    as hardware platforms and access controlstrategies. We can divide a data warehouseintodata marts after the data warehouse has beencreated. Data marts are usually implementedon low-cost departmental servers that areUNIX (or) windows/NT based. Theimplementation cycle of the data mart islikely to be measured in weeks rather thanmonths (or) years.

    10.What are dependent and independentdata marts?

    Dependent data marts are sourced directlyfrom enterprise data warehouses.Independent data marts are data capturedfrom one (or) more operational systems (or)external information providers (or) datagenerated locally with in particulardepartment(or) geographic area.

    11.Define indexing?Indexing is a technique, which is used forefficient data retrieval (or) accessingdata in a faster manner. When a table growsin volume, the indexes also increase in sizerequiring more storage.

    12.What are the types of indexing?

    _B-Tree indexing_Bit map indexing_Join indexing

    13.Define metadata?

    Metadata is used in data warehouse is usedfor describing data about data.(i.e.) meta data are the data that definewarehouse objects. Metadata are created forthe

  • 8/7/2019 DWH Two Marks Q & A

    7/33

    7

    data names and definitions of the givenwarehouse.

    14.Define VLDB?

    Very Large Data Base. If a database whose

    size is greater than 100GB, thenthe database is said to be very largedatabase.

    15.What is data cleaning?

    Data cleaning routines removeincomplete, noisy and inconsistent data by

    filling in missing values

    smoothing out noise

    identifying outliers and

    correcting inconsistencies in the data

    16.Mention the categories of data that

    may be encountered in mining.

    The data used in the analysis by thedata mining techniques may fall underthe following categories

    Incomplete data lacking

    attribute value or certainattributes of interest

    Noisy data Data containingerrors or outlier values thatdeviate from the expected. Noiseis defined as a random error orvariance in a measured variable

    Inconsistent data There maybe inconsistencies in datarecorded in some transactions,inconsistencies due to dataintegration (where a given

    attribute may have differentnames in different database),inconsistency due to dataredundancy

    17.What are the various data smoothing

    techniques to remove noise?

    The various data smoothingtechniques are

    Binning

    Clustering

    Combined computer and

    human inspection Regression

    18.What is Binning?

    Binning is used to smooth datavalues by consulting its neighborhoodvalues. The sorted values are distributedinto a number of buckets or bins.The data are first sorted and thenpartitioned into equidepth bins. Thereare three types of binning

    Smoothing by bin means Each value is replaced by themean value of the bin

    Smoothing by bin median

    Each bin value is replaced by thebin median

    Smoothing by boundaries

    The maximum and minimumvalues in the bin are identified asbin boundaries. Each value in the

    bin is replaced by the closestboundary value

    19.What is data integration? What are

    the issues to be considered while

    integrating data?

    Data integration combines datafrom multiple sources into a coherent datastore. Issues to be considered are

    a) Entity identification problemb) Correlation analysis

    c) Detection and resolution ofdata value conflict

    20.What is data transformation? What

    are the various methods of transforming

    data?

    Data transformationtransforms and consolidates data

  • 8/7/2019 DWH Two Marks Q & A

    8/33

    8

    into forms appropriate formining. The following arevarious methods of transformingdata

    i. Smoothing

    ii. Aggregationiii. Generalizationiv. Normalizationv. Attribute construction

    UNIT III

    1. Define the concept of classification.

    Two step process

    A model is built describing a predefined

    set of data classes or concepts.The model is constructed by analyzingdatabase tuples described byattributes.The model is used for classification.

    2. What is Decision tree?

    A decision tree is a flow chart like treestructures, where each internalnode denotes a test on an attribute, eachbranch represents an outcome of the test,and leaf nodes represent classes or class

    distributions. The top most in a tree is theroot node.

    3.What is tree pruning?

    Tree pruning attempts to identifyand remove branches that reflect noise oroutliers in the training data with the goalof improving classification accuracy onunseen data.

    4. What is Attribute Selection Measure?

    The information Gain measure is used toselect the test attribute at each nodein the decision tree. Such a measure isreferred to as an attribute selection measureor a measure of the goodness of split.

    5. Describe Tree pruning methods.

    When a decision tree is built, many of thebranches will reflect anomalies inthe training data due to noise or outlier. Treepruning methods address thisproblem of over fitting the data.

    Approaches:Pre pruningPost pruning

    6. Define Pre Pruning

    A tree is pruned by halting its constructionearly. Upon halting, the nodebecomes a leaf. The leaf may hold the mostfrequent class among the subsetsamples.

    7. Define Post Pruning.Post pruning removes branches from aFully grown tree. A tree node ispruned by removing its branches.Eg: Cost Complexity Algorithm

    8.Define information gain.

    The information gain measure isused to select the test attribute at eachnode in the tree. Such a measure isreferred to as an attribute selection

    measure ormeasure of goodness ofsplit. The attribute with the highestinformation gain is chosen as the testattribute for the current node.

    9. How does tree pruning work?

    There are two approaches to treepruning

    a. In prepruning approach, a

    tree is pruned by halting itsconstruction early. E.g. by

    deciding not to further splitthe training samples at agiven node. Upon halting, thenode becomes a leaf node.

    b. In postpruning approach, all

    branches from a fully-growntree are removed. The lowestpruned node becomes a leaf

  • 8/7/2019 DWH Two Marks Q & A

    9/33

    9

    and is labeled by the mostfrequent class.

    10.How are classification rules extracted

    from a decision tree?

    The knowledge representedin a decision tree can beextracted and represented in theform of classification IF-THENrules. One rule is created for eachpath from the root to a leaf node.E.g. IF age=

  • 8/7/2019 DWH Two Marks Q & A

    10/33

    10

    different objects into meaningful anddescriptive objects.

    4. What are the fields in which clustering

    techniques are used?

    Clustering is used in biology to develop

    new plants and animaltaxonomies.Clustering is used in business to enable

    marketers to develop newdistinct groups of their customers andcharacterize the customer group on basisof purchasing.Clustering is used in the identification of

    groups of automobilesInsurance policy customer.

    Clustering is used in the identification of

    groups of house in a city onthe basis of house type, their cost andgeographical location.Clustering is used to classify the

    document on the web for informationdiscovery.

    5.What are the requirements of cluster

    analysis?

    The basic requirements of cluster analysis

    areDealing with different types of attributes.Dealing with noisy data.Constraints on clustering.Dealing with arbitrary shapes.High dimensionalityOrdering of input dataInterpretability and usabilityDetermining input parameter andScalability

    6.What are the different types of dataused for cluster analysis?

    The different types of data used for clusteranalysis are interval scaled, binary,nominal, ordinal and ratio scaled data.

    7. What are interval scaled variables?

    Interval scaled variables are continuousmeasurements of linear scale.For example, height and weight, weathertemperature or coordinates for any cluster.These measurements can be calculated using

    Euclidean distance or Minkowski distance.

    8. Define Binary variables? And what are

    the two types of binary variables?

    Binary variables are understood by twostates 0 and 1, when state is 0, variable isabsent and when state is 1, variable ispresent. There are two types of binaryvariables,symmetric and asymmetric binary variables.Symmetric variables are those variables that

    have same state values and weights.Asymmetric variables are those variablesthat havenot same state values and weights.

    9. Define nominal, ordinal and ratio

    scaled variables?

    A nominal variable is a generalization of thebinary variable. Nominal variablehas more than two states, For example, anominal variable, color consists of four

    states,red, green, yellow, or black. In Nominalvariables the total number of states is N andit isdenoted by letters, symbols or integers.An ordinal variable also has more than twostates but all these states are orderedin a meaningful sequence.A ratio scaled variable makes positivemeasurements on a non-linear scale, suchas exponential scale, using the formula

    AeBt

    or Ae-Bt

    Where A and B are constants.

    10. What do u mean by partitioning

    method?

    In partitioning method a partitioningalgorithm arranges all the objects into

  • 8/7/2019 DWH Two Marks Q & A

    11/33

    11

    various partitions, where the total number ofpartitions is less than the total number ofobjects. Here each partition represents acluster. The two types of partitioningmethod are

    k-means and k-medoids.

    11. Define CLARA and CLARANS?

    Clustering in LARge Applications is calledas CLARA. The efficiency ofCLARA depends upon the size of therepresentative data set. CLARA does notworkproperly if any representative data set fromthe selected representative data sets does notfind best k-medoids.

    To recover this drawback a new algorithm,Clustering Large Applications basedupon RANdomized search (CLARANS) isintroduced. The CLARANS works likeCLARA, the only difference betweenCLARA and CLARANS is the clusteringprocessthat is done after selecting the representativedata sets.

    12. What is Hierarchical method?

    Hierarchical method groups all the objectsinto a tree of clusters that are arrangedin a hierarchical order. This method workson bottom-up or top-down approaches.

    13. Differentiate Agglomerative and

    Divisive Hierarchical Clustering?

    Agglomerative Hierarchical clusteringmethod works on the bottom-up approach.In Agglomerative hierarchical method, eachobject creates its own clusters. The single

    Clusters are merged to make larger clustersand the process of merging continues untilallthe singular clusters are merged into one bigcluster that consists of all the objects.Divisive Hierarchical clustering methodworks on the top-down approach. In this

    method all the objects are arranged within abig singular cluster and the large cluster iscontinuously divided into smaller clustersuntil each cluster has a single object.

    14. What is CURE?Clustering Using Representatives is calledas CURE. The clustering algorithmsgenerally work on spherical and similar sizeclusters. CURE overcomes the problem ofspherical and similar size cluster and is morerobust with respect to outliers.

    15. Define Chameleon method?

    Chameleon is another hierarchical clusteringmethod that uses dynamic modeling.

    Chameleon is introduced to recover thedrawbacks of CURE method. In this methodtwoclusters are merged, if the interconnectivitybetween two clusters is greater than theinterconnectivity between the objects withina cluster.

    16. Define Association rule mining.

    Association rule miningsearches for interesting

    relationships among items in agiven data set. Rule support andconfidence are the two measuresof rule interestingness.

    17. What is occurrence frequency of an

    itemset?

    The occurrence frequency of anitemset is the number of transactions thatcontain the itemset. It is also known asfrequency, support count or count of the

    itemset.

    18.What are the two steps in mining

    association rules?

    Association rule mining is atwo step process

    c. Find all frequent item sets

  • 8/7/2019 DWH Two Marks Q & A

    12/33

    12

    d. Generate strong associationrules from the frequentitemsets

    19.How are association rules classified?

    Association rules are classified asfollows

    Based on the types of valueshandled in the rule

    Based on the dimensions ofdata involved in the rule

    Based on the levels ofabstraction involved in therule

    Based on the variousextensions to association

    mining

    20.What is a quantitative association

    rule?

    If a rule describes associationbetween quantitative items orattributes, then it is a quantitativeassociation rule. In these rules, thequantitative values for items orattributes are partitioned intointervals.

    21.What is a Boolean association rule?

    If a rule concerns theassociation between the presenceor absence of an item, it is aBoolean association rule.

    22.What are single dimensional and multi

    dimensional association rules?

    If the items or attributes in an

    association rule reference only onedimension of a data cube, it is called singledimensional association rule.

    If the items or attributes in anassociation rule reference more than onedimension of a data cube, it is called multi dimensional association rule.

    23.What is multilevel association rule?

    If an association rule refers to adimension at multiple levels ofabstraction, it is called multilevelassociation rule.

    If an association rule does notrefer to a dimension at multiple levels ofabstraction, it is called single levelassociation rule.

    24. Define Apriori algorithm?

    Apriori is an influentialalgorithm for mining frequent itemsetsfor Boolean association rules. Thealgorithm uses prior knowledge offrequent itemset properties. Apriori

    employs an iterative approach known aslevel wise search, where k itemsetsare used to explore (k+1) itemsets.

    25. What is a cuboid?

    Data cubes created for varyinglevels of abstraction are referred to ascuboids. A data cube consists of a latticeof cuboids. Each higher level ofabstraction reduces the data size

    26. When we can say the association rulesare interesting?

    Association rules are considered interestingif they satisfy both a minimumsupport threshold and a minimumconfidence threshold. Users or domainexpertscan set such thresholds.

    27. Explain Association rule in

    mathematical notations.Let I-{i1,i2,..,im} be a set of itemsLet D, the task relevant data be a set ofdatabase transaction T is a set ofitemsAn association rule is an implication of theform A=>B where A C I, B C I,

  • 8/7/2019 DWH Two Marks Q & A

    13/33

    13

    and An B= . The rule A=>B contains in the transaction set D with support s,where s is the percentage of transactions inD that contain AUB. The Rule A=> Bhas confidence c in the transaction set D if c

    is the percentage of transactions in Dcontaining A that also contain B.

    28. Define support and confidence in

    Association rule mining.

    Support S is the percentage of transactionsin D that contain AUB.Confidence c is the percentage oftransactions in D containing A that alsocontainB.

    Support ( A=>B)= P(AUB)Confidence (A=>B)=P(B/A)support.

    Support is the ratio of the number oftransactions that include all items in theantecedent andconsequent parts of the rule to the totalnumber of transactions. Support is anassociation ruleinterestingness measure.Confidence.

    Confidence is the ratio of the number oftransactions that include all items in theconsequentas well as antecedent to the number oftransactions that include all items inantecedent. Confidence isan association rule interestingness measure.

    29. How are association rules mined from

    large databases?

    I step: Find all frequent item sets:

    II step: Generate strong association rules from frequent item sets

    30. Describe the different classifications of

    Association rule mining.

    Based on types of values handled in the

    Rule

    i. Boolean association ruleii. Quantitative association rule

    Based on the dimensions of data

    involved

    i. Single dimensional association rule

    ii. Multidimensional association ruleBased on the levels of abstraction

    involved

    i. Multilevel association ruleii. Single level association rule

    Based on various extensions

    i. Correlation analysisii. Mining max patterns

    31. What are the two main steps in

    Apriori algorithm?

    1) The join step2) The prune step

    32. What is the purpose of Apriori

    Algorithm?

    Apriori algorithm is an influential algorithmfor mining frequent item sets forBoolean association rules. The name of thealgorithm is based on the fact that thealgorithm uses prior knowledge of frequentitem set properties.

    33. Define anti-monotone property.

    If a set cannot pass a test, all of its supersetswill fail the same test as well.

    34. How to generate association rules

    from frequent item sets?

    Association rules can be generated asfollowsFor each frequent item set1, generate all nonempty subsets of 1.

    For every non empty subsets s of 1, outputthe rule S=>(1-s)ifSupport count(1)=min_conf,Support_count(s)Where min_conf is the minimum confidencethreshold.

  • 8/7/2019 DWH Two Marks Q & A

    14/33

    14

    35. Give few techniques to improve the

    efficiency of Apriori algorithm.

    Hash based techniqueTransaction ReductionPortioning

    Sampling

    Dynamic item counting

    36. What are the things suffering the

    performance of Apriori candidate

    generation technique.

    Need to generate a huge number of candidate setsNeed to repeatedly scan the scan the

    database and check a large set ofcandidates by pattern matching

    37. Describe the method of generating

    frequent item sets without candidate

    generation.

    Frequent-pattern growth(or FP Growth)adopts divide-and-conquerstrategy.Steps:

    ->Compress the database representingfrequent items into a frequent pattern treeor FP tree

    ->Divide the compressed database into a setof conditional database->Mine each conditional database separately

    38. Define Iceberg query.

    It computes an aggregate function over anattribute or set of attributes inorder to find aggregate values above somespecified threshold.Given relation R with attributes a1,a2,..,an

    and b, and an aggregate function,agg_f, an iceberg query is the formSelect R.a1,R.a2,..R.an,agg_f(R,b)From relation RGroup by R.a1,R.a2,.,R.anHaving agg_f(R.b)>=threshold

    39.What is hybrid dimension

    association rules?

    Multidimensional associationrules with repeated predicates whichcontain multiple occurrences of some

    predicates are called hybrid dimension association rules

    E.g. age (X, 2029) buys (X,

    laptop) buys (X, b/w printer)

    40. Mention few approaches to mining

    Multilevel Association Rules

    Uniform minimum support for all levels(or uniform support)Using reduced minimum support at lower

    levels(or reduced support)Level-by-level independentLevel-cross filtering by single itemLevel-cross filtering by k-item set

    41. What are multidimensional

    association rules?

    Association rules that involve two or moredimensions or predicatesInterdimension association rule:

    Multidimensional association rule with no

    repeated predicate or dimensionHybrid-dimension association rule:

    Multidimensional association rule withmultiple occurrences of some predicates ordimensions.

    42. Define constraint-Based Association

    Mining.

    Mining is performed under the guidance ofvarious kinds of constraintsprovided by the user.

    The constraints include the followingKnowledge type constraintsData constraintsDimension/level constraintsInterestingness constraintsRule constraints.

    43.What is strong association rule?

  • 8/7/2019 DWH Two Marks Q & A

    15/33

    15

    Association rules that satisfyboth user specified minimumconfidence threshold and user specified minimum support threshold arereferred to as strong association rules

    44.What are the various factors used to

    determine the interestingness measure?

    1) Simplicity the patternshould be simple overall for humancomprehension

    2) Certainty this is thevalidity or trustworthiness of the pattern

    3) Utility this is the potentialusefulness of the pattern

    4) Novelty novel patternsprovide new information or increase theperformance of the pattern

    45. Explain the various OLAP operations.

    a) Roll-up: The roll-up operation performsaggregation on a data cube, either byclimbing up a concept hierarchy for adimension.b) Drill-down: It is the reverse of roll-up. Itnavigates from less detailed data to more

    detailed data.c) Slice: Performs a selection on onedimension of the given cube, resulting in asubcube.46. Discuss the concepts of frequent

    itemset, support & confidence.

    A set of items is referred to as itemset. Anitemset that contains k items is called k-itemset. Anitemset that satisfies minimum support isreferred to as frequent itemset.

    Support is the ratio of the number oftransactions that include all items in theantecedent andconsequent parts of the rule to the totalnumber of transactions.Confidence is the ratio of the number oftransactions that include all items in theconsequent

    as well as antecedent to the number oftransactions that include all items inantecedent.

    47. What is the use of Regression?

    Regression can be used to solve theclassification problems but it can also beusedfor applications such as forecasting.Regression can be performed using manydifferenttypes of techniques; in actually regressiontakes a set of data and fits the data to aformula.

    48. What are the reasons for not using the

    linear regression model to estimate theoutput data?

    There are many reasons for that, One is thatthe data do not fit a linear model, It ispossible however that the data generally doactually represent a linear model, but thelinear model generated is poor because noiseor outliers exist in the data.Noise is erroneous data and outliers are datavalues that are exceptions to the usual andexpected data.

    49. What are the two approaches used by

    regression to perform classification?

    Regression can be used to performclassification using the followingapproaches1. Division: The data are divided intoregions based on class.2. Prediction: Formulas are generated topredict the output class value.

    50.What is linear regression?

    In linear regression data are modeled using astraight line. Linear regression is thesimplestform of regression. Bivariate linearregression models a random variable Ycalled response variable

  • 8/7/2019 DWH Two Marks Q & A

    16/33

    16

    as a linear function of another randomvariable X, called a predictor variable.Y = a + b X

    51. What is classification?

    A bank loan officer wants to analyzewhich loan applicants are safe and whichare

    risky for the bank.A marketing manager needs data analysis

    to help guess whether a customer with agiven profile will buy a new computer.In the above examples, the data analysis

    task is classification where a model orclassifier is constructed to predict

    categorical labels such as safe or risky

    forloan application data.

    52. What is prediction?

    Suppose the marketing manager wouldlike to predict how much a given customerwill

    spend during a sale. The data analysistask is an example of numeric prediction.The

    term prediction is used to refer to

    numeric prediction.

    53. How do classifications work?

    OrExplain the steps involved in data

    classification?Data classification is a two step process:Step1: A classifier is built describing a

    predetermined set of data classes orconcepts.

    This is the learning step(training phase),

    where a classification algorithm builds theclassifier by analyzing or learning froma training set made up of database tuples

    and their associated class labels.Step 2: The model is used for

    classification. A test set is used, made up oftuples and their associated class labels.Learning

    Training data are analyzed by aclassification algorithm

    ClassificationTest data are used to estimate the

    John HenryMiddleAged,

    low

    loan_decision = risky

    54. What is supervised learning?

    The class label of each training tuple isprovided in supervised learning (i.e. the

    learning of the classifier is supervisedin that it is told to which class each training

    tuple belongs)

    Eg Learning Training data are analyzedby a classification algorithm.

    Trainin

    g

    data

    Name age income

    loan_decision

    Sandy young low risky

    Jones

    Caroline middle high safe

    aged

    Susan senior high safe

    Lake

    C

    ru

    If age

    loan_

    if inc

    loan_

    Classification rul

    Test data New

  • 8/7/2019 DWH Two Marks Q & A

    17/33

    17

    Training dataEg Name Age Incomeloan_decision

    Sandy Jones younglow risky

    Caroline middle agedhigh safe

    Susan Lake seniorlow safe

    In the above table the class label attribute isloan_decision and the learned model orclassification is representing the form ofclassification rules.Eg. If age =young THEN loan_decision =risky

    If income=high THENloan_decision=safeIf age=middle-aged and income=low

    THEN loan_decision=risky55. What is unsupervised learning?

    In unsupervised learning (orclustering),is which the class label or eachtrainingTuple is not known, and the number or setof classes to be learned may not be known in

    advance.For eg: if we did not have the loan_decisiondata available for the training set we coulduse clustering to try to determine group oflike tuples, which may correspond to riskgroup within the loan application data.

    56. What are preprocessing steps of

    classification and prediction process?

    The following preprocessing steps applied tothe data to help inform the

    accuracy,efficieny and stability of theclassification or prediction process.data cleaning-this refers to preprocessing ofdata in order to remove or reduce noise (byapplying smoothing techniques )and thetreatment of missing valuesRelevance analysis- many of the attributes inthe data may be redundant. A strong

    correlation between attributes A1 and A2would suggest that one of the two could beremoved from further analysis.Attributes subset selection can be used tofind a reduced set of attributes.

    Relevance analysis in the form of correlationanalysis and attribute subset selection can beused to delete attributes that do notcontribute to the classification predictiontask.Data transform and reduction-the data maybe transformed by normalization. Data canalso be transformed by generalizing it tohigher level concepts.Eg: the attribute income can be generalizedto discriminate ranges such as low, medium

    and high.57. What are the criteria used in

    comparing classification and prediction

    methods?

    Accuracy-the accuracy of the classificationrefers to the ability of the classifier tocorrectly predict the class label of new orpreviously unseen data. (i.e. tuples withinclass label information)The accuracy of the prediction refers to how

    well the given prediction can guess the valueof the predicted attributes for new or unseendata.Speed this refers to the computational costinvolved in generating and using the givenclassification or prediction.Robustness The ability to make correctpredications from the given noisy data ordata with missing values.Scalability Ability to construct theclassifier or predictor efficiency given large

    amount data.Interpretability This refers to the level ofunderstanding and insight that is provided by the classifier or predictor .

    58. What are Bayesian classifiers?

    Bayesian classifiers are statisticalclassifiers. They can predict class

  • 8/7/2019 DWH Two Marks Q & A

    18/33

    18

    membership probabilities, such as theprobability that a given tuple belongs to aparticular class.Bayesian classification is based on Bayestheorem. Bayesian classifiers have exhibited

    a high accuracy and speed when applied tolarge data bases.

    59. Define Bayes theorem.

    Let X be a data tuple. In Bayesian terms,X is considered evidence. Let H be somehypothesis such as the data tuple X belongsto a specified class C.P(H|X) is the posterior probabilities of Hconditioned in H.Suppose X is a 35 yr old customer with an

    income of $40,000 and H is the hypothesisthat X will buy a computer given that weknow the customers age and income. Incontrast. P(H) is prior probability of H. Thisis the probability that any given customerwill buy a computer, regardless of age,income, or any other information. P(X|H) isthe posterior probability of X conditioned onH. i.e. , it is the probability that a customerX, is 35 yrs old and earns $40,000, giventhat we know the customer will buy a

    computer. P(X) is the prior probability of X.It is the probability that a person from ourset of customers is 35yrs old and earns$40,000.How are the probabilities estimated?P(H), P(X|H), and P(X) may be estimatedfrom the given data. Bayes theorem isuseful in that it provides a way ofcalculating the posterior probability P(H|X),from P(H) P(X|H), and P(X).Bayes theorem is

    P(H|X) = [ P(X|H) P(H)] / P(X)

    60. What are Bayesian belief networks?

    Give an example.

    Bayesian belief networks specify jointconditional probability distribution. Theyprovide a graphical model of casualrelationships, on which learning can be

    performed. Bayesian belief networks can beused for classification.

    A belief network is defined by twocomponents - a directed acyclic graph(DAG) and a set of conditional probability

    tables. The DAG represents a randomvariable. They may correspond to actualattributes given in the data believed to forma relationship (i.e. in the case of medicaldata a hidden variable may indicate asyndrome, representing a number ofsymptoms that together characterizes aspecific disease.Each represents a probabilistic dependence.

    Y is the parent or immediate predecessor ofZ, and Z is the descendent of Y.

    (a) A simple Bayesian belief network

    Positive

    X-rayDyspn

    emphy

    ma

    zy

    Familyhistory

    smoker

    Lung

    cancer

  • 8/7/2019 DWH Two Marks Q & A

    19/33

    19

    CPT

    FH,S FH,~S ~FH,S ~FH,~S

    LC 0.8 0.5 0.7 0.1

    ~LC 0.2 0.5 0.3 0.9

    (b) The conditional probability table forvalue of the variable Lung Cancer (LC)showing each possible combination of thevalue of its parents.A belief network has one conditionalprobability table (CPT) for each variable.The CPT for a variable Y specifies theconditional distribution P(Y | Parent(Y)),

    where Parent(Y) are the parents of Y.

    61. What is rule based classification?

    Rules are good way of representinginformation or bits of knowledge. A rulebased classification uses a set of IF-THENrules for classification. An IF-THEN rule isan expression of the form

    IF condition THEN conclusionEg:

    R1: IF age = youth AND student = yes

    THEN buys computerExplanation: The IF part (or left handside) of a rule is known as antecedent orprecondition. The THEN part (or righthand side) is the rule consequent.The rule R1 also can be written as

    R1: (age = youth) (student = yes) =>(buys_computer = yes)

    62. What is sequential covering

    algorithm? How is it different from

    decision tree induction?IF-THEN rules can be extracted directlyfrom the training data (i.e. without having togenerate a decision tree first) using asequential covering algorithm. Sequentialcovering algorithms are most widely usedapproach to mining disjunctive sets ofclassification rules. Popular sequential

    covering algorithms are AQ, CN2, and themost recent RIPPER. The general strategy isas follows. Rules are learned one at a time.Each time a rule is learned, the tuplescovered by the rule are removed, and the

    process repeats on the remaining tuples. Thesequential learning of rules is in contrast todecision tree induction. The path to each leafin a decision tree corresponds to a rule.

    63. What is back propagation?

    Back propagation is a neural networklearning algorithm. Neural network is a setof connected input/output units in whicheach connection has a weight associatedwith it. During the learning phase, the

    network learns by adjusting the weights soas to be able to predict the correct label orinput tuples.Back propagation learns by iterativelyprocessing a data set of training tuplescomparing the networks prediction for eachtuple with the actual target value.

    64. What is associative classification?

    In associative classification, association

    rules are generated and analyzed for use inclassification. The general idea is that wecan search for strong association betweenfrequent patterns (confirmation of attribute-value pairs) and class labels. The decisiontree induction considers only one attribute ata time where as association rules explorehighly confident associations amongmultiples attributes.

    Various associative classification methods

    are -CBA classification based associative CBAuses an iterative approach to frequent itemset mining.CMAR (classification based on multipleassociation rules) it differs from CBA inits strategy for a frequent item set miningand its construction of the classifier.

  • 8/7/2019 DWH Two Marks Q & A

    20/33

    20

    65. What are k-Nearest Neighbour

    classifier?

    Nearest neighbour classifier are based onlearning by analogy, i.e. by comparing a

    given tuple with training tuples that aresimilar to it. The training tuples aredescribed by n attributes. Each tuplerepresents a point in an n-dimensional space.In this way, all of the training tuples arestored in an n-dimensional pattern space.When given an unknown tuple, k-nearest-neighbour classifier searches the patternspace for the k training tuples that areclosest to the unknown tuple.

    66. What is regression analysis?Regression analysis can be used to modelthe relationship between one or moreindependent or predictor variables and adependent or response variable (which iscontinuous-valued). The predictor variablesare attributes of interest describing the tuple(i.e. making up the attribute vector). Ingeneral, the values of predictor variables areknown. The response variable is what wewant to predict. Given a tuple described by

    predictor variables, we want to predict theassociated value of the response variable.Many problems can be solved by linearregression. Several packages exist to solveregression problems. Examples include SAS, SPSS and S-Plus.

    67.What is non-linear regression?

    If a given response variable and predictionvariable have a relationship that may bemodeled by a polynomial function, it is

    called non-linear regression or polynomialregression. It can be modeled by addingpolynomial terms to the basis linear model.

    68. Explain clustering by k-means

    partitioning.

    The k-means algorithm takes theinput parameter k , and partitions a set of n

    objects into k clusters so that the resultingintra cluster similarity is high but intercluster similarity is low. Cluster similarity ismeasured in regard to the mean value of theobjects in a cluster (cluster centroid or

    center of gravity).How does k-means algorithm work?The k-means algorithm proceeds as follows :First it randomly selects k of the objects,each of which initially represents a clustermean or center. For each of the remainingobjects, an object is assigned to the clusterto which it is the most similar, based on thedistance between the objects and the clustermean. It then computes the new mean foreach cluster. This process iterates until the

    exterior function converges. Typically, thesquare error criterion is used, defined asE = i=1i=k p Ci | p-mi|2Where E = sum of the square errors for allobjects in the data set

    p = point in space representing thegiven object

    mi = mean of cluster mi

    69. What is k Medoids method?

    The k means algorithm is sensitive to

    outliers because an object with an extremelylarge value may substantially distort thedistribution of data.Instead of taking the mean value of theobjects in a cluster as a reference point, wecan pick actual objects to represent theclusters, using one representative objects percluster.Each remaining object is clustered with therepresentative object to which it is the mostsimilar. The partitioning method is then

    performed based on the principle ofminimizing the sum of the dissimilaritiesbetween each object and its correspondingreference point. The absolute error criterionis defined as

    E = j=1j=k p Cj | p-oj|

  • 8/7/2019 DWH Two Marks Q & A

    21/33

    21

    Where E = sum of the absolute errors for allobjects in the data set

    p = point in space representing thegiven object in cluster Cjoj is the representative object of Cj

    The algorithm iterates until eventually, eachrepresentative object is actually medoid ormost centrally located object of its cluster.This is the basis of the k Medoids methodfor grouping n objects into k clusters.

    70. What is outlier detection and

    analysis?

    One persons noise could be anotherpersons signal. Outliers are data objects that

    do not comply with the general behavior orthe model of the data. Outliers can be causedby measurement or execution error. Manydata mining algorithms try to minimize theinference of outliers or eliminate them alltogether. However, outliers may be ofparticular interest such as in the case offraud detection.

    71. What is outlier mining?

    Outlier detection and analysis is a interesting

    data mining task refered to as outlier mining.Outlier mining has wide applications. It canbe used in fraud detection (detecting unusualusage of credit cards etc). outlier mining canbe described as follows :Given a set of n data points or objects and k,the expected number of outliers, find the topkey objects that are considered thedissimilar, exceptional, or inconsistent withrespect to the remaining data.

    72. Explain in brief different outlier

    detection.

    Computer based methods for outlierdetection :There are four approaches :

    Statistical approach.The distance based approach.The density based local outlier approach.Deviation based approach.

    UNIT V

    1. What are the areas in which data

    warehouses are used in present and in

    future?

    The potential subject areas in which dataware houses may be developed atpresent and also in future are(i).Census data:

    The registrar general and census

    commissioner of India decenniallycompiles information of all individuals,villages, population groups, etc. Thisinformationis wide ranging such as the individual slip.A compilation of information of individualhouseholds, of which a database of5%sample is maintained for analysis. A datawarehouse can be built from this databaseupon which OLAP techniques can beapplied,

    Data mining also can be performed foranalysis and knowledge discovery(ii).Prices of Essential Commodities

    The ministry of food and civil supplies,Government of India compliesdaily data for about 300 observation centersin the entire country on the prices ofessential commodities such as rice, edibleoil etc, A data warehouse can be builtfor this data and OLAP techniques can beapplied for its analysis

    2. What are the other areas for Data

    warehousing and data mining?

    AgricultureRural developmentHealthPlanningEducation

  • 8/7/2019 DWH Two Marks Q & A

    22/33

    22

    Commerce and Trade

    3. Specify some of the sectors in which

    data warehousing and data mining are

    used?

    Tourism

    Program ImplementationRevenueEconomic AffairsAudit and Accounts

    4. Describe the use of DBMiner.

    Used to perform data mining functions,including characterization,association, classification, prediction andclustering.

    5. Applications of DBMiner.

    The DBMiner system can be used as ageneral-purpose online analyticalmining system for both OLAP and datamining in relational database anddatawarehouses.Used in medium to large relational databaseswith fast response time.

    6. Give some data mining tools.

    DBMinerGeoMinerMultimedia minerWeblogMiner

    7. Mention some of the application areas

    of data mining

    DNA analysisFinancial data analysisRetail IndustryTelecommunication industry

    Market analysisBanking industryHealth care analysis.

    8. Differentiate data query and

    knowledge query

    A data query finds concrete data stored in adatabase and corresponds to a

    basic retrieval statement in a databasesystem.A knowledge query finds rules, patterns andother kinds of knowledge in adatabase and corresponds to querying

    database knowledge includingdeduction rules, integrity constraints,generalized rules, frequent patterns andother regularities.

    9.Differentiate direct query answering

    and intelligent query answering.

    Direct query answering means that a queryanswers by returning exactly whatis being asked.Intelligent query answering consists of

    analyzing the intent of query andproviding generalized, neighborhood, orassociated information relevant to thequery.

    10. Define visual data mining

    Discovers implicit and useful knowledgefrom large data sets using data and/or knowledge visualization techniques.Integration of data visualization and datamining.

    11. What does audio data mining mean?

    Uses audio signals to indicate patterns ofdata or the features of data miningresults.Patterns are transformed into sound andmusic.To identify interesting or unusual patternsby listening pitches, rhythms, tuneand melody.Steps involved in DNA analysis

    Semantic integration of heterogeneous,distributed genome databasesSimilarity search and comparison amongDNA sequencesAssociation analysis: Identification of co-occuring gene sequencesPath analysis: Linking genes to differentstages of disease development

  • 8/7/2019 DWH Two Marks Q & A

    23/33

    23

    Visualization tools and genetic data analysis

    12.What are the factors involved while

    choosing data mining system?

    Data types

    System issuesData sourcesData Mining functions and methodologiesCoupling data mining with database and/ordata warehouse systemsScalabilityVisualization toolsData mining query language and graphicaluser interface.

    13. Define DMQL

    Data Mining Query LanguageIt specifies clauses and syntaxes forperforming different types of data miningtasks for example data classification, dataclustering and mining associationrules. Also it uses SQl-like syntaxes to minedatabases.

    14. Define text mining

    Extraction of meaningful information fromlarge amounts free format textual

    data.Useful in Artificial intelligence and patternmatchingAlso known as text mining, knowledgediscovery from text, or contentanalysis.

    15. What does web mining mean

    Technique to process information availableon web and search for useful data.To discover web pages, text documents ,

    multimedia files, images, and othertypes of resources from web.Used in several fields such as E-commerce,information filtering, frauddetection and education and research.

    16.Define spatial data mining.

    Extracting undiscovered and implied spatialinformation.Spatial data: Data that is associated with alocationUsed in several fields such as geography,

    geology, medical imaging etc.

    17. Explain multimedia data mining.

    Mines large data bases.Does not retrieve any specific informationfrom multimedia databasesDerive new relationships , trends, andpatterns from stored multimedia datamining.Used in medical diagnosis, stock markets,Animation industry, Airline

    industry, Traffic management systems,Surveillance systems etc.

    18. What is Time Series Analysis?

    A time series is a set of attribute values overa period of time. Time SeriesAnalysis may be viewed as finding patternsin the data and predicting future values.

    19. What are the various detected

    patterns?

    Detected patterns may include:Trends : It may be viewed as systematicnon-repetitive changes to the values overtime.Cycles : The observed behavior is cyclic.Seasonal : The detected patterns may be

    based on time of year or month or day.Outliers : To assist in pattern detection ,

    techniques may be needed to remove orreduce the impact of outliers

    20.What is a spatial database?

    A spatial database stores largeamount of space related data, such asmaps, preprocessed remote sensing or

  • 8/7/2019 DWH Two Marks Q & A

    24/33

    24

    medical imaging data, and VLSI chiplayout data.

    21.Define spatial data mining?

    Spatial data mining refers to the

    extraction of knowledge, spatialrelationships, or other interestingpatterns not explicitly stored in thespatial database. Such mining demandsan integration of data mining with spatialdatabases.

    22. What is spatial data warehouse?

    A spatial data warehouse is asubject oriented, integrated, time variant, and nonvolatile collection both

    spatial and nonspatial data in support ofspatial data mining and spatial data related decision making process.

    23. What are the different dimensions in a

    spatial data cube?

    There are three dimensions in aspatial data cube

    e. Nonspatial dimensionf. Spatial to nonspatial

    dimensiong. Spatial to spatialdimension

    24. What is progressive refinement?

    Progressive refinement is anoptimization method for mining spatialassociation rules from spatial database. Thismethod first mines large data sets roughlyusing a fast algorithm and then improves thequality of mining in a pruned data set using

    a more expensive algorithm.

    25. What is spatial classification?

    Spatial classification analysisspatial objects to derive classificationschemes in relevance to certain spatialproperties such as neighborhood of adistrict, highway, river etc.

    26. What are the two multimedia indexing

    and retrieval systems?

    The following are the problems inusing decision tree

    1)Description based retrievalsystem builds indices and performobject retrieval based on imagedescriptions such as keywords, time ofcreation, size etc 2)Content based retrievalsystem these systems support retrievalbased on the image content such as colorhistogram, texture, shape etc

    27. What are the retrieval methods (based

    on signature) proposed for similarity based retrieval in image databases?

    h. Color histogram basedsignature

    i. Multimedia composedsignature

    j. Wavelet based signaturek. Wavelet based with region

    based granularity

    28. What is feature descriptor?

    A feature descriptor is a set ofvectors for each visual characteristic.The main vectors are the color vector, aMFO (Most Frequent Orientation)vector, and a MFC (Most FrequentColor) vector.

    29. What is a time series database?

    A time series database consistsof a sequence of values or events thatchange with time. It is a sequence

    database in which the values aremeasured at equal intervals.

    30. What is a sequence database?

    A sequence database containssequence of ordered events with orwithout concrete notion of time.

  • 8/7/2019 DWH Two Marks Q & A

    25/33

    25

    31. What are the two popular data

    independent transformations?

    1) Discrete Fourier Transform(DFT)

    2) Discrete Wavelet Transform

    (DWT)

    32. What are similarity searches that

    handle gaps and differences in offsets and

    amplitudes?

    The searches that handle gapsand amplitude are

    Atomic matching

    Window stitching

    Subsequence ordering+

    33. What are the parameters that affectthe result of sequential pattern mining?

    o Duration

    o Event folding window

    o Interval

    34.What is a serial episode and a parallel

    episode?

    A serial episode is a set ofevents that occurs in a total order,whereas a parallel episode is a set of

    events whose occurrence ordering istrivial.

    35. What is periodicity analysis? What

    are the problems in periodic analysis?

    Periodicity analysis is the miningof periodic patterns, i.e. the search forrecurring patterns in a time seriesdatabase. The following are theproblems in periodic analysis

    l. Mining full periodic patterns

    m. Mining partial periodicpatterns

    n. Mining cyclic or periodicassociation rules

    36. What is text database?

    A text database consists of largecollection of documents from various

    sources such as news articles, researchpapers, books, digital libraries, e mailmessages, and web pages. Data isstored in semi-structured form.

    37. What is Information retrieval (IR)?Information Retrieval is

    concerned with the organization andretrieval of information from largenumber of text based documents. Atypical information retrieval problem isto locate relevant documents based onuser input, such as keywords or exampledocuments.

    38.What are the basic measures for

    accessing quality of text retrieval?1) Precision This is thepercentage of retrieved documents thatare in fact relevant to the query.

    2) Recall This is the percentageof documents that are relevant to thequery and where, in fact, retrieved

    39.Write short notes on multidimensional

    data model?

    Data warehouses and OLTP tools are based

    on a multidimensional data model.This model is used for the design ofcorporate data warehouses and departmentdatamarts. This model contains a Star schema,Snowflake schema and Fact constellationschemas. The core of the multidimensionalmodel is the data cube.

    40.Define data cube?

    It consists of a large set of facts (or)

    measures and a number of dimensions.41.What are facts?Facts are numerical measures. Facts can alsobe considered as quantities by whichwe can analyze the relationship betweendimensions.

    42.What are dimensions?

  • 8/7/2019 DWH Two Marks Q & A

    26/33

    26

    Dimensions are the entities (or) perspectiveswith respect to an organization forkeeping records and are hierarchical innature.

    43.Define dimension table?A dimension table is used for describing thedimension.(e.g.) A dimension table for item maycontain the attributes item_ name, brand andtype.

    44.Define fact table?

    Fact table contains the name of facts (or)measures as well as keys to each of therelated dimensional tables.

    45.What are lattice of cuboids?

    In data warehousing research literature, acube can also be called as cuboids. Fordifferent (or) set of dimensions, we canconstruct a lattice of cuboids, each showingthedata at different level. The lattice of cuboidsis also referred to as data cube.

    46.What is apex cuboid?

    The 0-D cuboid which holds the highestlevel of summarization is called the apexcuboid. The apex cuboid is typically denotedby all.

    47.List out the components of star

    schema?

    _A large central table (fact table) containingthe bulk of data with noredundancy.

    _A set of smaller attendant tables

    (dimension tables), one for eachdimension.

    Star schema.

    Multidimensional data modelcan exist in the form of star schema. Itconsists of

    a) A large central table (facttable) containing data withno redundancy

    b) A set of smaller attendanttables (dimension tables),

    one for each dimension

    48.What is snowflake schema?

    The snowflake schema is a variant of thestar schema model, where somedimension tables are normalized therebyfurther splitting the tables in to additionaltables.

    49.List out the components of fact

    constellation schema?

    This requires multiple fact tables to sharedimension tables. This kind of schemacan be viewed as a collection of stars andhence it is known as galaxy schema (or) factconstellation schema.

    50.Point out the major difference between

    the star schema and the snowflake

    schema?

    The dimension table of the snowflakeschema model may be kept in normalized

    form to reduce redundancies. Such a table iseasy to maintain and saves storage space.

    51.Which is popular in the data

    warehouse design, star schema model (or)

    snowflake schema model?

    Star schema model, because the snowflakestructure can reduce the effectivenessand more joins will be needed to execute aquery.

    52.Define concept hierarchy?A concept hierarchy defines a sequence ofmappings from a set of low-levelconcepts to higher-level concepts.

    53.Define total order?

    If the attributes of a dimension which formsa concept hierarchy such as

  • 8/7/2019 DWH Two Marks Q & A

    27/33

    27

    street

  • 8/7/2019 DWH Two Marks Q & A

    28/33

    28

    Design and construction of data warehouses based on the benefits of dataminingMultidimensional analysis of

    sales,customers,products,time and region

    Analysis of the effectiveness of sales

    campaignsCustomer retention-analysis of customer

    loyaltyPurchase recommendation and cross-

    reference of item

    66. Name some of the data mining

    applications

    Data mining for Biomedical and DNA data analysis

    Data mining for Financial data analysis

    Data mining for the Retail industryData mining for the Telecommunication

    industry

    67. What are the features of object-

    relational and object oriented data bases?

    Both kinds of systems deal with the

    efficient storage and access of vast amountsof disk-based complex structured

    objects.They organise a large set of dataobjects into classes , which in turn organisedinto class/sub class hierarchies.each objectin a class is associated with a. an object-identifier,b. a set of attributes,c. a set ofmethods that specify that computationalroutines or rulesassociated with each objectclass.

    68. How data mining is performed on

    complex data types?

    Vast amounts of data are stored invarious complex forms.The complex datatype include objects,spatial data, multimediadata, text data and web data.Multidimensional analysis and data miningcan be performed by

    a. class based generalization of complexobjects including set valued,listvalued,class-subclass hierarchies, and classcomposition hierarchiesb. constructing object data cube

    c. performing generalization -based mining.

    69. Give an example of star -schema of

    spatial data warehouse.

    There are 3000 weather probesdistributed in British Clombia (BC),Canada, each recording daily temperatureand precipitation for a designated small areaand transmitting signals to a provincialweather station with a spatial data

    warehouse that supports spatial OLAP, auser can view weather patterns on a map bymouth, by region,etc.

    70. How a spatial data warehouse is

    constructed?

    As with relational data, we can integeratespatial data to construct a data warehousethat facilitates spatial data mining.A spatial data warehouse is a subject

    oriented, integerated, time-variant and nonvolatile collection of both spatial and nonspatial data.

    71. What are spatial association rules?

    Similar to mining of associations rules intransaction and relational databases, spatial

  • 8/7/2019 DWH Two Marks Q & A

    29/33

  • 8/7/2019 DWH Two Marks Q & A

    30/33

    30

    There are tremendous number of onlinedocuments available. Automated documentclassification is an important text miningtask as need exists to automatically prganizedocuments into classes to facilitate

    document retrrreival and subsequentanalysis.A general procedure for automateddocument classificationFirst a set of pre classified document istaken as a trainiing set. The training set isthenanalyzed in order to derive aclassification scheme.Such a classificationscheme often needs to be refined with atesting process. The so-derived classificationscheme can be used for classification ofother on-line documents.

    A few typical classification methods used intext classification area. Nearest-neighbour classificationb. Feature selection methodsc. Bayesian classification.

    76. Explain breifly some data

    classification methods.

    a. Nearest-neighbor classification: Usingthe k-nearest-neighbor classification which

    is based on the intution that similardocuments are expected to be assigned thesame class label.i)We can simply index the trainingdocuments and associate each with a classlabel.ii)The class label of Text document can bedetermined based on class label distributionof k nearest neighbors.By timing k and incorporationg refinements,this kind of classification can acheive

    accuracy of a best classification. statisticallyuncorrelated with the class labels.c. bayesian classification first trains themodel by calculating a generative documentdistribution P(d/c) to each C of document dand then tests which class is most likely togenerate the test document.

    77. What are the different methods of

    document clustering?

    Document clustering is one of the mostcrucial techniques for organizing documents

    in an unsupervised manner ( class label notunown earlier)a. Spectral clustering method: first performsspectral embedding (dimensionalityreduction) on the original data, and thenapplies the traditional cluatering algorithm(eg k-means) on the reduced documentspace.b. The mixture modal clustering method :models the text datawith a mixturemodel(invloving mutilnormal component

    models)Clustering involves two steps(1). estimating the model parameters basedon the text data and any additional priorknowledge and(2) infering the clusters based on theestimated model parameters.c. The latent semantic indexing (LPI) :These are linear dimensionality reductionmethods.We can acquire tranformationvectors or embedding function through

    which we use function and embed all of thedata to lower-dimensional space.

    78. What is time series data base?

    A time series database consists ofsequences of values or events obtained overrepeated measurements of time(hourly ,daily, weekly) .TIme- series databases arepopular in many applications, such as stockmarket analysis ,economic and sales

    forecasting , budgetory analysis.workloadprojections , process and quality control ,natural phenomena (such as atmosphere)temperature wind, earth quake), scientificand engineering experiments and medicaltreatments.The amount of time-series data is increasingrapidly (giga bytes/day) such as in stock

  • 8/7/2019 DWH Two Marks Q & A

    31/33

    31

    trading or even per minute (such as NASAspace programs). Need exists to findcorrelation relationships within time seriesdata as well as analysing huge numbers ofregular patterns, trends, bursts(such as

    sudden sharp changes) and outline with fastor even real-time response.

    79. What is trend analysis?

    A time series data involving a variableY, representing, say, the closing price of ashare in a stock market, can be viewed as afunction of time t, that is , Y = f(t). Such afunction can be illustrated as a time -seriesgraph.

    How can we study the time series data ?There are two goals(1)Modelling time series (to gain insight intothe mechanisms or underlying forces thatgenerate time series.(2)forecasting time series (to predict thefuture values of the time series variables.Trend analysis consists of following 4

    major components

    1)trend or long-term movements- displayedby a trend curve or a trend line.

    2)Cyclic movements or cyclic variations -refer to cycles - the long-term oscillationsabout a trend line or curve.3)Seasonal movement5s or variations -These are systematic or claendar related.Eg: Events that recur annually - suddenincrease in sales of items before christmas.The observed increase in water consumptonduring summerb. A feature selection preocess can be usedto remove terms in the training documents

    that are4)irregular or random movementsEg:floods, personal changes withincompanies.

    .

    fig : TIme series data of stock priceDashed curve shows the trend

    80. What are the basic measures for textretrieval?

    a.Precision - This is the percentage ofretreived documents that are relevant to thequery ( ie correct response)precision = | { Relevant} n(intersection){retreived} |

    | { Retreived} |

    b. Recall - This is the percentage ofdocuments that are relavant to the query andretreived

    recall = | {Relevant} n {retreived} || {Relevant} |

    81. What is an object cube?

  • 8/7/2019 DWH Two Marks Q & A

    32/33

    32

    In an object database, data generalizationand multidimensional analysis are notapplied to induvidual objects but to classesof objects. The attribute oriented inductionmethod developed for mining characteristics

    of relational databases can be extended tomine data characteristics in objectdatabases.The generalization ofmultidimensional attributes of a complexobject class can be performed by a complexobject class can be performed by examiningeach attribute (or dimension ) generalizingeach attribute to simple - valued data, andconstructing a multidimensional datacube,called an object cube.once an object isconstructed , multidimensional analysis and

    data mining can be performed on it in amanner similar to that for relational datacubes.

    82. What are the challanges faced in web

    data mining ?

    1) The web seems to be too large foreffective datawarehousing and data mining.2) the complexity of web is far greater thanthat of any text document collection. Web

    pages lack a unifying structure.They containfar more authorityu style and contentvariations.3) The web is highly dynamic informationsource. news, stock market , weather etc areupdated regularly on the web.4) The web serves a broad diversity of usercommunities. The internet currentlyconnects more than 100 million workstations. users can easily get lost bygrouping in the " darkness" of the network.

    5) Only small portion of the information onthe web is truely relevant or useful.

    83. Whatr are the data mining

    applications?

    1) Intrusion detection2) Association & correlation analysis.

    3) Analysis of stream data.4) Distributed data mining.5) Visualizing & querying tools.

    84. What are the recent trends in data

    mining ?1)Applications exploration - in financialanalysis,telecommunications,biomedicine,countering terrorism etc.2)Scalable and interactive data miningmethods - constraint based mining3) integeration of DM systems with db,dw,and web dB systems.4) Standardiztion of data mining language.5) Visual data mining.6) New methods for mining complex types

    of data.7) Bilogical data mining - mining DNA andprtein sequences etc.8) Dm applications in software engineering9) Web mining10) Distributed data mining11) Real-time or time critical DM12) graph mining13) Privacy protection and informationsecurity in DM14) Multi relational & multi database DM

    85. What is web usage mining?

    Besides mining web contents and webstructures , another important task for webmining is web usage mining which minesweblog records to discover user accesspatterns of web pages. This helps to identifyhigh potential custimers for electroniccommerce , improve web serverperformance etc. A web sewrver usually

    registers a weblog entry , for every accessof web page. It includes URL requestes , IPaddress from which the request originatedand a time stamp.

    86. What are similarity based retreival in

    image data bases?

  • 8/7/2019 DWH Two Marks Q & A

    33/33

    33

    a. Description based retreival systems -which bulids indices and perform objectretreival based on image description such askeywords , captions, size and time ofevaluation.

    b. Content based retreival systems - whichsupports retreival based on the imagecontent , such as color histogram , texture ,pattern , image topology , and the shape ofthe objects and their in the image.

    87. What are the approaches used for

    similarity based retreival in image data

    bases?

    1) Colo-histogram based signature : It is

    based on the color composition of tha image,It does not contain aany information aboutshape,image topology or texture.2) Multifeature composed signature : Thesignature of the image includes multiplefeatures - color histogram,shape,imagetopology and texture. The extracted featuresare stored as meta data and images areindexed based on meta data.3) Wavelet based signature : This approachuses the dominant wavelet coeffiecients of

    an image as signature.Wavelets captureshape,texture, and image topologyinformation in a single unified frame work.4) Wavelet-based signature with region-based granularity : The computation andcomparison of signatures are at thegranularity of regions.