Top Banner
1 CSE-634 Data Mining Concepts and Techniques Spring 2007 Data Warehousing and OLAP Technology Part – I By Group 2 Guidance Anuradha T P – 106019423 Prof. Anita Wasilewska Karthik Bhade – 105840048 Department of Computer Science Maduri Narasimhan – 105791690 SUNY Stony Brook Sumit Chopra - 105959878
64

CSE-634 Data Mining Concepts and Techniques Spring 2007

Jan 27, 2015

Download

Documents

Tommy96

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CSE-634 Data Mining Concepts and Techniques Spring 2007

1

CSE-634Data Mining Concepts and Techniques

Spring 2007

Data Warehousing and OLAP TechnologyPart – I

By Group 2 GuidanceAnuradha T P – 106019423 Prof. Anita

WasilewskaKarthik Bhade – 105840048 Department of Computer

ScienceMaduri Narasimhan – 105791690 SUNY Stony BrookSumit Chopra - 105959878

Page 2: CSE-634 Data Mining Concepts and Techniques Spring 2007

2

References

[1] Data Mining Concepts and Techniques – Jiawei Han and Micheline Kamber[2] Data Mining Concepts and Techniques – Jiawei Han and Micheline Kamber – Book Slides[3] Sections 3.1,3.2, and 3.3[4] http://www.daneil-lemire.com[5] http://www.kalmstrom.nu

Page 3: CSE-634 Data Mining Concepts and Techniques Spring 2007

Knowledge is the antidote to fear.

- Ralph Waldo Emerson

Page 4: CSE-634 Data Mining Concepts and Techniques Spring 2007

What is Data Warehouse?

o Defined in many different ways.

A decision support database that is maintained separately

from the organization’s operational database.

Support information processing by providing a solid platform

of consolidated, historical data for analysis.

o “A data warehouse is a subject-oriented, integrated, time-

variant, and nonvolatile collection of data in support of

management’s decision-making process.”—W. H. Inmon

o Data warehousing:

The process of constructing and using data warehouses

Page 5: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Warehouse – Subject Oriented

o Organized around major subjects, such as customer,

product, sales.

o Focused on the modeling and analysis of data for decision

makers, not on daily operations

o Provide a simple and concise view around particular

subject issues by excluding data that are not useful in the

decision support process.

Page 6: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques - Book Slides

6

Data Warehouse - Integrated

o Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, on-line transaction records

o Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data sources

When data is moved to the warehouse, it is converted. o Eg: Sales data may be on RDB, customer information in flat

files.

Page 7: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Warehouse - Time Variant

o The time horizon for the data warehouse is significantly longer than that of operational database systems

Operational database: current value

Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

o Every key structure in the data warehouse

Contains an element of time, explicitly or implicitly

But the key of operational data may or may not contain “time element”

Page 8: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Warehouse - Nonvolatile

o A physically separate store of data, transformed from the

operational environment

o Operational update of data does not occur in the data

warehouse environment

Does not require transaction processing, recovery, and

concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data

Page 9: CSE-634 Data Mining Concepts and Techniques Spring 2007

9

Heterogeneous Databases

o Consists of a set of interconnected, autonomous databases.

o Objects in one database may differ from objects in other databases.

o Information exchange across such databases is difficult.

Page 10: CSE-634 Data Mining Concepts and Techniques Spring 2007

10

Data Warehouse vs. Heterogeneous DBMS

o Heterogeneous DBMS: A query driven approach

Build wrappers/mediators on top of heterogeneous databases

A meta-dictionary is used to translate the query into queries

appropriate for individual heterogeneous sites.

The results are integrated into a global answer set.

This approach involves complex information filtering.

Inefficient and potentially expensive.

o Data warehouse: update-driven, high performance

Information from heterogeneous sources is integrated in advance

and stored in warehouses for direct query and analysis

Page 11: CSE-634 Data Mining Concepts and Techniques Spring 2007

11

Operational DBMS

o They consist of tables with a set of attributes and stores a large set of tuples.

o They use the Entity-Relationship (ER) data model.o They are used to store transactional data.o They contain the most current information.o Thus known as Online Transaction Processing (OLTP)

systems.

Page 12: CSE-634 Data Mining Concepts and Techniques Spring 2007

12

Data Warehouse vs. Operational DBMS

o User and system orientation customer vs. market

o Data contents current, detailed vs. historical, consolidated

o Database design ER + application vs. star + subject

o View current, local vs. evolutionary, integrated

o Access patterns update vs. read-only but complex queries

Page 13: CSE-634 Data Mining Concepts and Techniques Spring 2007

OLTP vs. OLAP

OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Page 14: CSE-634 Data Mining Concepts and Techniques Spring 2007

14

Why Separate Data Warehouse?

o High performance for both systems DBMS - Tuned for Online Transaction Processing Systems Warehouse - Tuned for Online Analytical Processing systems involving

complex OLAP queries Processing OLAP queries would degrade DBMS performance of operational

tasks.

o Decision support requires historical data which operational Databases do not typically maintain.

o Decision Support requires consolidation of data from heterogeneous sources.

o Solution To maintain separate database systems which support special primitives

and structures suitable to store, access and process OLAP specific data.

Page 15: CSE-634 Data Mining Concepts and Techniques Spring 2007

Multidimensional Data Model

o A Data warehouse is based on multidimensional data model, which views data in the form of a data cube.

o Data cube models n-D data, defined by dimensions and facts. Dimensions: They are entities with respect to which an

organization wants to keep records such as items (item_name).

Facts: It is a subject of decision oriented analysis such as dollars_sold or units_sold.

Facts are numerical measures. Quantities by which we want to analyze relationship

between dimensions. Contains key to each of the related dimension tables.

o A multidimensional data model is typically organized around a central theme, like sales, and is represented by a fact table.

Page 16: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques-Book Slides

Sales volume as a function of product, Date, Country

DatePro

duct

Cou

ntr

y

sum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

• Total annual sales

• of TV in U.S.A.Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month

Office Week

Day

Page 17: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques-Book Slides

Cube: A Lattice of Cuboids

se

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 18: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques-Book Slides

Schemas for Multidimensional Databases

Multidimensional model exists in form of Star Schema: A fact table in the middle connected to a set of

dimension tables. time_key

dayday_of_the_weekmonthquarteryear

time time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_salesbranch_keybranch_namebranch_type

branch

item_keyitem_namebrandtypesupplier_type

item

location_keystreetcitystate_or_provincecountry

location

Sales Fact Table

Page 19: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques-Book Slides

o Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake.

time_keydayday_of_the_weekmonthquarteryear

time

branch_keybranch_namebranch_type

branch

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

item_keyitem_namebrandtypesupplier_key

item

location_keystreetcity_key

location

city_keycitystate_or_provincecountry

citySales Fact Table

Page 20: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques-Book Slides

o Fact Constellation: Multiple facts tables share dimension tables, viewed as collection of stars, therefore called galaxy schema or fact constellation.

qq

time_keydayday_of_the_weekmonthquarteryear

time

branch_keybranch_namebranch_type

branchlocation_keystreetcityprovince_or_statecountry

location

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

item_keyitem_namebrandtypesupplier_type

item

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipperSales Fact Table

Shipping Fact Table

Page 21: CSE-634 Data Mining Concepts and Techniques Spring 2007

Cube Definition syntax in DMQL

o Cube Definition (Fact Definition)define cube (cube_name) [dimension_list]: (measure_list)

Examples: define cube sales_star [time,item,branch,location]: dollars_sold= sum (sales_in_dollars), avg_sales= avg(sales_in_dollar

o Dimension Definition (Dimension Table)define dimension (dimension_name) as ((attribute_or_subdimension _list))Example: define dimension branch (branch_key,branch_name,branch

_type)o Special case (Shared dimensional table as in fact constellation)

define dimension (dimension_name) as (dimension_in_first_cube) in cube (first_cube_name)

Page 22: CSE-634 Data Mining Concepts and Techniques Spring 2007

Defining Star Schema in DMQL

Example

define cube sales_star [time,item,branch,location]:dollars_sold= sum (sales_in_dollars), units_sold= count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city, province_or_state, country)

Page 23: CSE-634 Data Mining Concepts and Techniques Spring 2007

Defining Snowflake Schema in DMQL

Example

define cube sales_snowflake [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type))

define dimension branch as (branch_key, branch_name, branch_type)

define dimension location as (location_key, street, city(city_key, province_or_state, country))

Page 24: CSE-634 Data Mining Concepts and Techniques Spring 2007

Defining Fact Constellation in DMQL

Exampledefine cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter,

year)define dimension item as (item_key, item_name, brand, type,

supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city,

province_or_state, country)define cube shipping [time, item, shipper, from_location, to_location]:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as

location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales

Page 25: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques- Sec 3.2.4

Measures of Data cubes:

Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning

E.g., count(), sum(), min(), max() Algebraic: if it can be computed by an algebraic function with

M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function

E.g., avg(), standard_deviation() Holistic: if there is no constant bound on the storage size

needed to describe a subaggregate. That is there does not exists a algebraic function with M arguments that characterizes computation.

E.g., median(), mode(), rank()

Page 26: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques- Fig 3.7

Concept Hierarchies

All all

Country Canada USA

state British Columbia .. Ontario New york … Illinois

Vancouver …Victoria Toronto .. Chicago

city Buffalo … New york

Page 27: CSE-634 Data Mining Concepts and Techniques Spring 2007

Typical OLAP Operations

Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Roll up may be performed by removing 1 or more dimensions

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data Drill Down may be performed by introducing new dimensions

Slice and dice: project and select Slice: selection on 1 dimension Dice : selection on 2 or more dimensions

Pivot (rotate): rotates data axes reorient the cube, visualization, 3D to series of 2D planes

Other operations Drill across: involving (across) more than one fact table Ranking top N or bottom N items in lists. Computing moving averages, growth rates etc

OLAP ENGINE IS A POWER DATA ANALYSIS TOOL

Page 28: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques-Book Slides

sss

Page 29: CSE-634 Data Mining Concepts and Techniques Spring 2007

Data Mining Concepts and Techniques- Fig 3.11

A Starnet Query model Location

Continent

Country

Province or State

City

Street

name brand category typeitems

day

month

quarter

year

time

- Lines represent a concept hierarchy for a dimension

- Each abstraction level is called a footprint

Starnet forms basis of querying a multi-D model

Page 30: CSE-634 Data Mining Concepts and Techniques Spring 2007

30

Data Warehouse Architecture

Design and Construction of Data Warehouse Three-tier architecture Warehouse servers for OLAP Processing

Page 31: CSE-634 Data Mining Concepts and Techniques Spring 2007

31

Design – A Business Analysis Framework

Why data warehouse for business analysts?

Competitive advantage – relevant information to measure performance and make critical adjustments.

Business Productivity – quickly and efficiently gather information that accurately describes the organization.

Customer relationship management – consistent view of customers and items across all lines of business, departments and all markets.

Cost reduction – tracking trends, patterns and exceptions over long period of time in a consistent and reliable manner.

Page 32: CSE-634 Data Mining Concepts and Techniques Spring 2007

32

Views for Design

o Top down View

Allows the selection of the relevant information necessary for the data warehouse.

The information matches the current and coming business needs.

o Data source View

Exposes information being captured, stored and managed by operational systems.

It is documented at various levels of detail and accuracy, from individual data source tables to integrated data source tables.

Data sources are modeled using Entity-relationship model or CASE (Computer Aided Software Engineering) tools.

Page 33: CSE-634 Data Mining Concepts and Techniques Spring 2007

33

Contd..

o Data Warehouse View It represents information that is stored inside the data warehouse,

including pre-calculated totals and counts, as well as information regarding the source, date and time of origin, added to provide historical context.

o Business Query View It is the perspective of data in the data warehouse from the

perspective of the end user.

Page 34: CSE-634 Data Mining Concepts and Techniques Spring 2007

34

Skill Sets

o Business Skills

o Technology Skills

o Program Management Skills

Page 35: CSE-634 Data Mining Concepts and Techniques Spring 2007

35

Design Process

o Top Down Approach Starts with overall design Technology is mature Business problems are clear and well understood

o Bottom-up Approach Starts with experiments and prototypes Early stage of business modeling and technology

development

o Combined Approach Planned and strategic nature of top-down approach Rapid implementation and opportunistic application of

bottom-up approach

Page 36: CSE-634 Data Mining Concepts and Techniques Spring 2007

36

Software Engineering View of Design Process

o Steps in design and construction Planning Requirements study Problem analysis Warehouse design Data Integration and testing Deployment of Data Warehouse

Page 37: CSE-634 Data Mining Concepts and Techniques Spring 2007

37

Contd..

Development Methods

o Waterfall Method Performs structured and systematic analysis at each step before

proceeding to the next.

o Spiral Method Involves rapid generation of functional systems with short intervals

between releases.

Spiral Model is a good choice for Data warehouse development especially for data marts.

Page 38: CSE-634 Data Mining Concepts and Techniques Spring 2007

38

General Steps in Warehouse design Process

o Choose a business process to modelo Choose the grain of the business process. Eg

Individual Transactions, snapshoto Choose the dimensions that will apply to each fact

table record. Eg time, item, customer, supplier, status

o Choose the measures that will populate each fact table record. Eg dollars_sold, units_sold

Page 39: CSE-634 Data Mining Concepts and Techniques Spring 2007

39

Data Warehouse Architecture

Design and Construction of Data Warehouse Three-tier architecture Warehouse servers for OLAP Processing

Page 40: CSE-634 Data Mining Concepts and Techniques Spring 2007

40

Data Warehouse: A Multi-Tiered ArchitectureData Warehouse: A Multi-Tiered Architecture

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

Othersources

Data Storage

OLAP Server

Page 41: CSE-634 Data Mining Concepts and Techniques Spring 2007

41

Data Warehouse Models

o Enterprise Warehouse

o Data Mart

o Virtual Warehouse

Page 42: CSE-634 Data Mining Concepts and Techniques Spring 2007

42

Enterprise Warehouse

o Collects all of the information about subjects spanning the entire organization.

o Corporate wide data integration, from one or more operational systems or external information providers, and is cross functional in scope.

o Can range in size from few giga bytes to hundreds of gigabytes, terabytes or beyond.

o Implemented on traditional mainframes, UNIX super servers, or parallel architecture platforms.

o Requires extensive business modeling and may take years to design and build.

Page 43: CSE-634 Data Mining Concepts and Techniques Spring 2007

43

Data Mart

o Contains a subset of corporate wide data that is of value to a specific group of users.

o The data in data marts tend to be summarized.o Implemented in low cost departmental servers that are UNIX or

Windows/NT - based.o It may involve complex integration in the long run if its design and

planning were not enterprise wide.o Depending on Source of data,o Independent Data Martso Data captured from one or more operational systems or

external information providers, or from data generated locally within a particular department or geographical area.

o Dependent Data Martso Sourced directly from enterprise data warehouses.

Page 44: CSE-634 Data Mining Concepts and Techniques Spring 2007

44

Virtual Warehouse

o It is a set of views over operational databases.

o For efficient query processing, only some of the possible summary views may be materialized.

o It is easy to build but requires excess capacity on operational database servers.

Page 45: CSE-634 Data Mining Concepts and Techniques Spring 2007

45

Data Warehouse Development: A Recommended Approach

Define a high-level corporate data model

Data Mart

Data Mart

Distributed Data Marts

Multi-Tier Data Warehouse

Enterprise Data Warehouse

Model refinementModel refinement

Page 46: CSE-634 Data Mining Concepts and Techniques Spring 2007

46

Data Warehouse Architecture

Design and Construction of Data Warehouse Three-tier architecture Warehouse servers for OLAP Processing

Page 47: CSE-634 Data Mining Concepts and Techniques Spring 2007

47

Types of OLAP Servers

o Relational OLAP (ROLAP) Servers Intermediate Servers standing in between a relational backend server and client

front end tools. They use a relational or extended relational DBMS to store and manage

warehouse data. They also optimize each DBMS backend, implementation of aggregation,

navigation logic.

o Multidimensional OLAP (MOLAP) Servers Support multidimensional views of data through array-based multi dimensional

storage engines. They map multidimensional views to data cubes array structures. Data cubes allow fast indexing to pre computed summarized data. The storage utilization may be low if the data is sparse. Dense sub cubes are identified and stored as array structures. Sparse sub cubes employ compression technology for efficient storage utilization.

Page 48: CSE-634 Data Mining Concepts and Techniques Spring 2007

48

Contd..

o Hybrid OLAP (HOLAP) Servers Combine ROLAP’s scalability and MOLAP’s fast computation. HOLAP may allow large volumes of detail data to be stored in a

relational database. Aggregations are kept in a separate MOLAP store. Microsoft SQL Server 7.0 supports a hybrid OLAP server.

o Specialized SQL Servers Provides advanced query language and query processing support for

SQL queries over star and snowflake schemas in a read only environment.

Page 49: CSE-634 Data Mining Concepts and Techniques Spring 2007

49

OLAP Reporting tool for Excel

Cited from www.kalmstrom.nu Kalmstrom.nu Outlook Solutions

Page 50: CSE-634 Data Mining Concepts and Techniques Spring 2007

50

This list contains the

saved reports views.

To the right you see the current data displayed in the format defined in the report view. The

views contain only

layout options, no

data.

The graph part of OLAP Reporting Tool works like an Excel

chart.

Select which information you

want to see

The pivot part of OLAP Reporting Tool. It works very much like an Excel pivot

table.

Saves the current graph as

a .gif file.

Page 51: CSE-634 Data Mining Concepts and Techniques Spring 2007

51

Anywhere the dropdown symbol is displayed you can filter the information. By simply clicking the dropdown and selecting on or more checkboxes you can change what information is being displayed. In the example above it is possible to filter all of the fields in the red circles. For example, I could do a filter to only show the items sold in Zacatecas and Veracruz with four clicks:1. De-select the All checkbox2. Select the Mexico Central checkbox (all three regions within Mexico Central will be selected)3. De-select the DF region4. Press OK

Page 52: CSE-634 Data Mining Concepts and Techniques Spring 2007

52

You can very easily drill down to find data on lower levels. Both the areas circled in read can be used to see the sales figure per type of promotion in the Sunday Paper as in the example here. Another very common example of drilldown is to see the values per month from a per quarter view. To do drilldown in the pivot view, simply click the + signs. In the graph you will need to right click on the category you want to expand of drill into. (Only possible with Excel 2002 or later.)

Page 53: CSE-634 Data Mining Concepts and Techniques Spring 2007

53

These are the basic steps for creating a multi-graph.1.

2.

3.

A new area is shown. Drag fields into it to create multi-graph.

Page 54: CSE-634 Data Mining Concepts and Techniques Spring 2007

54

The multi-graph feature is quite unique and is easy to create in OLAP Reporting. To do it in Excel is more complicated.

Page 55: CSE-634 Data Mining Concepts and Techniques Spring 2007

55

Technical Paper

Analyzing Large Collections of Electronic Text Using OLAP

Steven Keith, Owen kaserUniversity of New Brunswick

July 11,2005

-Maduri Rajan Narasimhan

Page 56: CSE-634 Data Mining Concepts and Techniques Spring 2007

56

WOW

Creation of user-driven tools to interface with a (Data) Warehouse

of Words (WoW) is needed. A WoW is built by an Extraction, Transformation, and Loading

(ETL) procedure, which processes the text and aggregates data from different sources.

A WoW stores its data in data cubes. A data cube can be abstracted as a k-dimensional array with

several predefined operations such as slicing, dicing, rolling up and drilling down.

These operations allow the user to focus on just some subset of the data, at the desired granularity.

On-Line Analytical Processing (OLAP) provides near constant-time answers to queries over large multidimensional data sets.

Page 57: CSE-634 Data Mining Concepts and Techniques Spring 2007

57

OLAP

OLAP is especially applicable when many aggregate queries such as sum and average are of interest.

Thus, data warehouses and OLAP have been used widely in business applications.

The main advantage a user-driven OLAP tool would provide is flexibility.

While IR and Artificial Intelligence tools are well suited to their single function, a user-driven tool gives a wide variety of users the freedom to pursue their individual research.

A simple user-driven application is the most reasonable solution for those users not already accustomed to writing their own MDX or SQL queries.

Page 58: CSE-634 Data Mining Concepts and Techniques Spring 2007

58

Practical Applications

User-driven analytical tools are used in the humanities for author attribution, lexical analysis, and stylometric analysis.

Author attribution is determining the authorship of an anonymous piece of writing through various stylistic and statistical methods.

Lexical analysis includes many measurements of vocabulary usage such as Type-Token Ratio, Number of Different Words and Mean Word Frequency.

Stylometric analysis not only considers the words in use but also accounts for other statistical elements of style such as word length, sentence length, use of punctuation and many other features.

Analogies of the form A is to B as C is to D can be characterized by cooccurrences: two words connected by a joining word such as has, on, and with (64 joining words were initially proposed).

Page 59: CSE-634 Data Mining Concepts and Techniques Spring 2007

59

WoW Creation

Creation involves the three stages of ETL. Extraction: The extraction involves the plain text and XML

documents of Project Gutenberg, a large corpus of literary works that is not in a suitable form for immediate analysis.

Transformation: The transformation phase will involve the calculation of all data that will be stored in the WoW such as word frequency, punctuation frequency, and sentence lengths.

Loading: The loading phase will involve the actual creation and storage of the data cubes containing the calculated items.

Issues to be handled: At times data, such as the author’s nationality, is missing and must be handled.

Also, new books are added to corpora daily, and a means for loading these new books into the WoW must be created.

Page 60: CSE-634 Data Mining Concepts and Techniques Spring 2007

60

WoW Schema

The main strength of an OLAP application is its efficient evaluation of aggregate queries across several dimensions and at different level of granularity.

The “book” hierarchy maintains its finest detail at the level of chapters.

Page 61: CSE-634 Data Mining Concepts and Techniques Spring 2007

61

Contd..

The year of publication may be generalized to a literary era (eg Victorian); alternatively, the year may be generalized to decade and then to century

Several natural generalizations may help word studies. Alternately words can be grouped according to their

final suffix.

Page 62: CSE-634 Data Mining Concepts and Techniques Spring 2007

62

Contd..

Finally, tools such as Signature allow user-specified word lists. Given a set of “interesting” word stems, a stemmed word can be classified as belonging to [oneof] the user’s lists or belonging to no list1.

These hierarchies allow for rollup queries (essentially generalizations) to be evaluated.

Instead of finding the frequent words used in a chapter or book, one might be interested in the frequent words used by an author or used in a time period.

To support the initial stylometric, analogy, and phrase-use queries, the WoW contains several cubes.

Sentence Style (Book × Word × WordCount × CommaCount × Colon- SemicolonCount × StopwordCount ! Occurrence Count).

Page 63: CSE-634 Data Mining Concepts and Techniques Spring 2007

63

Conclusion

Each “Count” is an integer, and the Word dimension represents the first word in a sentence.

Short Phrase (Book×Word×Word×Word ×Word ! OccurrenceCount).

The cube records all sequences of 4 words, and it could be used to explore common (or rare) phrases by authors or time periods.

These cubes will allow for many queries to be evaluated and would aid in all of the practical applications as well as a variety of other studies.

Page 64: CSE-634 Data Mining Concepts and Techniques Spring 2007

64

Thank you !