- 1.
- Prof. Dr. Bettina Berendt
- Humboldt Univ. Berlin, Germany
Web Usage MiningModelling: Session analysis, OLAP,
frequent-pattern mining I (association rules) 2. Please note
- These slides use and/or refer to a lot of material available on
the Internet. To reduce clutter, credits and hyperlinks are given
in the following ways:
-
- Slides adapted from other peoples materials: at bottom of
slide
-
- Pictures, screenshots etc.: URL visible in screenshot or given
in PPT Comments field
-
- Literature, software: On the accompanying Web
sitehttp://vasarely.wiwi.hu-berlin.de/WebMining07/
- Thanks to the Internet community!
- You are invited to re-use these materials, but please give the
proper credit.
3. Stages of knowledge discovery discussed in this lecture
Application understanding 4. Agenda Different levels of analysis
Session analysis & OLAP: Case study Frequent itemsets &
association rules: Method Association rules: Case study 5. Basic
Framework for E-Commerce Data Analysis Web Usage and E-Business
Analytics customers orders products Operational Database Content
Analysis Module Web/Application Server Logs Data Cleaning /
Sessionization Module Site Map Site Dictionary Integrated
Sessionized Data Data Integration Module E-Commerce Data Mart Data
Mining Engine OLAP Tools Session Analysis / Static Aggregation
Pattern Analysis OLAP Analysis Site Content Data Cube 6. Different
levels of analysis
-
- Static Aggregation and Statistics
7. Session Analysis
- Simplest form of analysis: examine individual or groups of
server sessions and e-commerce data.
-
- Gain insight into typical customer behaviors.
-
- Trace specific problems with the site.
8. Static Aggregation (Reports)
- Most common form of analysis.
- Data aggregated by predetermined units such as days or
sessions.
- Generally gives most bang for the buck.
-
- Gives quick overview of how a site is being used.
-
- Minimal disk space or processing power required.
-
- No ability to dig deeper into the data.
9. Data Mining: Going deeper Sequence mining Markov chains
Association rules Clustering Session Clustering Classification
Prediction of next event Discovery of associated events or
application objects Discovery of visitor groups with common
properties and interests Discovery of visitor groups with common
behaviour Characterization of visitors with respect to a set of
predefined classes Card fraud detection 10. KDD Techniques for Web
Applications:Examples (1)
- Calibration of a Web server:
-
- Prediction of the next page invocation over a group of
concurrent Web users under certain constraints
-
-
- Sequence mining, Markov chains
- Cross-selling of products:
-
- Mapping of Web pages/objects to products
-
- Discovery of associated products
-
-
- Association rules, Sequence Mining
-
- Placement of associated products on the same page
11. KDD Techniques for Web Applications:Examples (2)
- Sophisticated cross-selling and up-selling of products:
-
- Mapping of pages/objects to products of different price
groups
-
- Identification of Customer Groups
-
-
- Clustering, Classification
-
- Discovery of associated products of the same/different price
categories
-
-
- Association rules, Sequence Mining
-
- Formulation of recommendations to the end-user
-
-
- Suggestions on associated products
-
-
- Suggestions based on the preferences of similar users
12. Agenda Different levels of analysis Session analysis &
OLAP: Case study Frequent itemsets & association rules: Method
Association rules: Case study 13. Worldwide usability example 14.
The application context 15. Motivation & Application
understanding
- use of Internet asinformation source
- ease of finding information
- personal and situational variables
1. 16. Motivation & Application understanding
- Use of Internet asinformation source
- Ease of finding informations
- Personal and situational variables
1. 2.
- International eHealth-website
17. Motivation & Application understanding
- Use of the Internet as an information source
- Ease of finding information
- Personal and situationalvariables
1. 2.
- International eHealth-website
How does the users background affect information seeking
behaviour? 18. Outline of the KDD process
-
- Session IDs; usual data cleaning steps
-
- Linking of sessions & questionnaire information
(anonymized)
- Modelling / pattern discovery:
-
- Session analysis, static aggregation (non-hierarchical) OLAP;
frequent subgraph mining (hierarchical) [this is not shown
today]
- Evaluation:Correlation analysis, significance tests;
interesting patterns
- Appl. underst. : search behaviour, linguistic & expertise
theory
-
- Web server sessions, questionnaire
- Data understanding main step:
-
- modelling the semantics of the site in terms of a hierarchy of
service concepts
19. First step: What do people do there? 20. A sample session
(sequence of URLs only)
- /doia/mainmenu.asp?zugr=d&lang=e
/doia/dbrowser.asp?zugr=d&lang=e&benr=A
/doia/dbrowser.asp?zugr=d&lang=e&benr=A_6
/doia/image.asp?zugr=d&lang=e&cd=5&nr=83&diagnr=287000
/doia/image.asp?zugr=d&lang=e&cd=4&nr=66&diagnr=695890
/doia/image.asp?zugr=d&lang=e&cd=3&nr=95&diagnr=690030
/doia/dbrowser.asp?zugr=d&lang=e&benr=A_6_4
/doia/image.asp?zugr=d&lang=e&cd=7&nr=40&diagnr=287000
/doia/image.asp?zugr=d&lang=e&cd=5&nr=11&diagnr=690040
/doia/image.asp?zugr=d&lang=e&cd=4&nr=68&diagnr=695200
/doia/image.asp?zugr=d&lang=e&cd=5&nr=85&diagnr=287000
/doia/diagnose.asp?zugr=d&lang=e&diagnr=287000&topic=dd
/doia/diagnose.asp?lang=e&zugr=d&diagnr=693010
/doia/diagnose.asp?lang=e&zugr=d&diagnr=710022
/doia/diagnose.asp?zugr=d&lang=e&diagnr=710022&topic=i
/doia/image.asp?zugr=d&lang=e&cd=5&nr=83&diagnr=287000
/doia/image.asp?zugr=d&lang=e&cd=6&nr=96&diagnr=287000
/doia/image.asp?zugr=d&lang=e&cd=27&nr=99&diagnr=287000
21. SessionURL sequenceURL graph (Individualised site/web map)
Key for readingindivdiualised site maps /indivdiualised Web maps
22. Transformation I: Mapping the URLs into a concept hierarchy
governed by media types (SUCHE = search; D_ ... = sub-concepts of
information on diagnoses; BILD = picture, DD = differential
diagnosis) D_BILD -->
/doia/image.asp?zugr=d&lang=e&cd=6&nr=96&diagnr=287000
D_BILD -->
/doia/image.asp?zugr=d&lang=e&cd=5&nr=83&diagnr=287000
D_INFO -->
/doia/diagnose.asp?zugr=d&lang=e&diagnr=710022&topic=i
D_TEXT -->
/doia/diagnose.asp?lang=e&zugr=d&diagnr=710022 D_TEXT
--> /doia/diagnose.asp?lang=e&zugr=d&diagnr=693010 D_DD
-->
/doia/diagnose.asp?zugr=d&lang=e&diagnr=287000&topic=dd
D_BILD -->
/doia/image.asp?zugr=d&lang=e&cd=5&nr=85&diagnr=287000
D_BILD -->
/doia/image.asp?zugr=d&lang=e&cd=4&nr=68&diagnr=695200
D_BILD -->
/doia/image.asp?zugr=d&lang=e&cd=5&nr=11&diagnr=690040
D_BILD -->
/doia/image.asp?zugr=d&lang=e&cd=7&nr=40&diagnr=287000
SUCHE --> /doia/dbrowser.asp?zugr=d&lang=e&benr=A_6_4
D_BILD -->
/doia/image.asp?zugr=d&lang=e&cd=3&nr=95&diagnr=690030
D_BILD -->
/doia/image.asp?zugr=d&lang=e&cd=4&nr=66&diagnr=695890
D_BILD -->
/doia/image.asp?zugr=d&lang=e&cd=5&nr=83&diagnr=287000
SUCHE --> /doia/dbrowser.asp?zugr=d&lang=e&benr=A_6
SUCHE --> /doia/dbrowser.asp?zugr=d&lang=e&benr=A START
--> /doia/mainmenu.asp?zugr=d&lang=e 23. Sessionconcept
sequenceconcept graph (here: focus on main modality of requested
page) 24. Transformation II: Mapping URLs to concepts in an
ontology(based on a standard medical ontology + standard search
behaviour classifications) Alphabeticalsearch Diagnosis 21002
Diagnosis info TOP Search 25. Sessionconcept sequenceconcept graph
(here: focus on navigation structure) 26. Sessionconcept
sequenceconcept graph (here: focus on content) 27. The impact of
language (a simplified view) Nativespeakers (L1)lower cognitive
effort higher preference ofACTIVE LANGUAGEuse (compared to L2
users) Non-nativespeakers (L2) higher cognitive effort higher
preference ofPASSIVE LANGUAGEuse (compared to L1 users) Different
search options correspond, to varying degrees, to these
preferences, e.g. H2: Native speakers prefer search engines more
than non-native speakers. * For further hypotheses see Kralisch
& Berendt CATAC04 28. The impact of medical knowledge (a
simplified view) Different search options correspond, to varying
degrees, to a patients or physicians knowledge and search goals,
e.g.
- PhysiciansandPatientsdiffer in:
- in their knowledge of medical terms
- in their perceptions/differentiation of disease symptoms
- in other aspects of knowledge about diseases (e.g. Where does
the disease occur?)
H3: Physicians prefer search engines more than patients. * For
further hypotheses see Kralisch & Berendt 2005 29.
- 1. Operationalisation of search preference
- Which search option was (not) used.
- In combination with which other search options was the search
option used?
- In which order where the search option used.
- Number of page requests prior to access of search option.
- Frequency of use of search option.
- Frequency of use of search option per page request.
- Factor analysis applied for item reduction.
- 2.Operationalisation of culture
- Cultural index scores by Hofstede
- Control through 5 questions regarding cultural items
- 3.Operationalisation of language
- Native speaker vs. Non-native speakers according to answers in
questionnaire
- Control of proficiency level in non-native language
- 4.Operationalisation of medical knowledge
- Physicians vs. Patients according to answers in
questionnaire
Operationalisation and the resulting OLAP cube (in this study:
no drill-down) language expertise searchoptionuse 30. Information
Seeking Behaviour: Types (results) and characteristics (background
knowledge)
Characteristics2maintypesof information seeking behaviour
- name of disease required (predominantly goal oriented)
- little context information
- predominantly exploratory search behaviour
- highest amount of context information
31. The impact of language and domain knowledge on search option
choice
- 2 studies on the use of search options in the eHealth
site:
-
- Webserver log: 3 928 235 requests / 277 809 sessions from 188
countries
-
-
- 83.2 % first-language users, 16.8% second-language users
-
- Webserver log + Questionnaire: 165 (106) people from 34
countries
-
-
- 84.9% first-language users, 15.1% second-language users
-
-
- 10.4% physicians, 89.6% patients
-
-
- Search engine, alphabetical search: in particular
first-language users, physicians
-
-
- Content-organized search: in particular second-language
patients
-
- Domain knowledge compensates for limited language
knowledge.
[Kralisch & Berendt,New Review of Hypermedia and
Multimedia,2005] 32. Evaluation: Recommendations for website
design
- Patients and Physicians have different preferences in their
information seeking behaviour
- Native speakers and non-native speakers have different
preferences in their information seeking behaviour
- Native language information is more important for patients than
for physicians
- In some regions, offering information in their native language
is more important than in others
- Realisation in website design through:
-
- e.g. separate website areas for patients and physicians
-
- e.g. highlighting of more appropriate search options
-
- e.g. terminological support for non-native speakers, especially
patients
-
- e.g. adapted versions for groups of countries
33. Agenda Different levels of analysis Session analysis &
OLAP: Case study Data mining: going deeper Association rules: Case
study 34. 2. The structural/algorithmic part of knowledge discovery
(modelling in CRISP-DM): Patterns, data mining tasks, methods
(ex.)
-
-
-
-
- K-means, EM, hierarchical clustering, ...
-
-
-
- Link patterns (e.g., citation analysis la Google)
-
-
-
-
- Bayes techniques, Decision trees, Support Vector Machines,
...
-
-
- Frequent itemsets, sequences, subgraphs
-
-
-
- A priori and methods derived from it
-
-
- Cliques (Web Communities)
35. Global patterns local patternsoutlook=sunny temperature=hot
2 ==> humidity=high 2 Global: Describes all instances Local:
Describes some instances 36.
37. Agenda Different levels of analysis Session analysis &
OLAP: Case study Frequent itemsets & association rules: Method
Association rules: Case study 38. For an excellent introduction,
see ...
- Coenen, F. (2003).Association rule mining and its wider
context.AI2003 Association Rule Mining Tutorial, Cambridge,
December 2003.
- http:// www.csc.liv.ac.uk /~frans/KDD/ Tutorials
/tutorialAI2003.ppt
- What is an association rule?
- What are interestingness measures for association rules?
-
- support, confidence, lift (there are also further
measures)
-
- cf. the performance measures recall, precision, etc. for
classifiers
- How is association-rule mining performed?
-
- the basic apriori algorithm
39. Agenda Different levels of analysis Session analysis &
OLAP: Case study Frequent itemsets & association rules: Method
Association rules: Case study 40. CRM questions example: Why go to
a shop ...
- ... if everything is available on the Internet?
41. A multi-channel retailer, its business goals, and analysis
questions
- General goals : Standard e-tailer goals attract users/shoppers
and convert them into customers
- Specific goals : assess the success of the Web site in relation
to other distribution channels
- Questions of the evaluation :
- Whatbusiness metricscan be calculated from Web usage data,
transaction and demographic data for determining online
success?
- Are therecross-channel effectsbetween a companys e-shop and its
physical stores ?
Background: Internet market shares [BCG 2002] 42. The site 43.
Outline of the KDD process
-
- Session IDs; usual data cleaning steps
-
- Linking of sessions & transaction information
(anonymized)
- Modelling / pattern discovery:
-
- Web metrics, cluster analysis, association rules, sequence
mining + correlation analysis, questionnaire study, qualitative
market analysis
- Evaluation:Interesting patterns
- Business underst. : customer buying process
-
- Web server sessions, transaction info.
- Data understanding main step:
-
- modelling the semantics of the site in terms of a hierarchy of
service concepts
44. Agenda Case Study Business Understanding Data understanding
and preparation Pattern discovery + evaluation: Success metrics
Pattern disc. + eval.: Behavioural patterns Pattern disc. + eval.:
User types Pattern disc. + eval.: Behaviour & demographics 45.
Agenda Case Study Business Understanding Data understanding and
preparation Pattern discovery + evaluation: Success metrics Pattern
disc. + eval.: Behavioural patterns Pattern disc. + eval.: User
types Pattern disc. + eval.: Behaviour & demographics 46.
Description of the site and its services
-
- The retailer operates an e-shop and more than 5000 retail shops
in over 10 European countries
-
- It sells a wide range of consumer electronics
-
- Online customers can pay, pick-up/deliver and return both
online and offline
-
- Web pages provide for all tasks in the customer buying
process
47. Purchase Phases (Page Concepts) at Large MC Retailers 1.
Acquisition (home):All Web pages that are semantically related to
the initial acquisition of a visitor Home (Acquisition) 48.
Purchase Phases (Page Concepts) at Large MC Retailers Home
(Acquisition) 2. Catalogue information :pages providing an overview
of product categories.ProductImpression 49. Purchase Phases (Page
Concepts) at Large MC Retailers Product Click-Through Home
(Acquisition) 3. Information product (infprod) :pages displaying
information about a specific product ProductImpression 50. Purchase
Phases (Page Concepts) at Large MC Retailers Offlineinfo Home
(Acquisition) 4. offline information (offinfo):All pages related to
any offline information: store locator (pages for finding physical
stores in ones neighbourhood), information about offline services,
offline referrers etc. Product Click-Through ProductImpression 51.
Purchase Phases (Page Concepts) at Large MC Retailers Transaction
Offlineinfo Home (Acquisition) 5. transaction (transact) :steps
before an actual purchase, starting with a customer entering the
order process: check-out, input of customer data, payment and
delivery preferences (online or offline), etc. Product
Click-Through ProductImpression 52. Purchase Phases (Page Concepts)
at Large MC Retailers Transaction Purchase Offlineinfo Home
(Acquisition) 6. purchase :indicates if a visitor completed the
transaction process and bought a product, e.g. invocation of an
order confirmation page. Product Click-Through ProductImpression
53. Agenda Case Study Business Understanding Data understanding and
preparation Pattern disc. + eval.: Behavioural patterns 54. Data
and data preparation
-
- 92,467 sessions from the companys Web logs from 21 days in
2002
-
- anonymized transaction information of 13,653 customers who
bought online over a period of 8 months in 2001/02.
-
- 621 transaction records (21 days) were linked to Web-usage
records
-
- Sessions were determined by session IDs
-
- Robot visits eliminated, usual data cleaning steps
-
- Each URL request mapped to a service concept from{c 1 ,...,c n
}
-
- Session representation:s = [w 1 , ...w n ],withw i= weight ofc
i , indicating whether or not the concept was visited (1/0), or how
often it was visited
-
- Customer record: feature vector incl. session and transaction
data
55. Site semantics: A service concept hierarchy Any Information
Transaction Services Information Product Fulfillment/ Service
Customer Data Shopping Cart Payment Company Infos Registration
OtherAcquisition Offline Referrer Advertiser Other Store Locator
Information Catalog Home Game Offline Service and Support
=Multi-Channel Concept 760,535 page requests were mapped onto the
concepts from this hierarchy: 56. Types of patterns
-
- Conversion rates (~ confidence of content-specified sequential
association rules) for assessing business success
-
- Association rule and sequence analysis for understanding
online/offline preferences and their temporal development
-
- Cluster analysis for customer segmentation
-
- Correlation analysis for investigating the relationship between
demographic indicators and online/offline preferences
57. >>Session representation
- Each session represented as a feature vector on the
multi-channel concepts
- Two methods used for definition of new conversion metrics:
- weighted-concept method (number of visits to a concept)
- dichotomized concept method (whether or not concept was
visited)
2 0 0 0 5 3 1 B ... A Session 1 purch. 7 infprod 3 infcat 0 home
0 2 4 offinfo transact service 1 0 0 0 1 1 1 B ... A Session 1
purch. 1 infprod 1 infcat 0 home 0 1 1 offinfo transact service 58.
Agenda Case Study Business Understanding Data understanding and
preparation Pattern disc. + eval.: Behavioural patterns 59.
Internal consistency of preferences payment and delivery
preferences
- Online paymentDirect delivery (s=0.27, c=0.97)< 1/3
traditional onl.users!
- Online paymentIn-store pickup (s=0.02, c=0.03)
- Cash on deliveryDirect delivery (s=0.02, c=0.03)
- In-store paymentIn-store pickup (s=0.69, c=0.94)
- Site is primarily used to collect information.
s: support, c: confidence of the sequence 60. Internal
consistency of preferences return preferences
- ReturnIn-store (s=0.06, c=0.87)
- ReturnMail-in (s=0.04, c=0.13)
- Customers may wish personal assistance.
- (a result supported by the service mix analysis of different
multi-channel retailers and by questionnaire results)
61. Development of preferences over time
- Direct deliveryIn-store pickup in 1 following transaction
(s=0.001,c=0.15)
- Direct deliveryDirect delivery in all following transactions
(s=0.003,c=0.85)
- In-store pickupDirect delivery in 1 foll. transaction (s=0.001,
c=0.10) (*)
- In-store pickupIn-store pickup in all foll. transactions
(s=0.004, c=0.90)
- Results for payment migration are similar.
- 90% of repeat customers did not change transaction preferences
at all.
- Rule (*) as an indicator of the development of trust?!
62. Thank you!