Top Banner
1 Web Usage Mining Modelling: Session analysis, OLAP, frequent-pattern mining I (association rules) Prof. Dr. Bettina Berendt Humboldt Univ. Berlin, Germany www.berendt.de
62
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1.
    • Prof. Dr. Bettina Berendt
  • Humboldt Univ. Berlin, Germany
  • www.berendt.de

Web Usage MiningModelling: Session analysis, OLAP, frequent-pattern mining I (association rules) 2. Please note

  • These slides use and/or refer to a lot of material available on the Internet. To reduce clutter, credits and hyperlinks are given in the following ways:
    • Slides adapted from other peoples materials: at bottom of slide
    • Pictures, screenshots etc.: URL visible in screenshot or given in PPT Comments field
    • Literature, software: On the accompanying Web sitehttp://vasarely.wiwi.hu-berlin.de/WebMining07/
  • Thanks to the Internet community!
  • You are invited to re-use these materials, but please give the proper credit.

3. Stages of knowledge discovery discussed in this lecture Application understanding 4. Agenda Different levels of analysis Session analysis & OLAP: Case study Frequent itemsets & association rules: Method Association rules: Case study 5. Basic Framework for E-Commerce Data Analysis Web Usage and E-Business Analytics customers orders products Operational Database Content Analysis Module Web/Application Server Logs Data Cleaning / Sessionization Module Site Map Site Dictionary Integrated Sessionized Data Data Integration Module E-Commerce Data Mart Data Mining Engine OLAP Tools Session Analysis / Static Aggregation Pattern Analysis OLAP Analysis Site Content Data Cube 6. Different levels of analysis

    • Session Analysis
    • Static Aggregation and Statistics
    • OLAP
    • Data Mining

7. Session Analysis

  • Simplest form of analysis: examine individual or groups of server sessions and e-commerce data.
  • Advantages:
    • Gain insight into typical customer behaviors.
    • Trace specific problems with the site.
  • Drawbacks:
    • LOTS of data.
    • Difficult to generalize.

8. Static Aggregation (Reports)

  • Most common form of analysis.
  • Data aggregated by predetermined units such as days or sessions.
  • Generally gives most bang for the buck.
  • Advantages:
    • Gives quick overview of how a site is being used.
    • Minimal disk space or processing power required.
  • Drawbacks:
    • No ability to dig deeper into the data.

9. Data Mining: Going deeper Sequence mining Markov chains Association rules Clustering Session Clustering Classification Prediction of next event Discovery of associated events or application objects Discovery of visitor groups with common properties and interests Discovery of visitor groups with common behaviour Characterization of visitors with respect to a set of predefined classes Card fraud detection 10. KDD Techniques for Web Applications:Examples (1)

  • Calibration of a Web server:
    • Prediction of the next page invocation over a group of concurrent Web users under certain constraints
      • Sequence mining, Markov chains
  • Cross-selling of products:
    • Mapping of Web pages/objects to products
    • Discovery of associated products
      • Association rules, Sequence Mining
    • Placement of associated products on the same page

11. KDD Techniques for Web Applications:Examples (2)

  • Sophisticated cross-selling and up-selling of products:
    • Mapping of pages/objects to products of different price groups
    • Identification of Customer Groups
      • Clustering, Classification
    • Discovery of associated products of the same/different price categories
      • Association rules, Sequence Mining
    • Formulation of recommendations to the end-user
      • Suggestions on associated products
      • Suggestions based on the preferences of similar users

12. Agenda Different levels of analysis Session analysis & OLAP: Case study Frequent itemsets & association rules: Method Association rules: Case study 13. Worldwide usability example 14. The application context 15. Motivation & Application understanding

  • use of Internet asinformation source
  • ease of finding information
  • personal and situational variables

1. 16. Motivation & Application understanding

  • Use of Internet asinformation source
  • Ease of finding informations
  • Personal and situational variables

1. 2.

  • International eHealth-website
  • Visitors of different
    • CULTURAL BACKGROUNDS
    • LINGUISTIC BACKGROUNDS
    • MEDICAL KNOWLEDGE LEVELS

17. Motivation & Application understanding

  • Use of the Internet as an information source
  • Ease of finding information
  • Personal and situationalvariables

1. 2.

  • International eHealth-website
  • Visitors of different
    • CULTURAL BACKGROUNDS
    • LINGUISTIC BACKGROUNDS
    • MEDICAL KNOWLEDGE

How does the users background affect information seeking behaviour? 18. Outline of the KDD process

  • Data preparation:
    • Session IDs; usual data cleaning steps
    • Linking of sessions & questionnaire information (anonymized)
  • Modelling / pattern discovery:
    • Session analysis, static aggregation (non-hierarchical) OLAP; frequent subgraph mining (hierarchical) [this is not shown today]
  • Evaluation:Correlation analysis, significance tests; interesting patterns
  • Appl. underst. : search behaviour, linguistic & expertise theory
  • Data :
    • Web server sessions, questionnaire
  • Data understanding main step:
    • modelling the semantics of the site in terms of a hierarchy of service concepts

19. First step: What do people do there? 20. A sample session (sequence of URLs only)

  • /doia/mainmenu.asp?zugr=d&lang=e /doia/dbrowser.asp?zugr=d&lang=e&benr=A /doia/dbrowser.asp?zugr=d&lang=e&benr=A_6 /doia/image.asp?zugr=d&lang=e&cd=5&nr=83&diagnr=287000 /doia/image.asp?zugr=d&lang=e&cd=4&nr=66&diagnr=695890 /doia/image.asp?zugr=d&lang=e&cd=3&nr=95&diagnr=690030 /doia/dbrowser.asp?zugr=d&lang=e&benr=A_6_4 /doia/image.asp?zugr=d&lang=e&cd=7&nr=40&diagnr=287000 /doia/image.asp?zugr=d&lang=e&cd=5&nr=11&diagnr=690040 /doia/image.asp?zugr=d&lang=e&cd=4&nr=68&diagnr=695200 /doia/image.asp?zugr=d&lang=e&cd=5&nr=85&diagnr=287000 /doia/diagnose.asp?zugr=d&lang=e&diagnr=287000&topic=dd /doia/diagnose.asp?lang=e&zugr=d&diagnr=693010 /doia/diagnose.asp?lang=e&zugr=d&diagnr=710022 /doia/diagnose.asp?zugr=d&lang=e&diagnr=710022&topic=i /doia/image.asp?zugr=d&lang=e&cd=5&nr=83&diagnr=287000 /doia/image.asp?zugr=d&lang=e&cd=6&nr=96&diagnr=287000 /doia/image.asp?zugr=d&lang=e&cd=27&nr=99&diagnr=287000

21. SessionURL sequenceURL graph (Individualised site/web map) Key for readingindivdiualised site maps /indivdiualised Web maps 22. Transformation I: Mapping the URLs into a concept hierarchy governed by media types (SUCHE = search; D_ ... = sub-concepts of information on diagnoses; BILD = picture, DD = differential diagnosis) D_BILD --> /doia/image.asp?zugr=d&lang=e&cd=6&nr=96&diagnr=287000 D_BILD --> /doia/image.asp?zugr=d&lang=e&cd=5&nr=83&diagnr=287000 D_INFO --> /doia/diagnose.asp?zugr=d&lang=e&diagnr=710022&topic=i D_TEXT --> /doia/diagnose.asp?lang=e&zugr=d&diagnr=710022 D_TEXT --> /doia/diagnose.asp?lang=e&zugr=d&diagnr=693010 D_DD --> /doia/diagnose.asp?zugr=d&lang=e&diagnr=287000&topic=dd D_BILD --> /doia/image.asp?zugr=d&lang=e&cd=5&nr=85&diagnr=287000 D_BILD --> /doia/image.asp?zugr=d&lang=e&cd=4&nr=68&diagnr=695200 D_BILD --> /doia/image.asp?zugr=d&lang=e&cd=5&nr=11&diagnr=690040 D_BILD --> /doia/image.asp?zugr=d&lang=e&cd=7&nr=40&diagnr=287000 SUCHE --> /doia/dbrowser.asp?zugr=d&lang=e&benr=A_6_4 D_BILD --> /doia/image.asp?zugr=d&lang=e&cd=3&nr=95&diagnr=690030 D_BILD --> /doia/image.asp?zugr=d&lang=e&cd=4&nr=66&diagnr=695890 D_BILD --> /doia/image.asp?zugr=d&lang=e&cd=5&nr=83&diagnr=287000 SUCHE --> /doia/dbrowser.asp?zugr=d&lang=e&benr=A_6 SUCHE --> /doia/dbrowser.asp?zugr=d&lang=e&benr=A START --> /doia/mainmenu.asp?zugr=d&lang=e 23. Sessionconcept sequenceconcept graph (here: focus on main modality of requested page) 24. Transformation II: Mapping URLs to concepts in an ontology(based on a standard medical ontology + standard search behaviour classifications) Alphabeticalsearch Diagnosis 21002 Diagnosis info TOP Search 25. Sessionconcept sequenceconcept graph (here: focus on navigation structure) 26. Sessionconcept sequenceconcept graph (here: focus on content) 27. The impact of language (a simplified view) Nativespeakers (L1)lower cognitive effort higher preference ofACTIVE LANGUAGEuse (compared to L2 users) Non-nativespeakers (L2) higher cognitive effort higher preference ofPASSIVE LANGUAGEuse (compared to L1 users) Different search options correspond, to varying degrees, to these preferences, e.g. H2: Native speakers prefer search engines more than non-native speakers. * For further hypotheses see Kralisch & Berendt CATAC04 28. The impact of medical knowledge (a simplified view) Different search options correspond, to varying degrees, to a patients or physicians knowledge and search goals, e.g.

  • PhysiciansandPatientsdiffer in:
  • in their knowledge of medical terms
  • in their perceptions/differentiation of disease symptoms
  • in other aspects of knowledge about diseases (e.g. Where does the disease occur?)
  • in their search goals

H3: Physicians prefer search engines more than patients. * For further hypotheses see Kralisch & Berendt 2005 29.

  • 1. Operationalisation of search preference
  • Which search option was (not) used.
  • In combination with which other search options was the search option used?
  • In which order where the search option used.
  • Number of page requests prior to access of search option.
  • Frequency of use of search option.
  • Frequency of use of search option per page request.
  • Factor analysis applied for item reduction.
  • 2.Operationalisation of culture
  • Cultural index scores by Hofstede
  • Control through 5 questions regarding cultural items
  • 3.Operationalisation of language
  • Native speaker vs. Non-native speakers according to answers in questionnaire
  • Control of proficiency level in non-native language
  • 4.Operationalisation of medical knowledge
  • Physicians vs. Patients according to answers in questionnaire

Operationalisation and the resulting OLAP cube (in this study: no drill-down) language expertise searchoptionuse 30. Information Seeking Behaviour: Types (results) and characteristics (background knowledge)

  • Use of
  • Alphabetical search
    • &
  • Search engine
  • 2. Use of
  • Search by body parts

Characteristics2maintypesof information seeking behaviour

  • active vocabulary use
  • name of disease required (predominantly goal oriented)
  • fast information access
  • little context information
  • passive vocabulary use
  • predominantly exploratory search behaviour
  • more time-consuming
  • highest amount of context information

31. The impact of language and domain knowledge on search option choice

  • 2 studies on the use of search options in the eHealth site:
    • Webserver log: 3 928 235 requests / 277 809 sessions from 188 countries
      • 83.2 % first-language users, 16.8% second-language users
    • Webserver log + Questionnaire: 165 (106) people from 34 countries
      • 84.9% first-language users, 15.1% second-language users
      • 10.4% physicians, 89.6% patients
    • Results:
      • Search engine, alphabetical search: in particular first-language users, physicians
      • Content-organized search: in particular second-language patients
    • Domain knowledge compensates for limited language knowledge.

[Kralisch & Berendt,New Review of Hypermedia and Multimedia,2005] 32. Evaluation: Recommendations for website design

  • Patients and Physicians have different preferences in their information seeking behaviour
  • Native speakers and non-native speakers have different preferences in their information seeking behaviour
  • Native language information is more important for patients than for physicians
  • In some regions, offering information in their native language is more important than in others
  • Realisation in website design through:
    • e.g. separate website areas for patients and physicians
    • e.g. highlighting of more appropriate search options
    • e.g. terminological support for non-native speakers, especially patients
    • e.g. adapted versions for groups of countries

33. Agenda Different levels of analysis Session analysis & OLAP: Case study Data mining: going deeper Association rules: Case study 34. 2. The structural/algorithmic part of knowledge discovery (modelling in CRISP-DM): Patterns, data mining tasks, methods (ex.)

    • Global patterns
      • Description
        • Clustering
          • K-means, EM, hierarchical clustering, ...
        • Hidden Markov Models
        • Link patterns (e.g., citation analysis la Google)
      • Prediction
        • Classification
          • Bayes techniques, Decision trees, Support Vector Machines, ...
        • Regression
        • Time series analysis
    • Local patterns
      • Frequent itemsets, sequences, subgraphs
        • A priori and methods derived from it
      • Association rules
      • Cliques (Web Communities)

35. Global patterns local patternsoutlook=sunny temperature=hot 2 ==> humidity=high 2 Global: Describes all instances Local: Describes some instances 36.

  • Demonstration of WEKA

37. Agenda Different levels of analysis Session analysis & OLAP: Case study Frequent itemsets & association rules: Method Association rules: Case study 38. For an excellent introduction, see ...

  • Coenen, F. (2003).Association rule mining and its wider context.AI2003 Association Rule Mining Tutorial, Cambridge, December 2003.
  • http:// www.csc.liv.ac.uk /~frans/KDD/ Tutorials /tutorialAI2003.ppt
  • pp. 5 20, covering
  • What is an association rule?
  • What are interestingness measures for association rules?
    • support, confidence, lift (there are also further measures)
    • cf. the performance measures recall, precision, etc. for classifiers
  • How is association-rule mining performed?
    • the basic apriori algorithm

39. Agenda Different levels of analysis Session analysis & OLAP: Case study Frequent itemsets & association rules: Method Association rules: Case study 40. CRM questions example: Why go to a shop ...

  • ... if everything is available on the Internet?

41. A multi-channel retailer, its business goals, and analysis questions

  • General goals : Standard e-tailer goals attract users/shoppers and convert them into customers
  • Specific goals : assess the success of the Web site in relation to other distribution channels
  • Questions of the evaluation :
  • Whatbusiness metricscan be calculated from Web usage data, transaction and demographic data for determining online success?
  • Are therecross-channel effectsbetween a companys e-shop and its physical stores ?

Background: Internet market shares [BCG 2002] 42. The site 43. Outline of the KDD process

  • Data preparation:
    • Session IDs; usual data cleaning steps
    • Linking of sessions & transaction information (anonymized)
  • Modelling / pattern discovery:
    • Web metrics, cluster analysis, association rules, sequence mining + correlation analysis, questionnaire study, qualitative market analysis
  • Evaluation:Interesting patterns
  • Business underst. : customer buying process
  • Data :
    • Web server sessions, transaction info.
  • Data understanding main step:
    • modelling the semantics of the site in terms of a hierarchy of service concepts

44. Agenda Case Study Business Understanding Data understanding and preparation Pattern discovery + evaluation: Success metrics Pattern disc. + eval.: Behavioural patterns Pattern disc. + eval.: User types Pattern disc. + eval.: Behaviour & demographics 45. Agenda Case Study Business Understanding Data understanding and preparation Pattern discovery + evaluation: Success metrics Pattern disc. + eval.: Behavioural patterns Pattern disc. + eval.: User types Pattern disc. + eval.: Behaviour & demographics 46. Description of the site and its services

    • The retailer operates an e-shop and more than 5000 retail shops in over 10 European countries
    • It sells a wide range of consumer electronics
    • Online customers can pay, pick-up/deliver and return both online and offline
    • Web pages provide for all tasks in the customer buying process

47. Purchase Phases (Page Concepts) at Large MC Retailers 1. Acquisition (home):All Web pages that are semantically related to the initial acquisition of a visitor Home (Acquisition) 48. Purchase Phases (Page Concepts) at Large MC Retailers Home (Acquisition) 2. Catalogue information :pages providing an overview of product categories.ProductImpression 49. Purchase Phases (Page Concepts) at Large MC Retailers Product Click-Through Home (Acquisition) 3. Information product (infprod) :pages displaying information about a specific product ProductImpression 50. Purchase Phases (Page Concepts) at Large MC Retailers Offlineinfo Home (Acquisition) 4. offline information (offinfo):All pages related to any offline information: store locator (pages for finding physical stores in ones neighbourhood), information about offline services, offline referrers etc. Product Click-Through ProductImpression 51. Purchase Phases (Page Concepts) at Large MC Retailers Transaction Offlineinfo Home (Acquisition) 5. transaction (transact) :steps before an actual purchase, starting with a customer entering the order process: check-out, input of customer data, payment and delivery preferences (online or offline), etc. Product Click-Through ProductImpression 52. Purchase Phases (Page Concepts) at Large MC Retailers Transaction Purchase Offlineinfo Home (Acquisition) 6. purchase :indicates if a visitor completed the transaction process and bought a product, e.g. invocation of an order confirmation page. Product Click-Through ProductImpression 53. Agenda Case Study Business Understanding Data understanding and preparation Pattern disc. + eval.: Behavioural patterns 54. Data and data preparation

  • Data sources and sample:
    • 92,467 sessions from the companys Web logs from 21 days in 2002
    • anonymized transaction information of 13,653 customers who bought online over a period of 8 months in 2001/02.
    • 621 transaction records (21 days) were linked to Web-usage records
  • Data preparation:
    • Sessions were determined by session IDs
    • Robot visits eliminated, usual data cleaning steps
    • Each URL request mapped to a service concept from{c 1 ,...,c n }
    • Session representation:s = [w 1 , ...w n ],withw i= weight ofc i , indicating whether or not the concept was visited (1/0), or how often it was visited
    • Customer record: feature vector incl. session and transaction data

55. Site semantics: A service concept hierarchy Any Information Transaction Services Information Product Fulfillment/ Service Customer Data Shopping Cart Payment Company Infos Registration OtherAcquisition Offline Referrer Advertiser Other Store Locator Information Catalog Home Game Offline Service and Support =Multi-Channel Concept 760,535 page requests were mapped onto the concepts from this hierarchy: 56. Types of patterns

    • Conversion rates (~ confidence of content-specified sequential association rules) for assessing business success
    • Association rule and sequence analysis for understanding online/offline preferences and their temporal development
    • Cluster analysis for customer segmentation
    • Correlation analysis for investigating the relationship between demographic indicators and online/offline preferences

57. >>Session representation

  • Each session represented as a feature vector on the multi-channel concepts
  • Two methods used for definition of new conversion metrics:
  • weighted-concept method (number of visits to a concept)
  • dichotomized concept method (whether or not concept was visited)

2 0 0 0 5 3 1 B ... A Session 1 purch. 7 infprod 3 infcat 0 home 0 2 4 offinfo transact service 1 0 0 0 1 1 1 B ... A Session 1 purch. 1 infprod 1 infcat 0 home 0 1 1 offinfo transact service 58. Agenda Case Study Business Understanding Data understanding and preparation Pattern disc. + eval.: Behavioural patterns 59. Internal consistency of preferences payment and delivery preferences

  • Online paymentDirect delivery (s=0.27, c=0.97)< 1/3 traditional onl.users!
  • Online paymentIn-store pickup (s=0.02, c=0.03)
  • Cash on deliveryDirect delivery (s=0.02, c=0.03)
  • In-store paymentIn-store pickup (s=0.69, c=0.94)
  • Site is primarily used to collect information.

s: support, c: confidence of the sequence 60. Internal consistency of preferences return preferences

  • ReturnIn-store (s=0.06, c=0.87)
  • ReturnMail-in (s=0.04, c=0.13)
  • Customers may wish personal assistance.
  • (a result supported by the service mix analysis of different multi-channel retailers and by questionnaire results)

61. Development of preferences over time

  • Direct deliveryIn-store pickup in 1 following transaction (s=0.001,c=0.15)
  • Direct deliveryDirect delivery in all following transactions (s=0.003,c=0.85)
  • In-store pickupDirect delivery in 1 foll. transaction (s=0.001, c=0.10) (*)
  • In-store pickupIn-store pickup in all foll. transactions (s=0.004, c=0.90)
  • Results for payment migration are similar.
  • 90% of repeat customers did not change transaction preferences at all.
  • Rule (*) as an indicator of the development of trust?!

62. Thank you!