This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• Perl & Cookies• SGML / XML • CORBA & SOAP• Web Services• Search Engines• Recommender Syst.• Web Mining• Security Issues• Selected Topics
Course Content• Introduction• Internet and WWW• Protocols• HTML and beyond• Animation & WWW• CGI & HTML Forms• Javascript• Databases & WWW• Dynamic Pages
Web Mining• Web mining is the application of data mining techniques
and other means of extraction of knowledge for the integration of information gathered over the World Wide Web in all its forms: content, structure or usage. The integrated information is useful for either:– Understanding on-line user behaviour;– Retrieving/consolidating relevant knowledge/resources;– Evaluate the effectiveness of particular web sites or web-based
applications; • Web mining research integrates research from
Databases, Data Mining, Information retrieval, Machine learning, Natural language processing, software agent communication, etc.
There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking.
Web Structure MiningUsing Links•Hypursuit (Weiss et al. 1996)•PageRank (Brin et al., 1998)•CLEVER (Chakrabarti et al., 1998)Use interconnections between web pages to give weight to pages. Using Generalization•MLDB (1994), VWV (1998)Uses a multi-level database representation of the Web. Counters (popularity) and link lists are used for capturing structure.
General Access Pattern Tracking•Knowledge from web-page navigation (Shahabi et al., 1997)•WebLogMining (Zaïane, Xin and Han, 1998)•SpeedTracer (Wu,Yu, Ballman, 1998)•Wum (Spiliopoulou, Faulstich, 1998)•WebSIFT (Cooley, Tan, Srivastave, 1999)Uses KDD techniques to understand general access patterns and trends. Can shed light on better structure and grouping of resource providers as well as network and caching improvements.
•Adaptive Sites (Perkowitz & Etzioni, 1997)Analyzes access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.•Personalization (SiteHelper: Ngu & Wu, 1997. WebWatcher: Joachims et al, 1997. Mobasher et al., 1999). Provide recommendations to web users.
Web Content Mining: a huge field with many applications
• Data/information extraction: Extraction of structured data from Web pages, such as products and search results. Extracting such data allows one to provide services. Two main types of techniques, machine learning and automatic extraction exist.
• Web information integration and schema matching: Although the Web contains a huge amount of data, each web site (or even page) represents similar information differently. How to identify or match semantically similar data is a very important problem with many practical applications.
• Opinion extraction from online sources: There are many online opinion sources, e.g., customer reviews of products, forums, blogs and chat rooms. Mining opinions (especially consumer opinions) is of great importance for marketing intelligence and product benchmarking.
• Knowledge synthesis: Concept hierarchies or ontology are useful in many applications. However, generating them manually is very time consuming. A few methods that explores the information redundancy of the Web exist. The main application is to synthesize and organize the pieces of information on the Web to give the user a coherent picture of the topic domain.
• Segmenting Web pages and detecting noise: In many Web applications, one only wants the main content of the Web page without advertisements, navigation links, copyright notices. Automatically segmenting Web page to extract the main content of the pages is an interesting problem. A number of interesting techniques have been proposed in the past few years.
Citation Analysis in Information Retrieval• Pinski and Narin (1976) proposed a significant variation
on the notion of impact factor, based on the observation that not all citations are equally important.– A journal is influential if, recursively, it is heavily cited by other
influential journals.– influence weight: The influence of a journal j is equal to the sum
of the influence of all journals citing j, with the sum weighted by the amount that each cites j.
A good authority is a page pointed by many good hubs, while a good hub is a page that point to many good authorities.
This mutually enforcing relationship between the hubs & authorities serves as the central theme in our exploration of link based method for search, and the automated compilation of high-quality web resources.
Further Enhancement for Finding Authoritative Pages in WWW
• The CLEVER system (Chakrabarti, et al. 1998-1999)– builds on the algorithmic framework of extensions based on
both content and link information.• Extension 1: mini-hub pagelets
– prevent "topic drifting" on large hub pages with many links, based on the fact: Contiguous set of links on a hub page are more focused on a single topic than the entire page.
• Extension 2. Anchor text– make use of the text that surrounds hyperlink definitions
(href's) in Web pages, often referred to as anchor text – boost the weights of links which occur near instances of query
• Google assigns initial ranking and retains them independently of any queries. This makes it faster.
• CLEVER and Connectivity server assembles different root set for each search term and prioritizes those pages in the context of the particular query.
• Google works in the forward direction from link to link.• CLEVER looks both in the forward and backward direction.• Both the page-rank and hub/authority methodologies have
been shown to provide qualitatively good search results for broad query topics on the WWW.
• Hyperclass (Chakrabarti 1998) uses content and links of exemplary page to focus crawling of relevant web space.
Existing Web Log Analysis Tools• There are many commercially available applications.
– Many of them are slow and make assumptions to reduce the size of the log file to analyse.
• Frequently used, pre-defined reports:– Summary report of hits and bytes transferred– List of top requested URLs– List of top referrers– List of most common browsers– Hits per hour/day/week/month reports– Hits per Internet domain– Error report– Directory tree report, etc.
• Tools are limited in their performance, comprehensiveness, and depth of analysis.
Basic summarization:– Get frequency of individual
actions by user, domain and session.
– Group actions into activities, e.g. reading messages in a conference
– Get frequency of different errors.Questions answerable by such summary:
– Which components or features are the most/least used?
– Which events are most frequent?– What is the user distribution over
different domain areas?– Are there, and what are the
differences in access from different domains areas or geographic areas?
What Is Web access log Mining?• Web Servers register a log entry for every single
access they get.
• A huge number of accesses (hits) are registered and collected in an ever-growing web log.
• Web access log mining:
– Enhance server performance– Improve web site navigation– Improve system design of web applications– Target customers for electronic commerce– Identify potential prime advertisement locations
More on Log Files• Information NOT contained in the log files:
– use of browser functions, e.g. backtracking within-page navigation, e.g. scrolling up and down
– requests of pages stored in the cache– requests of pages stored in the proxy server– Etc.
• Special problems with dynamic pages:– different user actions call same cgi script– same user action at different times may call different cgi scripts – one user using more than one browser at a time– Etc.
• Personalization• Adaptive sites• Banner targeting• User behaviour analysis• Web site structure evaluation• Improve server performance (caching, mirroring…)
• Pages visited in the same session constitute a transaction. Relating pages that are often referenced together regardless of the order in which they are accessed (may not be hyperlinked).