General Chairman Veljko Milutinovic, School of Electrical Engineering, University of Belgrade Deputy General Chairman Frédéric Patricelli, Telecom Italia Learning Services (Head of International Education) Conference organization staff: Conference Managers: Miodrag Stefanovic Cesira Verticchio Conference Staff Renato Ciampa Veronica Ferrucci Maria Rosaria Fiori Maria Grazia Guidone Natasa Kukulj Bratislav Milic Zaharije Radivojevic Milan Savic These pages are optimized for Internet Explorer 4+ or Netscape Navigator v4+ and resolution of 1024x768 pixels in high color. Designed by SSGRR
23
Embed
Veljko Milutinovic, Frédéric Patricelli, School of ...sugawara/pdf/kurihara-SSGRR2002.pdfTelecom Italia Learning Services (Head of International Education) Conference organization
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
General ChairmanVeljko Milutinovic,
School of Electrical Engineering,University of Belgrade
Deputy General ChairmanFrédéric Patricelli,Telecom Italia Learning Services(Head of International Education)
Conferenceorganization staff:
ConferenceManagers:
Miodrag StefanovicCesira Verticchio
Conference Staff
Renato CiampaVeronica Ferrucci
Maria Rosaria FioriMaria Grazia
GuidoneNatasa KukuljBratislav Milic
Zaharije RadivojevicMilan Savic
These pages are optimizedfor
Internet Explorer 4+ orNetscape Navigator v4+
and resolution of 1024x768pixels
in high color.
Designed by SSGRR
SSGRR-2002s - Papers
1. .NET All New?
Jürgen Sellentin, Jochen Rütschlin
2. A center for Knowledge Factory Network Services (KoFNet) as a support to e-business
Giuseppe Visaggio, Piernicola Fiore
3. A concept-oriented math teaching and diagnosis system
Wei-Chang Shann, Peng-Chang Chen
4. A contradiction-free proof procedure with visualization for extended logic programs
Susumu Yamasaki, Mariko Sasakura
5. A Framework For Developing Emerging Information Technologies Strategic Plan
Amran Rasli
6. A Generic Approach to the Design of Linear Output Feedback Controllers
Yazdan Bavafa-Toosi, Ali Khaki-Sedigh
7. A Knowledge Management Framework for Integrated design
Niek du Preez, Bernard Katz
8. A Method Component Programming Tool with Object Databases
Masayoshi Aritsugi, Hidehisa Takamizawa, Yusuke Yoshida and Yoshinari Kanamori
9. A Model for Business Process Supporting Web Applications
Niko Kleiner, Joachim Herbst
10. A Natural Language Processor for Querying Cindi
Niculae Stratica, Leila Kosseim, Bipin C. Desai
11. A New Approach to the Construction of Parallel File Systems for Clusters
Felix Garcia, Alejandro Calderón, Jesús Carretero, Javier Fernández, Jose M. Perez
12. A New model of On-Line Learning
file:///F¦/papers.html (1/15)2004/03/22 13:16:33
SSGRR-2002s - Papers
Marjan Gusev, Ljupco N. Antovski, Vangel V. Ajanovski
13. A New Paradigm for Network Management: Business Driven Device Management
John Strassner
14. A Prototype of a Retail Internet Banking for Thai Customers
Rawin Raviwongse, Pornpriya Koedrabruen
15. A Reuse-Oriented Approach for the Construction of Hypermedia Applications
Naoufel Kraiem
16. A Scientific Paradigm On Image Processing痴 Lecture
Sar Sardy
17. A Theory of Programming for e-Science and Software Engineering
Juris Reinfelds
18. A video based laboratory on the Internet, and the experiences obtained with high-school teachers
Fernando Gamboa Rodríguez, J.L. Pérez Silva, F. Lara Rosano, A. Miranda Vitela, F. Cabiedes Contreras
19. Web Engineering: Methods and Tools for Education
George E. Cormack, G. Griffiths, B. D. Hebbron, M. A. Lockyer, B. J. Oates
20. Adding Security to Quality of Service Architectures
Stefan Lindskog, Erland Jonsson
21. Advanced Mobile Multipoint Rela-Time Military Conferencing System (AMMCS)
R. Sureswaran, A. Osman, M. S. Mushardin, M. Yusof, B. Husain
22. Advanced Optical Infrastructure for the Emerging Optical Internet Services
Marian Marciniak, Marian Kowalewski, Miroslaw Klinkowski
23. Agent-based Intelligent Clinical Information System
Il Kon Kim, Ji Hyun Yun, Sang Wook Lee, Hang Chan Kim
24. An approach for implementing Object Persistence in C++ using Broker
Kulathu Sarma
25. Digital Learning: Infrastructure and Web Culture
Alexei L. Semenov
26. An Efficient and Adaptive Method for Reservation of Multiple Multicast Trees
58. Economic Decision-making in a Technological Age
James R. Forcier
59. Complexity and the Emergent Web
Sorin Solomon, Eran Shir
60. E-Diagnosis Using GeneChip Technologies
Zhao Lue-Ping, S. Gilbert, C. Defty
61. e-DOCPROS: An e-Business Document Processing System
Zhenfu Cheng, Xuhong Li
62. Effects of Changing the Pedagogical Concept of a Part-time Bachelor of Science in Accounting from Traditional Lectures into an IT-supported Asynchronous and Flexible Teaching & Learning Concept
Lars Kiertzner, Maya Dole, Tage Rasmussen
63. e-Infrastructure in a complex environment
Julian Smith
64. E-learning at ENSAIT: a case study
Pierre Douillet, S. Pessé, A. M. Jolly
65. E-Learning Content Creation with MPEG-4
Michael Stepping
66. E-Learning of Spanish with Interactive Video and Blackboard Technologies for Elementary School Children
Julia Coll
67. EMERGENCY! Medicine and Modern Education Technology
Dag K.J. E. von Lubitz, Benjamin Carrasco, Francesco Gabbrielli, Frederic Patricelli, Tymoty Pletcher, Caleb Poirier, Simon Richir
68. e-Medicine Utilization: Socio-cultural issues
file:///F¦/papers.html (5/15)2004/03/22 13:16:33
SSGRR-2002s - Papers
Robert Doktor, David Bangert
69. Emerging market mechanisms in Business-to-Business E Commerce A framework
B. Mahadevan
70. Enhanced Security Watermarking and Authentication based on Watermark Semantics
Dimitrios Koukopoulos, Y. C. Stamatiou
71. Environment for Teaching Support in the Medical Area
Rosa Maria Vicari, Cecilia Dias Flores, Louise Seixas, André Silvestre
72. Epidemic Communication Mechanisms in Distributed Computing
Oznur Ozkasap
73. Evaluating Java Applets for Teaching on the Internet
Michael R. Healy, Dale E. Berger, Victoria L. Romero, Amanda Saw
159. Storage Technologies for an Efficient e-Infrastructure
Satish Rege
160. Structured Metadata Analysis
Steve Probets
161. Superscalar in City-1: An Educational Guide to the next step beyond Pipelining
Ryuichi Takahashi, Noriyoshi Yoshida
162. Supervision of Electrical Utility Works Based on Internet
Felipe Alaniz, Pablo R. de Buen
163. Teaching Novices Programming Skills Efficiently: What, When and How?
Yuh-Huei Shyu
164. Teaching, Technology and Teamwork
Elaine Carbone, Shaun Stemmler, Jon Beal
file:///F¦/papers.html (12/15)2004/03/22 13:16:34
SSGRR-2002s - Papers
165. Software solutions for Science e-Education: A case study from the VISIT Project
Yichun Xie
166. Technologies for Student-Generated Work in a Peer-Led, Peer-Review Instructional Environment
Brian P. Coppola, Ian C. Stewart
167. TEN WAYS TO IMPACT THE WEB WITHOUT A WEB MEISTER
Ken McNaughton
168. The Architecture of Knowledge: Representation and Theorization of Violence on the Internet
Lily Alexander
169. The emergence a of web-mediated genres: the home page
Anne Ellerup Nielsen
170. The Emerging Autosophy Internet
Klaus Holtz, Eric Holtz
171. The Future of Education
Lalita Rajasingham
172. The impact of internet technologies on the financial markets
Ross A. Lumley
173. The Mathematical Structure model of a Word-unit-based Program
Hamid Fujita, Osamu Arai
174. The Role of XML in E-Business
Betty Harvey
175. Think before you click customerschallenges in the e-commerce
Zita Zoltay Paprika
176. Topological Design of Multiple VPNs over MPLS Network
Anotai Srikitja, David Tipper
177. Toward logical-probabilistic modeling of complex systems
Taisuke Sato, Yoichi Motomura
178. Towards to e-transport
Miroslav Svítek, Mirko Nov疚
179. Emergence and Evolution of Microturbine Generators [MTGs] to Provide Infrastructure for E-Related Applications
file:///F¦/papers.html (13/15)2004/03/22 13:16:34
SSGRR-2002s - Papers
Stephanie L. Hamilton
180. Using Building Blocks to Implement a Business-to-Supplier Portal
Shannon Fowler
181. Using CORBA Interceptors to Implement a Security Wrapper
Luigi Romano, D. Cotroneo, A. Mazzeo, S. Russo
182. Using Internet and Database Technology to Enable Collaboration between Researchers and Teachers Developing Educational Websites Featuring Endangered Species Research and Conservation
Mary A. Overby, Mark MacAllister, Jeffrey Hoffman, Chris Bulla
183. Using the Quick Look Methodology to Plan and Implement Complex Information Technology Transformations
Richard C. Staats
184. Verifying and Leveraging Software Frameworks
Trent Larson
185. Virtual Communities for Service Delivery: Transferring the Notion of Pro-Social Behavior from 撤laceto 鉄pace"
Ko de Ruyter, Caroline Wiertz, Sandra Streukens
186. Visualizing Molecules Helps Students 'See' Chemistry in a New Light
Abstract—We are studying techniques that allow even ordinary end users to make efficient use of the Internet. We previously proposed an algorithm for determining the degree of similarity between web sites by using link information to find web sites that are mirrors of each other and ones that are not mirrors but have similar content and can be used as substitutes for each other. As a result of verifying the basic effectiveness of that algorithm, we found that when trying to find similar web sites to site-A, in addition to ones found to have almost 100% similarity to site-A, there were also ones that were thoroughly adequate for use as substitutes for site-A, even though they had a low degree of similarity of 50% or less. Therefore, for practical use of that algorithm, it is essential to be able to automatically judge whether web sites that can be inferred to have some kind of similarity are actually mirror sites or similar sites that can be used as substitutes. To solve this problem, in this paper, we propose and evaluate the basic effectiveness of an automatic judgment methodology, and we focus on its operation and propose a methodology for effectively finding candidates for a similar site by using a user’s Internet access history. Index Terms—Internet, Mirror site, Access history, Link information
I. INTRODUCTION
Due to the rapid expansion of the Internet, it has become possible for ordinary end users to obtain many kinds of information easily. However, it is still difficult for them to use the network effectively. For example, although mirror servers and cache servers have been provided in order to improve scalability and response times, it is difficult for users to identify the optimal server.
To solve this problem we have already proposed a “URL Resolver” framework, which allows users to select the optimal server from multiple servers that provide various kinds of services via data storage facilities such as caches or mirror servers [1]. To enable users to select one of the servers, it is first necessary to gather information such as a list of servers that might be useful to the user. Initially, we focused on information related to mirror sites or
similar sites, and have already proposed a basic algorithm for detecting similar web sites by focusing on the link information embedded in web pages [2]. As a result of verifying the basic effectiveness of that algorithm, we then found that there are some sites that are thoroughly adequate for use as substitutes yet have a degree of similarity of no more than 50%. But to use this detection method in practice, it is necessary to employ a mechanism for automatically judging whether sites for which a low degree of similarity has been detected are mirrors or similar sites that can actually be used instead of mirrors. So, in this paper, we propose an automatic determination algorithm, in which web pages are divided into hub-type and content-type and the judgment is done based on the results of judgment algorithms specifically tailored to each type of web page. Initial trials of this approach have yielded favorable detection results. We also examine the operation of this detection methodology and propose an algorithm for effectively finding candidates for similar web sites by using the user’s access history to the Internet.
Section 2 reviews our previously proposed algorithm for detecting similar web sites based on link information and describes our new automatic similar web site detection method. Section 3 discusses a similar-web-site candidate finding method.
II. USING LINK INFORMATION TO FIND SIMILAR
WEB SITES
A. HOW TO FIND A SIMILAR WEB SITE? We define a mirror site as follows:
This definition is based on the observation that mirror sites or sites that hold information that is so similar that they closely resemble mirror sites should more or less match in terms of the number and types of links embedded within them, even if there are slight differences
If the link structure of site-A is very similar to that of site-B, then sites-A and -B are mirrors of each other.
2
such as inserted advertising banners (This definition is based on [3]).
The detection method we proposed in reference [2] is as follows. Assume starting site-A and mirror candidate site-B. The degree of similarity between these two sites is as follows. The total number of embedded inward links that can be gathered when tracing the links of web pages to a depth of N levels from the top web page of site-A is referred to as url (Ain)N, and the total number of embedded outward links is referred to as url(Aout)N. Here, an inward link is one in which the host part of the link’s destination URL is the same as that of the current host, and an outward link is one link in which the host part of the link destination is different. The corresponding properties of site-B are similarly expressed as url(Bin)N and url(Bout)N. Then, the total number of inward links in url(Bin)N that are also included in url(Ain)N is expressed as url(Ain ∩ Bin)N, while the corresponding property of the outward links is expressed as url(Aout ∩ Bout)N. At this point, when determining the value of url(Ain ∩ Bin)N, the comparisons are made after replacing the host parts of site-A and -B with the same arbitrary text strings. So, the degree of similarity between site-A and -B when links are followed to a depth of N levels is denoted by the symbol α , which is given by
α = url(Ain ∩ Bin)N + url(Aout ∩ Bout)N
url(Ain)N + url(Aout)N × 100 (%).
Since this procedure only compares the link structures, it does not perform a text-level comparison of every character in every word on the web pages. The reasons for adopting this approach are as follows:
(1) As mentioned above, we think it is possible to judge the similarity of hypertext documents such as web pages by comparing only their link structures. The practicality of focusing on the link structure in web pages in also highlighted in other studies [4] and [5].
(2) In this procedure, although a lot of processing time is taken up by gathering web pages, the amount of text to be compared also increases substantially when links are followed to a depth of several levels and much more time is required for a comparing text than for comparing just the links. Besides, the gathering work can be speeded up by increasing the network bandwidth, so we decided not to compare the processing times required for a text-level comparison.
(3) We are planning to use this similar-web-site finding method even in environments with limited processing resources, such as users’ notebook PCs. Therefore, considering the storage of information obtained when calculating the degree of similarity, performing a text-level comparison would require all the text information to be stored, which would be a waste of resources. The link information takes up considerably less space than the text information, which is another
reason why we decided not to perform a text level comparison.
Next, we discuss the way to find candidates for mirror sites. Using a web robot to recursively access suitable sites indiscriminately and compare them with site-A to
find a candidate for site-B would be far too inefficient, so instead we adopted the following strategy:
For the actual trials, we extracted 1000 URLs entries from the access log stored in a proxy server used by our organization of about 200 people, and applied our similar-site detection program to each one (See α in Figure. 1). Similar site candidates were detected for 65% of these. It is interesting to note that we found that many sites can be used as substitutes, even though their degrees of similarity were less than 50% (see Table 1). So, if a way can be found to automatically judge whether or not they are actually capable of being used as substitutes, then it should be possible to present a greater number of sites to the users in addition to the sites having a high degree of similarity for which judgment is unnecessary.
Fig. 1: Detection results
Since web sites that employ mirror servers do so withthe aim of dispersing the load, they will probably wantto provide users accessing the site with informationabout these mirrors. In other words, it is reasonable toassume that the site will make some mention of whereits mirrors can be found. Accordingly, it is highlylikely that site-B can be found by gathering andanalyzing the content accessible within a certainnumber of link levels from the top page of site-A. It isalso highly likely that the web site will contain links tosites of a similar nature, so there should be a highlikelihood of being able to find similar sites bychecking the link structure.
3
Degree of similarity
Strength of relationship between two sites
0%–10% Probably unrelated
10%–60% May include some sites of a similar nature
60%–90% Either a mirror site or a site that is highly similar
90%–100% Almost certainly a mirror site
Table 1: Results of classifying detected sites
B. SITES CAPABLE OF BEING USED AS SUBSTITUTES
Thus, we propose the following detection method. First, we divide web pages into the following two broad categories according to the style of user access.
• Web pages that are accessed as a starting point for net surfing are called hub-type sites.
• Web pages that are accessed in order to view the content on the page itself are called content-type sites.
Then, by considering the conditions of sites that can be considered as substitutes for hub- and content-type sites, respectively, we propose the following degree-of-similarity calculation methods.
1) METHODOLOGY FOR JUDGING A HUB-TYPE SITE
For example, consider hub-type sites-A and –B. If site-A has many embedded external links that are the same as those in site-B, then it is highly likely that the user will be able to use both sites equally.
That is, site-A and the possible substitute candidate site-B are deemed to have a greater degree of similarity with respect to their outward links if they satisfy the condition
α < β , (1)
where α is as defined in Section 2.1 and
β = url(Aout ∩ Bout)N
url(Aout)N x 100(%)
is the degree of similarity related to outward links only, from which it can be inferred that site-B is highly likely to be suitable for use as a substitute hub-type site. 2) METHODOLOGY FOR JUDGING A CONTENT-TYPE SITE
On the other hand, in a content-type site we think that there may be some differences in the page structure such
as the way of embedding links, even if a site is capable of being used as a substitute. To deal with this, in addition to the links, we also use the label strings of links as important elements expressing the attributes of the links, and add to the calculation of the degree of similarity as follows: we use these labels corresponding to the text enclosed within the links; e.g., the text string “XXXXXX” in the link “<A href="url">XXXXXX</A>”, and the text string “YYYYY” in the link “<Ahref="url"><IMG src="url" alt="YYYYY"> </A>”.
We decided to rate these links by scoring them according to the length of the text strings “XXXXXX” and “YYYYY” embedded in their labels when a match is found between a pair of labels. Note that the calculation is performed using only the text string “XXXXXX” for links where the text strings “XXXXXX” and “YYYYY” match.
In content-type sites like a news site, the headline of the article is usually used as a link label, and the string length of the headline is usually longer than that of a link label whose reference address is another Web site.
An example of a link in which the alt option is set is as follows: <a href = http://www.apple.com/store/ > < imgsrc = "http://a772.g.ak···/2.gif" width = "84"height="42" alt="The Apple Store." Border ="0"></A>
In the algorithm in [2], the degree of similarity was calculated using only the text string of the URL part “http://www.apple.com/store”, but here in addition to this, the link label “The Apple Store.” is also included in the calculation. The labels are scored according to the following rules:
(1) When the URL parts and label parts both match, the link is awarded a score corresponding to the number of characters in the label.
(2) When the URL parts match but the label parts are different, the link is awarded a score of 70% of the number of characters in the starting link label.
(3) When the label parts match but the URL parts are different, the link is awarded a score of 50% of the number of characters in the starting link label.
(4) When both parts are different, the link is awarded no score.
These scoring settings are made based on experience, and further study is required to investigate their validity. Also, for rule (3), since we focus on the similarity of the link structure, it could conceivably be wrong to consider cases where the URL parts are different. However, in the case of content-type sites, since we concentrate on the label parts, we decided to include cases where the URL parts are different in the detection by reducing the score awarded. In the above example of “The Apple Store.”, if a link is detected whose URL parts and label parts are identical when matching the link with the mirror candidate site,
4
then this link is awarded 16 points (the number of characters in “The Apple Store.”). In calculating the number of characters in a label, all single-byte and double-byte characters (including English and Japanese characters, spaces, and so on) are each counted as one character.
The value of urllabel(Ain)N for the starting site is given by the sum of the number of characters in the labels added to each link in url(Ain)N, and is awarded the maximum possible score when matching is performed with an identical mirror site. The value of urllabel(Bin)N for a mirror candidate site-B is defined in the same way. Furthermore, the value of urllabel(Ain ∩ Bin)N is given by the sum of the scores awarded for matching combinations of the abovementioned URL parts and label parts in each respective label. If the degree of similarity between site-A and -B in terms of inward links is given by
γ = urllabel(Ain ∩ Bin)N
urllabel(Ain)N x 100(%) , then if
α <γ , (2)
it is judged to be likely that the mirror candidate can be used as a substitute for a content-type site.
When the number of links used as the denominator of α or β or γ, when judging Equations (1) and (2), is small (in the current version, less than 10), the degree of similarity is recalculated by following links to a greater depth. Unfortunately, this procedure is unable to detect sites that have a mirror relationship but have differences between both the link structure and the labels added to the links. However, it is doubtful whether many sites of this sort actually exist.
At present, the following simple techniques are used to select a URL from the list of URLs that are possibly mirrors: (1) select the site with the highest likelihood and (2) select the candidate with the highest transfer rate at the time of retrieval in the case of multiple candidates by phase (1). In the future, we also plan to make use of the transfer rate at the time of user access and feedback data from users, etc. Here we note that in terms of user access, the process of indicating the most suitable URL requires real-time properties unlike the mirror search process. In relation to the above, we are investigating using a technique that can flexibly select optimal strategies for selecting a URL at any time through an algorithm that executes multiple strategies in parallel [9]. For example, when a user is the first to access a certain URL, there is no time available that for measuring the transfer rate, and
the most suitable URL is selected on the basis of information from the mirror information managing agent. On the other hand, there may be a small amount of time available up until the user clicks an anchor within that URL, and if such is the case, it might be possible to select the most suitable URL according to new transfer rate information. If this can be accomplished, access will be forcibly changed to the most suitable URL when the user clicks the anchor. C. INITIAL TRIALS
As mentioned in Section 2.1, Fig. 1 shows the results of calculating β and γ for 1000 URLs. About 18% of the URLs were classified as hub-type sites with a degree of similarity of 30% or more, and about 6% of them were classified as content-type sites. Both these include URLs that were detected as hub-type and content-type pairs respectively. We then manually checked 50 sites for which either β or γ was 30% or more from among the sites classified as hub- or content-type sites, and found that all of them were indeed suitable for use as substitutes for these hub- and content-type sites. In future we plan to perform verifications with a greater number of access logs and to investigate and examine the reliability of the degree of similarity calculations.
III. USING USER’S ACCESS HISTORY
With the procedure in Section 2.1, we were able to detect similar sites by using an access log as a starting point. However, it will be more effective if it is possible to detect similar sites that are significant for each individual end user. Of course, facilities such as proxy servers contain access logs that reflect the character of the community that uses them, and can themselves be thought of as candidates for similar web sites that may match the users’ preferences. However, in the current version, we can only find mirror or similar sites that are limited to the range of sites traced from the starting host. That is, we do not evaluate the degree of similarity between different hosts in the access log. This is because it would lead to a combinatorial explosion and we judge it to be inefficient. However, it is clear that it is highly effective to detect similar sites including hosts that cannot be reached from the detection origin host, not just the hosts recorded in the access log. Therefore, we propose a method for retrieving mirror or similar sites by using the users’ access history to filter similar site candidates.
5
Figure 2 shows excerpts from web pages related to the same content (new digital camera products) in four web sites. On finding an article about a new product on one web site, many people (certainly the authors do) often habitually browse through other related web sites and look for articles related to the same content. This is because the content of the articles changes slightly from one site to the next. In the example shown in Fig. 2, an article related to resolution and the number of pictures that can be taken was only mentioned at www.nikkeibp.co.jp, while an article relating to the manufacturer’s business strategy was only mentioned at
www.watch.impress.co.jp, and detailed specifications were only mentioned at www.zdnet.co.jp.
Here, when a user browses through some content zzz at a certain site-A, if sites-B, -C, and -D which are highly likely to contain similar articles related to content zzz—i.e., sites that have a high degree of similarity to site-A—have been detected, then the user will be more likely to obtain a greater amount of information if these sites are recommended to him/her. Next, we discuss how to efficiently extract sites-A, -B, -C, and -D.
1. First, acquiring the following user’s access history: every time the user clicks on a link, we extract site-R in which this link is embedded, site-T which is the destination site of the link, and label-L which is the text string of the link’s label. Moreover, label-L is subjected to morphological analysis (using widely used general-purpose morphological analysis software, like [7]) to extract several noun parts nN ⋅⋅⋅1
1, and {R, T, nN ⋅⋅⋅1 } triplets are recorded. In the example shown in Fig. 2, the following lists are recorded:
1 By the morphological analysis, a noun is classified into a proper noun or a general noun or an unknown word, and we use a proper noun and an unknown words to express a character of a link.
Fig. 2: Examples of the same conent in different web sites.
2. When there are pages carrying the same content in
different sites, it is highly likely that the destinations of the outward links embedded within these pages will be the same, so there is a higher possibility that access histories such as {Ra, T, N} and {Rb, T, N} where only site-R is different can be extracted to evaluate similarity. In Fig. 2, a link to the manufacturer’s site “Kodak” is embedded in all the sites.
3. Then, by extracting from the access history the sites-R1...n for which the site-T and N terms are the same and only the R terms are different, we can obtain a list of sites where the same content appears, and the degree of similarity between these sites is calculated. Of course, a different site list could be accumulated from all the sites where only the noun parts N are the same, but in practice the content will have a lower likelihood of being related.
This procedure is very general-purpose because it can learn {R, T, N} triplets as soon as the user first accesses sites, even if they have not yet been registered in the access log.
As for a way to recommend detected sites to the users, several methodologies are considered as follows: When the user has accessed any one of these sites, he/she is recommended to browse other sites with priority given to those having a higher degree of similarity. Moreover, when the user has accessed site-T which has already been recorded in {R1...n, T, N}, he/she is recommended to browse other sites with priority given to pages in the individual Rm of R1...n whose content has been updated recently. Another effective measure is to pre-fetch the contents of similar sites related to sites accessed by the user and to display a compiled list of these sites.
If there is a similarity relationship between site-A and site-B, but site-B is the competitor of site-A, it may be difficult to find a similar web site-B from site-A by using the strategy proposed in section 2.1. But, by using user’s access history, it may be possible to find a similarity relationship between site-A and site-B.
To verify the basic efficiency of this methodology, we investigated how many {R, T, N}s, having the same noun part N and same URL of the destination site T could be found from two similar sites. Figure 3 shows the results of this investigation: First, we extracted the link information from www.watch.impress.co.jp and www.zdnet.co.jp, which are hub-type sites concerning new products in the computer or office automation fields, and from www.asahi.com and www.yomiuri.co.jp, which are web sites of newspapers. Specifically, we extracted noun parts and their destination sites of the outward links to a depth of 3 levels (N=3). And second, we searched for the following types of {R, T, N}s from this link information: Case-I: {R, T, N} in which noun part N and destination
site T are both the same. Case-II: {R, T, N} in which only noun part N is the same. Case-III: {R, T, N} in which only destination site T is the
same. Finally, we checked that among the searched {R, T, N}s in Case-I whether each {R, T, N} did indeed express the character of the site T or not. The results show that even though many {R, T, N}s were searched for Case-II and Case-III from every combination of the sites, in Case-I, {R, T, N}s were mainly extracted from only combinations of the similar sites (“watch vs. zdnet” and “asahi vs. yomiuri”). And although, several {R, T, N}s were also extracted from the combination of “watch vs. asahi” having no relationship between them, we could find only one {R, T, N} which indeed expressed the character of both sites (see the following partial lists of extracted {T, N}s). watch vs. zdnet {http://www.minolta.co.jp/, MINOLTA} (The company name) {http://www.newtech.co.jp/, NEWTECH} (The company name) {http://www.melcoinc.co.jp/, MELCO} (The company name) {http://www.tsutaya.co.jp/, TSUTAYA} (The company name) {http://www.sony.co.jp/sd/, SONY} (The company name)
All Both N and T are same Only N is same Only T is same When both N and T are same, Case-I Case-II Case-III N indeed expressed the character of T
watch vs. zdnet 24482 22 23966 494 20 asahi vs. yomiuri 2288 15 2198 76 14 watch vs. asahi 4389 8 4281 100 1 zdnet vs. asahi 15879 1 14789 89 0 watch vs. yomiuri 1869 1 1845 23 0 zdnet vs. yomiuri 6350 3 6289 58 0 ( watch: www.watch.impress.co.jp, zdnet: www.zenet.co.jp, asahi: www.asahi.com, yomiuri: www.yomiuri.co.jp )
Fig. 3: Number of extracted {R, T, N}s
7
watch vs. asahi {http://www.microsoft.com/japan/misc/cpyright.htm, Microsoft} (The company name) {http://www.microsoft.com/japan/misc/cpyright.htm, Corporation.} {http://www.microsoft.com/japan/misc/cpyright.htm, ALL} {http://www.microsoft.com/japan/misc/cpyright.htm, rights} {http://www.microsoft.com/japan/misc/cpyright.htm, reserved}
Therefore, from this initial investigation, we can infer that if two sites have links having the same noun parts and same destination sites, these two sites can be thought as being strongly candidates for having a similarity relationship. Of course, this procedure is still at the stage of initial trials, and we are planning to verify its effectiveness by conducting full-scale verification trials.
IV. CONCLUDING REMARKS
In this study, only link information was used to detect similarity on the grounds that in hypertext environments such as the WWW, links express the most information regarding the characteristics of content. On the other hand, a considerable amount of research is being done in the field of natural language processing for procedures that determine the degree of similarity by analyzing the text content. Reference [8] describes one example of a study in which this procedure is applied to the WWW. However, it has been concluded that this sort of conventional text-based procedure does not function effectively in hypertext environments such as the WWW [5],[6].
Studies of ways to detect mirror sites by focusing on the link structure include references [3] and [4]. However, the aim of those methods is to detect only complete mirror sites, and in these procedures, other information—such as the link connection relationships and the information from a DNS, etc—is used besides the calculated degree of similarity corresponding to α in this study. Our procedure is different in that it regards some sites as being capable of being used as substitutes even though they have a low α value, and aims to detect these sites as well. To do this, we broadly divide web sites into hub-type and content-type sites, and the degree of similarity is calculated using methods tailored to each type. By comparing the degrees of similarity thereby obtained, it is possible to automatically judge whether or not web pages can be used as substitutes for each other.
In this paper, we focused on the operation of this similar web site detection method and proposed an effective procedure for finding candidates for similar web site that match the user’s preferences. This involves storing the connection relationships and label parts of links in sites accessed by the user, and extracting similar site candidates by starting from sites where the noun parts of the labels are the same.
ACKNOWLEDGEMENTS We thank our executive manager, Dr. Keiichi Koyanagi of NTT Network Innovation Labs, and the researchers of the Computer Networking Principles Research Group.
Takada, and Toshiharu Sugawara: ARESAIN - Alternative Resource Access Information Navigator, Thirteenth IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS 2001), 2001.
[2] Satoshi Kurihara, Toshio Hirotsu, Toshihiro Takada, and Toshiharu Sugawara: Mirror Site Navigator using Link Information, Proceedings of World Multiconference on Systemics, Cybernetics and Informatics (SCI2000), pp. 283–290, 2000.
[3] Krishna Bharat, Andrei Z. Broder, Jeffrey Dean, Monika Rauch Henzinger: A Comparison of Techniques to Find Mirrored Hosts on the WWW, Journal of the American Society for Information Science (JASIS), Vol. 51, No. 12, Nov. 2000, pp. 1114–1122.
[4] Narayanan Shivakumar and Hector Garcia-Molina: Finding near-replicas of documents on the web, International Workshop on the World Wide Web and Databases (WebDB ’98), 1998.
[5] O. Zamir and O. Etzioni: Grouuper -A Dynamic Clustering Interface to Web Search Results-, The Eighth International WWW Conference, 1999.
[6] L. Page, S. Brin, R. Motwani, and T. Winograd: The PageRank Citation Ranking: Bringing Order to the Web, Work in progress. http://google.stanford.edu/~backrub/pageranksub.ps.
[7] http://chasen.aist-nara.ac.jp/ [8] S. Chakrabarti, B. Dom, R. P., S. Rajagopalan, D.
Gibsoon, and J. Kleinberg: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, The Seventh International WWW Conference, pp. 65–74, 1998.
[9] S Kurihara, S, Aoyagi, S, Onai, R, and Sugawara, T: Adaptive Selection of Reactive/Deliberate Planning for the Dynamic Environment, Robotics and Autonomous Systems, vol. 24, No. 3--4, pp. 183--195, 1998.