Top Banner
Fuzzy Clustering: An Approachfor Mining Usage Profilesfrom Web Ms.Archana N. Boob,Prof. D. M. Dakhane Sipna’s COET,Amravati Asst. Prof, Sipna’s COET,Amravati [email protected] Abstract Web usage mining is an application of data mining technology to mining the data of the web server log file. It can discover the browsing patterns of user and some kind of correlations between the web pages. Web usage mining provides the support for the web site design, providing personalization server and other business making decision, etc. Web mining applies the data mining, the artificial intelligence and the chart technology and so on to the web data and traces users' visiting characteristics, and then extracts the users' using pattern.In this paper, we present an approach to cluster Web site users into different groups. By using fuzzy clustering, we enable generation of overlapping clusters that can capture the uncertainty among Web user’s navigation behaviour. Key-Words:Web mining, clustering, fuzzy clustering, and personalization. I. Introduction The last years have been characterized by an exponential growth both of the number of online available Web applications and of the number of their users. This growth has generated huge quantities of data related to user interactions with the Web sites, stored by the servers in user access log files. On the other hand, the degree of personalization that a Web site is able to offer in presenting its services to users represents an important attribute contributing to the site’s success. Hence, the need for a Web site that understands the interests of its users is becoming a fundamental issue. If properly exploited, log files can reveal useful information about user preferences. Therefore data mining, intended as knowledge discovery process from large database, has naturally found application on Web data, leading to the so-called Web Mining [1], [2], [3], [4]. Three principal areas can be identified in Web Mining: Web Content Mining which focuses on the information available in the web pages. Web Structure Mining which searches the information resources in the structure of web sites. Web Usage Mining which deals with the knowledge extraction from server log files in order to derive useful patterns of user access. Recently, several research activities have especially investigated Web Usage Mining techniques and a lot of works have been published on these topics [4], [5], [6], [7], [8], [9],[10]. A variety of traditional machine learning methods have been used for pattern discovery in Web Usage Mining [11], [12]. Among these, unsupervised methods, especially clustering, seem to be the most appropriate to group users with common browsing behaviour. A wide range of applications can benefit from the knowledge discovered by the clustering process, from real time content personalization to dynamic link suggestion. In the choice of the clustering method for Web Usage Mining, one important constraint to be considered is the possibility to obtain overlapping clusters, so that a user can belong to more than one group. To deal with the ambiguity and the uncertainty underlying Web interaction data, fuzzy reasoning appears to be an effective tool. In the dissertation, we use the fuzzy clusteringto categorize user sessions in order to derive groups of users which exhibit similar access patterns from web log data. The obtained clusters which can be exploited to implement different personalization functions, such as dynamic suggestion of links to Web pages retained interesting for the user. II. Related Work Clustering is a process of discovering groups of objects such that the objects belonging to the same group are similar in a certain manner, and the objects belonging to different groups are dissimilar. There are many algorithms in the literature that deal with the problem of clustering large number of objects. The different algorithms can be classified regarding different aspects. One of the key issue, which also determines another features of the algorithm is the basic approach of the clustering algorithm. The aim of the partition-based algorithmsis to decompose the set of objects into a set of disjoint clusters where the number of the resulting clusters is predefined by the user. The algorithm uses an iterative method, and based on a distance measure it updates the cluster of each object. It is done until any changes can be made. The most representative partition-based clustering
3

Fuzzy Clustering: An Approachfor Mining Usage Profilesfrom Web

Sep 12, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Fuzzy Clustering: An Approachfor Mining Usage Profilesfrom Web
Page 2: Fuzzy Clustering: An Approachfor Mining Usage Profilesfrom Web

algorithms are the kmeans and the k-mediod. The

advantage of the partitionbased

algorithms that they use an iterative way to create

the clusters, but the drawback is, that the number of

clusters have to be determined in advance and only

spherical shapes can be determined as clusters.

Hierarchical algorithmsprovides a

hierarchical grouping of the objects. There exist

two approaches, the bottom-up and the top-down

approach.In case of bottom-up approach, at the

beginning of the algorithm each object represents a

different cluster and at the end all objects belong to

the same cluster. In case of top-down method at the

start of the algorithmall objects belong to the same

cluster which is split, until each object constitute a

different cluster. A key aspect in these kind of

algorithms is the definition of the distance

measurements between the objects and between the

clusters. The drawback of the hierarchical

algorithm is that after an object is assigned to a

given cluster it cannot be modified later.

Furthermore, like in partition-based case, also only

spherical clusters can be obtained. The advantage

of the hierarchical algorithms is that the validation

indices (correlation, inconsistency measure), which

can be defined on the clusters, can be used for

determining the number of the clusters.

Density-based algorithmsstart by

searching for core objects, and they are growing the

clusters based on these cores and by searching for

objects that are in a neighbourhood within a radius

of a given object. The advantage of these type of

algorithms is that they can detect arbitrary form of

clusters and it can filter out the noise.

Grid-based algorithmsthe grid-based

algorithms use a hierarchical grid structure to

decompose the object space into finite number of

cells. For each cell statistical information is stored

about the objects and the clustering is achieved on

these cells. The advantage of this approach is the

fast processing time that is in general independent

of the number of data objects.

Fuzzy clusteringsuppose that no hard

clusters exist on the set of objects, but one object

can be assigned to more than one cluster. The best

known fuzzy clustering algorithm is FCM.

III. Analysis of Problem With the explosive growth of information

sources available on the World Wide Web and the

rapidly increasing pace of adoption to internet

commerce, internet has evolved into a gold mine

that contains or dynamically generates information

that is beneficial to E-businesses. A web site is the

most direct link a company has to its current and

potential customers. The companies can study

visitor’s activities through web analysis, and find

the patterns in the visitor’s behavior. Web usage

patterns could be directly applied to efficiently

manage activities related to e-Business, e-CRM, e-

Services, e-Education, e-Newspapers, and e-

Government . With the large number of companies

using the internet to distribute and collect

information, knowledge discovery on the web has

become an important research area.

Application like Customer Relationship

Management (CRM) can use data from within and

outside an organization to allow an understanding

of its customer on individual basis or on the group

basis such as by forming customer’s profiles. An

improved knowledge about the customers’

preference and needs forms the basis for effective

CRM. For the better business it’s important to keep

loyalty of their old customers and to lure new

customers. Automated data mining or knowledge

discovery techniques can be used to discover web

user profiles. These mass user profiles can

automatically extract frequent access patterns from

the history of the previous user click streams stored

in web log files. Although there have been

considerable advances in web usage mining ,there

have been no detailed studies presenting a fully

integrated approach to mine a real web sites, such

as evolving profiles, dynamic content and the

availability of taxonomy or database in addition to

web logs.

IV. Proposed Work The general scheme of the proposed approach for

mining usage profiles using fuzzy clustering is as

shown below:

Web Log Data

When users visit a Web site, the Web server

stores the information about their accesses in a log

file. Each record of a log file represents a page

request executed from a Web user. In particular, it

typically contains the following information: user’s

IP address, date and time of the access, URL of the

requested page, request protocol, a code indicating

the status of the request.

Usage Pre-processing

The aim of the pre-processing step is to

identify user sessions starting from the information

contained in the access log file.

Data pre-processing involves two main steps: data

cleaning and user session identification.

Archana N Boob et al ,Int.J.Computer Technology & Applications,Vol 3 (1),329-331

IJCTA | JAN-FEB 2012 Available [email protected]

330

ISSN:2229-6093

Page 3: Fuzzy Clustering: An Approachfor Mining Usage Profilesfrom Web

Data cleaning

The first step of log data pre-processing consists in

removing useless requests from log files. In

particular, data cleaning removes redundant

references such as images, sound files, multiple

frames, and dynamic pages that have the same

template. We eliminate the irrelevant items by

checking the suffix of the URL requests. Hence, all

log entries with filename suffixes such as gif, jpeg,

jpg, and map are removed. These operations allow

to not only remove uninteresting sessions but also

to simplify the mining task that will follow.

User session identification

- User identification

User identification refers to a process of labeling

users with their visiting pages’ web logs.

According to the IP address and user agent, visitors

will be classified accordingly. Due to the

existence of cache, proxy server (including cafe,

etc) and firewall network, this step could be very

complicated and time-consuming, scholars have put

forward some heuristic rules to identify users: (1)

The different IP address represents with different

user. (2) When the IP address is as same as the

others’, the defaults of different operating systems

or browser represent different users. (3) With the

same IP address and operating system and also the

same browser, judging whether there is a direct link

between the requiring page and all the pages visited

previously, if so, then there is only one user, if not,

then different users.

- Session identification

Session refers to a series activities from when a

user first logs into the website to when the

userleaves it. The goal to identify session is to get

meaningful visiting sequence during specific time.

Session categorization by fuzzy clustering

Once user sessions have been identified, a

clustering process is applied in order to group

similar sessions in the same category. Each session

category includes users exhibiting a common

browsing behaviour and hence similar interests.

Web data uses different types of clustering

algorithms .One important criteria to be considered

in the choice of the clustering method is the

possibility of creating overlapping clusters. This is

a fundamental facet in Web personalization, where

the ambiguity of the navigational data requires that

a user may belong to more than one category or

profile. Fuzzy clustering turns out to be a good

candidate method to handle ambiguity in the data,

since it enables the creation of overlapping clusters

and introduces a degree of item-membership in

each cluster.

V. Desired Implications

Web Usage mining involves mining the

usage characteristics of the users of Web

Applications. This extracted information can then

be used in a variety of ways such as, to enhance the

quality of electronic commerce services, to

personalize the web portals, improvement of the

applications etc.

The above proposed method can be

successfully implemented for mining the usage

characteristics of the users of Web Applications.

The number of similar urls visited by users for a

particular session gives understanding of user

behavior. Comparing the number of urls visited and

user session will give clear understanding of user’s

evolution.

VI. References

[1] R. Kosala, and H. Blockeel, WebMining Research: A

Survey, SIGKDDExplorations, Vol.2, No.1, 2000, pp. 1-15.

[2] F. M. Facca and P. L. Lanzi, Mining interesting knowledge

from weblogs: a survey, Data & Knowledge Engineering, 53,

2005, pp. 225–241.

[3] B. Mobasher, R. Cooley, J. Srivastava, Automatic

personalization based onWeb usage mining, TR-99010,

Department of Computer Science. DePaul University, 1999.

[4] F. Masseglia, P. Poncelet, R. Cicchetti, An efficient

algorithm for web usage mining, J. Networking Inf. Syst. (NIS),

2(5-6), 1999, pp. 571–603.

[5] S. Araya, M. Silva, R. Weber, A methodology for web usage

mining and its application to target group identification, Fuzzy

Sets and Systems, 148, 2004, pp. 139–152.

[6] D. Pierrakos, G. Paliouras, C. Papatheodorou, and C. D.

Spyropoulos, Web usage mining as a tool for personalization: a

survey. User Modeling and User- Adapted Interaction, Vol. 13,

No. 4, 2003, pp. 311-372] K. P. Sankar, T. Varun, M. Pabitra,

Web Mining in Soft Computing Framework: Relevance, State of

the Art and Future Directions, IEEE Transaction on Neural

Networks, Vol. 13, No. 5, 2002, pp. 1163-1177.

[7] Y. H. Cho, J. K. Kim, Application of Web usage mining and

product taxonomy to collaborative recommendations in e-

commerce, Expert Systems with Applications, 26, 2004, pp.

233–246.

[8] A. Abraham, Business Intelligence from Web Usage Mining,

Journal of Information & Knowledge Management, Vol. 2, No.

4, 2003, pp. 375-390.

[9] M .Kitsuregawa, M. Toyoda, I. Pramudiono, Web

community mining and web log mining: commodity cluster

based execution, In Proceedings of the 13th Australasian

Database Conference (ADC(02), Melbourne, Australia, 5, 2002,

pp. 3–10.

[10] M. N. Garofalakis, R. Rastogi, S. Seshadri, K. Shim, Data

minino and the web: past, present and future, In Proc. of the

second international workshop on web information and data

management, ACM, 1999.

[11] O. Nasraoui, R. Krishnapuram, A Joshi, Relational

clustering based on a new robust estimator with applications to

web mining, In Proc. of the International Conf. North American

Fuzzy Info. Proc. Society (NAFIPS 99), New York, 1999, pp.

705-709.

[12] A. Vakali, J. Pokorný and T. Dalamagas, An Overview of

Web Data Clustering Practices, EDBT Workshops, 2004, pp.

597-606

Archana N Boob et al ,Int.J.Computer Technology & Applications,Vol 3 (1),329-331

IJCTA | JAN-FEB 2012 Available [email protected]

331

ISSN:2229-6093