Fuzzy Clustering: An Approachfor Mining Usage Profilesfrom Web

algorithms are the kmeans and the k-mediod. The

advantage of the partitionbased

algorithms that they use an iterative way to create

the clusters, but the drawback is, that the number of

clusters have to be determined in advance and only

spherical shapes can be determined as clusters.

Hierarchical algorithmsprovides a

hierarchical grouping of the objects. There exist

two approaches, the bottom-up and the top-down

approach.In case of bottom-up approach, at the

beginning of the algorithm each object represents a

different cluster and at the end all objects belong to

the same cluster. In case of top-down method at the

start of the algorithmall objects belong to the same

cluster which is split, until each object constitute a

different cluster. A key aspect in these kind of

algorithms is the definition of the distance

measurements between the objects and between the

clusters. The drawback of the hierarchical

algorithm is that after an object is assigned to a

given cluster it cannot be modified later.

Furthermore, like in partition-based case, also only

spherical clusters can be obtained. The advantage

of the hierarchical algorithms is that the validation

indices (correlation, inconsistency measure), which

can be defined on the clusters, can be used for

determining the number of the clusters.

Density-based algorithmsstart by

searching for core objects, and they are growing the

clusters based on these cores and by searching for

objects that are in a neighbourhood within a radius

of a given object. The advantage of these type of

algorithms is that they can detect arbitrary form of

clusters and it can filter out the noise.

Grid-based algorithmsthe grid-based

algorithms use a hierarchical grid structure to

decompose the object space into finite number of

cells. For each cell statistical information is stored

about the objects and the clustering is achieved on

these cells. The advantage of this approach is the

fast processing time that is in general independent

of the number of data objects.

Fuzzy clusteringsuppose that no hard

clusters exist on the set of objects, but one object

can be assigned to more than one cluster. The best

known fuzzy clustering algorithm is FCM.

III. Analysis of Problem With the explosive growth of information

sources available on the World Wide Web and the

rapidly increasing pace of adoption to internet

commerce, internet has evolved into a gold mine

that contains or dynamically generates information

that is beneficial to E-businesses. A web site is the

most direct link a company has to its current and

potential customers. The companies can study

visitor’s activities through web analysis, and find

the patterns in the visitor’s behavior. Web usage

patterns could be directly applied to efficiently

manage activities related to e-Business, e-CRM, e-

Services, e-Education, e-Newspapers, and e-

Government . With the large number of companies

using the internet to distribute and collect

information, knowledge discovery on the web has

become an important research area.

Application like Customer Relationship

Management (CRM) can use data from within and

outside an organization to allow an understanding

of its customer on individual basis or on the group

basis such as by forming customer’s profiles. An

improved knowledge about the customers’

preference and needs forms the basis for effective

CRM. For the better business it’s important to keep

loyalty of their old customers and to lure new

customers. Automated data mining or knowledge

discovery techniques can be used to discover web

user profiles. These mass user profiles can

automatically extract frequent access patterns from

the history of the previous user click streams stored

in web log files. Although there have been

considerable advances in web usage mining ,there

have been no detailed studies presenting a fully

integrated approach to mine a real web sites, such

as evolving profiles, dynamic content and the

availability of taxonomy or database in addition to

web logs.

IV. Proposed Work The general scheme of the proposed approach for

mining usage profiles using fuzzy clustering is as

shown below:

Web Log Data

When users visit a Web site, the Web server

stores the information about their accesses in a log

file. Each record of a log file represents a page

request executed from a Web user. In particular, it

typically contains the following information: user’s

IP address, date and time of the access, URL of the

requested page, request protocol, a code indicating

the status of the request.

Usage Pre-processing

The aim of the pre-processing step is to

identify user sessions starting from the information

contained in the access log file.

Data pre-processing involves two main steps: data

cleaning and user session identification.

Archana N Boob et al ,Int.J.Computer Technology & Applications,Vol 3 (1),329-331

IJCTA | JAN-FEB 2012 Available [email protected]

330

ISSN:2229-6093

Data cleaning

The first step of log data pre-processing consists in

removing useless requests from log files. In

particular, data cleaning removes redundant

references such as images, sound files, multiple

frames, and dynamic pages that have the same

template. We eliminate the irrelevant items by

checking the suffix of the URL requests. Hence, all

log entries with filename suffixes such as gif, jpeg,

jpg, and map are removed. These operations allow

to not only remove uninteresting sessions but also

to simplify the mining task that will follow.

User session identification

- User identification

User identification refers to a process of labeling

users with their visiting pages’ web logs.

According to the IP address and user agent, visitors

will be classified accordingly. Due to the

existence of cache, proxy server (including cafe,

etc) and firewall network, this step could be very

complicated and time-consuming, scholars have put

forward some heuristic rules to identify users: (1)

The different IP address represents with different

user. (2) When the IP address is as same as the

others’, the defaults of different operating systems

or browser represent different users. (3) With the

same IP address and operating system and also the

same browser, judging whether there is a direct link

between the requiring page and all the pages visited

previously, if so, then there is only one user, if not,

then different users.

- Session identification

Session refers to a series activities from when a

user first logs into the website to when the

userleaves it. The goal to identify session is to get

meaningful visiting sequence during specific time.

Session categorization by fuzzy clustering

Once user sessions have been identified, a

clustering process is applied in order to group

similar sessions in the same category. Each session

category includes users exhibiting a common

browsing behaviour and hence similar interests.

Web data uses different types of clustering

algorithms .One important criteria to be considered

in the choice of the clustering method is the

possibility of creating overlapping clusters. This is

a fundamental facet in Web personalization, where

the ambiguity of the navigational data requires that

a user may belong to more than one category or

profile. Fuzzy clustering turns out to be a good

candidate method to handle ambiguity in the data,

since it enables the creation of overlapping clusters

and introduces a degree of item-membership in

each cluster.

V. Desired Implications

Web Usage mining involves mining the

usage characteristics of the users of Web

Applications. This extracted information can then

be used in a variety of ways such as, to enhance the

quality of electronic commerce services, to

personalize the web portals, improvement of the

applications etc.

The above proposed method can be

successfully implemented for mining the usage

characteristics of the users of Web Applications.

The number of similar urls visited by users for a

particular session gives understanding of user

behavior. Comparing the number of urls visited and

user session will give clear understanding of user’s

evolution.

VI. References

[1] R. Kosala, and H. Blockeel, WebMining Research: A

Survey, SIGKDDExplorations, Vol.2, No.1, 2000, pp. 1-15.

[2] F. M. Facca and P. L. Lanzi, Mining interesting knowledge

from weblogs: a survey, Data & Knowledge Engineering, 53,

2005, pp. 225–241.

[3] B. Mobasher, R. Cooley, J. Srivastava, Automatic

personalization based onWeb usage mining, TR-99010,

Department of Computer Science. DePaul University, 1999.

[4] F. Masseglia, P. Poncelet, R. Cicchetti, An efficient

algorithm for web usage mining, J. Networking Inf. Syst. (NIS),

2(5-6), 1999, pp. 571–603.

[5] S. Araya, M. Silva, R. Weber, A methodology for web usage

mining and its application to target group identification, Fuzzy

Sets and Systems, 148, 2004, pp. 139–152.

[6] D. Pierrakos, G. Paliouras, C. Papatheodorou, and C. D.

Spyropoulos, Web usage mining as a tool for personalization: a

survey. User Modeling and User- Adapted Interaction, Vol. 13,

No. 4, 2003, pp. 311-372] K. P. Sankar, T. Varun, M. Pabitra,

Web Mining in Soft Computing Framework: Relevance, State of

the Art and Future Directions, IEEE Transaction on Neural

Networks, Vol. 13, No. 5, 2002, pp. 1163-1177.

[7] Y. H. Cho, J. K. Kim, Application of Web usage mining and

product taxonomy to collaborative recommendations in e-

commerce, Expert Systems with Applications, 26, 2004, pp.

233–246.

[8] A. Abraham, Business Intelligence from Web Usage Mining,

Journal of Information & Knowledge Management, Vol. 2, No.

4, 2003, pp. 375-390.

[9] M .Kitsuregawa, M. Toyoda, I. Pramudiono, Web

community mining and web log mining: commodity cluster

based execution, In Proceedings of the 13th Australasian

Database Conference (ADC(02), Melbourne, Australia, 5, 2002,

pp. 3–10.

[10] M. N. Garofalakis, R. Rastogi, S. Seshadri, K. Shim, Data

minino and the web: past, present and future, In Proc. of the

second international workshop on web information and data

management, ACM, 1999.

[11] O. Nasraoui, R. Krishnapuram, A Joshi, Relational

clustering based on a new robust estimator with applications to

web mining, In Proc. of the International Conf. North American

Fuzzy Info. Proc. Society (NAFIPS 99), New York, 1999, pp.

705-709.

[12] A. Vakali, J. Pokorný and T. Dalamagas, An Overview of

Web Data Clustering Practices, EDBT Workshops, 2004, pp.

597-606

Archana N Boob et al ,Int.J.Computer Technology & Applications,Vol 3 (1),329-331

IJCTA | JAN-FEB 2012 Available [email protected]

331

ISSN:2229-6093

Fuzzy Clustering: An Approachfor Mining Usage Profilesfrom Web

Documents