ArticlePDF Available

Web Usage Mining: An Implementation View

January 2011

January 2011
125

DOI:10.1007/978-3-642-18440-6_16

Authors:

Saroj Kumar Panigrahy

VIT-AP University Amaravati

Sanjay Kumar Jena

National Institute of Technology Rourkela

This paper describes the implementation of Web usage mining for DSpace server of NIT Rourkela. The DSpace log files have been preprocessed to convert the data stored in them into a structured format. Thereafter, the general procedures for bot-removal and session-identification from a Web log file have been applied with certain modifications pertaining to the DSpace log files. Furthermore, analysis of these log files using a subjective interpretation of recently proposed algorithm EIN-WUM has also been conducted.

Content uploaded by Sanjay Kumar Jena

Content may be subject to copyright.

Web Usage Mining: An Implementation View

Sathya Babu Korra, Saroj Kumar Panigrahy, and Sanjay Kumar Jena

Department of Computer Science and Engineering

National Institute of Technology Rourkela, 769 008, Odisha, India

{ksathyababu,panigrahys,skjena}@nitrkl.ac.in

Abstract. This paper describes the implementation of Web usage min-

ing for DSpace server of NIT Rourkela. The DSpace log ﬁles have been

preprocessed to convert the data stored in them into a structured for-

mat. Thereafter, the general procedures for bot-removal and session-

identiﬁcation from a Web log ﬁle have been applied with certain modiﬁ-

cations pertaining to the DSpace log ﬁles. Furthermore, analysis of these

log ﬁles using a subjective interpretation of recently proposed algorithm

EIN-WUM has also been conducted.

Keywords: Data mining, Web data, Web usage mining, DSpace.

1 Introduct ion

Web usage mining (WUM) is the process of applying data mining techniques

to the discovery of usage patterns from Web data, targeted towards various

applications [1]. WUM involves mining the usage characteristics of the users of

Web applications. This extracted information can then be used in a variety of

ways such as— improvement of the application, checking of fraudulent elements

etc. The major problem with Web mining in general and WUM in particular

is the nature of the data they deal with. With the upsurge of Internet in this

millennium, the Web Data has become huge in nature and a lot of transactions

and usages are taking place by the seconds. Apart from the volume of the data,

the data is not completely structured. It is in a semi-structured format and needs

a lot of preprocessing and parsing before the actual extraction of the required

information. This paper describes about the work in which, a small part of the

WUM process has been taken up, that involves preprocessing, user identiﬁcation,

bot-removal and analysis of the log ﬁles of DSpace Web server at NIT Rourkela.

2 Data for Web Usage Mining

In Web Mining, data can be collected at the server-side, client-side, proxy servers,

or obtained from an organization’s database (which contains business data or

consolidated Web data for business intelligence [2]). Each type of data collection

diﬀers not only in terms of the location of the data source, but also the kinds

of data available, the segment of population from which the data was collected,

and its method of implementation.

132 S B Korra, S K Panigrahy, and S K Jena

Web Data: The various kinds of data that can be used in Web mining are

Content— usually consists of multimedia contents such as text, graphics, etc;

Structure— describes the organization of the contents, i.e., HTML or XML tags,

hyperlinks, etc.; Usage— the pattern of usage of webpages such as IPs, page

references, and the date and time of access; and User Proﬁle— demographic

information about users of the website such as registration data and customer

proﬁle information [1].

Data Sources: The data sources may include Web data repositories— Web

Server Logs, i.e., a history of page requests [3, 4]; Proxy Server Logs, i.e., proxy

traces may reveal the actual HTTP requests from multiple clients to multiple

Web servers and serve as a data source for characterizing the browsing behav-

ior of a group of anonymous users, sharing a common proxy server; Browser

Logs, i.e., client-side data collection can be done by using a remote agent (such

as JavaScript or Java applets) or by modifying the source code of an existing

browser (such as Mozilla) to enhance its data collection capabilities [1].

Abstract Data: The information obtained by the data sources described above

can be used to identify various abstract data— number of hits, number of vis-

itors, visitor referring website, visitor referral website, time and duration, path

analysis, browser type, cookies, and platform [5].

Possible Actions: The data collected can be analysed and the following pos-

sible actions can be taken— shortening paths of high visit pages, eliminating or

combining low visit pages, redesigning pages to help user navigation, and helping

in evaluating eﬀectiveness of advertising campaigns [5].

3 Web Usage Mining

There are three main tasks for performing WUM— preprocessing, pattern dis-

covery and pattern analysis [1]. These are brieﬂy explained as follows.

Preprocessing: Commonly used as a preliminary data mining practice, data

preprocessing transforms the data into a format that will be more easily and

eﬀectively processed for the purpose of the user. The diﬀerent types of prepro-

cessing in WUM are— usage, content, and structure preprocessing.

Pattern Discovery: WUM can be used to uncover patterns in server logs but

is often carried out only on samples of data. The mining process will be ineﬀec-

tive if the samples are not a good representation of the larger body of data. The

various pattern discovery methods are— Statistical Analysis, Association Rules,

Clustering, Classiﬁcation, Sequential Patterns, and Dependency Modeling.

Pattern Analysis: The need behind pattern analysis is to ﬁlter out uninterest-

ing rules or patterns from the set found in the pattern discovery phase. The most

Web Usage Mining: An Implementation View 133

common form of pattern analysis consists of a knowledge query mechanism such

as SQL. Content and structure information can be used to ﬁlter out patterns

containing pages of a certain usage type, content type, or pages that match a

certain hyperlink structure.

4 Implementation Details

This section describes the various operations that have been done for ﬁnding

web usage patterns of DSpace server of NIT Rourkela. Diﬀerent web server log

analyzers like Web Expert Lite 6.1 and Analog 6.0 have been used to analyze

various sample web server logs obtained. The key information obtained was—

total hits, visitor hits, average hits per day, average hits per visitor, failed re-

quests, page views total page views, average page views per day, average page

views per visitor, visitors total visitors average visitors per day, total unique IPs,

bandwidth, total bandwidth, visitor bandwidth, average bandwidth per day, av-

erage bandwidth per hit, average bandwidth per visitor; access data like ﬁles,

images, referrers, user agents etc.

4.1 Collection of DSpace Log Files

The DSpace server log ﬁles were collected and the features found are shown be-

low. The Common log contains the requested resource and a few other pieces

of information, but does not contain referral, user agent, or cookie information.

The information is contained in a single ﬁle. The following example shows these

ﬁelds populated with values in a common log ﬁle record:

host log user date:time GMToﬀset request status bytes

125.125.125.125 - - [10/Oct/1999:21:15:05 +0500] “GET /index.html HTTP/1.0” 200 1043

4.2 Analysis of Web Server Logs

First part of analysis was preprocessing. Preprocessing segregated all the details

provided in the log ﬁle into a structured form. JAVA is used for this. Data

structures used are linear arrays— ip, time, content, httpmethod, httpstatus,

bandwidth, browser etc.

4.3 Key Constraints and Solutions

Not much Variation in IP: As we are considering the DSapce log ﬁles, which

are speciﬁc to NIT Rourkela, it is observed that there is not much variation in

IP addresses in the entries recorded in the log ﬁle.

Usernames and Aliases not Provided: The second and third entries in the

common log format are the usernames and aliases which are mainly recorded in

a login based website. These information are not there in the DSpace log ﬁles.

134 S B Korra, S K Panigrahy, and S K Jena

Web Cr awl ers : The various types of crawlers found in the DSpace log ﬁles

are— MSN bots, Yahoo slurps, Google bots, Baidu spiders etc.

Bot Identiﬁcation: After much analysis of bot identiﬁcation and removal, a

method has been used speciﬁc to DSpace log ﬁles to do the same. The pseu-

docode for the method is as follows:

BotId()

{ while(!EOF)

{ readLine();

Check for keywords (bot,slurp,spider) in browser[] array

if the array contains keyword

{ botflag=true; botcounter++; }

else

botflag=false;

}}

Identiﬁcation of User Sessions: User sessions in WUM generally refers to

the usage or access of any content of the website from a ﬁxed IP over a ﬁxed

period of time. The period of time is subjective to the analyzer. Considering

the above requirements, a method speciﬁc to DSpace log ﬁle has been used to

identify user sessions in the log ﬁle. The pseudocode for the method is as follows:

SessionId()

{ while(!EOF)

{ i=1;

add first not bot entry to session i;

for each (next entry)

{ if(entry != bot)

if(IP == previous IP)

if(time[this entry] - time[this entry -1] < x)

add entry to session i;

else

{ i++; add entry to session i; }

else

{ i++; add entry to session i; }

}}}

4.4 Using EIN-WUM Algorithm

After preprocessing, bot identiﬁcation and removal, and session identiﬁcation,

the EIN-WUM (Enhanced Immune Network Web Usage Mining) algorithm [6]

is used. Our interpretation of the algorithm subject to DSpace website of NIT

Rourkela is as follows:

– Limit value of no. of antibodies to 6 (based on the category from DSpace

Website).

Web Usage Mining: An Implementation View 135

– We deﬁne the category of each entry in the Server Log by assigning it a

number (0 through 6). The numbers signify—0 - default value, 1 - content

searched by title, 2 - content searched by author, 3 - content searched by

date, c - Content searched by author, 5 - content accessed by handle, and 6

- content accessed by bitstream.

– The antibodies are initialized from the ﬁrst 10 sessions. For each session an

entry goes to the corresponding number of antibody as its category is. So

each antibody contains only one category of server log entry.

– For each incoming session, compare with each existing antibody. If (simi-

larity of antibody > threshold), replace old session with new session, else if

(similarity < threshold) update antibody with most similarity.

– Put a limit on the size of antibody. If (antibody crosses limit), delete old

entries.

The various Utilities of the above interpretation are found as:

a. At the end of the program, the ten most interesting antibodies will remain.

b. The contents accessed in the antibodies will be the most frequently accessed

contents in the whole website.

c. Based on (b) the following changes can be brought to the concerned site:

i. improvements on frequently accessed pages.

ii. deletion or merging of unused pages.

iii. improvement of content.

iv. improvement of interaction with referral sites.

4.5 Results

The results obtained from the analysis are given below.

Preprocessed Information from Log Files: The preprocessing program col-

lected the details in the appropriate data structures and also identiﬁed whether

an entry is a bot entry or a valid user entry (shown below).

1 true 203.129.199.129 10/Jan/2010:04:04:26 GET 200

17013B 0 /dspace/browse-title?top=2080%2F905

2 true 203.129.199.129 10/Jan/2010:04:04:29 GET 200

14295B 0 /dspace/browse-author?bottom=Misra%2C+M

Summary of the Log File and Sessions: Thesummaryofthelogﬁlegiving

overall details, the sessions and the diﬀerent log ﬁle entries that constitute the

sessions are shown below.

****************Summary****************

number of hits = 14274

number of visitor hits= 7923

number of spider hits= 6351

Number of days= 5

Average hits per day = 2854

Total Bandwidth used = 1494419892 Bytes

Avenrage Bandwidth= 298883978 Bytes

***************************************

session 1 182

session 1 183

session 1 191

session 2 193

session 2 194

session 2 195

session 2 196

session 2 197

session 2 198

136 S B Korra, S K Panigrahy, and S K Jena

Usage Patterns: The diﬀerent frequently accessed contents in the DSpace web-

site is shown below.

16 0 1 /dspace/browse-author?starts_with=Das%2C+Atanau

92 0 1 /dspace/browse-author?bottom=Wai%2C+P+K+A

181 0 1 /dspace/browse-author?starts_with=Verghese%2C+L

227 1 1 /dspace/browse-author?top=Joshi%2C+Avinash

364 1 1 /dspace/browse-author?top=Joshi%2C+Avinash

527 5 1 /dspace/browse-author

530 5 1 /dspace/browse-author?starts_with=C

532 5 1 /dspace/browse-author?top=Chatterjee%2C+Saurav

536 5 1 /dspace/browse-author?starts_with=S

569 7 1 /dspace/browse-author?starts_with=S

571 7 1 /dspace/browse-author?top=Chatterjee%2C+Saurav

715 8 1 /dspace/browse-author?top=Bal%2C+S

748 8 1 /dspace/browse-author?starts_with=Das%2C+B+M

831 8 1 /dspace/browse-author?starts_with=Karanam%2C+U+M+R

5 Conclusions

The proposed methods were successfully tested on the log ﬁles for bot removal

and user sessions identiﬁcation. The results which were obtained after the analy-

sis were satisfactory and contained valuable information about the log ﬁles. The

methodology and implementation presented in this paper are purely DSpace

Website speciﬁc. Analysis of above obtained information proved WUM as a

powerful technique in Website management and improvement. However, this

subjective interpretation of the algorithm EIN-WUM is very ingenious and pro-

poses a lot of scope to be extended on to other problem domains.

References

1. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: Discovery

and applications of usage patterns from web data. SIGKDD Explor. Newsl. 1(2),

12–23 (2000), http://portal.acm.org/citation.cfm?id=846188

2. Abraham, A.: Business intelligence from web usage mining. Journal of Information

& Knowledge Management, iKMS & World Scientiﬁc Publishing Co. 2(4), 375–390

(2003), http://www.worldscinet.com/jikm/02/0204/S0219649203000565.html

3. W3C: Logging control in w3c httpd, http://www.w3.org/Daemon/User/Config/

Logging.html

4. W3C: Extended log ﬁle format. w3c working draft wd-logﬁle-960323., http://www.

w3.org/TR/WDlogfile.html

5. Gupta, G.K.: Introduction to Data Mining with Case Studies. Phi Learning, 1st

edn. (2008)

6. Rahmani, A.T., Helmi, B.H.: Ein-WUM: an AIS-based algorithm for web usage

mining. In: Ryan, C., Keijzer, M. (eds.) GECCO. pp. 291–292. ACM (2008), http:

//doi.acm.org/10.1145/1389095.1389144

An Improved Model for Web Usage Mining and Web Traffic Analysis

Article

Jan 2018

Iyabo Awoyelu

Implementing APRIORI Algorithm on Web serve log

Conference Paper

Full-text available

May 2011

Ankit Kharwar

Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behaviour at a Web site. Web server data correspond to the user logs that are collected at Web server. In order to produce the usage patterns and user behaviours, this paper implements the high level process of Web Usage Mining using basic Association Rules algorithm call Apriori Algorithm. Web Usage Mining consists of three main phases, namely Data Pre-processing, Pattern Discovering and Pattern Analysis. Server log files become a set of raw data where it’s must go through with all the Web Usage Mining phases to producing the final results. Here, Web Usage Mining, approach has been combining with the basic Association Rules, Apriori Algorithm to optimize the content of the serve log data. Finally, this paper will present a finding association Rule from server log which are useful in many application like cache for web page, Marketing, Targeted Advertising etc.

Methodologies and Techniques of Web Usage Mining

Chapter

Full-text available

Jan 2017

Web usage mining attempts to discover useful knowledge from the secondary data obtained from the interactions of the users with the Web. It is the type of Web mining activity that involves the automatic discovery of out what users are looking for on the Internet. In this chapter methodology of web usage mining explained in detail which are data collection, data preprocessing, knowledge discovery and pattern analysis. The different Web Usage Mining techniques are described, which are used for knowledge and pattern discovery. These are statistical analysis, sequential patterns, classification, association rule mining, clustering, dependency modeling. Pattern analysis is needed to filter out uninterested rules or patterns from the set found in the pattern discovery phase.

Web Usage Mining: An Analysis

Article

Aug 2013

Web usage mining is research area in web mining. Web mining is an activity that focuses to discover new, relevant and reliable information and knowledge by examining the structure, content and usage of web. The major focus is on learning about web users and their interaction with websites. Web log files generated on web servers are used in order to extract web usage of different users. There are three types of web repositories: web server log, proxy server log, browser log. Analysing web logs for usage can not only provide important information to websites developers but also help in creating adaptive web sites. In this paper we discuss various sources of information for WUM, Methodology of web usage mining techniques which involves Data collection, Data pre-processing, knowledge discovery and knowledge analysis. Various applications of WUM are personalization, prefetching and caching, support to design and E-commerce. Major application of web usage mining is to predict future accesses. Thus, the result obtained after web usage mining can be used to improve the performance of prefetching and caching.

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data.

Article

Full-text available

Jan 2000

Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. This paper describes each of these phases in detail. Given its application potential, Web usage mining has seen a rapid increase in interest, from both the research and practice communities. This paper provides a detailed taxonomy of the work in this area, including research efforts as well as commercial offerings. An up-to-date survey of the existing work is also provided. Finally, a brief overview of the WebSIFT system as an example of a prototypical Web usage mining system is given.

Business Intelligence from Web Usage Mining

Article

Full-text available

Dec 2003
J Inform Knowl Manag

Ajith Abraham

The rapid e-commerce growth has made both business community and customers face a new situation. Due to intense competition on the one hand and the customer's option to choose from several alternatives, the business community has realized the necessity of intelligent marketing strategies and relationship management. Web usage mining attempts to discover useful knowledge from the secondary data obtained from the interactions of the users with the Web. Web usage mining has become very critical for effective Web site management, creating adaptive Web sites, business and support services, personalization, network traffic flow analysis and so on. This paper presents the important concepts of Web usage mining and its various practical applications. Further a novel approach called "intelligent-miner" (i-Miner) is presented. i-Miner could optimize the concurrent architecture of a fuzzy clustering algorithm (to discover web data clusters) and a fuzzy inference system to analyze the Web site visitor trends. A hybrid evolutionary fuzzy clustering algorithm is proposed to optimally segregate similar user interests. The clustered data is then used to analyze the trends using a Takagi-Sugeno fuzzy inference system learned using a combination of evolutionary algorithm and neural network learning. Proposed approach is compared with self-organizing maps (to discover patterns) and several function approximation techniques like neural networks, linear genetic programming and Takagi–Sugeno fuzzy inference system (to analyze the clusters). The results are graphically illustrated and the practical significance is discussed in detail. Empirical results clearly show that the proposed Web usage-mining framework is efficient.

EIN-WUM: an AIS-based algorithm for web usage mining

Conference Paper

Jul 2008

With the ever expanding Web and the information published on it, effective tools for managing such data and presenting information to users based on their needs are becoming necessary. In this paper, we propose a new algorithm named "EIN-WUM" for Web usage mining based on artificial immune system metaphor. This algorithm introduces several novelties such as using danger theory, directed mutation and an enhanced immune network model. Experimental results show that The EIN-WUM algorithm can properly learn the frequent trends in noisy, sparse and huge Web usage data in single pass.

Introduction to Data Mining with Case Studies. Phi Learning, 1st edn

G K Gupta

Gupta, G.K.: Introduction to Data Mining with Case Studies. Phi Learning, 1st edn. (2008)

Web Usage Mining: An Implementation View

Abstract

Recommended publications

Application Areas of Web Usage Mining

A Novel Semantically-Time-Referrer based Approach of Web Usage Mining for Improved Sessionization in...

Research on Path Completion Technique in Web Usage Mining

Web Usage Mining using ART Neural Network