Sunday, January 21, 2007

Web mining for web personalization

Web mining for web personalization
Magdlini Eirinaki and Michalis Vazirgiannis
Athens University of Economics and Business

ACM Transaction on Internet Technology, Vol. 3., No 1, Feb., 2003, Page 1 - 27

Introduction

Web personalization is defined as any action that adapts the information or services provided by a Web site to the needs of a particular user or a set of users, taking advantage of the knowledge gained from the user’s navigational behavior and individual interests, in combination with the content and the structure of the web site.

Objective:

The objective of a web personalization system is to provide users with the information they want or need, without expecting from them to ask for it explicitly.

The content management is the process classifying the content of a web site in semantic categories in order to make information retrieval and presentation easier for the users. Content management is very important for web sites whose content is increasing on a daily basis, such as news sites or portals.

Web personalization

… the analysis of the collected data, and the determination of the actions that should be performed. The ways that are employed in order to analyze the collected data include content-based filtering, collaborative filtering, rule-based filtering and Web usage mining.

Content-based filtering systems: are solely based on individual users’ preferences. The system tracks each user’s behavior and recommends items to them that are similar to items the user liked in the past.

Collaborative filtering systems invite users to rate objects or divulge their preferences and interests to them. This is based on the assumption that users with similar behavior have analogous interests.

The data mining methods that are employed are: association rule mining, sequential pattern discovery, clustering and classification. This knowledge is then used from the system in order to personalize the site according to each user’s behavior and profile.

User profiling

In order to personalize a web site the system should be able to distinguish between different users or groups of users. This process is called user profiling and its objective is the creation of an information base that contain the preferences, characteristics, and activities of the users.

Log analysis and web usage mining:

By applying statistical and data mining methods to the web log data, interesting patterns concerning the user’s navigational behavior can be identified, such as users and page clustering, as well as possible correlations between web pages and user groups.

The web usage mining process can be regards as a three-phase process, consisting of the data presentation, pattern discovery, and pattern analysis phases. In the first phase, log data are preprocessed in order to identify users’ session, page views and so on. In the second phases, statistical methods, as well as data mining methods (such as association rules, sequential pattern discovery, clustering and classification are applied in order to detect interesting patterns.

Most important of all is the user identification issue. More accurate approaches for a priori identification of unique visitors are the use of cookies or similar mechanisms of the requirement for be the reluctance of users to share personal information.

Web usage mining

More advanced data mining methods and algorithms tailored appropriately are use in the Web domain include association rules, sequential pattern discovery, clustering and classification. Association rule mining is used in order to reveal correlations between pages accessed together during a server session. It can reveal association between groups of users with specific interests.

Sequential pattern discovery is an extension of association rules mining in that it reveals pattern of concurrence incorporating the notion of time sequence. Clustering is used to group together items that have similar characteristics. In the context of web mining, we can distinguish two cases, user clusters and page clusters.

Page clustering identifies group of pages that seem to be conceptually related according to the user’s perception. User clustering results in group of users that seems to behave similarly when navigating through a Web site.

Classification is a process that maps a data item into one of several predetermined classes. In web domain classes usually represent different user profiles and classification is performed using selected features that describe each user’s category. The most common classification algorithm are decision trees, Naïve Bayesian classifier, neural networks, and so on.

After discovering patterns from usage data, a further analysis has to be conducted. The exact methodology that should be followed depends on the technique previously used. The most common ways of analyzing such patterns are either by using a query mechanism on a database where the results are stored, or by loading the results into a data cube and then performing OLAP operations. Additionally, visualization technique are used fir an easier interpretation of formation convening the web site there can be extracted useful knowledge for modifying the site according to the correlation between user and content groups.

Research initiatives

Most of the efforts focus on extracting useful patterns and rules using data mining techniques in order to understand the users’ navigational behavior, so that decision concerning site restructuring or modification can then be made by humans. In several cases, a recommendation engine helps the user navigates through a site.

A different approaches is adopted by Zaiane et al. the authors combine the OLAP and data mining techniques and a multidimensional data cube, to extract interactively implicit knowledge. Their webLogMiner system after filtering the data contained in the web log, transforms them into a relational database. In the next phase a data cube is built, each dimension representing a field with all possible values described by attributes. OLAP technology s then used in combination with data mining techniques for prediction, classification and time-series analysis of web log data.

Pattern discovery is accomplished through the use of general statistics algorithms and data mining techniques such as association rules, sequential pattern analysis, clustering and classification. the result then analyzed through a simple knowledge query mechanism, a visualization tool, or the information filter, that makes use of the preprocessed content, and structure information to automatically filter the results of the knowledge discovery algorithms.

No comments: