WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. W eb Pages Feature Extracti on Fused Sim ilarity ContentFeatures Structure Features Links Similarity Represent ation Similarity Represent ation Similarity Represent ation Content-based Similarities Structure-based Similarities Neighborhood-based Similarities Prediction M odel Zenglin Xu, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong {zlxu, king, lyu}@cse.cuhk.edu.hk The Chinese University of Hong The Chinese University of Hong Kong Kong • For web page classification, there are many a vailable data sources, such as the text, the title, the meta data, the anchor text, etc. • Simply putting them together would not greatl y enhance the classification performance. • Different dimensions and types of data sources can be represented into a common format of kernel matrix. • A kernel learning approach is thus proposed to integrate multiple data sources • A systematic way of integrati ng multiple data sources. • Better classification accurac y. 1 2 •Dataset: DMOZ Dataset: DMOZ • AT: Anchor Text AT: Anchor Text • LT: Link Text LT: Link Text • MT: MT: Meta Data Meta Data • TI: Title TI: Title • PT: Plain Text PT: Plain Text • UW: Universally Weighted UW: Universally Weighted sources sources • KC: sources by Kernel KC: sources by Kernel Combination Combination • Mi -F1: Micro-F1 Mi -F1: Micro-F1 • Ma-F1: Macro-F1 Ma-F1: Macro-F1 3 4 The Chinese University of Hong Kong • 1. 1. Feature Extraction. Feature Extraction. • 2. 2. Similarity Representation Similarity Representation. Each data . Each data source is represented as a kernel matrix source is represented as a kernel matrix (Ki) (Ki) • 3. 3. Similarity Combination. Similarity Combination. • 4. 4. Classification. Classification. • Substitute K into the dual SVM Substitute K into the dual SVM • We have the following QCQP problem: We have the following QCQP problem: where where αis the parameter of dual SVMs, is the parameter of dual SVMs,δ is a constant and t is the trace is a constant and t is the trace vector. vector.