Improve the performance of webpage

Improve the Performance of the Webpage Content Extraction using Webpage

Segmentation Algorithm

Fu Lei, Meng Yao, Yu Hao Fujitsu R&D Center CO., LTD, Beijing, China, 100025

[email protected], [email protected], [email protected]

ABSTRACT: In this paper, we present a method using webpage segmentation algorithm to improve the performace of the webpage content extraction. The traditional methods often depend on parsing the DOM tree of the webpage and judging each node of the DOM tree to determin which node is the text node, this kind of method has a potential problem, it sometimes throws part of the content away because of its local judgement strategy. But our method which is based on the VIPS (Vision-based Page Segmentation) algorithm, can solve the problem satisfactorily, it can extract the content according to the coordinate information of the block and help the traditional method to recall the lost part of the content.

KEYWORDS: Webpage Segmentation; Webpage Content Extraction; DOM tree analysis; VIPS

I. INTRODUCTION With the explosion of the World Wide Web, a large

amount of data on many different subjects has become available on-line, this has opened the opportunity for users to benefit from the available data in many interesting way. Usually, users retrieve web data by browsing and keyword searching, which are intuitive forms of accessing data on the web. However, these search strategies present several limitations. Browsing is not suitable for locating particular items of data, because following links is tedious and it is easy to get lost. Keyword searching is sometimes more efficient than browsing, but often returns large amounts of data, far beyond what the user can handle. As a result, in spite of being publicly and readily available, web data can hardly be properly queried or manipulated. So the researchers begin to consider how to extract the content of the webpage for further handling.

The traditional approaches for extracting data from the webpage can be classified as below. First, it’s the method based on wrappers [1-5], the wrappers are some specialized programs, which identify data of interest and map them to some suitable format. This method has a well-known shortcoming, the wrappers are always developed manually, it’s a very time-consuming work and very difficult to debug them. Although many researchers introduce the machine learning method to optimize the process, it still has no sufficient power to deal with many different web pages, the wrappers often takes effect on some similar web pages, not most of on-line web pages. Second, it’s the method based on HTML DOM tree analysis [6-8, 10, 11], much recent work focus on this method. The main idea of this method is to judge each node of the DOM tree whether it is a text node. Although many researchers try to improve it from many

detailed aspects, such as tag tree method [11], ontology [12]

method and so on, it still has some problems, one main problem is that this method often throws some sentences of the main body content away. Because it’s based on the local judgment of the DOM tree, it can’t get the whole view of the page. On the other hand, the DOM tree is initially introduced for presentation in the browser rather than description of the semantic structure of the webpage, so you can’t get the semantic relation of the different sentence directly, it’s no wonder that this method sometimes loses some part of the content.

In fact, most of researches show that when a page is presented to the user, the spatial and visual cues play a very important role, they help the user to unconsciously divide the webpage into several semantic parts. So, if we can make use of this information, it’ll help us to extract the body content of the page much more precisely. Detecting the semantic content structure of a webpage could potentially improve the performance of the webpage content extraction. VIPS [9] algorithm can do this work perfectly, it can divide the webpage into some different independent semantic blocks, and we can also get the coordinate information of each block to assist the webpage content extraction. Based on VIPS algorithm, we can recall the lost sentences easily.

The rest of the paper is organized as follows. Section Ⅱprovides an overview of VIPS algorithm. In Section Ⅲ, I will introduce our method, how to use VIPS to improve the performance of the webpage content extraction. The results are also shown in this section. Finally, we give concluding remarks in Section Ⅳ.

II. OVERVIEW OF VIPS ALGORITHM The VIPS algorithm makes full use of the webpage

layout feature: firstly, it extracts all the suitable blocks based on the html DOM tree structure, then it tries to find the separators between these extracted blocks. Here, separators denote the horizontal or vertical lines in a webpage that visually cross with no other blocks. Finally, based on these separators, the semantic structure for the webpage is constructed and the webpage is divided into some independent blocks. VIPS algorithm employs a top-down approach, which is very effective.

The basic model of VIPS is described as below. A web page W is represented as a triple:

W = (B, S, δ)

2009 International Forum on Computer Science-Technology and Applications

978-0-7695-3930-0/09 $26.00 © 2009 IEEE

DOI 10.1109/IFCSTA.2009.84

323

B = {B1, B2, ... ,BN } is a finite set of blocks. All these blocks must not be overlapped. Each block can be recursively viewed as a sub-web-page associated with sub-structure induced from the whole page structure. S = {S1, S2,..., SN} is a finite set of separators, including horizontal separators and vertical separators. Every separator has a weight indicating its visibility, and all the separators in the same S have the same weight. δ is the relationship of every two blocks in B and can be expressed as:

δ = B×B → S ∪ {NULL}.

For example, suppose Bi and Bj are two objects in B, δ(Bi, Bj) ≠ NULL indicates that Bi and Bj are exactly separated by the separator δ(Bi, Bj) or we can say that the two objects are adjacent to each other, otherwise there are other objects between the two blocks Bi and Bj.

VIPS algorithm can divide the webpage into some independent blocks, each two blocks are separated by the separators in S, and we can also get the coordinate information of each block. Just as the example of Figure 1 and Figure 2 below.

Figure 1. Source Webpage

Figure 2. Blocks and Separators in Source Webpage

III. INTRODUCTION OF OUR METHOD Our method is based on the VIPS algorithm, it can

overcome the shortcoming of the method based on DOM tree analysis. For the problem of throwing some sentences of the content away in traditional method, it’s mainly because of its local analysis strategy, you can see the example showed in Figure 3 below.

Figure 3. Lost Sentences Example

******B1******

************ B2

********** **********

**B4**

*****B3*********************

(x,y)

******B5******

324

In this example, you can see that the text in blue circle contains a hyperlink, for this case, because the traditional DOM tree methods always use the text-link ratio (ratio of the length of the text in the node and the length of the hyperlink text in the node) to judge whether the node is a text node, the text in blue circle will always be treated as a pure link which has no sense, they will be thrown away wrongly too.

In our method, we’ll use the VIPS algorithm to overcome this problem and improve the performance of the webpage content extraction. For VIPS can divide the webpage into some semantic blocks, it can get a whole view of the webpage and get the position information of each block. In order to recall the sentences which are thrown away, we’ll keep the DOM tree node tag when using traditional method to extract the content. The steps are as follows.

1. Using VIPS to divide the webpage into several blocks and keep the coordinate information of each block and the node tag in each block.

2. Using traditional method to extract the content of the webpage and keep the html tag information of each content node.

3. Using the coordinate information of each block to determine which blocks should be content blocks.

4. Map the extracted content node tag sequence to the content block according to the node tag and the content itself. If some node tags in content block don’t appear in extracted content node tag sequence, we recall the node and the text in this node.

Using our method, because you can get a whole view of the webpage to some extent by VIPS algorithm, then you can make full use of this information to supervise the process of the content extraction and recall the lost sentences. Just as well, to a certain extent, you extract the content as a whole.

From the experiment we conduct, we can also see that almost all the lost sentences in traditional method are recalled using our method.

IV. CONCLUSION In this paper, we present a method using VIPS algorithm

to improve the performance of webpage content extraction, our method overcomes the shortcoming of the traditional

method based on DOM tree analysis, and it can extract the content of the page from a global view of the page not a local view to some extent. It makes full use of the webpage layout information, and guides the process of content extraction. By recalling the sentences which the traditional method throws away, it improves the performance of the traditional method greatly.

REFERENCES

[1] Adelberg, B., NoDoSE: A tool for semiautomatically extracting structured and semistructured data from text documents, In Proceedings of ACM SIGMOD Conference on Management of Data, 1998, pp. 283-294.

[2] Ashish, N. and Knoblock, C. A., Semi-Automatic Wrapper Generation for Internet Information Sources, In Proceedings of the Conference on Cooperative Information Systems, 1997, pp. 160-169.

[3] Ashish, N. and Knoblock, C. A., Wrapper Generation for Semi-structured Internet Sources, SIGMOD Record, Vol. 26, No. 4, 1997, pp. 8-15.

[4] Embley, D. W., Jiang, Y., and Ng, Y.-K., Record-boundary discovery in Web documents, In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, Philadelphia PA, 1999, pp. 467-478.

[5] Valter Crescenzi, GiansalvatoreMecca. RoadRunner: Towards Automatic Data Extraction from Large WebSite [A]. In proceeding of the 26th International Conference on very Large Database Systems[C], 2001:109-118.

[6] Chakrabarti, S., Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction, In the 10th International World Wide Web Conference, 2001.

[7] Shian-Hua Lin, Jan-Ming Ho: Discovering informative content blocks from Web documents，KDD 2002: 588-593.

[8] Suhit Gupta, Gail E. Kaiser, David Neistadt, Peter Grimm: DOM-based content extraction of HTML documents. WWW 2003: 207-214.

[9] Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma: Extracting Content Structure for Web Pages Based on Visual Representation. APWeb 2003: 406-417

[10] 李效东,顾毓清. 基于 DOM 的 Web 信息抽取[J] . 计算机学报,2002 ,25 (5) :128

[11] 常育红,姜哲,朱小燕.基于标记树表示方法的页面结构分析[J].计算机工程与应用,2004 (16):129～132

[12] 高军,王腾蛟,等. 基于 Ontology 的 Web 内容二阶段半自动提取方法 [J]. 计算机学报 ,2004,27(3):310-317

325

Improve the performance of webpage

Documents

web data

tag tree method

kind of method

available data

machine learning method

html dom tree analysis

different web pages

main body content