UIMA

Extraction and Classification of Unstructured Data in WebPages for Structured Multimedia Database via XML

Siti Z. Z. Abidin, Noorazida Mohd Idris and Azizul H. Husain

Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA,

40450, Shah Alam, Selangor, Malaysia. {sitizaleha533, noorazida}@salam.uitm.edu.my, [email protected]

AbstractNowadays, there is a vast amount of information available in the internet. The useful data must be captured and stored for future purposes. One of the major unsolved problems in the information technology (IT) industry is the management of unstructured data. The unstructured data such as multimedia files, documents, spreadsheets, news, emails, memorandums, reports and web pages are difficult to capture and store in the common database storage. The underlying reason is due to the tools and techniques that proved to be so successful in transforming structured data into business intelligence and actionable information, simply do not work when it comes to unstructured data. As a result, new approaches are necessary. Attempts have been undertaken by several researchers to deal with unstructured data, but, so far it is hard to find a tool that can store and retrieve the extracted and classified unstructured data into a structured database system.

This paper is to present our research on unstructured data identification, extraction and classification of web pages, which is then transformed into structured format in Extensible Markup Language (XML) document, and later stored into a multimedia database. The contribution of this research is in the approach of capturing the unstructured data and the efficiency of a multimedia database to handle this kind of data. The stored data could give benefits to various communities such as students, lecturers, researchers and IT managers because it can be used for any planning, decision-making, day-to-day operations, and other future purposes.

Keywords Unstructured data; Webpage; Data extraction; Data classification; XML; Multimedia database

I. INTRODUCTION Today, peoples lives are greatly influenced by

information technology due to the excessive exposure of the Internet, reducing costs of disk storage and the overwhelming amounts of information stored by todays business industries. While there is a vast amount of data available, what are sorely missing are tools and methods to manage this unstructured ocean of facts and turn them into usable information. Unstructured data includes documents, spreadsheets, presentations, multimedia files, memos, news, user groups, chats, reports, emails and web pages [1]. Merrill Lynch [2] estimated that more than 85 percent of all business information existed as unstructured data [3]. This data plays a significant part of a knowledge base for an organization. The data needs to be properly managed for a long-term usage. With three quarter of data is unstructured, it represents a single largest

opportunity for positive economic returns. Through effective unstructured data management, revenue, profitability and opportunity can go up, while risks and costs may go down [3].

This paper presents an exploratory study on various tools for data extraction and classification and the design of a prototype tool that is able to extract and classify unstructured data in any web pages. The classified data is structured in extensible markup language (XML) before all the useful data is stored into an Oracle multimedia database. With the analysis on the current available tools, the prototype is designed and implemented using C# programming by selecting the significant methods among the current tools. In transforming the unstructured data into structured forms, image data type is converted into specific format through double conversion for speed performance in image retrieval (from multimedia database).

This paper is organized as follows: Section II describes the related work on all possible data extraction and classification techniques while Section III gives details explanation on research methodology. Section IV demonstrates the results of this research and Section V draws the conclusion.

II. RELATED WORK The World Wide Web is a growing database in

which a great amount of information is available. There are three types of web pages, which are unstructured, semi-structured and structured web pages. Unstructured web pages are those in which the information of interest is within free text that no common pattern can be induced. Semi-structured is typically generated by using a template and a set of data that can infer one or more patterns that may be used to extract data from the web pages. A structured web page is a web page that presents information in HTML for human browsing, and also offers structured data that can be processed automatically by machines. This data is easily integrated into business processes. Unfortunately, querying and accessing this data by software agents is not a simple task since it is represented in a human friendly format. Although the software agent makes it easier for human to understand and browse the Web, it makes the incorporation and the use of this data by automated processes very difficult. Some solutions for this can be the use of semantic web [4], which is still a vision, or the use of web services [5], but missing a complete specification. An information extractor or wrappers [6] may fill the gap and help in transforming the web into completely

44978-1-4244-5651-2/10/$26.00 2010 IEEE

structured data that is usable by automated processes. There are many extraction algorithms, but unfortunately none of them can be considered as a perfect solution. They are usually designed and built to provide distinct interfaces, thus, leads to complicating the task for integration of these algorithms inside enterprise applications. There are some methods introduced by some researchers to convert web page data from either semi-structured or unstructured format into structured and machine-understandable design and XML is found to be the most popular.

A. Data Extraction Data extraction is a process of retrieving and

capturing data from one medium into another medium. The medium can be WebPages, documents, databases, repository, stack or anything that consists of information. Refer to the evText Website [7], data retrieval is a process of locating and linking data points in the user supplied document with a corresponding data point in the data retrieval structure. A wrapper accesses HTML documents and exports the relevant text to a structured format, normally XML [8]. In order to extract data from a webpage, two tasks need to be considered; to define its input and its extraction target. Input can be unstructured page, semi-structured and structured page. The extraction target can be a relation of k-tuple where k is the number of attributes in a record or it can be a complex object with hierarchically organized data [6]. Moreover, the difficulty of information extraction can become complicated when various permutations of attributes or typographical errors occur in the input documents.

There are various classifications of wrapper. For example, Hsu and Dung [9] classified wrappers into four distinct categories, including hand-crafted wrappers that had heuristic-based and induction approaches. Chang [6] followed this taxonomy and came up with systems that involved annotation-free systems and semi-supervised systems. Muslea [10] concerned on extraction patterns of free text using syntactic or semantic constraints. However, the complete categorization was made by Laender [11] who proposed the taxonomy such as languages for wrapper development that consists of HTML-aware tools, NLP-based tools, wrapper induction tools, modeling-based tools, and ontology-based tools.

B. Data Classification Data classification is to categorize data based on

required needs. The goal of classification is to build a set of models that can correctly predict the class of different objects. There are many algorithms for data classification used in data mining and machine learning. They are also being used as the base algorithms in some data extractor systems. For example, the k-Nearest Neighbor (KNN) algorithm [12] is mostly used in determining data (in terms of distance) through its similarity with respect to its neighbors. Nave Bayesian (NB) algorithm [13] and

Concept Vector-based (CB) algorithm [14], are mostly used for classifying words in documents. Other methods to classify data, are using Classification and Regression Trees (CART) algorithm [15], and PageRank algorithm [16]. CART algorithm is implemented using decision tree while PageRank uses search ranking algorithm based on hyperlinks on the Web.

Classifying data into several categories is important because the raw data has to be matched with the corresponding data classes specified in the database.

C. Tools for Data Extraction Several tools are compared to look into their page

type, class, feature extraction, extraction rule type, and learning algorithm. Table 1 depicts the results that help in the design and implementation of the prototype tool produced by this research. For the page type, the structure of input documents is compared.

As proposed by Laender [11], the tools for data extraction are developed based on several approaches. HTML-aware tools are used for HTML documents that require the document to be presented in a parsing tree. The tree reflects its HTML tag hierarchy and it is generated either semi-automatically or automatically. Examples of tools are RoadRunner [17] and Lixto [18].

RAPIER [19], SRV [20] and WHISK [21] use Natural language processing or NLP techniques to build relationship between sentences elements and phrases (filtering, part-of-speech tagging, and lexical semantic tagging). These techniques derive extraction rules based on syntactic and semantic constrains that help to identify the relevant information within a document.

There are also several wrapper induction tools under generated delimiter-based extraction rules derived from a given set of training examples. The main difference between tools based NLP techniques and these tools is that they do not rely on linguistic constrains, but rather in formatting features that implicitly delineate the structure of the pieces of data found. Therefore, they are more suitable for HTML documents than the previous ones. An example of this type of tool is STALKER[10]. For the modeling based tools, a target structure is provided according to a set of modeling primitives that conform to an underlying data model. Example of tools that used this approach is NoDoSE [22] and DEByE [11].

There are also wrappers that use ontology-based to locate constants that present in the page and construct objects with them. This approach is different from all approaches explained previously as it relies on the structure of presentation data features within a document, to generate rules or patterns to perform extraction. However, extraction can be accomplished by relying directly on the data. Example of this ontology-based tool is the tool that developed by Brigham Young University Data Extraction Group [23].

45

TABLE I. COMPARATIVE STUDY ON EXTRACTION TOOLS

III. SYSTEM DESIGN The prototype tool built using C# is designed and

developed based on a framework shown in Fig. 1. The framework consists of User, Interface, Source, XML and Multimedia Database layers. Each of the layers in the framework communicates with each other in order to retrieve or pass data from one layer to another.

Fig. 1. Research Framework Each layer has its functionality as follows:

Layer 1 - This layer represents the user who will be using the implemented system.

Layer 2 - Interface is an interaction medium between user and the source location that allows user to manipulate data in the webpage. Examples of interface are programming languages that support network environment such as Java and CSharp (C#).

Layer 3 - Source layer consists of huge amount of useful data in webpage in forms of structured, semi-structured or unstructured page. User will identify the useful data to be extracted from the source, which later stored in a storage location that can handle various types of the multimedia elements. Before data can be allocated into the storage, they need to be classified first. This classification will be done to determine which data should be allocated in the storage. This classification is based on the type of data such as text, image, audio, or video.

Layer 4 - The result from the classification process will be placed into a structured XML document.

The structured data is then transmitted to the storage layer, which located at layer 5.

Layer 5 - The storage used in this layer is a multimedia database. The multimedia database needs to be used for handling the huge amount of data that consists of various elements of multimedia types such as text, audio, images and videos format. The example of a multimedia database that can be used for storing purpose is Oracle 11g standards. This database is able to support any type of data, especially for business and commercial purposes.

A. Extraction and Classification Classification of data patterns is important

especially for data extraction from the webpage. Fig. 2 illustrates the process of data extraction and classification. Four classes of data have been identified: text, image, video and audio. For each of the class, there are several sub-classes which represent the detail category of particular data. Data for media such as audio, video and image will be identified when the parser found the word src= in the data structure during the extraction process. src is a keyword for source reference, so the parser knows where to locate the source data.

Fig. 2. Data Classes

After location of the source is detected, the parser

identifies its data type and classifies its class. For text or label type, keyword is not required for references

USER (LAYER 1)

INTERFACE (LAYER 2)

SOURCE (LAYER 3)

XML (LAYER 4)

MULTIMEDIA DATABASE (LAYER 5)

46

since it can be identified within the HTML tag structure. Table 2 shows the class of data type and its classification.

TABLE II. CLASSIFICATION OF FOUR MAIN CLASS DATA TYPE

Content Type Description Text Consist of strings, numbers

and symbols Image Various image formatsVideo Various video formatsSound Various sounds format

In classification process, Document Object Module

(DOM) tree technique is applied to find the correct data in the HTML document. During web pages navigation, DOM is used in the data extraction because it allows DOM events to be processed simultaneously. An example of DOM tree structure is shown in Fig. 3 below.

Fig. 3: Example of DOM tree In the DOM tree, some unnecessary nodes, such as

script, style or other customized nodes need to be filtered. This content is shown in the body node, which is in the body tag. The advantage of DOM is that it is filled with lots of information. However, some of the unnecessary information cannot be eliminated completely. With pattern classification, the unnecessary information may be minimized during the extraction process.

B. Implementation Generally, in implementing the prototype system,

its architecture consists of six important components that include web, generator, user, converter, XML document and multimedia database. Fig. 4 depicts the high-level view of the system architecture that shows the flow of data extraction from Web page into multimedia database.

Fig. 4: The Prototype Architecture

Web-Web is a collection of information in World Wide Web (www).The website consists of many structured, semi-structured or unstructured data that need to be captured for many purposes in different areas such as financial statements, weather reports, stock exchange, travel information, advertisements and others.

The Generator-This part shows how the generator supports the user during wrapper design phase. The generator is used to request HTTP web service from the target web and also to retrieve data from the web. The generator consists of three components. There are visual interface, program generator and temporary storage. o Visual InterfaceIt defines the data that

should be extracted from web pages and mapped it into a structured format like HTML form. There are several windows in visual interface itself such as a window for displaying HTML structure of the web page, a parse tree window to show the category of data to be extracted from the current webpage, and table for the result of data classification. Moreover, a control window that allows the user to view and control the overall progress of the wrapper design process and to input textual data are also in the visual interface.

o Program GeneratorA program window that displays the constructed wrapper program, allows the user to make further adjustment or correction to the program. This sub module interprets the user actions on the web pages and successively generates the wrapper. It specifies the URL of web pages, which are structurally similar to the target web pages, or to navigate to such pages. In the latter case, the link of navigation is recorded and can be automatically reproduced.

o Temporary StorageA temporary storage location stores the result of data extraction from the web. This storage location holds four

47

data categories that include text, image, audio and video in separate locations.

The User This component specifies input data for the generator, and categorizes the result of the extraction process to be stored into multimedia database.

ConverterConsists of three types of converter for data conversion whether from a XML documents to a multimedia database or from a generator to XML document. o Bitmap converterConvert various types of

image format into Bitmap and vice versa. This converter can be used for images only.

o Base64 converter Convert Bitmap into base 64 format and vice versa. This converter can be used for images only.

o String converter Convert all format types into string format.

XMLA structured storage for data classification and as a medium for data transmission from web into multimedia database. A XML document holds various types of data such as text, audio, video and image.

IV. RESULTS The output or user interface of the prototype tool is

illustrated in Fig. 5 that allows users to store multimedia elements in any specified web pages. A progress bar is shown at the bottom of the screen to show users the percentage of processing work done by the engine. There are also several items in the main menu such as Headers, URLs, URLs title, Emails and Phone, to assist in dealing with the information of interest.

Fig. 5: Screenshot of the prototype system Using the provided interface, a user can extract

useful multimedia data that resides in the webpage specified in the URL column. This tool will extract useful information by searching all the possible links associated with the webpage. Fig. 6 shows part of the links for a given URL as an example. It illustrates various links to other web Pages as well as any data that resides in the webpage.

Fig. 6: All related links associated with the given webpage The useful multimedia data is classified using the

regular expression and DOM tree path learning algorithm. Later, it is stored in a temporary XML file with a specific format. All types of multimedia data are stored according to their types; however, image type will be converted into bitmap for fast processing and retrieval. Fig. 7 depicts an example of the XML file.

Fig. 7: An example of the XML document for four types of data From the XML format, all the possible valuable data

can be mapped to a permanent multimedia database for a later usage. In this case, an Oracle 11g database is used as the storage. Fig. 8 illustrates the output classification for image type with its value and link. Outputs for other data types are also presented in the same manner as the image type.

48

Fig. 8: Classification Output for Image The use of user interface in this prototype design

helps users to work conveniently with any web pages. The menu and command buttons allow easy access to the unstructured webpage and multimedia database. Thus, this prototype system can be viewed as a tool in extracting and gathering multimedia data of unstructured information for systematic data management.

V. CONCLUSION This paper presents a prototype tool that extracts

data from any WebPages and store necessary multimedia data into a multimedia database using XML. The transformation from unstructured information into structured data has been successfully performed using various methods that include regular expression and DOM parse tree. Thus, this prototype has been developed to help the end users to get useful multimedia data (text, image, audio and video) stored for future retrieval and usage. This research also performs a comparative analysis on eight of extraction tools, namely, DeLa, EXALG, RAPIER, RoadRunner, Stalker, SRV, WebOQL and WHISK, to ensure that the best data extraction methods can be adapted for the implementation of the prototype tool. The tools are compared based on page type, class of tool, feature extraction, extraction rule type and learning algorithm. This research also has a new contribution in introducing an automated unstructured data capturing for structured storing that deals with multimedia data.

The prototype tool could be further enhanced by displaying the data stored in the multimedia database into any manageable forms such as report, documentation, statistics and so on.

REFERENCES [1] C. W. Smullen, S.R. Tarapore and S. Gurumurthi, A

Benchmark Suite for Unstructured Data Processing, International Workshop on Storage Network Architecture and Parallel I/Os, Sept. 2007, pp. 79 83, 2007.

[2] Merrill Lynch & Co., Inc.., http://www.ml.com/ index.asp?id=7695_1512, 2010.

[3] R. Blumberg and S.Atre, Robert Blumberg and Shaku Atre. DM Review. Retrieved from http://www.soquelgroup.com/Articles/dmreview_0203_problem.pdf/, 2003.

[4] T. Berners-Lee, J. Hendler and O. Lassila, "The Semantic Web", Scientific American, May 2001, 284(5):34-43. 2001.

[5] G. Alonso, F. Casati, H. Kuno and V. Machiraju, Web Services Concepts, Architectures and Applications, Springer-Verlag, 2004.

[6] C.H.Chang, H.Siek, J.J. Lu, C.N. Hsu and J.J. Chiou, Reconfigurable web wrapper agents, IEEE Intelligent Systems, Vol. 18, Issue 5, Sept 2003, pp: 34 40, 2003.

[7] evText, Inc, https://www.evtext.com., 2008. [8] Fiumara G. , "Automated Information Extraction from Web

Sources: a Survey between Ontologies and Folksonomies" Workshop in 3rd International Conference on Communities and Technology, 2007.

[9] C. Hsu and M. Dung. Generating finite-state transducersfor semistructured data extraction from the web. J. Information Systems, 23(8), 1998.

[10] I. Muslea, S. Minton, and Knoblock, A Hierarchical Approach To Wrapper Induction. Proceedings of the third International Conference on Autonomous Agents (AA-99), 1999 .

[11] A. H. Laender, R. Neto, and D. Silva, DEByE Data Extraction by Example. Data and Knowledge Engineering, 40(2): 121-154, 2002.

[12] K. Teknomo, K-Nearest Neighbors Tutorial, http:people.revoledu.comkardi tutorialKNN, 2004.

[13] W. Ding, Songnian Yu, Qianfeng Wang, Jiaqi Yu and Qiang Guo, A Novel Naive Bayesian Text Classifier. International Symposiums on Information Processing. 2008.

[14] R. Zhang, Zhongfei, Image Database Classifcation based on Concept Vector Model. IEEE International Conference on Multimedia and Expo, 2005.

[15] L. Breiman, J. H. Friedman, R. A. Olshen and C.J. Stone, Classification and Regression Trees, Wadsworth, Belmont, 1984.

[16] S. Brin and L. Page, Anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (Brisbane, Australia, Apr. 14 18), pp. 107117. 1998)

[17] V.Crescenzi, G.Mecca and P.Merialdo, RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB Conference, 2001.

[18] R.Baumgartner, S.Flesca and G.Gottlob, Visual Web Information Extraction with Lixto, Proceedings of the 27th VLDB Conference, 2001.

[19] M. E. Califf, Relational Learning Techniques for Natural Language Information Extraction. Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin,TX. Also appears as Artificial Intelligence Laboratory Technical Report AI, pp. 98-276, 1998.

[20] D. Freitag, Information Extraction From HTML: Application Of A General Learning Approach. Proceedings of the Fifteenth Conference on Artificial Intelligence (AAAI-98). 1998.

[21] S. Soderland (1999). Learning Information Extraction Rules For Semi-Structured And Free Text. Journal of Machine Learning, 34(1-3, pp. 233-272), 1999.

[22] B. Adelberg, NoDoSE: A Tool For Semi-Automatically Extracting Structured And Semi-Structured Data From Text Documents. SIGMOD Record 27(2), pp. 283-294, 1998.

[23] T. Chartrand, Ontology-Based Extraction Of Rdf Data From The World Wide Web. Brigham Young University. 2003.

[24] J. Wang, Information Discovery, Extraction and Integration for the Hidden Web. University of Science and Technology. 2004.

[25] A. Arasu, Garcia and H. Molina (2003). Extracting Structured Data from Web Pages. Proceedings of the ACM SIGMOD International Conference on Management of Data, San Diego, California, pp. 337-348, 2003.

[26] G. Arocena and A. Mendelzon. "WebOQL: Restructuring Documents, Databases, and Webs" in Proceedings of the 14th International Conference on Data Engineering, Orlando, Florida, pp. 24-33, 1998.

49