-
Extraction and Classification of Unstructured Data in WebPages
for Structured Multimedia Database via XML
Siti Z. Z. Abidin, Noorazida Mohd Idris and Azizul H. Husain
Faculty of Computer and Mathematical Sciences, Universiti
Teknologi MARA,
40450, Shah Alam, Selangor, Malaysia. {sitizaleha533,
noorazida}@salam.uitm.edu.my, [email protected]
AbstractNowadays, there is a vast amount of information
available in the internet. The useful data must be captured and
stored for future purposes. One of the major unsolved problems in
the information technology (IT) industry is the management of
unstructured data. The unstructured data such as multimedia files,
documents, spreadsheets, news, emails, memorandums, reports and web
pages are difficult to capture and store in the common database
storage. The underlying reason is due to the tools and techniques
that proved to be so successful in transforming structured data
into business intelligence and actionable information, simply do
not work when it comes to unstructured data. As a result, new
approaches are necessary. Attempts have been undertaken by several
researchers to deal with unstructured data, but, so far it is hard
to find a tool that can store and retrieve the extracted and
classified unstructured data into a structured database system.
This paper is to present our research on unstructured data
identification, extraction and classification of web pages, which
is then transformed into structured format in Extensible Markup
Language (XML) document, and later stored into a multimedia
database. The contribution of this research is in the approach of
capturing the unstructured data and the efficiency of a multimedia
database to handle this kind of data. The stored data could give
benefits to various communities such as students, lecturers,
researchers and IT managers because it can be used for any
planning, decision-making, day-to-day operations, and other future
purposes.
Keywords Unstructured data; Webpage; Data extraction; Data
classification; XML; Multimedia database
I. INTRODUCTION Today, peoples lives are greatly influenced
by
information technology due to the excessive exposure of the
Internet, reducing costs of disk storage and the overwhelming
amounts of information stored by todays business industries. While
there is a vast amount of data available, what are sorely missing
are tools and methods to manage this unstructured ocean of facts
and turn them into usable information. Unstructured data includes
documents, spreadsheets, presentations, multimedia files, memos,
news, user groups, chats, reports, emails and web pages [1].
Merrill Lynch [2] estimated that more than 85 percent of all
business information existed as unstructured data [3]. This data
plays a significant part of a knowledge base for an organization.
The data needs to be properly managed for a long-term usage. With
three quarter of data is unstructured, it represents a single
largest
opportunity for positive economic returns. Through effective
unstructured data management, revenue, profitability and
opportunity can go up, while risks and costs may go down [3].
This paper presents an exploratory study on various tools for
data extraction and classification and the design of a prototype
tool that is able to extract and classify unstructured data in any
web pages. The classified data is structured in extensible markup
language (XML) before all the useful data is stored into an Oracle
multimedia database. With the analysis on the current available
tools, the prototype is designed and implemented using C#
programming by selecting the significant methods among the current
tools. In transforming the unstructured data into structured forms,
image data type is converted into specific format through double
conversion for speed performance in image retrieval (from
multimedia database).
This paper is organized as follows: Section II describes the
related work on all possible data extraction and classification
techniques while Section III gives details explanation on research
methodology. Section IV demonstrates the results of this research
and Section V draws the conclusion.
II. RELATED WORK The World Wide Web is a growing database in
which a great amount of information is available. There are
three types of web pages, which are unstructured, semi-structured
and structured web pages. Unstructured web pages are those in which
the information of interest is within free text that no common
pattern can be induced. Semi-structured is typically generated by
using a template and a set of data that can infer one or more
patterns that may be used to extract data from the web pages. A
structured web page is a web page that presents information in HTML
for human browsing, and also offers structured data that can be
processed automatically by machines. This data is easily integrated
into business processes. Unfortunately, querying and accessing this
data by software agents is not a simple task since it is
represented in a human friendly format. Although the software agent
makes it easier for human to understand and browse the Web, it
makes the incorporation and the use of this data by automated
processes very difficult. Some solutions for this can be the use of
semantic web [4], which is still a vision, or the use of web
services [5], but missing a complete specification. An information
extractor or wrappers [6] may fill the gap and help in transforming
the web into completely
44978-1-4244-5651-2/10/$26.00 2010 IEEE
-
structured data that is usable by automated processes. There are
many extraction algorithms, but unfortunately none of them can be
considered as a perfect solution. They are usually designed and
built to provide distinct interfaces, thus, leads to complicating
the task for integration of these algorithms inside enterprise
applications. There are some methods introduced by some researchers
to convert web page data from either semi-structured or
unstructured format into structured and machine-understandable
design and XML is found to be the most popular.
A. Data Extraction Data extraction is a process of retrieving
and
capturing data from one medium into another medium. The medium
can be WebPages, documents, databases, repository, stack or
anything that consists of information. Refer to the evText Website
[7], data retrieval is a process of locating and linking data
points in the user supplied document with a corresponding data
point in the data retrieval structure. A wrapper accesses HTML
documents and exports the relevant text to a structured format,
normally XML [8]. In order to extract data from a webpage, two
tasks need to be considered; to define its input and its extraction
target. Input can be unstructured page, semi-structured and
structured page. The extraction target can be a relation of k-tuple
where k is the number of attributes in a record or it can be a
complex object with hierarchically organized data [6]. Moreover,
the difficulty of information extraction can become complicated
when various permutations of attributes or typographical errors
occur in the input documents.
There are various classifications of wrapper. For example, Hsu
and Dung [9] classified wrappers into four distinct categories,
including hand-crafted wrappers that had heuristic-based and
induction approaches. Chang [6] followed this taxonomy and came up
with systems that involved annotation-free systems and
semi-supervised systems. Muslea [10] concerned on extraction
patterns of free text using syntactic or semantic constraints.
However, the complete categorization was made by Laender [11] who
proposed the taxonomy such as languages for wrapper development
that consists of HTML-aware tools, NLP-based tools, wrapper
induction tools, modeling-based tools, and ontology-based
tools.
B. Data Classification Data classification is to categorize data
based on
required needs. The goal of classification is to build a set of
models that can correctly predict the class of different objects.
There are many algorithms for data classification used in data
mining and machine learning. They are also being used as the base
algorithms in some data extractor systems. For example, the
k-Nearest Neighbor (KNN) algorithm [12] is mostly used in
determining data (in terms of distance) through its similarity with
respect to its neighbors. Nave Bayesian (NB) algorithm [13] and
Concept Vector-based (CB) algorithm [14], are mostly used for
classifying words in documents. Other methods to classify data, are
using Classification and Regression Trees (CART) algorithm [15],
and PageRank algorithm [16]. CART algorithm is implemented using
decision tree while PageRank uses search ranking algorithm based on
hyperlinks on the Web.
Classifying data into several categories is important because
the raw data has to be matched with the corresponding data classes
specified in the database.
C. Tools for Data Extraction Several tools are compared to look
into their page
type, class, feature extraction, extraction rule type, and
learning algorithm. Table 1 depicts the results that help in the
design and implementation of the prototype tool produced by this
research. For the page type, the structure of input documents is
compared.
As proposed by Laender [11], the tools for data extraction are
developed based on several approaches. HTML-aware tools are used
for HTML documents that require the document to be presented in a
parsing tree. The tree reflects its HTML tag hierarchy and it is
generated either semi-automatically or automatically. Examples of
tools are RoadRunner [17] and Lixto [18].
RAPIER [19], SRV [20] and WHISK [21] use Natural language
processing or NLP techniques to build relationship between
sentences elements and phrases (filtering, part-of-speech tagging,
and lexical semantic tagging). These techniques derive extraction
rules based on syntactic and semantic constrains that help to
identify the relevant information within a document.
There are also several wrapper induction tools under generated
delimiter-based extraction rules derived from a given set of
training examples. The main difference between tools based NLP
techniques and these tools is that they do not rely on linguistic
constrains, but rather in formatting features that implicitly
delineate the structure of the pieces of data found. Therefore,
they are more suitable for HTML documents than the previous ones.
An example of this type of tool is STALKER[10]. For the modeling
based tools, a target structure is provided according to a set of
modeling primitives that conform to an underlying data model.
Example of tools that used this approach is NoDoSE [22] and DEByE
[11].
There are also wrappers that use ontology-based to locate
constants that present in the page and construct objects with them.
This approach is different from all approaches explained previously
as it relies on the structure of presentation data features within
a document, to generate rules or patterns to perform extraction.
However, extraction can be accomplished by relying directly on the
data. Example of this ontology-based tool is the tool that
developed by Brigham Young University Data Extraction Group
[23].
45
-
TABLE I. COMPARATIVE STUDY ON EXTRACTION TOOLS
III. SYSTEM DESIGN The prototype tool built using C# is designed
and
developed based on a framework shown in Fig. 1. The framework
consists of User, Interface, Source, XML and Multimedia Database
layers. Each of the layers in the framework communicates with each
other in order to retrieve or pass data from one layer to
another.
Fig. 1. Research Framework Each layer has its functionality as
follows:
Layer 1 - This layer represents the user who will be using the
implemented system.
Layer 2 - Interface is an interaction medium between user and
the source location that allows user to manipulate data in the
webpage. Examples of interface are programming languages that
support network environment such as Java and CSharp (C#).
Layer 3 - Source layer consists of huge amount of useful data in
webpage in forms of structured, semi-structured or unstructured
page. User will identify the useful data to be extracted from the
source, which later stored in a storage location that can handle
various types of the multimedia elements. Before data can be
allocated into the storage, they need to be classified first. This
classification will be done to determine which data should be
allocated in the storage. This classification is based on the type
of data such as text, image, audio, or video.
Layer 4 - The result from the classification process will be
placed into a structured XML document.
The structured data is then transmitted to the storage layer,
which located at layer 5.
Layer 5 - The storage used in this layer is a multimedia
database. The multimedia database needs to be used for handling the
huge amount of data that consists of various elements of multimedia
types such as text, audio, images and videos format. The example of
a multimedia database that can be used for storing purpose is
Oracle 11g standards. This database is able to support any type of
data, especially for business and commercial purposes.
A. Extraction and Classification Classification of data patterns
is important
especially for data extraction from the webpage. Fig. 2
illustrates the process of data extraction and classification. Four
classes of data have been identified: text, image, video and audio.
For each of the class, there are several sub-classes which
represent the detail category of particular data. Data for media
such as audio, video and image will be identified when the parser
found the word src= in the data structure during the extraction
process. src is a keyword for source reference, so the parser knows
where to locate the source data.
Fig. 2. Data Classes
After location of the source is detected, the parser
identifies its data type and classifies its class. For text or
label type, keyword is not required for references
USER (LAYER 1)
INTERFACE (LAYER 2)
SOURCE (LAYER 3)
XML (LAYER 4)
MULTIMEDIA DATABASE (LAYER 5)
46
-
since it can be identified within the HTML tag structure. Table
2 shows the class of data type and its classification.
TABLE II. CLASSIFICATION OF FOUR MAIN CLASS DATA TYPE
Content Type Description Text Consist of strings, numbers
and symbols Image Various image formatsVideo Various video
formatsSound Various sounds format
In classification process, Document Object Module
(DOM) tree technique is applied to find the correct data in the
HTML document. During web pages navigation, DOM is used in the data
extraction because it allows DOM events to be processed
simultaneously. An example of DOM tree structure is shown in Fig. 3
below.
Fig. 3: Example of DOM tree In the DOM tree, some unnecessary
nodes, such as
script, style or other customized nodes need to be filtered.
This content is shown in the body node, which is in the body tag.
The advantage of DOM is that it is filled with lots of information.
However, some of the unnecessary information cannot be eliminated
completely. With pattern classification, the unnecessary
information may be minimized during the extraction process.
B. Implementation Generally, in implementing the prototype
system,
its architecture consists of six important components that
include web, generator, user, converter, XML document and
multimedia database. Fig. 4 depicts the high-level view of the
system architecture that shows the flow of data extraction from Web
page into multimedia database.
Fig. 4: The Prototype Architecture
Web-Web is a collection of information in World Wide Web
(www).The website consists of many structured, semi-structured or
unstructured data that need to be captured for many purposes in
different areas such as financial statements, weather reports,
stock exchange, travel information, advertisements and others.
The Generator-This part shows how the generator supports the
user during wrapper design phase. The generator is used to request
HTTP web service from the target web and also to retrieve data from
the web. The generator consists of three components. There are
visual interface, program generator and temporary storage. o Visual
InterfaceIt defines the data that
should be extracted from web pages and mapped it into a
structured format like HTML form. There are several windows in
visual interface itself such as a window for displaying HTML
structure of the web page, a parse tree window to show the category
of data to be extracted from the current webpage, and table for the
result of data classification. Moreover, a control window that
allows the user to view and control the overall progress of the
wrapper design process and to input textual data are also in the
visual interface.
o Program GeneratorA program window that displays the
constructed wrapper program, allows the user to make further
adjustment or correction to the program. This sub module interprets
the user actions on the web pages and successively generates the
wrapper. It specifies the URL of web pages, which are structurally
similar to the target web pages, or to navigate to such pages. In
the latter case, the link of navigation is recorded and can be
automatically reproduced.
o Temporary StorageA temporary storage location stores the
result of data extraction from the web. This storage location holds
four
47
-
data categories that include text, image, audio and video in
separate locations.
The User This component specifies input data for the generator,
and categorizes the result of the extraction process to be stored
into multimedia database.
ConverterConsists of three types of converter for data
conversion whether from a XML documents to a multimedia database or
from a generator to XML document. o Bitmap converterConvert various
types of
image format into Bitmap and vice versa. This converter can be
used for images only.
o Base64 converter Convert Bitmap into base 64 format and vice
versa. This converter can be used for images only.
o String converter Convert all format types into string
format.
XMLA structured storage for data classification and as a medium
for data transmission from web into multimedia database. A XML
document holds various types of data such as text, audio, video and
image.
IV. RESULTS The output or user interface of the prototype tool
is
illustrated in Fig. 5 that allows users to store multimedia
elements in any specified web pages. A progress bar is shown at the
bottom of the screen to show users the percentage of processing
work done by the engine. There are also several items in the main
menu such as Headers, URLs, URLs title, Emails and Phone, to assist
in dealing with the information of interest.
Fig. 5: Screenshot of the prototype system Using the provided
interface, a user can extract
useful multimedia data that resides in the webpage specified in
the URL column. This tool will extract useful information by
searching all the possible links associated with the webpage. Fig.
6 shows part of the links for a given URL as an example. It
illustrates various links to other web Pages as well as any data
that resides in the webpage.
Fig. 6: All related links associated with the given webpage The
useful multimedia data is classified using the
regular expression and DOM tree path learning algorithm. Later,
it is stored in a temporary XML file with a specific format. All
types of multimedia data are stored according to their types;
however, image type will be converted into bitmap for fast
processing and retrieval. Fig. 7 depicts an example of the XML
file.
Fig. 7: An example of the XML document for four types of data
From the XML format, all the possible valuable data
can be mapped to a permanent multimedia database for a later
usage. In this case, an Oracle 11g database is used as the storage.
Fig. 8 illustrates the output classification for image type with
its value and link. Outputs for other data types are also presented
in the same manner as the image type.
48
-
Fig. 8: Classification Output for Image The use of user
interface in this prototype design
helps users to work conveniently with any web pages. The menu
and command buttons allow easy access to the unstructured webpage
and multimedia database. Thus, this prototype system can be viewed
as a tool in extracting and gathering multimedia data of
unstructured information for systematic data management.
V. CONCLUSION This paper presents a prototype tool that
extracts
data from any WebPages and store necessary multimedia data into
a multimedia database using XML. The transformation from
unstructured information into structured data has been successfully
performed using various methods that include regular expression and
DOM parse tree. Thus, this prototype has been developed to help the
end users to get useful multimedia data (text, image, audio and
video) stored for future retrieval and usage. This research also
performs a comparative analysis on eight of extraction tools,
namely, DeLa, EXALG, RAPIER, RoadRunner, Stalker, SRV, WebOQL and
WHISK, to ensure that the best data extraction methods can be
adapted for the implementation of the prototype tool. The tools are
compared based on page type, class of tool, feature extraction,
extraction rule type and learning algorithm. This research also has
a new contribution in introducing an automated unstructured data
capturing for structured storing that deals with multimedia
data.
The prototype tool could be further enhanced by displaying the
data stored in the multimedia database into any manageable forms
such as report, documentation, statistics and so on.
REFERENCES [1] C. W. Smullen, S.R. Tarapore and S. Gurumurthi,
A
Benchmark Suite for Unstructured Data Processing, International
Workshop on Storage Network Architecture and Parallel I/Os, Sept.
2007, pp. 79 83, 2007.
[2] Merrill Lynch & Co., Inc.., http://www.ml.com/
index.asp?id=7695_1512, 2010.
[3] R. Blumberg and S.Atre, Robert Blumberg and Shaku Atre. DM
Review. Retrieved from
http://www.soquelgroup.com/Articles/dmreview_0203_problem.pdf/,
2003.
[4] T. Berners-Lee, J. Hendler and O. Lassila, "The Semantic
Web", Scientific American, May 2001, 284(5):34-43. 2001.
[5] G. Alonso, F. Casati, H. Kuno and V. Machiraju, Web Services
Concepts, Architectures and Applications, Springer-Verlag,
2004.
[6] C.H.Chang, H.Siek, J.J. Lu, C.N. Hsu and J.J. Chiou,
Reconfigurable web wrapper agents, IEEE Intelligent Systems, Vol.
18, Issue 5, Sept 2003, pp: 34 40, 2003.
[7] evText, Inc, https://www.evtext.com., 2008. [8] Fiumara G. ,
"Automated Information Extraction from Web
Sources: a Survey between Ontologies and Folksonomies" Workshop
in 3rd International Conference on Communities and Technology,
2007.
[9] C. Hsu and M. Dung. Generating finite-state transducersfor
semistructured data extraction from the web. J. Information
Systems, 23(8), 1998.
[10] I. Muslea, S. Minton, and Knoblock, A Hierarchical Approach
To Wrapper Induction. Proceedings of the third International
Conference on Autonomous Agents (AA-99), 1999 .
[11] A. H. Laender, R. Neto, and D. Silva, DEByE Data Extraction
by Example. Data and Knowledge Engineering, 40(2): 121-154,
2002.
[12] K. Teknomo, K-Nearest Neighbors Tutorial,
http:people.revoledu.comkardi tutorialKNN, 2004.
[13] W. Ding, Songnian Yu, Qianfeng Wang, Jiaqi Yu and Qiang
Guo, A Novel Naive Bayesian Text Classifier. International
Symposiums on Information Processing. 2008.
[14] R. Zhang, Zhongfei, Image Database Classifcation based on
Concept Vector Model. IEEE International Conference on Multimedia
and Expo, 2005.
[15] L. Breiman, J. H. Friedman, R. A. Olshen and C.J. Stone,
Classification and Regression Trees, Wadsworth, Belmont, 1984.
[16] S. Brin and L. Page, Anatomy of a large-scale hypertextual
web search engine. In Proceedings of the 7th International World
Wide Web Conference (Brisbane, Australia, Apr. 14 18), pp. 107117.
1998)
[17] V.Crescenzi, G.Mecca and P.Merialdo, RoadRunner: Towards
Automatic Data Extraction from Large Web Sites. VLDB Conference,
2001.
[18] R.Baumgartner, S.Flesca and G.Gottlob, Visual Web
Information Extraction with Lixto, Proceedings of the 27th VLDB
Conference, 2001.
[19] M. E. Califf, Relational Learning Techniques for Natural
Language Information Extraction. Ph.D. thesis, Department of
Computer Sciences, University of Texas, Austin,TX. Also appears as
Artificial Intelligence Laboratory Technical Report AI, pp. 98-276,
1998.
[20] D. Freitag, Information Extraction From HTML: Application
Of A General Learning Approach. Proceedings of the Fifteenth
Conference on Artificial Intelligence (AAAI-98). 1998.
[21] S. Soderland (1999). Learning Information Extraction Rules
For Semi-Structured And Free Text. Journal of Machine Learning,
34(1-3, pp. 233-272), 1999.
[22] B. Adelberg, NoDoSE: A Tool For Semi-Automatically
Extracting Structured And Semi-Structured Data From Text Documents.
SIGMOD Record 27(2), pp. 283-294, 1998.
[23] T. Chartrand, Ontology-Based Extraction Of Rdf Data From
The World Wide Web. Brigham Young University. 2003.
[24] J. Wang, Information Discovery, Extraction and Integration
for the Hidden Web. University of Science and Technology. 2004.
[25] A. Arasu, Garcia and H. Molina (2003). Extracting
Structured Data from Web Pages. Proceedings of the ACM SIGMOD
International Conference on Management of Data, San Diego,
California, pp. 337-348, 2003.
[26] G. Arocena and A. Mendelzon. "WebOQL: Restructuring
Documents, Databases, and Webs" in Proceedings of the 14th
International Conference on Data Engineering, Orlando, Florida, pp.
24-33, 1998.
49