Motivation UTF-8 Encoding Form DANA Results Conclusion A Fast and Accurate Approach for Main Content Extraction based on Character Encoding Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert and Gholamreza Nakhaeizadeh Ph.D. Candidate Institute of Applied Information Processing University of Ulm Germany [email protected]Tir 2011, Augest 2011, Toulouse, France 1 / 25
25
Embed
Accurate Main Content Extraction from Persian HTML Files
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Motivation UTF-8 Encoding Form DANA Results Conclusion
A Fast and Accurate Approach forMain Content Extraction
based on Character Encoding
Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggertand Gholamreza Nakhaeizadeh
Ph.D. CandidateInstitute of Applied Information Processing
Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANAAlgorithm, named DANAData Sets, Evaluation
4 Results
5 Conclusion
2 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANAAlgorithm, named DANAData Sets, Evaluation
4 Results
5 Conclusion
3 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
An example of web pages with selected MC
4 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
What are the right to left (R2L) laguages?Laguages which are written from the right side to the leftside, Arabic, Persian, Urdu and Pshtoo
What are our goals?From a technical point of view, most of the main contentextraction approaches use HTML tags to separate the maincontent from the extraneous items.This implies the need to employ a parser for the entire webpages. Consequently, the computation costs of these maincontent extraction(MCE) approaches are increased.Thus, the main goal of proposed algorithm is to increaseboth the accuracy and effectiveness of the MCE algorithmsdealing with R2L languages
5 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANAAlgorithm, named DANAData Sets, Evaluation
4 Results
5 Conclusion
6 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
In UTF-8, ASCII characters need only one byte, with avalue guaranteed to be less than 128In UTF-8, all letters of right to left languages (Persian ,Arabic, Pashto, and Urdu Languages) take exactly 2 bytes,each with value greater than 127By using simple condition , we are able to seperate ASCIIcharacters from Non-ASCII characters .
If the value of one byte is < 128 ⇒ this byte is a member ofASCII charactersOtherwise it is a member of Non-ASCII characters
7 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANAAlgorithm, named DANAData Sets, Evaluation
4 Results
5 Conclusion
8 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
The Phases of DANA
In the First phase, we count the number of ASCII andNon-ASCII characters of each line of the HTML file, savedin two 1D arrays T1 and T2In the Second phase, we are looking to find areas comprisingthe MC in an HTML file using the arrays T1 and T2Finally, we feed all HTML lines determined in the previousphase as an input to a parser. The output of parser showsthe main content
9 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase One : Counting ASCII and Non-ASCII characters of each line
DANA reads an HTML file line by line and counts the numberof ASCII and Non-ASCII characters of each line
The number of ASCII characters of each line is saved inone-dimentioanl array T1The number of Non-ASCII characters of each line is savedin one-dimentioanl array T2
10 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
Recognizing areas in the HTML file, in which:Non-ASCII Characters have high densityASCII Characters have low density
To illustrate our approach, we depict two diagrams. In the firstdiagram, for each line of the HTML file
For the Non-ASCII characters, a vertical line with identicallength is drawn upside of the x-axis, as stored in T1Similarly, for the ASCII characters a vertical line with equallength is drawn downside of the x-axis, as stored in T2
11 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
12 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
Different types of regions:Regions with low or near zero density of columns above thex-axis and high density of columns below the x-axis, labeledARegion with high density of columns above the x-axis and alow density of columns below the x-axis, labelled BRegions with slight difference between the density of thecolumns above and below the x-axis, labelled C
13 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
Now the problem of finding MC in the HTML file becomes theproblem of finding region B. To find Region B we follow 3 steps:
1) Draw smoothed diagram. For all columns we calculatediff i and draw new diagram
diffi = T1i − T2i
+ T1i+1 − T2i+1
+ T1i−1 − T2i−1
If diff i > 0 then we draw a line with the length of diff i
above the x-axixOtherwise , we draw a line with the length of absolute valueof diff i below the x-axix
2) We find a column with the longest length above thex-axis3) Finding the boundaries of the MC region 14 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
diff calculated by formula (1)
number of linesin the HTML le
main content
15 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
Finding the boundaries of the MC regions, but howAfter recognizing the longest column above the x-axis, thealgorithm moves up and down in the HTML file to find allparagraph belonging to the MC. But where is the end ofthese movements?The number of lines we need to traverse to find the nextMC paragraph is defined as a parameter P, initialized with20.By considering this parameter, we move up or down,respectively until we can not find a line containingNon-ASCII characters.At this moment, all lines we found make our MC.
16 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Three : Extracting MC
In final phase,we feed all HTML lines determined in the previous phase as aninput to a parser.Following our hypothesis, the output of the parser is exactly themain content.
17 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Data Sets, Evaluation
Data Sets
Web site Num. of Pages LanguagesBBC 598 Farsi
Hamshahri 375 FarsiJame Jam 136 Farsi
Ahram 188 ArabicReuters 116 Arabic
Embassy of 31 FarsiGermany, Iran
BBC 234 UrduBBC 203 PashtoBBC 252 ArabicWiki 33 Farsi
Arabic, Farsi, Pashto, and UrduWe have collected 2166 web pages from different web sites.
18 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Data Sets, Evaluation
Evaluation
r =length(k)
length(g)
p =length(k)
length(m)
F1 = 2 ∗p ∗ r
p + r
19 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANAAlgorithm, named DANAData Sets, Evaluation
4 Results
5 Conclusion
20 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANAAlgorithm, named DANAData Sets, Evaluation
4 Results
5 Conclusion
23 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion
Conclusion and Future Work
Result shows that the DANA produces satisfactory MCwith F1-measure > 0.935We do not need to use parser in the first phase, Usingparser is time consumingFuture works
Generalizing DANA by differentiating between tags andtext/content to propose a new language-independentcontent extractionExtending DANA for Wikipedia web pages to obtain betterresultsTrying to improve DANA to a free-parameter algorithm
24 / 25
Motivation UTF-8 Encoding Form DANA Results Conclusion