Top Banner
Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang , Nanjing University of Science and Technology Xuchen Yao , Johns Hopkins University Chunyu Kit , City University of Hong Kong 8 August 2013 BUCC2013, Sofia, Bulgaria
23

Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Jan 02, 2016

Download

Documents

Tyler Wells
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Finding More Bilingual Webpages

with High Credibility

via Link Analysis

Chengzhi Zhang , Nanjing University of Science and Technology

Xuchen Yao , Johns Hopkins University

Chunyu Kit , City University of Hong Kong

8 August 2013

BUCC2013, Sofia, Bulgaria

Page 2: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

3 ideas

• Bilingual URL Pattern Detection

• Deep Webpage Recovery

• Incremental Bilingual Website Exploration

Page 3: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Bilingual URL Pattern Detection

• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm

• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)

• U is the set of all URLs within a website• approach: inverted index for URLs

– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...

– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>

– supports multiple languages• better mining multilingual websites such as EU and UN

Page 4: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Bilingual URL Pattern Detection

• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm

• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)

• U is the set of all URLs within a website• approach: inverted index for URLs

– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...

– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>

– supports multiple languages• better mining multilingual websites such as EU and UN

Page 5: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Bilingual URL Pattern Detection

• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm

• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)

• U is the set of all URLs within a website• approach: inverted index for URLs

– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...

– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>

– supports multiple languages• better mining multilingual websites such as EU and UN

Page 6: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Bilingual URL Pattern Detection

• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm

• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)

• U is the set of all URLs within a website• approach: inverted index for URLs

– token-based pair -> char-based pair• weak pairs: <1e, 1c>, <2e, 2c>, ...

– http://.../1e/i.html <-> http://.../1c/i.html• enchanced: <e,c>

– supports multiple languages• better mining multilingual websites such as EU and UN

Page 7: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Top 20 Keys

Page 8: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Number of matched URLs (top 10)

Page 9: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Keys in domain “gov.hk”(rescue local weak keys if they are globally strong)

Page 10: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

3 ideas

• Bilingual URL Pattern Detection

• Deep Webpage Recovery

• Incremental Bilingual Website Exploration

Page 11: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Deep Webpage Recovery

• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions

• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,

<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...

Page 12: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

various structures of websites with deep pages

Page 13: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Deep Webpage Recovery

• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions

• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,

<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...

Page 14: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Deep Webpage Recovery

• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions

• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,

<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...

Page 15: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Deep Webpage Recovery

• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions

• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,

<tc_chi, eng>, ..., then try:– wget http://www.fehd.gov.hk/english/cagenda 20070904.htm– wget http://www.fehd.gov.hk/en/cagenda 20070904.htm– wget http://www.fehd.gov.hk/eng/cagenda 20070904.htm– ...

Page 16: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

3 ideas

• Bilingual URL Pattern Detection

• Deep Webpage Recovery

• Incremental Bilingual Website Exploration

Page 17: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Incremental Bilingual Website Exploration

• Intuition: bilingual websites tend to link to other bilingual websites.

• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)

– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

Page 18: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Incremental Bilingual Website Exploration

• Intuition: bilingual websites tend to link to other bilingual websites.

• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)

– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

Page 19: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Incremental Bilingual Website Exploration

• Intuition: bilingual websites tend to link to other bilingual websites.

• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)

– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

Page 20: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Incremental Bilingual Website Exploration

• Intuition: bilingual websites tend to link to other bilingual websites.

• Measures:– Linkout(w): total number of outgoing links from website w– PageRank(w): (Brin and Page, 1998)

– WeightedPageRank(w): weighted by "how bilingual" w is (the more bilingual URLs, the more "bilingual" w is)

Page 21: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Discovering related webistes from seed websites(select the top K most related websites)

[Linkout, PageRank, WeightedPageRank]

Page 22: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Evaluationon number of URL pairs found and precision

total websites: 12,800

Page 23: Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang, Nanjing University of Science and Technology Xuchen Yao, Johns.

Conclusion

• Unsupervised bilingual pair detection (no heuristics)– http://mega.ctl.cityu.edu.hk/~czhang22/pupsniffer-eval/Data/

Pattern_Credibility_LargeThan100.txt

• A large collection of English-Chinese webpages