Finding More Bilingual Webpages with High Credibility via Link Analysis Chengzhi Zhang , Nanjing University of Science and Technology Xuchen Yao , Johns Hopkins University Chunyu Kit , City University of Hong Kong 8 August 2013 BUCC2013, Sofia, Bulgaria
23
Embed
Finding More Bilingual Webpages with High Credibility via Link Analysis
BUCC2013, Sofia, Bulgaria. Finding More Bilingual Webpages with High Credibility via Link Analysis. Chengzhi Zhang , Nanjing University of Science and Technology Xuchen Yao , Johns Hopkins University Chunyu Kit , City University of Hong Kong 8 August 2013. 3 ideas. - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Finding More Bilingual Webpages
with High Credibility
via Link Analysis
Chengzhi Zhang , Nanjing University of Science and Technology
Xuchen Yao , Johns Hopkins University
Chunyu Kit , City University of Hong Kong
8 August 2013
BUCC2013, Sofia, Bulgaria
3 ideas
• Bilingual URL Pattern Detection
• Deep Webpage Recovery
• Incremental Bilingual Website Exploration
Bilingual URL Pattern Detection
• a URL pattern: <en, zh> (Kit and Ng, 2007)– www.legco.gov.hk/yr99-00/en/fc/esc/e0.htm– www.legco.gov.hk/yr99-00/zh/fc/esc/e0.htm
• Improvement:– pairing up speed goes up from O(|U|2) to O(|U|)
• U is the set of all URLs within a website• approach: inverted index for URLs
– supports multiple languages• better mining multilingual websites such as EU and UN
Top 20 Keys
Number of matched URLs (top 10)
Keys in domain “gov.hk”(rescue local weak keys if they are globally strong)
3 ideas
• Bilingual URL Pattern Detection
• Deep Webpage Recovery
• Incremental Bilingual Website Exploration
Deep Webpage Recovery
• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions
• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,
• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions
• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,
• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions
• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,
• deep webpage: pages that are not linked by any other static pages (not searchable) until created dynamically– mostly triggered by JavaScript or Flash actions
• http://www.fehd.gov.hk/tc_chi/cagenda 20070904.htm– we have discovered patterns <tc_chi, english>, <tc_chi, en>,