This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Text documents are everywhere around us (e.g. 189 Mio Web Sites1) All containing lots of valuable information
Semantic Web as a vision to annotate information with their meaning
Only 12% of Web Sites make use of any semantic annotation like RDFa, microformat or Microdata [Mühleisen12] Most content remains incomprehensible to machines
Tools required that allow automatic identification of certain information in text
Addresses consisting of different attributes Extracted data is only valuable if all attributes have been identified correctly Sequentiality can be exploited
Business addresses have a high volatility Need to track them automatically
Business address data is of interest in various domains
1. MotivationBusiness Address Data
KOM – Multimedia Communications Lab 5
Semantic Web!
Web Sites aggregating existing content Often relying on addresses given on Web Sites E.g. restaurant recommendations, job search engines, product search engines
Address-repositories Can be created automatically
Location-based services Can gain from population of geographical
Evaluation with legal notes (“Impressum”) from German company Web sites 1576 documents containing one or more addresses Each Web site annotated with the address of the owner of the Web site
( Our Gold Standard)
Recall Fraction of addresses from the Gold Standard found Only if all attributes of a single address were completely correct, then the
Structure of company names often very unusual Leads to partly correct detection E.g. “oberüber Agentur für digitale Wertschöpfung” has been detected as
“Agentur für digitale Wertschöpfung”
Several company names on the Web site Wrong company is assigned to an address
Transformation from HTML code to text introduces errors
4. EvaluationChallenges
KOM – Multimedia Communications Lab 19
1. Motivation Why Business Address Data? Application Scenarios
A new approach for identification of address data Outperforming existing approaches No usage of commercial databases Adaptable to other languages / countries Tailored for identification of business addresses
Next steps: Adapt patterns to other languages / countries Evaluate in other languages / countries
7. Conclusion & Future Work
KOM – Multimedia Communications Lab 26
Questions & Contact
Source: http://www.dreifragezeichen.de/
KOM – Multimedia Communications Lab 27
[Ahlers08] D. Ahlers and S. Boll. Retrieving Address-based Locations from the Web. In Proceedings of the 2nd international workshop on Geographic information retrieval, GIR ’08, pages 27–34, New York, NY, USA, 2008. ACM
[Asadi08] S. Asadi, G. Yang, X. Zhou, Y. Shi, B. Zhai, and W.-R. Jiang. Pattern-Based Extraction of Addresses from Web Page Content. In Y. Zhang, G. Yu, E. Bertino, and G. Xu, editors, Progress in WWW Research and Development, volume 4976 of Lecture Notes in Computer Science, pages 407–418. Springer Berlin Heidelberg, 2008.
[Cai05] W. Cai, S. Wang, and Q. Jiang. Address extraction: Extraction of location-based information from the web. In Y. Zhang, K. Tanaka, J. Yu, S. Wang, and M. Li, editors, Web Technologies Research and Development - APWeb 2005, volume 3399 of Lecture Notes in Computer Science, pages 925–937. Springer Berlin Heidelberg, 2005.
[Loos08] B. Loos and C. Biemann. Supporting Web-based Address Extraction with Unsupervised Tagging. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker, editors, Data Analysis, Machine Learning and Applications, Studies in Classification, Data Analysis, and Knowledge Organization, pages 577–584. Springer Berlin Heidelberg, 2008.
[Mühleisen12] H. Mühleisen and C. Bizer. Web Data Commons -Extracting Structured Data from Two Large Web Corpora. In Proceedings of the 5th Workshop on Linked Data on the Web, 2012.