Top Banner
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide 06-September-2013 Prof. Dr.-Ing. Ralf Steinmetz KOM - Multimedia Communications Lab i-know_Address Extraction_SebS___2013.08.20.pptx Image Source: http://upload.wikimedia.org/wikipedia/en/7/7f/World_Map_flat_Mercator.png, http://www.frdc.at/hp_frdc_pictures/frdc_dart_ Sebastian Schmidt, M.Sc. Extraction of Address Data from Unstructured Text using Free Knowledge Resources ??
27

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

Mar 30, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide06-September-2013

Prof. Dr.-Ing. Ralf SteinmetzKOM - Multimedia Communications Lab

i-know_Address Extraction_SebS___2013.08.20.pptx

Image Source: http://upload.wikimedia.org/wikipedia/en/7/7f/World_Map_flat_Mercator.png, http://www.frdc.at/hp_frdc_pictures/frdc_dart_pfeil.jpg.gif Sebastian Schmidt, M.Sc.

Extraction of Address Data from Unstructured Text

using Free Knowledge Resources

??

Page 2: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 2

1. Motivation Why Business Address Data? Application Scenarios

2. Structure of German Addresses

3. Solution

4. Evaluation Methodology Results Challenges

5. Related Work

6. Adaptation

7. Conclusion and Future Work

Outline

Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

Page 3: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 3

Text documents are everywhere around us (e.g. 189 Mio Web Sites1) All containing lots of valuable information

Semantic Web as a vision to annotate information with their meaning

Only 12% of Web Sites make use of any semantic annotation like RDFa, microformat or Microdata [Mühleisen12] Most content remains incomprehensible to machines

Tools required that allow automatic identification of certain information in text

1 http://news.netcraft.com/archives/2013/08/09/august-2013-web-server-survey.html

1. MotivationGeneral

Image source: http://www.netresearch.de/blog/wp-content/uploads/2009/04/semantic_web_day.jpg

Page 4: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 4

Addresses consisting of different attributes Extracted data is only valuable if all attributes have been identified correctly Sequentiality can be exploited

Business addresses have a high volatility Need to track them automatically

Business address data is of interest in various domains

1. MotivationBusiness Address Data

Page 5: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 5

Semantic Web!

Web Sites aggregating existing content Often relying on addresses given on Web Sites E.g. restaurant recommendations, job search engines, product search engines

Address-repositories Can be created automatically

Location-based services Can gain from population of geographical

repositories with business information

1. MotivationApplication Scenario

Image source: http://www.thedigitalbus.com/wp-content/uploads/2011/09/Location-Based-Services.jpg

Page 6: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 6

1. Motivation Why Business Address Data? Application Scenarios

2. Structure of German Addresses

3. Solution

4. Evaluation Methodology Results Challenges

5. Related Work

6. Adaptation

7. Conclusion and Future Work

Outline

Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

Page 7: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 7

<Company Name>

<Street> <Street Number>

<Postal Code> <City>

2. Structure of German Addresses

No common pattern Variable length Type of business entity can be part of the name

A number of common suffixes But many exceptions

Spelling varies a lot (abbreviations) Variable length

Single digit or number Can be suffixed by a character

Five digits Might be pre-fixed by “D-”

No common structure Some suffixed indicators

Not for all cities Different naming schemes for single city

E.g. “Frankfurt”, “Frankfurt/Main”, “Ffm”,…

General structure exists but many exceptions fragmented by other attributes

E.g. name of a company not mentioned next to the address but somewhere else on a Web site

All attributes within one line …

Page 8: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 8

1. Motivation Why Business Address Data? Application Scenarios

2. Structure of German Addresses

3. Solution

4. Evaluation Methodology Results Challenges

5. Related Work

6. Adaptation

7. Conclusion and Future Work

Outline

Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

Page 9: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 9

Aggregation

Approach:1. Pre-Processing

2. Identification of single attributes with some dependencies defined by patterns

3. Afterwards aggregation of results to complete addresses

3. SolutionOverview

Pre-Processing

Cities

Street Numbers

Street Names

CompanyNames

Postal Codes

Iden

tific

atio

n of

Page 10: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 10

Preprocessing Stripping of HTML markups Data cleaning Line splitting Tokenization Part-of-Speech (POS) Tagging

Identification of Single Attributes Independently of previous identifications

Only some dependencies for improving precision Leads to a large number of candidates for each attribute

3. SolutionSteps

Aggregation

Pre-Processing

Cities

Street Numbers

Street Names

CompanyNames

Postal Codes

Iden

tific

atio

n of

Page 11: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 11

Identification of Postal Codes Regular expression

Identification of Cities1. Terms in a certain distance (3 tokens) to postal code

candidate that exist in Gazetteer Gazetteer assembled from OpenStreetMap 28,087 entries

2. Terms that are preceded directly by a postal code candidate

Capitalized

3. SolutionSteps

Aggregation

Pre-Processing

Cities

Street Numbers

Street Names

CompanyNames

Postal Codes

Iden

tific

atio

n of

Page 12: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 12

Identification of Street Numbers Regular expression Also for range of street numbers

Identification of Street Names1. Token chains ending with an indicator term

Gazetteer of indicators assembled from OpenStreetMap

Containing 30 most common endings of German street names

Covering 70% of German street names

2. Token chains that follow a certain POS pattern Out of 6 manually defined patterns

3. SolutionSteps

Aggregation

Pre-Processing

Cities

Street Numbers

Street Names

CompanyNames

Postal Codes

Iden

tific

atio

n of

Page 13: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 13

Identification of Company Names1. Token chain ending with indicator term

List of terms from a Wikipedia page on types of business entities

29 indicator terms

2. Token chains preceding a street name

3. SolutionSteps

Aggregation

Pre-Processing

Cities

Street Numbers

Street Names

CompanyNames

Postal Codes

Iden

tific

atio

n of

Page 14: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 14

Aggregation1. Company candidates as seed

2. Search for closest combination of street name and number candidate

3. Search for closest combination of postal code and city candidate

4. If all elements are found for a company candidate Complete address

3. SolutionSteps

Image source: http://d3sdoylwcs36el.cloudfront.net/online_content_distribution_strategies_aggregation_getty_images.jpg/

Aggregation

Pre-Processing

Cities

Street Numbers

Street Names

CompanyNames

Postal Codes

Iden

tific

atio

n of

Page 15: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 15

1. Motivation Why Business Address Data? Application Scenarios

2. Structure of German Addresses

3. Solution

4. Evaluation Methodology Results Challenges

5. Related Work

6. Adaptation

7. Conclusion and Future Work

Outline

Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

Page 16: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 16

Evaluation with legal notes (“Impressum”) from German company Web sites 1576 documents containing one or more addresses Each Web site annotated with the address of the owner of the Web site

( Our Gold Standard)

Recall Fraction of addresses from the Gold Standard found Only if all attributes of a single address were completely correct, then the

address as a whole was considered as correct

Precision Fraction of correct addresses found

F1-Measure

4. EvaluationMethodology

Image source: http://wisesyracuse.wordpress.com/2012/05/23/how-to-measure-the-effectiveness-of-your-social-media-efforts/

Page 17: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 17

4. EvaluationResults

complete address w/o

company name

complete address with

company name

company name

street place0.5

0.6

0.7

0.8

0.9

1

Precision

Recall

F1-Measure

Page 18: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 18

Structure of company names often very unusual Leads to partly correct detection E.g. “oberüber Agentur für digitale Wertschöpfung” has been detected as

“Agentur für digitale Wertschöpfung”

Several company names on the Web site Wrong company is assigned to an address

Transformation from HTML code to text introduces errors

4. EvaluationChallenges

Page 19: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 19

1. Motivation Why Business Address Data? Application Scenarios

2. Structure of German Addresses

3. Solution

4. Evaluation Methodology Results Challenges

5. Related Work

6. Adaptation

7. Conclusion and Future Work

Outline

Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

Page 20: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 20

[Loos08] Usage of Conditional Random Fields Small annotated dataset for bootstrapping Result of unsupervised tagger as an additional feature

[Asadi08] Manually defined patterns for address extraction with confidence scores Usage of some geographic information from unknown source

[Cai05] Exploiting graph based similarity to a template graph Usage of commercial GIS database

[Ahlers08] Relying on complete database of street names, postal codes and cities Matching of text to valid combination of those attributes

Relying on manual effort and/or extensive proprietary data sources No identification of business addresses

5. Related Work

Page 21: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 21

Comparison to Related Work Restricting to address without company name

5. Related WorkResults

Approach Precision Recall F1-Measure Language

[Loos08] 0.89 0.64 0.74 de

[Asadi08] 0.97 0.73 0.83 en

[Cai05] 0.75 0.73 0.74 en

[Ahlers08] Not given ~0.95 Not given de

Our approach 0.93 0.95 0.94 de

Page 22: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 22

1. Motivation Why Business Address Data? Application Scenarios

2. Structure of German Addresses

3. Solution

4. Evaluation Methodology Results Challenges

5. Related Work

6. Adaptation

7. Conclusion and Future Work

Outline

Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

Page 23: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 23

Define overall pattern (order of attributes)

Adapt identification of single attributes

Re-Create Gazetteers Cities Street name indicators Business entity types

OpenStreetMap and Wikipedia exist in most countries/languages

6. Adaptation to other Country/Language

Page 24: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 24

1. Motivation Why Business Address Data? Application Scenarios

2. Structure of German Addresses

3. Solution

4. Evaluation Methodology Results Challenges

5. Related Work

6. Adaptation

7. Conclusion and Future Work

Outline

Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

Page 25: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 25

A new approach for identification of address data Outperforming existing approaches No usage of commercial databases Adaptable to other languages / countries Tailored for identification of business addresses

Next steps: Adapt patterns to other languages / countries Evaluate in other languages / countries

7. Conclusion & Future Work

Page 26: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 26

Questions & Contact

Source: http://www.dreifragezeichen.de/

Page 27: © author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

KOM – Multimedia Communications Lab 27

[Ahlers08] D. Ahlers and S. Boll. Retrieving Address-based Locations from the Web. In Proceedings of the 2nd international workshop on Geographic information retrieval, GIR ’08, pages 27–34, New York, NY, USA, 2008. ACM

[Asadi08] S. Asadi, G. Yang, X. Zhou, Y. Shi, B. Zhai, and W.-R. Jiang. Pattern-Based Extraction of Addresses from Web Page Content. In Y. Zhang, G. Yu, E. Bertino, and G. Xu, editors, Progress in WWW Research and Development, volume 4976 of Lecture Notes in Computer Science, pages 407–418. Springer Berlin Heidelberg, 2008.

[Cai05] W. Cai, S. Wang, and Q. Jiang. Address extraction: Extraction of location-based information from the web. In Y. Zhang, K. Tanaka, J. Yu, S. Wang, and M. Li, editors, Web Technologies Research and Development - APWeb 2005, volume 3399 of Lecture Notes in Computer Science, pages 925–937. Springer Berlin Heidelberg, 2005.

[Loos08] B. Loos and C. Biemann. Supporting Web-based Address Extraction with Unsupervised Tagging. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker, editors, Data Analysis, Machine Learning and Applications, Studies in Classification, Data Analysis, and Knowledge Organization, pages 577–584. Springer Berlin Heidelberg, 2008.

[Mühleisen12] H. Mühleisen and C. Bizer. Web Data Commons -Extracting Structured Data from Two Large Web Corpora. In Proceedings of the 5th Workshop on Linked Data on the Web, 2012.

References