Top Banner
Rome November 2001 CROSSMARC Third meeting ICDC French NERC (first version and results) CROSSMARC Project IST- 2000-25366 Third meeting Rome November 2001
21

ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Jan 06, 2018

Download

Documents

Theodora Poole

ICDCRome November 2001CROSSMARC Third meeting French Corpus 56 mono product description pages 7 manufacturers : SONY, ASUS, DELL… 17 models : VAIO, INSPIRON, L8400… 6 processors : PENTIUM III, CELERON… 5 OS : WIN MILLENIUM, WIN 98… Wide ranges of WEIGHTS, PRICES...
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

French NERC(first version and results)

CROSSMARC Project IST-2000-25366

Third meetingRome November 2001

Page 2: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

Summary

• Complete experiment on French corpus– French mono-product corpus– Detailed extraction performances– Examples of limits

• French NERC overview– XML DTD for named-entities extractions– Architecture & components description– Development & maintenance

Page 3: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

French Corpus

• 56 mono product description pages

• 7 manufacturers : SONY, ASUS, DELL…• 17 models : VAIO, INSPIRON, L8400…• 6 processors : PENTIUM III, CELERON…• 5 OS : WIN MILLENIUM, WIN 98…• Wide ranges of WEIGHTS, PRICES...

Page 4: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

Example of extraction

Page 5: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

Detailed extraction performances[OK,KO]

• MANUF [56, 0], Small number of cases (7)• MODEL [56, 0], Great number of configurations (VAIO FX 101, 105, 201, 203, 205, 209, 808, PCG, QR10…)

• PROCESSOR [55, 1], Most of the cases are PENTIUM III & CELERON• SOFT_OS [51, 5], Small number of cases (WIN XX)

• PRICE [35, 21], Some limits, ambiguities due to component prices• RESOLUTION [39, 17], Some limits• SPEED [41, 15], Some limits, ambiguities due to component speed• CAPACITY [52, 4], ambiguities due to component capacities

Page 6: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

(1a) Limits: Information does not exist

• No weight

Page 7: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

(1b) Limits: Information does not exist

• No Soft_OS

Page 8: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

(2) Limits: Information inside an image

<big><big><font face="Arial" color="#000080"><strong>13990.00</strong></font></big></big><img src="img/francb.gif">

Page 9: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

(3) Limits:One description for several products

Page 10: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

(4) Limits:Information outside of the page

Page 11: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

(5) Limits:Information contains an error

Soft_OS = windows 200

Page 12: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

Perspectives

• Ambiguities will be managed by the Fact Extractor Module

• Limits should be discussed by the Consortium– Information does not exist– Information inside an image– One description for several products– Information outside of the page – Information contains an error

Page 13: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

French NERC Overview

laptops.xml

nerc.dtd

xml2nerc nerc-laptops.pl Nerc.pm

product.html

extraction.html

static step dynamic step

refers to

is processed by

generates

XMLPerlHTMLXHTML

Page 14: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

nerc.dtd

<?xml version="1.0" encoding="iso-8859-1"?><!-- DTD French NERC --><!-- Informatique CDC 2001 --><!-- Project CROSSMARC --><!ELEMENT nerc (feature)+><!ATTLIST nerc domain CDATA #REQUIRED><!ELEMENT feature (element)+><!ATTLIST feature no CDATA #REQUIREDname CDATA #REQUIREDtype (STRING|INTEGER|DECIMAL|DOUBLE-INTEGER) #REQUIREDif CDATA #REQUIREDweak CDATA #IMPLIED><!ELEMENT element (form)+><!ATTLIST element norm CDATA #REQUIREDweak CDATA #IMPLIED><!ELEMENT form (#PCDATA)>

•DTD File•Domain independant rulebase metadescription

• nerc: main– domain

• feature: of a product (e.g., SPEED)– no– name– type– if– weak

• element: of a feature (e.g., MHz)– norm– weak

• form: string or regex of an element(e.g., "[Mm][Hh][Zz]")

Page 15: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

laptops.xml (1)•XML File•Domain dependant matching rulebase description

Page 16: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

laptops.xml (2)•Domain independant desambiguation

Page 17: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

xml2nerc

• Perl Program• Domain independant XML to Perl translator• Refers to nerc.dtd: elements, attributes,

pcdata• Refers to Nerc.pm: main, matching and

desambiguation algorithms

Page 18: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

Nerc.pm

• Perl Module• Domain independant pattern matching• Domain independant desambiguation

Page 19: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

nerc-laptops.pl

• Generated domain dependant Perl Program• Applies pattern matching and desambiguation• Generates named-entities that are recognized• Refers to Nerc.pm: matching and

desambiguation algorithms

Page 20: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

FNERC Development & Maintenance

nerc.dtdxml2nerc / Nerc.pmlaptops.xml

Level 2New PCDATA regex

Level 0New PCDATA string

Level 5New attribute

Level 1Attributes value

Domain dependent Domain independent

Level 4New attribute enum.

Level 3New attribute value

Page 21: ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST-2000-25366 Third meeting Rome November 2001.

Rome November 2001 CROSSMARC Third meeting ICDC

Perspectives

• WP1: Experimenting the NERC as a better evaluation function for the topic spider

• WP2: Improving the FNERC• WP3: Implementing desambiguation

techniques for the Fact Extractor Module