Rome November 2001 CROSSMARC Third meeting ICDC French NERC (first version and results) CROSSMARC Project IST- 2000-25366 Third meeting Rome November 2001
Jan 06, 2018
Rome November 2001 CROSSMARC Third meeting ICDC
French NERC(first version and results)
CROSSMARC Project IST-2000-25366
Third meetingRome November 2001
Rome November 2001 CROSSMARC Third meeting ICDC
Summary
• Complete experiment on French corpus– French mono-product corpus– Detailed extraction performances– Examples of limits
• French NERC overview– XML DTD for named-entities extractions– Architecture & components description– Development & maintenance
Rome November 2001 CROSSMARC Third meeting ICDC
French Corpus
• 56 mono product description pages
• 7 manufacturers : SONY, ASUS, DELL…• 17 models : VAIO, INSPIRON, L8400…• 6 processors : PENTIUM III, CELERON…• 5 OS : WIN MILLENIUM, WIN 98…• Wide ranges of WEIGHTS, PRICES...
Rome November 2001 CROSSMARC Third meeting ICDC
Example of extraction
Rome November 2001 CROSSMARC Third meeting ICDC
Detailed extraction performances[OK,KO]
• MANUF [56, 0], Small number of cases (7)• MODEL [56, 0], Great number of configurations (VAIO FX 101, 105, 201, 203, 205, 209, 808, PCG, QR10…)
• PROCESSOR [55, 1], Most of the cases are PENTIUM III & CELERON• SOFT_OS [51, 5], Small number of cases (WIN XX)
• PRICE [35, 21], Some limits, ambiguities due to component prices• RESOLUTION [39, 17], Some limits• SPEED [41, 15], Some limits, ambiguities due to component speed• CAPACITY [52, 4], ambiguities due to component capacities
Rome November 2001 CROSSMARC Third meeting ICDC
(1a) Limits: Information does not exist
• No weight
Rome November 2001 CROSSMARC Third meeting ICDC
(1b) Limits: Information does not exist
• No Soft_OS
Rome November 2001 CROSSMARC Third meeting ICDC
(2) Limits: Information inside an image
<big><big><font face="Arial" color="#000080"><strong>13990.00</strong></font></big></big><img src="img/francb.gif">
Rome November 2001 CROSSMARC Third meeting ICDC
(3) Limits:One description for several products
Rome November 2001 CROSSMARC Third meeting ICDC
(4) Limits:Information outside of the page
Rome November 2001 CROSSMARC Third meeting ICDC
(5) Limits:Information contains an error
Soft_OS = windows 200
Rome November 2001 CROSSMARC Third meeting ICDC
Perspectives
• Ambiguities will be managed by the Fact Extractor Module
• Limits should be discussed by the Consortium– Information does not exist– Information inside an image– One description for several products– Information outside of the page – Information contains an error
Rome November 2001 CROSSMARC Third meeting ICDC
French NERC Overview
laptops.xml
nerc.dtd
xml2nerc nerc-laptops.pl Nerc.pm
product.html
extraction.html
static step dynamic step
refers to
is processed by
generates
XMLPerlHTMLXHTML
Rome November 2001 CROSSMARC Third meeting ICDC
nerc.dtd
<?xml version="1.0" encoding="iso-8859-1"?><!-- DTD French NERC --><!-- Informatique CDC 2001 --><!-- Project CROSSMARC --><!ELEMENT nerc (feature)+><!ATTLIST nerc domain CDATA #REQUIRED><!ELEMENT feature (element)+><!ATTLIST feature no CDATA #REQUIREDname CDATA #REQUIREDtype (STRING|INTEGER|DECIMAL|DOUBLE-INTEGER) #REQUIREDif CDATA #REQUIREDweak CDATA #IMPLIED><!ELEMENT element (form)+><!ATTLIST element norm CDATA #REQUIREDweak CDATA #IMPLIED><!ELEMENT form (#PCDATA)>
•DTD File•Domain independant rulebase metadescription
• nerc: main– domain
• feature: of a product (e.g., SPEED)– no– name– type– if– weak
• element: of a feature (e.g., MHz)– norm– weak
• form: string or regex of an element(e.g., "[Mm][Hh][Zz]")
Rome November 2001 CROSSMARC Third meeting ICDC
laptops.xml (1)•XML File•Domain dependant matching rulebase description
Rome November 2001 CROSSMARC Third meeting ICDC
laptops.xml (2)•Domain independant desambiguation
Rome November 2001 CROSSMARC Third meeting ICDC
xml2nerc
• Perl Program• Domain independant XML to Perl translator• Refers to nerc.dtd: elements, attributes,
pcdata• Refers to Nerc.pm: main, matching and
desambiguation algorithms
Rome November 2001 CROSSMARC Third meeting ICDC
Nerc.pm
• Perl Module• Domain independant pattern matching• Domain independant desambiguation
Rome November 2001 CROSSMARC Third meeting ICDC
nerc-laptops.pl
• Generated domain dependant Perl Program• Applies pattern matching and desambiguation• Generates named-entities that are recognized• Refers to Nerc.pm: matching and
desambiguation algorithms
Rome November 2001 CROSSMARC Third meeting ICDC
FNERC Development & Maintenance
nerc.dtdxml2nerc / Nerc.pmlaptops.xml
Level 2New PCDATA regex
Level 0New PCDATA string
Level 5New attribute
Level 1Attributes value
Domain dependent Domain independent
Level 4New attribute enum.
Level 3New attribute value
Rome November 2001 CROSSMARC Third meeting ICDC
Perspectives
• WP1: Experimenting the NERC as a better evaluation function for the topic spider
• WP2: Improving the FNERC• WP3: Implementing desambiguation
techniques for the Fact Extractor Module