Ph.D. Research Proposal RegLocator RegLocator – A Regulation Management System Enhanced By Domain Knowledge Haoyi Wang March 16, 2005 Committee Members:

Ph.D. Research Proposal

RegLocatorRegLocator – A Regulation Management System Enhanced By Domain Knowledge

Haoyi WangMarch 16, 2005

Committee Members: Prof. Kincho Law

Prof. Gio Wiederhold

Prof. Eduardo Miranda

3/16/2005 Engineering Informatics Group 2

Topics

Problems IntroductionProblems Introduction

Project ObjectivesProject Objectives

Related WorkRelated Work

System ArchitectureSystem Architecture

Expected ContributionsExpected Contributions


Introduction on U.S. Statutes

Two major parts: federalfederal and statesstates.

Three types of codes: constitution, codes and regulations.

Regulations are rules made by government agencies.

Federal has 50 titles, and regulations in different states have diverse structures.

U.S. Statutes

Federal States

AL

WY

Constitution

U.S. Codes

Regulations

Constitution

Codes

Regulations

Title 1

Title 50

Title 1

Title N


Structures Inside Regulations

Internal hierarchy Internal hierarchy (Title, Division, Chapter, Article, Section, etc.).

List in subject orderList in subject order (Administration, Food, Business, etc.).

Regulation

Title 1 Title 2 Title N

Division 1 Division 2 Division N

The following definitions shall apply to the regulations contained in this chapter …

Source: Code of California Regulation, http://ccr.oal.ca.gov/

http://ccr.oal.ca.gov/




Problems In Regulation Informatics

Many parties need accessing regulations.

The complexitiescomplexities of federal and state regulations:

• Distributed resources;

• Diverse document formats (pdf, word, html, etc.);

• Semi-structured information: traditional information retrieval approaches do not consider such information.

TroublesTroubles:

• Cause inefficient and ineffective access approaches;

• Increase the risk of companies failing to comply with regulations;

• Hinder public understanding of the government.


Distributed Regulation Resources – External


Distributed Regulation Resources – Internal

Specific topic may cross multiple regulation sections;

The size grows up at the speed of O(ncO(ncdd)) , where nc is the average number of children and d is the depth of the tree;

• ManuallyManually search one topic will be hardhard if d become large;

Example: Environmental Compliance Assistance Platform (ENVCAP)

• An application on environmental regulations;

• Manually build an environmental directory on related regulations.

Source: http://www.envcap.org


Distributed Regulation Resources –Internal (Cont.)

LimitationsLimitations in ENVCAP

• The regulations are distributed on states’ web sites;

• Building the directory is a labor-intensive work and the directory is shallow:

• Multiple locations for a subject, such as mercury;

• Undividable document format, such as PDF.

General Subjects


Content and Structural Query

Query includes structural restrictionsstructural restrictions:

Structural Restriction


Content and Structural Query (Cont.)

Search “Environment AND “waste water”” in CFR:

Irrelevant result

Unreadable size


Tax Payer

Content Only Query

Search only on contentcontent, but regulations are categorizedregulations are categorized.

Waste Water

Subjects


Research Objectives

Developing a universal and centralized platformplatform to handle distributed diverse regulation files and explore their structural informationstructural information;

Defining the mechanismsmechanisms to interpret user’s query by the domain knowledge domain knowledge within regulations, such as hierarchies and categories;

Improving the traditional relevance algorithmtraditional relevance algorithm to rank the search results with the considerations on the featuresfeatures of semi-structured files;


General Web Search Engine

Behind the scene, not only the content in web pages:• Link analysisLink analysis, popularity of a web page is decided by others;

• Meta dataMeta data, more useful information than content;

• Domain and URL path;

• Html meta tags;

• Titles, captions, etc.

• Tweak the ranking of results:

• By geographic location, e.g., local search;

• By commercial.


Information Systems on Regulations

NaviLex (E. Pietrosanti and B. GraziaDio, 1999)

• Search and navigate Italy banking legislation;

• StructuralStructural, , conceptualconceptual and functionalfunctional dimensions of legal documents;

• Query like “which regulation part includes a concept C and C is the subject of an obligation relation?”

Scheme: Obligation

Subject <subject of the Obligation>

Action <activity performed by the subject>

Object <object of the action>

Context <relevant concepts>

Schema: Definition

Definiendum <concept to be defined>

Definiens <defining concepts>

Definition context <relevant concepts>


Information Systems on Legal Cases

Several systems on representing legal cases:

• SMILE, extract factors from cases in trade secret law (Brüninghaus & Ashley, 2001)

• EMBRACE, framework for reasoning refugee law (Yearwood & Stranieri, 1999)

• SPIRE & INQUERY, CBR+IR system on bankruptcy law (Daniels & Rissland, 1997)

Common approach:

• Define a set of featuresfeatures to represent one type of cases by experts;

• Given a new case, the value of features in it is identified;

• Further process on cases can be performed by IR techniques, such as case-based reasoning.


Information Systems on XML Files

A popular standard to represent and exchange knowledge.

XPathXPath – standards to describe the structure in XML files

• Example, in “book.xml”, /book/chapter or /book/chapter/section.

XQueryXQuery – standard query language to retrieve elements in XML files.

book

author title

chapter chapter

heading heading sectionsectionJohn Smith

XML Retrieval

Introduction

This …

XML Query Language XQL

We describe syntax of XQL

Source: Fuhr, etc., 2001


Information Systems on XML Files (Cont.)

XML Systems

Data-Centric Doc-Centric

Structural Mapping

(O2SQL)

Model Mapping

(XRel, Tequyla-TX)

DB Approach

(IRQL, PowerDB, Timber, XQueryIR)

IR Approach

(TIX, HyRex, XPres, XXL)

Complexity & CapabilityComplexity & Capability


Distinguished Characters of RegLocator

Document-centric XML repository for regulations

Challenges of RegLocatorRegLocator

• Hierarchy structureHierarchy structure is defined by titlestitles, not by element types;

• Users only care about the contentcontent, no idea about tree structure;

• Query is bag of wordsbag of words - how to find structural information within a query?

• Match the structural information in query with the underlying content;

• Utilize other domain knowledge in regulationsdomain knowledge in regulations, such as category and references;


Topics

Problems IntroductionProblems Introduction

Project ObjectivesProject Objectives

Related WorkRelated Work

System ArchitectureSystem Architecture

Expected ContributionsExpected Contributions


Centralized Regulation Database


Mining Domain Knowledge

Complete formal representationComplete formal representation is hard to build

• First order logic (FOL), knowledge base, ontology, etc.

• Automatic building process does not exist yet. Knowledge engineeringKnowledge engineering is a reasonable approach.

• Using available information extractioninformation extraction and text miningtext mining methods to identify partial domain knowledge at the shallow level (features).

• Title hierarchy, concept, reference, etc;

• Relationship between concepts and categories;

TITLE 17. Public Health Division 3. Air Resources Chapter 1. Air Resources Board Subchapter 2.6 Air Pollution Control District Rules (CCR)

http://ccr.oal.ca.gov/cgi-bin/om_isapi.dll?clientID=110848&hitsperheading=on&infobase=ccr&record=%7B49A3F%7D&softpage=Document42


Kernel

Main Processing Pipeline

Internet

RegulationCrawler

Reg.DB

Hierarchy Identifier

EntityExtractor

ContentAnalyzer

Content Indexer

Feature Indexer

Search Engine

ContentDB

FeatureDB

New

Old


Major Components of RegLocator

Render Engine1. Search Box

2. Subject Directory

3. Results Rendering

Index Engine1. Content

2. Domain features

Content Engine1. Web Crawler2. Shallow Parser3. Feature Extractor4. Content Analyzer

Search Engine1. Basic Score

2. Feature Score

3. Structural Score

Test Engine1. Functional Test

2. Performance Test


Content Engine – Web Crawler

Structure of web: hyperlinked hyperlinked online documentonline document.

Crawling the web sites

• General processor + General processor + configurationsconfigurations;

• Start from a control center;

• Find links in a download page;

• Follow the links to get more;

• Avoid loop.

outputDir = HIstartTOC = http://www.hawaii.gov/dlnr/AdminRulesIdx.htmmaxDepth=2

linkPattern1 = ^Final.*RulesmatchLink1 = falsefilePattern1 = .*/dlnr/.*\\.pdfindexPattern1 = .*

linkPattern2 = .*filePattern2 = .*/dlnr/.*\\.pdf


Content Engine – Shallow Parser

TextConverter TEXT

StructuralConverter

XMLWORD

HTML

PDF

ConfigurationsConfigurationsType Patterns (junk, hierarchy, table, etc.)

Sample patterns on content filter:# Remove the title of the pages/TITLE 18. ENVIRONMENTAL QUALITY//

# Remove the table of contents/<center>Supp\.(.*?)ARTICLE 1\.(.*?)ARTICLE 1\./<p>ARTICLE 1\./s

Sample patterns on hierarchy recognition:0@^<p>(CHAPTER (\d+))\. (.*?)$@<p><a NAME="$1" LEVEL="2" TITLE="$3">$3<\/a>@0


Content Engine – Feature Extractor

All salt, table salt, iodized salt, or iodized table salt in packages intended for retail sale shipped in interstate commerce 18 months after the date of publication of this statement of policy in the FEDERAL REGISTER, shall be labeled as prescribed by this section; and if not so labeled, the Food and Drug Administration will regard them as misbranded within the meaning of sections 403 (a) and (f) of the Federal Food, Drug, and Cosmetic Act. (21.CFR.100.155)

ConceptConcept and ReferenceReference

Source: Stanford PCFG parser


Content Engine – Significant Concept

Relationship between concepts and categories Corpora ComparisonCorpora Comparison:

• Assume a topic-related concepttopic-related concept must have a significant significant distributiondistribution in the corpus related to that topic;

• Decide whether a concept is related to a topic by comparingcomparing its distribution in this corpus with its distributions in others;

Distribution of a concept in a corpus:• View a corpus as a list of words;

• Divide this list into many fixed-length regions;

• A sample point is the occurrence times of this concept within a region.

Example, giving a short corpus, build a sample on “food” when each region is 20-word long.

• The sample for “food” in this corpus is (0,0,1,1).


Content Engine – Significant Concept(Cont.)

Sample: compare 21CFR (Food and Drugs) and 40CFR (Environmental Protection)

• Find the significant conceptssignificant concepts in 21CFR;

• List length, both corpora have more than 2M words;

• Each region have 20K words, about 10 points in each sample.

ConceptType

(p-value)

Specific to 21CFRwith high frequency

(close to 0)

Specific to 21CFR withlow frequency

(smaller than 0.5 but larger than 0.05)

Common to 21CFRand 40CF

(close to 0.5)

More specific to40CFR

(close to 1.0)

Examples “alcohol”, “beef”,

“cheese”, “cream”,

“cup”,

“bakery”, “cigarette”,

“chocolate”, “cookie”,

“wax”,

“abstract”, “act”,“fact”,

“paragraph”,“section”

“crop”, “dispose”,

“field”, “waste”,“water”,


Content Engine – Regulation in XML Form

<!ELEMENT regulation (regElement+)><!ATTLIST regulation id ID #REQUIRED name CDATA #REQUIRED type CDATA #REQUIRED><!ELEMENT regElement (concept*, regText?, regElement*,reference*)><!ATTLIST regElement id ID #REQUIRED name CDATA #REQUIRED><!ELEMENT concept> <!ATTLIST concept name CDATA #REQUIRED times CDATA #REQUIRED><!ELEMENT reference> <!ATTLIST reference id CDATA #REQUIRED times CDATA #REQUIRED><!ELEMENT regText (#PCDATA | paragraph)*><!ELEMENT paragraph (#PCDATA | pre | img )*><!ELEMENT pre (#PCDATA)><!ELEMENT img (#PCDATA>

- <regulationregulation idid="40.cfr.1" namename="STATEMENT OF ORGANIZATION AND GENERAL INFORMATION" typetype="federal">- <regElementregElement idid="40.cfr.1.A" namename="-- Introduction">- <regElement regElement idid="40.cfr.1.1" namename="Creation and authority."> <conceptconcept namename="executive branch" timestimes="1" /> <conceptconcept namename="environmental protection agency" timestimes="1" /> - <regTextregText> <paragraphparagraph>Reorganization Plan 3 of 1970, established the U.S. Environmental Protection Agency (EPA) in the Executive branch as an independent Agency, effective December 2, 1970.</paragraphparagraph> </regTextregText> </regElementregElement>


Index Engine – Inverted Index

Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1

Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1

PointersPointers

TermsTerms

Source: CS276a, Fall, 2002


Index Engine – Domain Knowledge

TablesTables storing relationships among regulation elements and features

• Title hierarchies;

• Significant concepts;

• Tree structure;

TITLE 17. Public Health Division 3. Air Resources Chapter 1. Air Resources Board Subchapter 2.6 Air Pollution Control District Rules

Element ID1 Feature 1 Feature 2 … Parent Child 1 …




First stageFirst stage:

• Identify the candidate documents by query words;

• Rank the candidate elements by tf.idftf.idf.

Second stageSecond stage:

• Re-rank the elements by mapping their domain features to the structural information in user’s query.

Third stageThird stage:

• Tune the ranking by other internal structures.

Search Engine – Three Stages


Rank the candidate elements by tf.idftf.idf:

Represent both candidate document and query by vectorsvectors.

Each element in a vector is the weightweight for a term:

• Term FrequencyTerm Frequency (tf) measures the term density in a document;

• Inverted Document FrequencyInverted Document Frequency (idf) measures the term’s informativeness;

• The weight for term i in document d:

Search Engine – First Stage

ididi dfntfw log,, Relevance Score

• SimilaritySimilarity is measured by the cosine of angle between two vectors:

nk jk

nk ik

nk jkik

ji

jiji

ww

ww

dd

ddddsim

12,1

2,

1 ,,),(


Search Engine – Second Stage

MapMap the domain features in candidate elements to the structural information in user’s query:

• Keywords in title hierarchytitle hierarchy of an element:

• E.g. for query “environment, waste water”, existing an element with “waste water” in content and “environment” in titles,

• This element is a better candidate for the query:

• Keywords are significant conceptssignificant concepts of a category:

• The elements from this category are more relevant

The ranking scores of candidates should be adjusted regarding these relationships.


Search Engine – Second Stage (Cont.)

score_d_s = sum_l (w_l * sum_t (tf_q * tf_t ) / norm_d_l)where: score_d_s : score for document d on title hierarchy sum_l : sum for all title levels l w_l : weight for level l sum_t : sum for all terms t tf_q : the square root of the frequency of t in query tf_t : the square root of the frequency of t in title of d at level l norm_d_l : normalization denominator for d at title level l

score_d_c = sum_c (w_c * tf_q * tf_c / norm_d_c)where: score_d_c : score for document d on concept’s significance sum_c : sum for all concepts c w_c : weight for concept c tf_c : the square root of the frequency of c in d norm_d_c : normalization denominator for d on concepts

score_new = p_d * score_d + p_s * score_d_s + p_c * score_d_cwhere: score_new : updates ranking score from the second stage p_d, p_s, p_c : tuning parameters


Search Engine – Third Stage

S10

S1 S2 S3 S4 S5 S6

S7 S8 S9

Tune the relevance scoring by other internal structuresinternal structures:

• Tree: ranking elements with concerns about granularity.

• References.


Render Engine

Providing the GUIs on search boxsearch box and/or subject directorysubject directory

The browsing format for retrieved results

• Traditional list interface;

• Tree interface.Source: HCIL Space Tree


Test Engine

Functionality testFunctionality test

• E.g., query format, special index, relevance scoring.

System performanceSystem performance

• Speed and cost to accomplish a query.

Result qualityResult quality

• Precision – fraction of retrieved docs that are relevant.

• Recall – fraction of relevant docs that are retrieved.

• Coverage – results are not too specific or too general.


Expected Applications of RegLocator

CrawlingCrawling different state web sites and collect regulations from related resources;

IdentifyingIdentifying the structural information within regulation context and storing them in a XML repository;

MappingMapping the structural information from a query to the content of the regulations;

RankingRanking the relevant regulation elements with the considerations on domain features;

BuildingBuilding a regulation subject directory automatically with an acceptable accuracy rate.


Expected Contributions of RegLocator

Management kernel Management kernel for U.S. regulation system;

Framework Framework to manage semi-structured documents;

Approaches Approaches of knowledge engineering and information retrieval on regulatory information;

Relevance AlgorithmsRelevance Algorithms on searching regulation documents.


Acknowledgements

Committee Members:

• Prof. Gio Wiederhold

• Prof. Eduardo Miranda

• Prof. Kincho Law

Research Colleagues:

• Gloria Lau, Shawn Kerrigan, Jun Peng, Chuck Han, Xianshan Pan, and Yang Wang.

NSF Grant No.: EIA-9983368


Q & A

Ph.D. Research Proposal RegLocator RegLocator – A Regulation Management System Enhanced By Domain Knowledge Haoyi Wang March 16, 2005 Committee Members:

Documents

state regulations

envcapthe regulations

related regulations

structural query query

regulation management

code of california regulation

structural restrictions

structural informationdefining