Ph.D. Research Proposal RegLocator RegLocator – A Regulation Management System Enhanced By Domain Knowledge Haoyi Wang March 16, 2005 Committee Members: Prof. Kincho Law Prof. Gio Wiederhold Prof. Eduardo Miranda
Jan 04, 2016
Ph.D. Research Proposal
RegLocatorRegLocator – A Regulation Management System Enhanced By Domain Knowledge
Haoyi WangMarch 16, 2005
Committee Members: Prof. Kincho Law
Prof. Gio Wiederhold
Prof. Eduardo Miranda
3/16/2005 Engineering Informatics Group 2
Topics
Problems IntroductionProblems Introduction
Project ObjectivesProject Objectives
Related WorkRelated Work
System ArchitectureSystem Architecture
Expected ContributionsExpected Contributions
3/16/2005 Engineering Informatics Group 3
Introduction on U.S. Statutes
Two major parts: federalfederal and statesstates.
Three types of codes: constitution, codes and regulations.
Regulations are rules made by government agencies.
Federal has 50 titles, and regulations in different states have diverse structures.
U.S. Statutes
Federal States
AL
WY
Constitution
U.S. Codes
Regulations
Constitution
Codes
Regulations
Title 1
Title 50
Title 1
Title N
3/16/2005 Engineering Informatics Group 4
Structures Inside Regulations
Internal hierarchy Internal hierarchy (Title, Division, Chapter, Article, Section, etc.).
List in subject orderList in subject order (Administration, Food, Business, etc.).
Regulation
Title 1 Title 2 Title N
Division 1 Division 2 Division N
The following definitions shall apply to the regulations contained in this chapter …
Source: Code of California Regulation, http://ccr.oal.ca.gov/
3/16/2005 Engineering Informatics Group 5
Problems In Regulation Informatics
Many parties need accessing regulations.
The complexitiescomplexities of federal and state regulations:
• Distributed resources;
• Diverse document formats (pdf, word, html, etc.);
• Semi-structured information: traditional information retrieval approaches do not consider such information.
TroublesTroubles:
• Cause inefficient and ineffective access approaches;
• Increase the risk of companies failing to comply with regulations;
• Hinder public understanding of the government.
3/16/2005 Engineering Informatics Group 6
Distributed Regulation Resources – External
3/16/2005 Engineering Informatics Group 7
Distributed Regulation Resources – Internal
Specific topic may cross multiple regulation sections;
The size grows up at the speed of O(ncO(ncdd)) , where nc is the average number of children and d is the depth of the tree;
• ManuallyManually search one topic will be hardhard if d become large;
Example: Environmental Compliance Assistance Platform (ENVCAP)
• An application on environmental regulations;
• Manually build an environmental directory on related regulations.
Source: http://www.envcap.org
3/16/2005 Engineering Informatics Group 8
Distributed Regulation Resources –Internal (Cont.)
LimitationsLimitations in ENVCAP
• The regulations are distributed on states’ web sites;
• Building the directory is a labor-intensive work and the directory is shallow:
• Multiple locations for a subject, such as mercury;
• Undividable document format, such as PDF.
General Subjects
3/16/2005 Engineering Informatics Group 9
Content and Structural Query
Query includes structural restrictionsstructural restrictions:
Structural Restriction
3/16/2005 Engineering Informatics Group 10
Content and Structural Query (Cont.)
Search “Environment AND “waste water”” in CFR:
Irrelevant result
Unreadable size
3/16/2005 Engineering Informatics Group 11
Tax Payer
Content Only Query
Search only on contentcontent, but regulations are categorizedregulations are categorized.
Waste Water
Subjects
3/16/2005 Engineering Informatics Group 12
Research Objectives
Developing a universal and centralized platformplatform to handle distributed diverse regulation files and explore their structural informationstructural information;
Defining the mechanismsmechanisms to interpret user’s query by the domain knowledge domain knowledge within regulations, such as hierarchies and categories;
Improving the traditional relevance algorithmtraditional relevance algorithm to rank the search results with the considerations on the featuresfeatures of semi-structured files;
3/16/2005 Engineering Informatics Group 13
General Web Search Engine
Behind the scene, not only the content in web pages:• Link analysisLink analysis, popularity of a web page is decided by others;
• Meta dataMeta data, more useful information than content;
• Domain and URL path;
• Html meta tags;
• Titles, captions, etc.
• Tweak the ranking of results:
• By geographic location, e.g., local search;
• By commercial.
3/16/2005 Engineering Informatics Group 14
Information Systems on Regulations
NaviLex (E. Pietrosanti and B. GraziaDio, 1999)
• Search and navigate Italy banking legislation;
• StructuralStructural, , conceptualconceptual and functionalfunctional dimensions of legal documents;
• Query like “which regulation part includes a concept C and C is the subject of an obligation relation?”
Scheme: Obligation
Subject <subject of the Obligation>
Action <activity performed by the subject>
Object <object of the action>
Context <relevant concepts>
Schema: Definition
Definiendum <concept to be defined>
Definiens <defining concepts>
Definition context <relevant concepts>
3/16/2005 Engineering Informatics Group 15
Information Systems on Legal Cases
Several systems on representing legal cases:
• SMILE, extract factors from cases in trade secret law (Brüninghaus & Ashley, 2001)
• EMBRACE, framework for reasoning refugee law (Yearwood & Stranieri, 1999)
• SPIRE & INQUERY, CBR+IR system on bankruptcy law (Daniels & Rissland, 1997)
Common approach:
• Define a set of featuresfeatures to represent one type of cases by experts;
• Given a new case, the value of features in it is identified;
• Further process on cases can be performed by IR techniques, such as case-based reasoning.
3/16/2005 Engineering Informatics Group 16
Information Systems on XML Files
A popular standard to represent and exchange knowledge.
XPathXPath – standards to describe the structure in XML files
• Example, in “book.xml”, /book/chapter or /book/chapter/section.
XQueryXQuery – standard query language to retrieve elements in XML files.
book
author title
chapter chapter
heading heading sectionsectionJohn Smith
XML Retrieval
Introduction
This …
XML Query Language XQL
We describe syntax of XQL
Source: Fuhr, etc., 2001
3/16/2005 Engineering Informatics Group 17
Information Systems on XML Files (Cont.)
XML Systems
Data-Centric Doc-Centric
Structural Mapping
(O2SQL)
Model Mapping
(XRel, Tequyla-TX)
DB Approach
(IRQL, PowerDB, Timber, XQueryIR)
IR Approach
(TIX, HyRex, XPres, XXL)
Complexity & CapabilityComplexity & Capability
3/16/2005 Engineering Informatics Group 18
Distinguished Characters of RegLocator
Document-centric XML repository for regulations
Challenges of RegLocatorRegLocator
• Hierarchy structureHierarchy structure is defined by titlestitles, not by element types;
• Users only care about the contentcontent, no idea about tree structure;
• Query is bag of wordsbag of words - how to find structural information within a query?
• Match the structural information in query with the underlying content;
• Utilize other domain knowledge in regulationsdomain knowledge in regulations, such as category and references;
3/16/2005 Engineering Informatics Group 19
Topics
Problems IntroductionProblems Introduction
Project ObjectivesProject Objectives
Related WorkRelated Work
System ArchitectureSystem Architecture
Expected ContributionsExpected Contributions
3/16/2005 Engineering Informatics Group 20
Centralized Regulation Database
3/16/2005 Engineering Informatics Group 21
Mining Domain Knowledge
Complete formal representationComplete formal representation is hard to build
• First order logic (FOL), knowledge base, ontology, etc.
• Automatic building process does not exist yet. Knowledge engineeringKnowledge engineering is a reasonable approach.
• Using available information extractioninformation extraction and text miningtext mining methods to identify partial domain knowledge at the shallow level (features).
• Title hierarchy, concept, reference, etc;
• Relationship between concepts and categories;
TITLE 17. Public Health Division 3. Air Resources Chapter 1. Air Resources Board Subchapter 2.6 Air Pollution Control District Rules (CCR)
3/16/2005 Engineering Informatics Group 22
Kernel
Main Processing Pipeline
Internet
RegulationCrawler
Reg.DB
Hierarchy Identifier
EntityExtractor
ContentAnalyzer
Content Indexer
Feature Indexer
Search Engine
ContentDB
FeatureDB
New
Old
3/16/2005 Engineering Informatics Group 23
Major Components of RegLocator
Render Engine1. Search Box
2. Subject Directory
3. Results Rendering
Index Engine1. Content
2. Domain features
Content Engine1. Web Crawler2. Shallow Parser3. Feature Extractor4. Content Analyzer
Search Engine1. Basic Score
2. Feature Score
3. Structural Score
Test Engine1. Functional Test
2. Performance Test
3/16/2005 Engineering Informatics Group 24
Content Engine – Web Crawler
Structure of web: hyperlinked hyperlinked online documentonline document.
Crawling the web sites
• General processor + General processor + configurationsconfigurations;
• Start from a control center;
• Find links in a download page;
• Follow the links to get more;
• Avoid loop.
outputDir = HIstartTOC = http://www.hawaii.gov/dlnr/AdminRulesIdx.htmmaxDepth=2
linkPattern1 = ^Final.*RulesmatchLink1 = falsefilePattern1 = .*/dlnr/.*\\.pdfindexPattern1 = .*
linkPattern2 = .*filePattern2 = .*/dlnr/.*\\.pdf
3/16/2005 Engineering Informatics Group 25
Content Engine – Shallow Parser
TextConverter TEXT
StructuralConverter
XMLWORD
HTML
ConfigurationsConfigurationsType Patterns (junk, hierarchy, table, etc.)
Sample patterns on content filter:# Remove the title of the pages/TITLE 18. ENVIRONMENTAL QUALITY//
# Remove the table of contents/<center>Supp\.(.*?)ARTICLE 1\.(.*?)ARTICLE 1\./<p>ARTICLE 1\./s
Sample patterns on hierarchy recognition:0@^<p>(CHAPTER (\d+))\. (.*?)$@<p><a NAME="$1" LEVEL="2" TITLE="$3">$3<\/a>@0
3/16/2005 Engineering Informatics Group 26
Content Engine – Feature Extractor
All salt, table salt, iodized salt, or iodized table salt in packages intended for retail sale shipped in interstate commerce 18 months after the date of publication of this statement of policy in the FEDERAL REGISTER, shall be labeled as prescribed by this section; and if not so labeled, the Food and Drug Administration will regard them as misbranded within the meaning of sections 403 (a) and (f) of the Federal Food, Drug, and Cosmetic Act. (21.CFR.100.155)
ConceptConcept and ReferenceReference
Source: Stanford PCFG parser
3/16/2005 Engineering Informatics Group 27
Content Engine – Significant Concept
Relationship between concepts and categories Corpora ComparisonCorpora Comparison:
• Assume a topic-related concepttopic-related concept must have a significant significant distributiondistribution in the corpus related to that topic;
• Decide whether a concept is related to a topic by comparingcomparing its distribution in this corpus with its distributions in others;
Distribution of a concept in a corpus:• View a corpus as a list of words;
• Divide this list into many fixed-length regions;
• A sample point is the occurrence times of this concept within a region.
Example, giving a short corpus, build a sample on “food” when each region is 20-word long.
• The sample for “food” in this corpus is (0,0,1,1).
3/16/2005 Engineering Informatics Group 28
Content Engine – Significant Concept(Cont.)
Sample: compare 21CFR (Food and Drugs) and 40CFR (Environmental Protection)
• Find the significant conceptssignificant concepts in 21CFR;
• List length, both corpora have more than 2M words;
• Each region have 20K words, about 10 points in each sample.
ConceptType
(p-value)
Specific to 21CFRwith high frequency
(close to 0)
Specific to 21CFR withlow frequency
(smaller than 0.5 but larger than 0.05)
Common to 21CFRand 40CF
(close to 0.5)
More specific to40CFR
(close to 1.0)
Examples “alcohol”, “beef”,
“cheese”, “cream”,
“cup”,
“bakery”, “cigarette”,
“chocolate”, “cookie”,
“wax”,
“abstract”, “act”,“fact”,
“paragraph”,“section”
“crop”, “dispose”,
“field”, “waste”,“water”,
3/16/2005 Engineering Informatics Group 29
Content Engine – Regulation in XML Form
<!ELEMENT regulation (regElement+)><!ATTLIST regulation id ID #REQUIRED name CDATA #REQUIRED type CDATA #REQUIRED><!ELEMENT regElement (concept*, regText?, regElement*,reference*)><!ATTLIST regElement id ID #REQUIRED name CDATA #REQUIRED><!ELEMENT concept> <!ATTLIST concept name CDATA #REQUIRED times CDATA #REQUIRED><!ELEMENT reference> <!ATTLIST reference id CDATA #REQUIRED times CDATA #REQUIRED><!ELEMENT regText (#PCDATA | paragraph)*><!ELEMENT paragraph (#PCDATA | pre | img )*><!ELEMENT pre (#PCDATA)><!ELEMENT img (#PCDATA>
- <regulationregulation idid="40.cfr.1" namename="STATEMENT OF ORGANIZATION AND GENERAL INFORMATION" typetype="federal">- <regElementregElement idid="40.cfr.1.A" namename="-- Introduction">- <regElement regElement idid="40.cfr.1.1" namename="Creation and authority."> <conceptconcept namename="executive branch" timestimes="1" /> <conceptconcept namename="environmental protection agency" timestimes="1" /> - <regTextregText> <paragraphparagraph>Reorganization Plan 3 of 1970, established the U.S. Environmental Protection Agency (EPA) in the Executive branch as an independent Agency, effective December 2, 1970.</paragraphparagraph> </regTextregText> </regElementregElement>
3/16/2005 Engineering Informatics Group 30
Index Engine – Inverted Index
Doc # Freq2 12 11 12 11 11 12 21 11 12 11 21 12 11 11 22 11 12 12 11 12 12 12 11 12 12 1
Term N docs Tot Freqambitious 1 1be 1 1brutus 2 2capitol 1 1caesar 2 3did 1 1enact 1 1hath 1 1I 1 2i' 1 1it 1 1julius 1 1killed 1 2let 1 1me 1 1noble 1 1so 1 1the 2 2told 1 1you 1 1was 2 2with 1 1
PointersPointers
TermsTerms
Source: CS276a, Fall, 2002
3/16/2005 Engineering Informatics Group 31
Index Engine – Domain Knowledge
TablesTables storing relationships among regulation elements and features
• Title hierarchies;
• Significant concepts;
• Tree structure;
TITLE 17. Public Health Division 3. Air Resources Chapter 1. Air Resources Board Subchapter 2.6 Air Pollution Control District Rules
Element ID1 Feature 1 Feature 2 … Parent Child 1 …
Element ID2 Feature 1 Feature 2 … Parent Child 1 …
Element ID3 Feature 1 Feature 2 … Parent Child 1 …
3/16/2005 Engineering Informatics Group 32
First stageFirst stage:
• Identify the candidate documents by query words;
• Rank the candidate elements by tf.idftf.idf.
Second stageSecond stage:
• Re-rank the elements by mapping their domain features to the structural information in user’s query.
Third stageThird stage:
• Tune the ranking by other internal structures.
Search Engine – Three Stages
3/16/2005 Engineering Informatics Group 33
Rank the candidate elements by tf.idftf.idf:
Represent both candidate document and query by vectorsvectors.
Each element in a vector is the weightweight for a term:
• Term FrequencyTerm Frequency (tf) measures the term density in a document;
• Inverted Document FrequencyInverted Document Frequency (idf) measures the term’s informativeness;
• The weight for term i in document d:
Search Engine – First Stage
ididi dfntfw log,, Relevance Score
• SimilaritySimilarity is measured by the cosine of angle between two vectors:
nk jk
nk ik
nk jkik
ji
jiji
ww
ww
dd
ddddsim
12,1
2,
1 ,,),(
3/16/2005 Engineering Informatics Group 34
Search Engine – Second Stage
MapMap the domain features in candidate elements to the structural information in user’s query:
• Keywords in title hierarchytitle hierarchy of an element:
• E.g. for query “environment, waste water”, existing an element with “waste water” in content and “environment” in titles,
• This element is a better candidate for the query:
• Keywords are significant conceptssignificant concepts of a category:
• The elements from this category are more relevant
The ranking scores of candidates should be adjusted regarding these relationships.
3/16/2005 Engineering Informatics Group 35
Search Engine – Second Stage (Cont.)
score_d_s = sum_l (w_l * sum_t (tf_q * tf_t ) / norm_d_l)where: score_d_s : score for document d on title hierarchy sum_l : sum for all title levels l w_l : weight for level l sum_t : sum for all terms t tf_q : the square root of the frequency of t in query tf_t : the square root of the frequency of t in title of d at level l norm_d_l : normalization denominator for d at title level l
score_d_c = sum_c (w_c * tf_q * tf_c / norm_d_c)where: score_d_c : score for document d on concept’s significance sum_c : sum for all concepts c w_c : weight for concept c tf_c : the square root of the frequency of c in d norm_d_c : normalization denominator for d on concepts
score_new = p_d * score_d + p_s * score_d_s + p_c * score_d_cwhere: score_new : updates ranking score from the second stage p_d, p_s, p_c : tuning parameters
3/16/2005 Engineering Informatics Group 36
Search Engine – Third Stage
S10
S1 S2 S3 S4 S5 S6
S7 S8 S9
Tune the relevance scoring by other internal structuresinternal structures:
• Tree: ranking elements with concerns about granularity.
• References.
3/16/2005 Engineering Informatics Group 37
Render Engine
Providing the GUIs on search boxsearch box and/or subject directorysubject directory
The browsing format for retrieved results
• Traditional list interface;
• Tree interface.Source: HCIL Space Tree
3/16/2005 Engineering Informatics Group 38
Test Engine
Functionality testFunctionality test
• E.g., query format, special index, relevance scoring.
System performanceSystem performance
• Speed and cost to accomplish a query.
Result qualityResult quality
• Precision – fraction of retrieved docs that are relevant.
• Recall – fraction of relevant docs that are retrieved.
• Coverage – results are not too specific or too general.
3/16/2005 Engineering Informatics Group 39
Expected Applications of RegLocator
CrawlingCrawling different state web sites and collect regulations from related resources;
IdentifyingIdentifying the structural information within regulation context and storing them in a XML repository;
MappingMapping the structural information from a query to the content of the regulations;
RankingRanking the relevant regulation elements with the considerations on domain features;
BuildingBuilding a regulation subject directory automatically with an acceptable accuracy rate.
3/16/2005 Engineering Informatics Group 40
Expected Contributions of RegLocator
Management kernel Management kernel for U.S. regulation system;
Framework Framework to manage semi-structured documents;
Approaches Approaches of knowledge engineering and information retrieval on regulatory information;
Relevance AlgorithmsRelevance Algorithms on searching regulation documents.
3/16/2005 Engineering Informatics Group 41
Acknowledgements
Committee Members:
• Prof. Gio Wiederhold
• Prof. Eduardo Miranda
• Prof. Kincho Law
Research Colleagues:
• Gloria Lau, Shawn Kerrigan, Jun Peng, Chuck Han, Xianshan Pan, and Yang Wang.
NSF Grant No.: EIA-9983368
3/16/2005 Engineering Informatics Group 42
Q & A