Mapping Regulations to Industry–Specific Taxonomies Chin Pang Cheng, Gloria T. Lau, Kincho H. Law Engineering Informatics Group, Stanford University June 5, 2007
Jan 04, 2016
Mapping Regulations to Industry–Specific Taxonomies
Chin Pang Cheng, Gloria T. Lau, Kincho H. Law
Engineering Informatics Group, Stanford University
June 5, 2007
Motivating Problem
To Legal Practitioners:To Legal Practitioners: Hierarchical, well-structured Precise and concise Familiar with regulatory
organization systems
To Industry Practitioners:To Industry Practitioners: Voluminous Not trained to read
regulations More familiar with industry-
specific terminology and classification structure
Mapping Regulations to Taxonomies
Possible Cases: One-Taxonomy-One-Regulation One-Taxonomy-N-Regulation N-Taxonomy-One-Regulation N-Taxonomy-N-Regulation
One-Taxonomy-One-Regulation
Simple keyword latching task Stemming (e.g. piling pile, disabled disable) Word interval
Concept: “fire alarm system” Regulation: “… fire alarm and detection system …”
Each taxonomy concept is hyperlinked
“No Matched Sections” for non-matched OmniClass concepts
See other matched related concepts in that section
Inverted Regulations
One-Taxonomy-N-Regulation
Alabama (AL) regulation Arizona (AZ) regulation
One Regulation as the Base
(AL)
(AZ)
Similarity Comparison on Sections
Core from Lau, Law and Wiederhold (2005) Feature extraction (e.g. concepts, measurements) Comparison of shared features Consideration of hierarchical and referential information
G.Lau, K.Law and G.Wiederhold. “Legal Information Retrieval and Application to E-Rulemaking,” In Proceedings of the 10th International Conference on Artificial Intelligence and Law (ICAIL 2005), Bologna, Italy, pp. 146-154, Jun 6-11, 2005.
A U
parent
sibling
child
psc(A) psc(U) ref(U)
child node
reference node
nodes in comparison
f0
s-refs-psc
psc-psc
AL regulation AZ regulation
Inclusion of Regulation Hierarchy Terminological differences: revealed by neighbor inclusion
4.13 Doors 12.5.4 Doors
4.13.9Door Hardware
12.5.4.2Door Furniture
12.5.4.14.13.1
4.13.3
4.13.2
4.13.12
UFAS BS8300
parent
sibling
Uniform Federal Accessibility Standards 4.13.9 Door Hardware 4.13 Doors 4.13.1 General ... 4.13.9 Door Hardware Handles, pulls, latches, locks, and other operating devices on accessible doors shall have a shape that is easy to grasp with one hand and does not require tight grasping ...
... 4.13.12 Door Opening Force
British Standard 8300 12.5.4.2 Door Furniture 12.5.4 Doors 12.5.4.1 Clear Widths of Door Openings 12.5.4.2 Door Furniture Door handles on hinged and sliding doors in accessible bedrooms should be easy to grip and operate by a wheelchair user or ambulant disabled person ...
N-Taxonomy-One-Regulation
Multiple taxonomies exist in a single industry Translation is unavoidable E.g. in architectural, engineering and construction (AEC) industry
Industry Foundation Classes (IFC) CIMsteel Integration Standards (CIS/2) Automating Equipment Information Exchange (AEX) UniFormatTM, MasterFormatTM
etc.
Possible solution: Merging taxonomy
Unfamiliar taxonomy
Proposed System
Proposed Methodology of Taxonomy Mapping
[F] 903.4.2 Alarms. Approved audible devices shall be connected to every automatic sprinkler system. Such sprinkler water-flow alarm devices shall be activated by water flow equivalent to the flow of a single sprinkler of the smallest orifice size installed in the system. Alarm devices shall be provided on the exterior of the building in an approved location. Where a fire alarm system is installed, actuation of the automatic sprinkler system shall actuate the building fire alarm system.
sprinkler system
orifice
T1
fire
alarmT1
water flow
T2
fire alarm system
T2
Taxonomy Mapping: Mainly manually nowadays Usually term matching (e.g. fire fire alarm)
Demonstration in Construction Industry
International Building Code, IBC
Taxonomy 1 (OmniClass) Taxonomy 2 (ifcXML)
IfcSlab
steel
KnowledgeCorpus
Corpus: carefully selected (in the same domain)
Relatedness Analysis on Concepts
Notations: a pool of m concepts for a taxonomy a corpus of N regulation sections frequency vector is an N-by-1 vector storing the occurrence frequencies
of concept i among the N documents frequency matrix C is an N-by-m matrix in which the i-th column vector is
ic
Example:
C =
ic
m = 4, N = 5
=3c
Concept 3 is matched to Section 4 3 times
1
3
2
0
0
0101
0300
0213
0010
2051
5sec
4sec
3sec
2sec
1sec4321
Cosine Similarity Measure
Common arithmetic measure of similarity to compare documents in text mining
Finding angle between two frequency vectors in N dimensions
and from Taxonomy 1 and 2 respectively Similarity score = [0, 1] Represented using dot product and magnitude, the similarity
score is given by:
ic
jc
ji
ji
cc
ccjiSim
),(
Jaccard Similarity Coefficient
Statistical measure of the extent of overlapping of two vectors in N dimensions and from Taxonomy 1 and 2
Defined as size of intersection divided by size of union of the vector dimension sets:
For concept relatedness analysis,
ic
jc
ji
ji
cc
ccjiJaccard
),(
011011
11),(NNN
NjiSim
N11 = number of sections both concepts i and j are matched toN10 = number of sections concept i is matched to but not concept jN01 = number of sections concept j is matched to but not concept i
Market Basket Model
Probabilistic measure to find item-item correlation used in data-mining Two main elements: (1) set of items; (2) set of baskets
Association rule means a basket containing all the items is very likely to contain item j
Confidence of a rule =
Interest of a rule =
Example: Coca-cola Pepsi: Low-confidence but high-interest
jiii k },...,,{ 21
kii ,...,1
),...,|Pr( 1 kiij
)Pr(),...,|Pr( 1 jiij k
Market Basket Model (cont’d)
For concept relatedness analysis N11 = number of sections both concepts i and j are matched to
N01 = number of sections concept j is matched to but not concept i
N10 = number of sections concept i is matched to but not concept j
N00 = number of sections both concepts i and j are NOT matched to
Probability of concept j is
Confidence of association rule is
Forward similarity of concept i and j is the interest as:
00011011
0111)Pr(NNNN
NNj
ji
1011
11)(NN
NjiConf
00011011
0111
1011
11),(NNNN
NN
NN
NjiSim
Asymmetry of Market Basket Model
Asymmetry of market basket model: Forward similarity:
Backward similarity:
OmniClassconcept i
IfcXMLconcept j Sim(i, j) Sim(j, i)
curtain walls IfcCurtainWall 0.992849 0.992849
sound and signal devices
IfcSwitchingDeviceType
0.998808 0.998808
roof decking IfcSlab 0.802344 0.370313
speakers IfcAlarmType 0.883194 0.018024
gypsum board IfcWallType 0.568832 0.029939
concrete IfcSlab 0.119548 0.427615
00011011
0111
1011
11)Pr()(),(NNNN
NN
NN
NjjiConfjiSim
00011011
1011
0111
11)Pr()(),(NNNN
NN
NN
NiijConfijSim
Evaluation of Accuracy
Root Mean Square Error (RMSE): Difference between the true values and the predicted values For Taxonomy1 of m concepts and Taxonomy2 of n concepts:
Precision: Fraction of predictions that are correct
Recall: Fraction of correct matches that are predicted
m
i
n
jjiji predictedtrue
mnRMSE
1 1,,
1
RelatedPredicted
RelatedActurally RelatedPredictedPrecision
RelatedActurally
RelatedActurally RelatedPredictedRecall
Evaluation Results
Cosine Similarity: Average among three metrics
Jaccard Similarity: NOT preferred (unacceptably low recall, though high precision)
Market Basket Model: Preferred (lowest RMSE, highest recall)
Cosine Similarity
Jaccard Similarity
Market Basket Model
RMSE 0.1000 0.1300 0.0825
Precision 0.9130 1.0000 0.7955
Recall 0.3559 0.1186 0.5932
20 concepts from OmniClass, 20 concepts from ifcXML
Conclusion
Mapping industry-specific taxonomy to regulation allows industry practitioners to retrieve regulations faster
Four cases: 1-Taxonomy-1-Regulation: simple keyword latching 1-Taxonomy-N-Regulation: hierarchy of regulation sections
considered N-Taxonomy-1-Regulation: 3 similarity analysis metrics
introduced (cosine similarity, Jaccard similarity, market basket model)
N-Taxonomy-N-Regulation: future step
~ Thank You ~