Top Banner
Cover Page Uploaded July 1, 2011 Using UltraStructure for Automated Identification of Sensitive Information in Documents Author: Jeffrey G. Long ([email protected]) Date: October 21, 1999 Forum: Talk presented at the 20th annual conference of the American Society for Engineering Management. Contents Pages 15: Preprint f paper Pages 624: Slides (but no text) for presentation License This work is licensed under the Creative Commons AttributionNonCommercial 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/bync/3.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.
25

Automated identification of sensitive information

Nov 01, 2014

Download

Technology

Jeff Long

October 21, 1999: "Using Ultra-Structure for Automated Identification of Sensitive Information in Documents". Presented at the 20th annual conference of the American Society for Engineering Management. Paper published in conference proceedings.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automated identification of sensitive information

Cover Page 

Uploaded July 1, 2011 

 

Using Ultra‐Structure for 

Automated Identification of 

Sensitive Information in 

Documents  

Author: Jeffrey G. Long ([email protected]

Date: October 21, 1999 

Forum: Talk presented at the 20th annual conference of the American Society for 

Engineering Management.

Contents 

Pages 1‐5: Preprint f paper 

Pages 6‐24: Slides (but no text) for presentation 

 

License 

This work is licensed under the Creative Commons Attribution‐NonCommercial 

3.0 Unported License. To view a copy of this license, visit 

http://creativecommons.org/licenses/by‐nc/3.0/ or send a letter to Creative 

Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA. 

Page 2: Automated identification of sensitive information

USING ULTRA-STRUCTURE FOR AUTOMATED IDENTIFICATION OF SENSITIVE INFORMATION IN DOCUMENTS

1

Jeffrey G. Long, Sr. Knowledge Engineer, DynCorp NSP

1 -- This work was funded by the U.S. Department of Energy under contract DE-AC01-98NN50049.

Abstract. The Government has a strong interest in protecting nuclear and national security information while maximizing the availability of information about its operations. Towards this end federal agencies have developed tens of thousands of 'guidance rules' to help determine what is or is not classified, and at what level. Trained and certified document reviewers apply these rules. The Reviewer's Assistant System (RAS) at the Department of Energy is a new type of expert system that uses a relational database to store very large numbers of rules. It is being developed to automatically apply DOE guidance rules to text documents. This requires far more than a mere keyword search, as the ideas being distinguished can be quite subtle. The purpose of the system is not to 'understand' the documents it processes, but rather to simply check them for the existence of certain ideas or facts. This technology for 'concept-spotting' can be applied to other arenas, such as detecting junk email or searching web sites. This paper will discuss the features, goals and general methods of the Reviewers Assistant System.

Background The traditional approach to managing any engineering project is structured: it moves from general planning to requirements analysis, design, implementation, and then long-term maintenance, and has explicit criteria to determine whether and when to move to the next stage. This approach works quite well for creating most types of systems. If a system is simple and the requirements change, the traditional structured approach works because the system can be affordably modified. Alternatively, if a system is complex but the requirements never change, the system can be successfully built. Traditional structured approaches have proven to be better than completely unstructured approaches and have led to the development of many successful systems. However, standard structured approaches have failed to satisfactorily address the problems involved in creating time-viable software systems, especially over long times. Instead, they have led to the frequent replacement of systems with wholly new ones, most often at great cost for both developers and users of software. Attempts to get better user requirements through rapid prototyping and better charting tools for notating work processes have helped

somewhat, but they too have failed to make systems more flexible in the face of changing user requirements. Indeed, if both (a) the system being developed is complex and (b) the user requirements are subject to significant change over time, then the existing structured approaches usually do not work. The greatest engineering accomplishments of the 20th Century are of the former type. Complex systems ranging from computers to the Space Shuttle have of course been successfully built – but only on the condition that their user requirements change little if at all after the design stage. Unfortunately for us, software systems are both complex and ever changing. As such, they require a different engineering management approach. The theory of “Ultra-Structure” was developed in part because the application of traditional engineering approaches has failed when applied to designing software systems. This theory was described in (Long and Denning, 1995), and less technically in (Long, 1999), but will be briefly and nontechnically described here. This article primarily discusses the application of Ultra-Structure theory to a new area, namely the analysis of text documents for sensitive information. The U.S. Department of Energy (DOE) Declassification Productivity Initiative (DPI), which will be described later in this paper, has funded this work. The use of Ultra-Structure theory has thus far allowed us to address several difficult problems that have limited and hindered previous efforts at Natural Language Processing (NLP) systems, expert systems, and large knowledgebases; in particular: the ability to manage large numbers of rules in a

knowledgebase, numbering in the tens of thousands now and eventually in the hundreds of thousands

the ability to give knowledge engineers a set of tools to help them visualize and manage large knowledgebases of rules (“rulebases”)

the ability to manage and maintain both metadata and content information regarding large numbers of documents.

Ultra-Structure Fundamentals

Ultra-Structure theory is based on a different way of looking at the world – a different paradigm or

Page 3: Automated identification of sensitive information

worldview. The traditional Western, Aristotelian worldview sees the world as composed of objects having attributes and relationships to other objects. Ultra-Structure theory sees the world as being a process which, as a minor by-product, occasionally generates physical entities and new relationships among physical entities. The first task of an Ultra-Structure analysis is to understand these processes and to represent them very accurately. The development task of an Ultra-Structure analysis is to ensure that the representation models not just the processes that exist currently, but all logically possible processes within that family of systems. Ultra-Structure theory suggests that any process can, in principle and in practice, be analyzed into or reduced to a set of If-Then rules. Seemingly simple processes usually follow just a few simple rules; and seemingly complex processes may follow either a few, or many, simple rules. One of the interesting discoveries in cellular automata studies has been that very simple rules can generate very complex behaviors. This has also been observed in the work done in the last 20 years in fractal geometry, where the recursive execution of a single remarkably simple formula, which is a type of rule, can specify very complex shapes. The kind of rules that humans usually work with may be thought of as complex rules. However, a good analyst can analyze complex rules into simple or atomic rules, according to Ultra-Structure theory. Rules may be defined in terms of sets, with each set having a specified and limited possible domain or set of values. A particular rule in this definition is a particular ordered sequence of these values. Ultra-Structure theory suggests that a good analyst can always define an ordered sequence of domains (in Ultra-Structure terminology, universals) which will contain any particular instance of a rule. Such an ordered set of universals is called a ruleform. A ruleform is to a rule what ordinary algebra is to arithmetic: it is a more generalized way of specifying the essential structural ideas and relationships of a system. While ordinary algebra uses symbols to represent numbers and various arithmetical operations, Ultra-Structure uses universals to represent various domains whose contents may be numeric, alphabetical, or indeed the tokens of any notational system. Exhibit 1 shows proposed terminology and distinctions for the different layers of structure of any system. Simply put, the surface structure is defined to be the physical manifestation of any system, consisting of its physical entities, relationships and processes. The middle structure is the set of all rules governing the system, which generate the surface structure. The deep structure is the set of ordered domains from which particular rules may be constructed. The sub-structure represents the set of all possible domains (universals).

Finally, the notational structure is the set of tokens used by the rules to represent various abstractions. For RAS, these tokens are numbers and letters; for (say) a music system they would be letters and other signs interpreted as musical “notes” or instructions. Ultra-Structure theory specifies how these various layers can be represented on a computer as tables in a relational database. To implement a ruleform, we create a table where each row is a separate simple rule, and each column is a separate universal. Typically a complex rule will require several simple rules that are stored in different ruleforms; this group of ruleforms must be examined as a whole in order for the system to make decisions; it is referred to as a cluster. There are several types of ruleforms, as will be illustrated later. Properly specifying these structures for a model of a system enables the users of the model to enter the rules of the system as data, which is easily changed as necessary when the rules change. Approximately 99% of the rules of the system are specified as data, so the model itself – its software and data structures – has no knowledge of the outside world. All the model itself knows is the order in which to read the ruleforms. This control logic, called animation procedures, is very small; typically just a few thousand lines of code even for a very complex system.

Natural Language: A New Application Area for Ultra-Structure

Ultra-Structure theory has been applied in a number of application areas, mostly in business. This has included the traditional application areas of order entry, inventory control, billing and cash application, and similar business functions. It has also been applied experimentally to other areas, as indicated by (Shostko 1999) and (Oh and Scotti, 1999). Starting with part-time work in 1995, it has now been applied to the automated identification of sensitive information in text documents. The present application area is new for Ultra-Structure: attempting to specify the rules by which natural language can be at least partially understood, i.e. partially interpreted and assigned meaning. The Government has a strong interest in protecting national security information (NSI), while facilitating government openness to public scrutiny and not spending significant amounts of money and time protecting information that does not truly need to be classified. President Clinton issued Executive Order 12958 (E.O.), the most recent codification of the Government’s intentions in this area, on April 14, 1995. It states, among other things, that all classified documents containing only NSI (not RD or FRD,discussed below) will be automatically declassified

Page 4: Automated identification of sensitive information

after 25 years unless one of nine specified conditions for exemption is met. The estimated volume of documents to be reviewed for declassification under the E.O. exceeds one billion, and the five year grace period specified by the E.O. for reviews to identify exemptions from automatic declassification has been extended for an additional 18 months to help the Agencies meet the enormous work loads. Under the E.O., any information that is covered under the Atomic Energy Act (AEA) is exempt from automatic declassification. Such information includes anything pertaining to the construction, design or use of nuclear weapons, nuclear propulsion systems, and other special nuclear materials. This information was exempted because the President does not have the authority to unilaterally change the AEA (a law), and also because it is generally recognized that even “old” nuclear design information would still be of current value to a would-be proliferant. It is simply not in the interests of the United States to provide such information. To help identify this kind of information, called “Restricted Data” (RD) or “Formerly Restricted Data” (FRD), as well as other kinds of national security information (NSI), the Department of Energy has developed about 65,000 specific guidance topics. Their purpose is to help determine what is or is not classified as RD, FRD or NSI, and at what level (confidential, secret, or top secret). Trained and certified document reviewers apply these topics. Moreover, under the Freedom of Information Act (FOIA), the public is entitled to request documents and DOE must be prepared to justify any classification actions it takes before a federal judge. They must have a clear rationale tracing back to the 65,000 guidance topics and from there to either the AEA or to the latest E.O. pertaining to national security. Document reviewing is a manually intensive process requiring years of education and training. Congress funded the Declassification Productivity Initiative (DPI) at the Department of Energy (DOE) in order to develop advanced tools to help reviewers in various ways. One of the primary tools we have been developing under DPI is called the “Reviewer’s Assistant System” (RAS), which was built using Ultra-Structure theory.

Reviewer’s Assistant System Functions We are building RAS using Ultra-Structure theory because the number of rules is quite large and these rules are likely to change over time. RAS is designed to: rigorously apply DOE Guidance Rules to text

documents

highlight any segments of the text to which a Guidance Rule applies

highlight for the user (i.e. a certified document reviewer) the specific Guidance Rule(s) that caused any particular sections of text to be selected.

The purpose of RAS is not to “understand” the documents it processes, but rather to detect the existence of any classified concepts or facts. While this is simpler than true document understanding, it nevertheless requires far more than mere keyword searching, where a system simply scans a document for the existence of one or more specified terms. It also requires more than a Boolean keyword search, where specific terms can be ANDed, ORed, etc. We are seeking specific concepts having specific relations to one another, which we refer to as ideas or propositions. The ideas being sought can be quite subtle.

Merging Databases and Text Markup Languages Traditionally the task of defining the elements of document structure would be performed using a text markup language such as a derivative of the Standard Generalized Markup Language (SGML). In this kind of language, “tags” indicating different structural features of the document are inserted into the document at the beginning and end of each structural feature. Following Ultra-Structure theory, RAS represents the information in terms of rules, and the rules are stored as records of data in various tables. In RAS, therefore, all structural markup information is stored in a database. This kind of markup does not use in-line tags, but instead uses different fields in a table. There are a number of advantages to storing structured text information in database tables rather than in a flat file with tags. Chief among these are the following general capabilities of relational databases over flat files: control access to the data through a security system

and audit trail enforce referential integrity, such that when a value

changes in one part of the system it is immediately changed in all parts of the system

permit use of complex queries using (e.g.) Standard Query Language (SQL)

give users quick access to volumes of data through easy-to-use forms and reports

store and retrieve various types of objects in addition to standard text (e.g. images, sounds).

Merging Databases and Knowledgebases

Page 5: Automated identification of sensitive information

Ever since mankind first used an abacus about 5,000 years ago, and possibly since we first notched tallies on a stick 30,000 years ago, we have distinguished algorithms from data. This has been a useful distinction, but the veritable wall between the two began to break down when John von Neumann proposed in a memo in 1945 that not just data but also algorithms (as computer instructions) could be stored on a computer in a binary form. This insight – based on work done by him and others at the University of Pennsylvania, including John Mauchley and J. Presper Eckert – led to programmable (stored program) computers. Although both parts are stored in the same way as binary digits (bits), computer applications are still viewed as consisting of two very different things: algorithms and data. An algorithm is a finite series of steps taken to compute an answer. Data is the values or parameters used by an algorithm to reach its conclusions, which data may have initial, intermediate and final values. Database applications are generally viewed as applications that provide storage places and access methods for the safe storage and retrieval of persistent data, and the safe adding, changing and deleting of data following certain integrity rules regardless of whether the application software using the database enforces those rules or not. Under this paradigm, databases store and protect “facts” or “data,” and the algorithms that read and use these facts are stored in software programs, queries, stored procedures, job control language procedures, etc. Examples of such applications are order entry, inventory, purchasing, and accounting systems. This class of systems is concerned primarily with data storage, arithmetic and logical calculations, and information retrieval. For this class of systems, changing the rules of a business area requires changing the software – a frequently difficult task. Expert systems are a different class of applications which consist of rules and an inference engine, and which are concerned primarily with applying reasoning to facts in order to simulate the behavior of human experts in a particular subject domain. The inference engine processes the rules, which are stored in a “knowledgebase” rather than a database. These rules may include executable code, or they may be mere data. The reasoning process may be similar to that of a human expert, or it may be completely different. The behavior of the system as a whole is intended to mimic, and hopefully outperform, a human expert. Examples of such applications include bank credit approval, medical diagnosis, and hardware configuration systems. These systems are usually intended to aid rather than replace human decision-makers. They offer the benefits of high speed, high consistency, and perfect attention to detail.

There have been a number of attempts in the last ten years to bridge the gap between these two classes of applications – to merge databases and knowledgebases and their associated technologies. There is a growing belief that modern database systems must evolve towards knowledgebase systems, and that more "inferencing" is necessary for better understanding and use of data. This could lead to applications involving hundreds of thousands of complex rules that make decisions that seem truly “intelligent.” The Ultra-Structure paradigm does not make these conventional distinctions between algorithms and data. Rather it defines whatever is stored in a relational database table to be rules which have two different types of parts, called factors and considerations. Factors are primary keys in a table that determine under what general conditions a rule should be looked at; and following standard normalization rules it requires that there be unique keys (factors) for each record (rule). What is traditionally considered to be data (i.e., a fact) is usually stored as a consideration (a non-primary-key attribute) in the record, and this attribute serves merely to guide the execution of a rule cluster. In an inventory system, for example, the quantity-on-hand of a particular item is simply a consideration determining where the item may be sourced for an order. That and other rules in the cluster must all be examined in order for the inventory system to make an intelligent sourcing decision. The inference engine (called animation procedures) consists of just a few thousand lines of code. All knowledge of the external world lies in the rulebase, and none in the animation procedures. RAS is an example of a new type of system that uses a relational database to store a very large number of rules as data. This perspective requires a new and broader understanding of the nature of rules. If we broaden our concept of rules from IF x THEN do y and z to IF x THEN consider y and z before deciding what to do, then y and z can serve the role traditionally reserved for data, that is they can represent the facts of the world. They do this as an integral part of a larger and more comprehensive cluster of rules, acting as considerations for the execution of individual rules. This means that all the business rules of an organization can be stored as data, and the only software that is necessary is the inference engine, which should never need to change. This puts all knowledge of the world and all the knowledge of rules in a format which is easy to update, easy to review, and can be managed easily by a standard relational database.

Page 6: Automated identification of sensitive information

Rules in RAS

As used by RAS, Ultra-Structure defines several basic kinds of existential rules or types of entities: semantic entities, which can be letters, words,

phrases, guide topics, or entire guides documents, which are the entities being analyzed by

the system markings, which indicate what to do in the event

that certain ideas are found in a text, e.g. mark the document as “confidential”

users, which define the authorized users of the system.

These entities typically have complex relations to one another. If related to other entities of the same type they are called network rules. RAS has several kinds of network rule: entities network relates semantic entities to one

another markings network relates markings to one another,

and in particular indicates a hierarchy of markings documents network relates documents to one

another, indicating (e.g.) that one document replaces another, or is a duplicate of another, etc.

If existential entities of one kind are related to entities of another kind, we represent that with an authorization rule. RAS has several kinds of authorization rules: document detail contains the results of the pre-

analysis of a document, specifying the semantic entities and their characteristics and order in the document

document analysis contains the results of the analysis of a document

entity markings relates semantic entities (e.g. guide topics) and their associated markings (if any)

Note that each ruleform (table) may be interpreted as defining rules. For example, the Document Detail table may be interpreted as specifying rules for the (re-)construction of the original document. The Markings Network specifies how markings are ordered in a hierarchy, e.g. if a marking of “confidential” and a marking of “secret” both apply to the same document, then the overall classification of the document is “secret”.

There are other ruleforms (tables) in RAS, but these give the general idea of what the system contains.

Executing (Animating) the RAS Rules In order to search for concepts in a text, the text must first be “pre-analyzed.” This involves the determination of various boundaries (e.g. sentence boundaries) and the determination of the nature of certain kinds of lexical entities, e.g. whether a specific entity is numeric or non-numeric, and whether a period is part of a number (a decimal point), is used as part of an acronym or abbreviation, or is indeed the end of a sentence. Each word in a document is usually treated as a separate “semantic entity.” But since words in a phrase often have meanings very different than the same words outside the phrase (e.g. “A horse of a different color” has nothing to do with either horses or colors!), there is frequently a need to indicate that several words must always be treated as a single phrase, in which case the entire phrase becomes a single semantic entity. This is defined by a replacement rule in the Entities Network ruleform. Each semantic entity has a number of attributes such as character position in the document, word number, sentence number, paragraph number, whether it is numeric, etc. These and other attributes for each semantic entity are stored as a rule on a single record in the Document Detail ruleform, in lieu of using an SGML-type markup language. The system is thus generating new “rules” based on other rules, which facilitates subsequent analysis. After performing an analysis it is necessary to indicate which portions of the text are considered classified or are otherwise marked, and what specific guidance topic(s) caused the text to be selected. These rules, also generated by the system based on other rules, are stored in the Document Analysis ruleform. Performing the analysis itself requires looking for the tokens in the target documents, and applying the markings indicated. Since each guidance topic is translated into one or more propositions, and there are about 65,000 guidance topics, we anticipate that there will be about 100,000 propositions to be represented and searched for in each text. This number accounts for and excludes duplicate guidance topics. This number of rules alone would make RAS a very large expert system. As indicated in Exhibit 2, specifying a proposition (in the sense used here) means specifying usually two to four concepts which occur within a defined proximity of one another in a text, e.g. 5 sentences or 15 words. We have found thus far that even the most complex propositions require only six concepts. Specifying several concepts to search for is not by itself adequate: the computer must also know all the

Page 7: Automated identification of sensitive information

possible ways that each concept can be tokenized (i.e. lexically expressed) in any text document. This calls for a large tree of relationships specifying how concepts can be tokenized. This mapping requirement – essentially a large ontology of all areas of DOE activity – will probably add another 500,000+ additional rules. We keep these rules in the Entities Network ruleform, first defining all concepts and tokenizations in the Semantic Entities ruleform. Note that in many cases there is no need to specify all forms of an entity, e.g. singular, plural, possessive, etc.; using a wildcard before or after a word stem is sometimes adequate so long as the knowledge engineer is aware that use of stems may increase the false positive hit rate. Of course, not all concepts and tokenizations are equally related to one another. We represent this degree of closeness with a fuzzy fitness number from 0 to 1 to indicate the degree to which the two are related. We then need to test this rulebase against a large set of documents, and to go back and correct the rules to minimize or eliminate false positives and false negatives, either by adding new entities to look for, changing the relationships of existing entities, or specifying tighter hit ranges. In the long term, we will need to be able to keep the rules up-to-date as the original guidance topics change over time and as we apply it do different corpora having variations in their lexical representations, e.g., using a different (former) name for a national laboratory. We are still in the early stages of creating and validating these rules, and we expect this to be the most difficult part of building the system due to the wide range of subject areas to be covered.

Results to Date We have run RAS on several different corpora having very different characteristics, in order to see how it performs with these different corpora. Characteristics of interest include whether the documents are known or believed to be classified or unclassified; the size of each document in the corpus, ranging from a few sentences to hundreds of pages; the size of the total corpus, ranging so far up to about 3 million words; and whether the corpus was originally created electronically or whether it was OCRed and therefore has some number of OCR errors in it. The rulebase tested has over 700 guidance rules in it, and maps to about 20,000 tokens. We expect soon to greatly increase the number of guidance rules applied. Results so far show a typical false positive hit rate of about ten percent of the documents reviewed, meaning that of 100 documents, RAS will incorrectly identify ten as “hot” when they are not. We hope of course to reduce this rate by broadening and deepening the rulebase, and

work is underway to automatically generate RAS rules from the written English guidance. Of these, about ten percent are what we call “good false positives”, i.e. items that are not in fact classified but which a reviewer would want to look at closely before making that determination. In terms of missing items that should have been identified as “hot” (i.e., false negatives), results are harder to determine since sometimes even human reviewers may disagree about what is sensitive. But results to date indicate that almost all missed items are readily accounted for as outside the domain of the rulebase, either pertaining to a different subject area or including tokens that the rulebase was unaware of. This low false negative and low false positive rate is in great contrast to other approaches which have often been found unusable based on 50%+ false positive hit rates, using that term as defined above.

Other Possible RAS Applications

The RAS technology for ”concept-spotting” – reviewing documents for the existence of specific propositions – can theoretically be applied to other arenas, such as: identifying unsolicited commercial email (“spam”) searching web sites for certain ideas searching patents for certain ideas scanning computer source code for Y2K date issues. We are still a long way from completing the RAS system. The results to date have been very promising: RAS is demonstrating the advantages of Ultra-Structure theory for concept detection and large knowledgebases. The Declassification Productivity Research Center (DPRC) at The George Washington University is carrying out other Ultra-Structure based research projects, which are also showing positive results (Oh and Scotti, 1999).

Summary and Conclusions A million records is small by database system standards, but a million rules is essentially an impossible number for a traditional expert system to manage. We expect to be able to effectively handle very large numbers of rules, numbering in the hundreds of thousands, using the techniques being followed for RAS. Ultra-Structure theory may constitute a real merger of knowledgebase and database technologies. If so, it has the potential to usher in a new era of vastly larger expert systems for carrying out policies and procedures of extreme complexity.

References

Page 8: Automated identification of sensitive information

Long, Jeffrey G. and Denning, Dorothy E., “Ultra-Structure: A design theory for complex systems and processes,” Communications of the ACM 38(1), (1995) 103-120. Long, Jeffrey G., “A new notation for representing business and other rules,” Semiotica 125-1/3, (1999) 215-228 Oh, Youngsuck and Scotti, Richard, “Analysis and Design of a Database using Ultra-Structure Theory (UST) – Conversion of a Traditional Software System to One Based on UST,” Proceeding of the 20th Annual Conference, American Society for Engineering Management (1999) Shostko, Alexander, “Design of an automatic course-scheduling system using Ultra-Structure,” Semiotica 125-1/3, (1999) 197-214

About the Author Mr. Long is Senior Knowledge Engineer on the DPI project. He is also Director of the Notational Engineering Laboratory, an effort to create a clearinghouse for people interested in problems of representation in any field of science, art or other activity. His experience includes 25 years of consulting on various kinds of applications software development, with a particular focus on studying complex systems and the problems of representing them.

Standard Terminology (if any) Ultra-Structure Instance Name Ultra-Structure Level Name U-S Implementation

behavior, physical entities and relationships, processes

particular(s) surface structure system behavior

rules, laws, constraints, guidelines, rules of thumb

rule(s) middle structure data and some software (animation procedures)

(no standard or common term) ruleform(s) deep structure tables (no standard or common term) universal(s) sub-structure attributes, fields tokens, signs or symbols token(s) notational structure character set

Exhibit 1: Layers of Structure in Any System, According to Ultra-Structure Theory

Page 9: Automated identification of sensitive information

Exhibit 2: RAS Breakdown of Topics to Tokens

Page 10: Automated identification of sensitive information

Using Ultra-Structure forAutomated Identification of

Sensitive Information in Documents

Jeffrey G. Long

Sr. Knowledge Engineer, DynMeridian

[email protected]

Page 11: Automated identification of sensitive information

Traditional Engineering Approaches Work Traditional Engineering Approaches Work g g ppg g ppOnly Under Certain ConditionsOnly Under Certain Conditions

Page 12: Automated identification of sensitive information

Unfortunately Complex and ChangingUnfortunately, Complex and Changing Needs Exist in Every Organization

Needs

SW & DB

Needs

time 1 time 2

SW & DB

time 3...time 1 time 2

Page 13: Automated identification of sensitive information

Ultra-Structure Theory Was Created toUltra Structure Theory Was Created to Support Complex and Changing Rules

New theory of systems design, developed 1985

F ti l t t ti f Focuses on optimal computer representation of complex, conditional and changing rules

Based on a new abstraction called ruleforms Based on a new abstraction called ruleforms

The breakthrough was to find the unchanging features of changing systems

Page 14: Automated identification of sensitive information

The Theory Offers a Different Way toThe Theory Offers a Different Way to Look at Complex Systems and Processes

observablebehaviors surface structure

generatesrules

f f l

middle structure

constrainsform of rules deep structure

Page 15: Automated identification of sensitive information

This Creates New Levels for AnalysisThis Creates New Levels for AnalysisThis Creates New Levels for Analysis This Creates New Levels for Analysis and Representationand Representation

Standard Terminology (if any) Ultra-Structure Instance Name

Ultra-Structure Level Name

U-S Implementation

behavior, physical entities and relationships, processes

particular(s) surface structure system behavior

rules laws constraints rule(s) middle structure data and somerules, laws, constraints, guidelines, rules of thumb

rule(s) middle structure data and some software (animation procedures)

(no standard or common term)

ruleform(s) deep structure tables

(no standard or common term)

universal(s) sub-structure attributes, fields

tokens signs or symbols token(s) notational structure character settokens, signs or symbols token(s) notational structure character set

Page 16: Automated identification of sensitive information

Th R l f H h iThe Ruleform Hypothesis

Complex system structures are created by not-necessarilycomplex processes; and these processes are created by theanimation of operating rules. Operating rules can be groupedanimation of operating rules. Operating rules can be groupedinto a small number of classes whose form is prescribed by"ruleforms". While the operating rules of a system change overtime, the ruleforms remain constant. A well-designed collectiongof ruleforms can anticipate all logically possible operating rulesthat might apply to the system, and constitutes the deepstructure of the system.

Page 17: Automated identification of sensitive information

Th C RE H h iTh C RE H h iThe CoRE HypothesisThe CoRE Hypothesis

There exist Competency Rule Engines, or CoREs, consisting of <50 ruleforms, that are sufficient to represent all rules found among systems sharing broad family resemblances e g allamong systems sharing broad family resemblances, e.g. all corporations. Their definitive deep structure will be permanent, unchanging, and robust for all members of the family, whose differences in manifest structures and behaviors will bedifferences in manifest structures and behaviors will be represented entirely as differences in operating rules. The animation procedures for each engine will be relatively simple compared to current applications, requiring less than 100,000 lines p pp , q g ,of code in a third generation language.

Page 18: Automated identification of sensitive information

DOE Reviewer’s Assistant SystemDOE Reviewer’s Assistant SystemDOE Reviewer s Assistant System DOE Reviewer s Assistant System RequirementsRequirements

650 guides defining 65,000 topics that are or may be classified

Extensive background knowledge required to interpret guidance

Guidance changes over time

Terminology in documents changes over time

Current backlog of 300+ million pages Current backlog of 300+ million pages

Objective is concept spotting, not document understandingg

Page 19: Automated identification of sensitive information

Normally This Would be Done UsingNormally This Would be Done UsingNormally This Would be Done Using Normally This Would be Done Using an Expert System Shellan Expert System Shell

ES often have trouble with > 100 rules

DOE system will require about 500 000 rules DOE system will require about 500,000 rules

Key issue: maintainability of rules

Many benefits from using relational database to store Many benefits from using relational database to store rules as data Built-in referential integrity

E t iti d i Easy report-writing and queries

Simple user interface for KE and Reviewers

Page 20: Automated identification of sensitive information

RAS Defines Guidance Concepts and pAll Possible Lexical Expressions of Those Conceptsp

Convert GuidesConvert GuidesDefine

InterpretationsDefine

InterpretationsSystemReadySystemReady

Apply Guidance

Apply Guidance

ReadDocument

ReadDocument

DocumentReviewedDocumentReviewed

Page 21: Automated identification of sensitive information

Rules Specify Relations BetweenRules Specify Relations BetweenRules Specify Relations Between Rules Specify Relations Between Concepts, Tokens and MarkingsConcepts, Tokens and Markings

Page 22: Automated identification of sensitive information

R l D P i iResults to Date are Promising

In a corpus of 3,750 unclassified documents, the false positive rate was less than 10%

I th f 16 500 l ifi d d t In another corpus of 16,500 unclassified documents, the false positive rate was 2.5%

In other (e.g. keyword and statistical systems) approaches, false positive and false negative rates

ft i f 50%are often in excess of 50%

Page 23: Automated identification of sensitive information

The Ultra-Structure-Based RAS S Off S b i l B fiSystem Offers Substantial Benefits to Reviewers and Knowledge Engineers

System can provide precise and rigorous interpretation of DOE Classification Guidancep

Rules can become more complex if necessary

Rules are easy to specify, change and review

Implications and consequences of changes can be better foreseen

Changes to rules do not require changing software or Changes to rules do not require changing software or table structures – just data

Page 24: Automated identification of sensitive information

N S f RAS D lNext Steps for RAS Development

Work with subject experts to expand scope and improve quality and completeness of rulebase

C ti t ti t i t t f Continue testing system against many types of documents

Improve design to minimize/eliminate false negatives Improve design to minimize/eliminate false negatives and false positives

Work with end-users to improve user interface

Integrate into other systems

Improve design to increase speed: parallel processing stored queries etcprocessing, stored queries, etc.

Page 25: Automated identification of sensitive information

As the CoRE Hypothesis Promises RASAs the CoRE Hypothesis Promises RASAs the CoRE Hypothesis Promises, RAS As the CoRE Hypothesis Promises, RAS Could be Used in Other Areas AlsoCould be Used in Other Areas Also

Categorize documents by subject

Scan email for spam/UCE Scan email for spam/UCE

Scan websites, e.g. for compliance to a standard

Categorize patents or scan them for specified g p pconcepts

Scan source code, e.g. Y2K

S hi d bl f ifi d Scan any machine-readable corpus for specified ideas