Teaching machines about a subject domain

Teaching machines about a subject

domain

Paul H. Cleverley Aug 2016

Department of Information Management

Aberdeen Business School

Robert Gordon University (RGU), United Kingdom

Introduction

• There may be opportunities to teach search engines in enterprises about a subject in some detail using machine learning AI techniques. Using unstructured text and certain algorithms can allow the computer to 'learn' as more text is added to the system. In this way these techniques differ from the classic 'expert systems' where people programmed in the domain relationships. Using unstructured text, the machine can work out the relationships for itself.

• Word co-occurrence techniques can help build associative networks from unstructured text, where that text may be from documents on your file-system, in Microsoft SharePoint or any other Electronic Document Management System (EDMS). Through 'counting', probabilities can be estimated between terms which represent their 'relatedness'. The more times two terms occur together, the more related they are, the stronger the associated link. These mathematical representations can act like 'fingerprints' of contexts, helping identify similar concepts/contexts or highlighting discriminatory ones.

• Word association networks may exhibit 'small world' network characteristics, where some popular terms have many links to other terms (termed hubs), whilst most terms have a smaller number of links.

• In word association games that people play, the mind has a tendency to recall associated terms that are hubs (high authority), Google's PageRank algorithm operates in a similar conceptual way. This presents an opportunity for search engines to suggest associative terms 'off the beaten track' of our minds.

Research

• I conducted some research within the oil and gas industry using Python freely available tools and data (100,000 reports) from the Society of Petroleum Engineers (SPE) and Petroleum Lyell Collection from the Geological Society of London (GSL). The examples in this article relate to associated words to the term 'permeability'. Although this text is oil and gas related, the concept applies to other disciplines and industry sectors.

• One of the research questions was to understand how different disciplines view the same concept term (permeability) using industry literature as a surrogate for the discipline.

• For the query term ‘permeability’ Figure 1 below shows the associative term differences between the two collections (SPE for Petroleum Engineers, GSL for Geoscientists).

• The terms are colour coded using a derivation of the NASA SWEET Ontology. For example, properties are dark blue, realms are gold, human techniques are red, natural phenomena are light blue, materials are yellow.

Associated words

Figure 1: The ‘fingerprints’ are ordered in terms of their popularity, so you can see on the left in the SPE collection, ‘relative’ is the most popular associated word, whereas in the GSL dataset, ‘porosity’ is the most associated word. The first 20 associations are shown but obviously the data is more extensive

Discussion• There are interesting patterns (that reflect an engineering rather than geoscience perspective

on the term ‘permeability’). For example the term ‘fault’ and ‘facies’ that are highly associated with permeability in the GSL collection, simply do not appear in the SPE associative network. Conversely, the associated terms ‘formation’ and ‘curves’ that are highly associated with the SPE collection are not associated with the GSL associative network. There are also emphasis differences, for example, the term ‘matrix’ is the 9th most popular association in the SPE (to permeability), whereas in the GSL collection it comes in at number 33.

• It is tempting to think of associative networks in a positivist fixed way. So for a term such as ‘permeability’, there is a ‘right answer’ out there, in terms of the words most associated with that term. Analyzing Wikipedia text may give us an answer. However, there are likely to be multiple realities based on the discipline and other contexts. An engineer may have different preferences for content with certain associations compared to a geoscientist.

Intent• The meaning is the same, only the intent differs

• Sometimes word associations are used to help disambiguate terms. For example between 'Jaguar' the big cat and 'Jaguar' the car. By using the words that appear around each different context within text, we can teach machines to understand the difference.

• However, the sections above describe a different phenomenon. This is not about disambiguation because the meaning of the term 'permeability' is exactly the same. It’s the subtle context of its use/user intent which is different. Normally these subtle discipline differences get smoothed out or ignored.

Influencing search results• So what are the implications? Enterprise search & discovery search engines including

'recommender systems' could use this associative information based on the role of the user to ‘understand’ what is important and what may be of interest. Even what is unusual. If an engineer searches on information related to permeability (for example), if may be possible to use associative data from industry datasets to train the search engine and influence what results and suggestions are returned to improve relevance. A geoscientist may get a different result reflecting their role and slant on user intent.

Pointwise Mutual Information Measure• Another useful way to rank word associations is not by frequency of occurrence, but by

Pointwise Mutual Information (PMI) Measure. This can be used to rank associative terms based on their occurrence to a given term, as an overall ratio to how often it occurs as a term as a whole throughout the text collection. In other words, a term with a very high PMI (for example to the term permeability) will almost always occur ONLY in association with the term permeability (in a given text collection).

Pointwise Mutual Information Measure

Figure 2 shows word associations for permeability ranked by PMI for the SPE and GSL collections.

Pointwise Mutual Information Measure• This technique is useful to identify ‘clues’ (for auto-categorization and auto-classification of

text) that can be used to infer whether text is discussing a certain topic or is about a certain concept or term, even if that term is not mentioned explicitly. So sentences, paragraphs or even 'documents' mentioning ‘darcies’, ‘kr’, klinkenberg’ may be likely about ‘permeability’ even if that term is not mentioned.

• Compound terms

• The associative networks can obviously be applied to compound terms, an example for a bigram is shown below using the SPE collection. Removing bigrams that mention the term you are looking at (permeability) helps remove taxonomic type categorizations (semantic similarity) to focus on semantic relatedness. From Figure 3 it is clear to see that ‘capillary pressure’ is the most associated word with ‘permeability’ in the SPE collection, occurring 857 times.

Compound terms

Figure 3 – Word association (bigram compound term) ranked by popularity for permeability in the SPE collection.

Discussion• There are many more sophisticated methods using rules and ontologies that can help capture

the meaning between some term associations. Natural Language Processing (NLP) can help remove unwanted associations (e.g. for the sentence 'Zone 24B was not permeable') would probably mean you would not want the machine to count Zone 24B as associated with being permeable. Taxonomies can also help resolve acronyms, synonyms and hypernyms for more accurate statistical counting of associations.

• What you use depends on the purpose for which you are building the associative network and the effort you want to put in. Many organizations are embracing text analytics as part of their search deployment, but some are not

Summary• Many organizations are sitting on a wealth of unstructured text. There are many OpenSource

and free tools than can help build large scale associative networks in either unsupervised or semi-supervised ways.

• With exponentially increasing volumes of information, much information is being ranked or suggested by popularity. That may effectively ‘censor’ some information through its obscurity. With an increasing need/intent for search engines to ‘show me something I don’t already know’there appears a need to revisit ‘relevance’ algorithms. What is most popular, is not necessarily what is most interesting.

• Search engines may be increasingly the way in which ‘we come to know’. If a corpus is the starting point, allowing the user to explore associative networks in various ways (not just by popularity), as a series of click-able facets, may mitigate the issues presented with a classic search box, where a searcher may be hampered to find out what they don’t know by their own existing knowledge of keywords. In other words, the agency of the searcher using traditional search engines may limit their ability to discover something they have no advance knowledge of. Exploiting associative networks may be a useful way of discovering new knowledge.

References• Chuang, J., Manning, C.D., Heer, J. (2012). “Without the Clutter of Unimportant Words”:

Descriptive Keyphrases for Text Visualization. ACM Transactions on Computer-Human Transactions, 19(3)

• Church, K., Hanks, P. (1991). Word Association Norms, Mutual Information and Lexicography. Computational Linguistics. 16(1), 22-29.

• Harris, Z (1954). Distributional Structure. Word. 10(23), 146-162.

• Hillis, K., Petit, M., Jarrett, K. (2013). Google and the Culture of Search. UK. Routledge.

• Kruschwitz, U. (2014). Utilizing User Access Patterns in Enterprise Search. Real Artificial Intelligence. British Computer Society, October 10th 2014, London, UK. Online Article (http://www.bcs-sgai.org/realai2014/slides/kruschwitz.pdf, accessed January 2016).

• Smith, P. (2015). Woodside Petroleum searches for data value with IBM’s Watson cognitive computing. Online Article (http://www.afr.com/technology/woodside-petroleum-searches-for-data-value-with-ibms-watson-cognitive-computing-20150521-gh6un7,).

• Turney, P.D., Pantel, P (2010). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, 37, 141-188

http://www.bcs-sgai.org/realai2014/slides/kruschwitz.pdf

http://www.afr.com/technology/woodside-petroleum-searches-for-data-value-with-ibms-watson-cognitive-computing-20150521-gh6un7

Thank you for listeningPaul H. Cleverley

http://www.rgu.ac.uk/dmstaff/cleverley-paul

[email protected]

http://www.rgu.ac.uk/dmstaff/cleverley-paul

mailto:[email protected]