How structured data (Linked Data) help in Big Data Analysis --- Expand Patent Data with Linked Data Cloud Lishan Zhang Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2013-96 http://www.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-96.html May 17, 2013
38
Embed
How structured data (Linked Data) help in Big Data ... · How structured data (Linked Data) ... information!on!our!environment.![1]! ... analytics!platformand!data!management!system.!
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
How structured data (Linked Data) help in Big Data
Analysis --- Expand Patent Data with Linked Data
Cloud
Lishan Zhang
Electrical Engineering and Computer SciencesUniversity of California at Berkeley
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission.
How structured data (Linked Data) help in Big Data Analysis
-‐-‐-‐ Expand Patent Data with Linked Data Cloud
Explanation of Results ................................................................................................................................ 20
What is different ........................................................................................................................................... 22
Limitation of this approach ...................................................................................................................... 22
User Study ....................................................................................................................................................... 23
Endpoint is an association between a fully specified Interface Binding and a network
address, specified by a URI. It is used to communicate with an instance of a Web
Service. An endpoint indicates a specific location for accessing a Web Service using a
15
specific protocol and data format. [19] A SPARQL endpoint enables users to query a
knowledge base via the SPARQL language. Results are typically returned in one or
more machine-‐processable formats like HTML. For simplicity, we can say that a
SPARQL endpoint is the place you send your SPARQL query and receive the result.
The commonly used SPARQL Endpoints are lists below (SparqlEndpoints, 2013):
Data Source Endpoint Address
DBpedia http://dbpedia.org/sparql
U.S. Census http://www.rdfabout.com/sparql
FactForge http://factforge.net/sparql
data.gov.uk http://data.gov.uk/sparql
In our project, we need to query the bio information of the patent inventor from
DBpedia through SPARQL endpoint query. The information of a certain person is the
same as we often see in Wikipedia, but it is in a different format. For example, as for
our professor David A. Patterson, the Wikipedia page and DBpedia page are showed
as below. We can see they have quite different representation of the same
information. In DBpedia, data is machine-‐readable. We can get the value from the
property on the left side. We just need to select the properties we need in SPARQL
query and can get the corresponding values more convenient.
16
Fig 3: Screenshot of an example of Wikipedia
Fig 4: Screenshot of an example of DBpedia
17
HTTP request
The Hypertext Transfer Protocol (HTTP) can work as a request-‐response protocol
between a client and server. An HTTP request consists of a request method, a
request URL, header fields and a body. The request methods are GET, HEAD, POST,
PUT, DELETE, OPTIONS, TRACE. [20] The two commonly used HTTP request
methods are GET and POST. While these two methods have similar function, GET
emphasizes requests data from a specified resource while POST submits data to be
processed to a specified resource. We use POST method here to avoid caching.
In our case, the client is the Search Interface that submits an HTTP request using
JavaScript to the server endpoint with the SPARQL query. Then the server returns a
response to the client. The response contains status and content information about
the request. Consider that JavaScript is not good at dealing with RDF data; we set the
return format as json format.
User Interface design
The User Interface (UI) design for our prototype is simple and clean. It looks like a
simplified Wikipedia. We query from both the Patent Data and Linked Data Cloud
and display the output in the interface. The structure of the User Interface is the
Patent information surrounded by some information of the inventor of the patent.
We can see the screenshot as below:
18
Fig 5: Screenshot of Paten Search Interface
The left side contains the basic information including his profile picture, working
place, Alma Mater and Doctoral Advisor. The upper right side is a biography of the
inventor. Then followed his patent information got from relational database. If we
click the link in the left side, it can lead us to the certain Wikipedia page to get more
information. The UI design emphasizes the Patent part while putting the relevant
information surrounded.
The procedure
The procedure works as below:
On the client-‐side, when people search a keyword, a HTTP request message will
send to the DBpedia web server. We write a wrapper class “SPARQLWrapper.js” in
JavaScript that is similar to SPARQL Endpoint interface to Python. [21]
19
The SPARQL endpoint query is http://dbpdia.org/sparql. We send the request with
searched title and some properties like abstract, workplaces and so on to the server
endpoint. But it will return html page, which is not what we need. So we set the
accept field in Request Header to identify the return data type. Here we need to
return json format. We use GET and POST methods to send the SPARQL.
The web server then will provide resources and return a response message to the
client. The response message is read by JavaScript and write into html and display in
the User Interface.
For the Patent Data part, we have potentially two main approaches. One approach is
to use the Patent Data as the relational database and query the data from local
database. And the other approach is to convert it to RDF format and store it in triple
store or even publish on the web. The first approach is efficient because we just
need to obtain the Patent information from the search keyword. It is quite
convenient to use relational database. The bottleneck would be how to store the
data. The whole dataset could be saved locally or upload in Google Datastore.
The second approach is more complex because we need to pre-‐process the whole
dataset and convert to RDF format. Since the Patent Data is quite large, many
existing tools like Google Refine cannot hold such a large amount of data. The
advantage for the second approach is that the Patent Data can interlink with other
Linked Data and make Patent Data more available.
Since the large amount of Data is always a problem, we will begin from a small
subset and go from there. For example, we can use the Patent Data from Berkeley
Professor first.
20
Discussion
In this section, I will main discuss the use case that we bring Linked Data in Patent
Data search. Also I will talk about how linked Data helped in patent search, what is
the limitation and how linked data can be used in broader context. I also evaluate
the User Interface of the search interface and test with real users.
Results
Explanation of Results
For our Capstone Project, we would like to explore the potential use of Linked Data
to help Big Data Analysis. And thus we are building a patent search engine based on
these two concepts. Linked Data has many advantages like highly structured data,
machine-‐readable and interlinked between different data sources. So we take
advantage of the structured data format of Linked Data and use it to expand the
search result for patent and add more values to it. Basically we have proven the
hypothesis that Linked Data works in this situation and it will have many other
implications.
Here is our User Interface for after searching for a certain patent:
21
Fig 6: Screenshot of Paten Search Result
From the screenshot we can easily see that it has association information adding
into the patent search result. Here we add some wiki information for the certain
inventor. In this way user can easily distinguish the exact inventor by looking at the
biography or some related information like work place, alma mater and doctoral
advisor. It will help in disambiguation for patents since there will be a large amount
of people with the same name but work in different areas and have totally different
patents.
Besides, users can also search for the patents for the coworkers by clicking their
names in the page. Or if the users are interested in the workplace or alma mater,
they can also just click the link and it will lead them to the Wikipedia page of the
certain item.
22
With the help of Open Linked Data, we have a new kind of patent association search
that disambiguation the patent search and provide a broader context of the patent
related information.
What is different
We have many some changes compare to our initial ideas in our implementation.
First for the patent data, we retain its format as relational database and query with
SQL rather than converting it into RDF format. Actually we have worked in some
small prototype to convert the data using Google Refine. But it becomes really
complex when we use a large amount of data. And it is not necessary to covert data
format in our use case. So we decided to query the relational database directly and
combine the result with inventor information from Linked Data.
Also we decide to put the patent data locally and use PHP to query the relational
database and send back to client side with json format. We find out this is the most
efficient way of doing that at this stage. If time permits, we would probably put them
in the cloud server so that we can run the search engine remotely.
Limitation of this approach There are also some limitations of our patent search.
Firstly, we are assuming that the inventor would have a Wikipedia page so that we
can find the corresponding information in DBpedia. However, this would not also be
the case. Although more people get their own page in Wikipedia, there would not be
23
all the people who held their patents. In such case, we won’t find their information
from the Linked Data Cloud and it would cause a problem.
Secondly, the user will need to type the full name of the inventor in order to match
the name in DBpedia and the inventor name in patent database. Compare with
Google Patent Search, it is kind of limited because Google can find us a lot of
information based on selection rank even if we didn’t type the full name.
Thirdly, we are using patent data as its original format and run two queries to
search from DBpedia and relational database. It doesn’t make the best use of Linked
Data because the advantage of Linked Data over other format is that it is in the same
format and different datasets can be interlinked together. Later it would be better if
we can actually convert the patent data into RDF format and even publish the data
into Open Linked Data Cloud. In this way, the patent data would have been
interlinked with all the other data source in the cloud and make use of the Linked
Data concept better.
Evaluation
In the evaluation part, I will mainly discuss the User Interface we build for patent
search and the effectiveness and convenience of search experience for real users.
User Study
We have asked some people in different areas to do the usability experiments to
experience the search engine and made some changes based on their feedback.
24
Most of them think that the patent association search result is better comparing it
with the traditional approach. They often encounter the problem whether they get
the right one when they search for patents. With our prototype they can easily get
the information of the inventor and therefore get correct and comprehensive
understanding of the information they retrieve.
They thinks that our patent search has clear output with the associate information
and it can also run relevant search. But they also point out the limitation of the
approach. We can only have basic information for the patent itself. If users would
like to know about some details of the patent itself, we cannot provide that because
we don’t have that information in Patent Database.
Heuristic Evaluation
We examine our User Interface with the famous 10 Usability Heuristics introduced
by Jakob Nielsen. It is a usability engineering method for finding the usability
problems in a user interface design. [22] We have a small set of evaluators examine
the interface with the recognized usability principles with point one to ten and
combine the result of evaluation.
We asked our users to go through a set of tasks we designed in our search interface
and provide evaluators with the goals of the system and allowed them to do their
own tasks. After that, they filled out the sheet of Heuristic Evaluation.
The Heuristic Evaluation Sheet is designed as followed:
25
Heuristic Evaluation principles Points (1-‐10)
Comments
Visibility of system status
Match between system and the real world
User control and freedom
Consistency and standards
Error prevention
Recognition rather than recall
Flexibility and efficiency of use
Aesthetic and minimalist design
Help users recognize, diagnose, and recover from errors
Help and documentation
We analyzed the results the real users provides and explained the evaluation result.
The principle got Good if the average point is more than 6 out of 10, otherwise it
need to improve.
(1). Visibility of system status: Good (8.7)
Our interface has clear layout and different components will not combine together
when it shows. User can easily see if they have obtained the search result and how
the information likes.
(2). Match between system and the real world: Good (8.2)
The interface is kind of like the Wikipedia format to show the bio-‐information and
put the patent in the front of the page so that it is easy to understand.
26
(3). User control and freedom: Good (7.1)
Users can search new patent by using the textbox in the upper left corner or simply
click the information in the page.
(4). Consistency and standards: Need to improve (5.8)
For the search textbox, we can only do search for the existing patents number and
some inventor information. So user may get confused about what they should enter
at first.
(5). Error prevention: Need to improve (5.0)
We don’t build the function for auto-‐completion or auto-‐correction so that users
need to type correctly in order to get the result.
(6). Recognition rather than recall: Good (7.5)
We have minimized the user’s memory load by making the objects and actions
visible. Users don’t have to remember information but can just click in the old result.
(7). Flexibility and efficiency of use: Good (7.2)
The differences between novice user and expert user will not be huge because there
are no complicated actions needed for the search feature.
(8). Aesthetic and minimalist design: Good (6.8)
The interface contains the most relevant and needed information and diminishes
the extra information with low visibility.
27
(9). Help users recognize, diagnose, and recover from errors: Need to improve
(5.7)
If users type some names that does not exist in the Wikipedia or they make some
typo, there is no error messages to indicate the problem precisely.
(10). Help and documentation: Need to improve (5.5)
We actually didn’t implement the documentation part to help user understand the
functionality of the search engine. Normally people will understand because the
interface looks like all the other search engine.
Future Work
Enriching the functionality of the Patent Search
Now we only focus on how to combine the Linked Data and relational Data together
to make the patent search more convenient. So we only use a limited information
collected from only one source of Open Linked Data Cloud. In fact, there are many
more things we can do to enrich the functionality of the Patent Search. For example,
we can obtain the geo information in the Patent Data and do some visualization of
from the Geo Names Data from Linked Data Cloud. Or we can even visualize some
Patent Search Graph to show the relationships between different inventors and
their patents more explicit.
28
Querying a Collection of Datasets in Linked Data
We query data from only DBpedia for this project. But since Linked Data is
interlinked, we may be able to query a collection of datasets using an existing
SPARQL endpoint and access to a set of copies of relevant dataset. For example,
OpenLink SW has a majority of dataset from the LOD cloud using SPARQL endpoint.
[23]
Applying the concept to other topics
Currently we apply the patent data with the DBpedia in Linked Data Cloud. There
are many other sources in Linked Data Cloud we may use like Geo Names data,
IMDB data, BBC music and so on. We may make use of these sources and find other
available applications. For example, we can search for a certain music singer and get
the relevant biographical information along with their albums and songs in different
data sources.
29
Conclusions or Impact Statement
For our capstone project, it is a research project to explore the potential use of
Linked Data into Big Data. We have do some research about Big Data, knowing the
existing approaches to analysis Big Data and their strength and weakness. And we
figured out that the highly structured Linked Data might be a potential solution for
unstructured Big Data analytics and dig out more values behind the Big Data.
Based on that, we are building a search engine to describe how Linked Data help in
Big Data Analysis by expanding the Patent Data with the Open Linked Data Cloud. In
this way, we may be able to find out the patent association information through the
Linked Data Cloud and combine with the patent search to get a comprehensive
answer.
Although we have learned a lot about the mechanism of Linked Data and use it in
our prototype, there is something remains to be learned. For example, we just query
from a single sources from Linked Data Cloud, we may explore multiply queries
from different sources or directly convert the Patent Data into RDF format and
publish it in the Linked Data cloud.
The strength for Linked Data is its structured and uniform format that information
can be shared among different datasets and it can be read automatically by
computers. Yet we still need to figure out the drawbacks like complicated pre-‐
processing procedures and the way to protect the available data in the web.
Our prototype has proven that linked data has many advantages and can be used in
data analysis in different situations. We can see a bright future for making better use
of linked data and semantic web to help in big data analysis.
30
Bibliography 1. Ian Mitchell, Mark Wilson. Linked Data: Connecting and exploiting big data. London : Fujitsu UK, 2012.
2. Dumbill, Edd. What is big data? An introduction to the big data langscape. [Online] January 11, 2012. http://strata.oreilly.com/2012/01/what-‐is-‐big-‐data.html.
3. James Manyika, Michael Chui, Brad Borwn, Jacques Bughin, Richard Dobbs, Charles Roxburgh, Angela Hung Byers. Big Data: The next frontier for innovation, competition, and productivity. s.l. : McKinsey Global Institute, 2011. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation.
4. Roebuck, Kevin. Big Data: High-‐impact Strategies – What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors. s.l. : Lightning Source Incorporated, 2011.
5. Richard Cyganiak, Anja Jentzsch. Linking Open Data cloud diagram. [Online] 2011. http://lod-‐cloud.net/.
6. Hopkins, Brian and Evelson, Boris. Expand Your Digital Horizon with Big Data . s.l. : Forrester , 2011.
7. Oreskovic, Alexei. YouTube, Google Inc's video website, is streaming 4 billion online videos every day, a 25 percent increase in the past eight months, according to the company. [Online] Jan. 23, 2012. [Cited: Nov. 30, 2012.] http://www.reuters.com/article/2012/01/23/us-‐google-‐youtube-‐idUSTRE80M0TS20120123.
8. twittersearch. The Engineering Behind Twitter’s New Search Experience. [Online] May 31, 2011. [Cited: Nov 30, 2012.] http://engineering.twitter.com/2011/05/engineering-‐behind-‐twitters-‐new-‐search.html.
9. O'Brien, Kevin. Why Media Literacy? A Catholic Reflection. [Online] [Cited: Nov. 30, 2012.] http://www.medialit.org/reading-‐room/why-‐media-‐literacy-‐catholic-‐reflection.
10. IDC European Software Predictions. Woodward, Alys, et al. 2012, IDC.
11. IDC Worldwide Big Data Taxonomy . Woo, Benjamin, et al. 2011.
12. Ghemawat, Jeffrey Dean and Sanjay. 2004, OSDI, p. 13.