Top Banner
USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008
23

USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

USPTO Patent Data Source and Data Extraction

Mandy Dang

MIS 580

University of Arizona

02-06-2008

Page 2: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

2

OutlineOutline

• Patent

• USPTO

• Search USPTO Patents

• Data Extraction: Case Study of NSE Patents

Page 3: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

3

PatentPatent

• “Patent" usually refers to a right granted to anyone who invents or discovers any new and useful process, machine, article of manufacture, or composition of matter, or any new and useful improvement. – A patent is not a right to practice or use the invention. Rather, it

provides the right to exclude others from making, using, selling, offering for sale, usually 20 years from the filing date.

– It is a limited property right that the government offers to inventors in exchange for their agreement to share the details of their inventions with the public.

• A patent is a special type of technology document which documents many important innovations and technology advances.

Page 4: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

4

USPTOUSPTO

• The United States Patent and Trademark Office (USPTO) is an agency in the United States Department of Commerce that provides patent protection to inventors and businesses for their inventions, and trademark registration for product and intellectual property identification.

• Each year, the USPTO issues thousands of patents to companies and individuals worldwide. As of March 2006, the USPTO has issued over 7 million patents, with 3,500 to 4,500 newly granted patents each week.

• USPTO provides online full-text access for patents issued since 1976.

• URLs:– USPTO Official Website: http://www.uspto.gov/– USPTO Patent Search: http://www.uspto.gov/main/search.html

Page 5: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

5

Search USPTO PatentsSearch USPTO Patents

http://www.uspto.gov/main/search.html

Page 6: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

6

Page 7: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

7

Page 8: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

8

Data Extraction: Case Study of NSE PatentsData Extraction: Case Study of NSE Patents

• Nanoscale Science and Engineering (NSE) field– Fundamental technology that is critical for a nation’s

technological competence.– Revolutionize a wide range of application domains.

• Nanotechnology– Is an applied science/ technology field that is multi-

disciplinary and encompasses engineering and other work taking place at the nanoscale.

– Critical for a nation’s technological competence. – R&D status attracts various communities’ interest.

Page 9: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

9

Data Extraction ProcedureData Extraction Procedure

• The goal is to gather all the related patents from USPTO Web site as free-text html pages and then parse them into structured data and stored in a database.

• Procedure of extracting NSE patents from USPTO:1. Spider search results (summary pages)2. Spider individual patent documents (detailed pages)3. Noise filtering4. Parsing

Page 10: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

10

1. Spider search results (summary pages)1. Spider search results (summary pages)

• A list of keywords can be used to search for patents related to NSE domain. The keywords were provided by domain experts.

• A spider program written by Perl was used to spider the search result pages.

Keywordsatomic force microscopeatomic force microscopicatomic force microscopyatomic-force-microscopeatomic-force-microscopyatomistic simulationbiomotormolecular devicemolecular electronicsmolecular modelingmolecular motormolecular sensormolecular simulationnano*quantum computingquantum dot*quantum effect*scanning tunneling microscopescanning tunneling microscopicscanning tunneling microscopyscanning-tunneling-microscopescanning-tunneling-microscopyself assembledself assemblingself assemblyselfassembl*self-assembledself-assemblingself-assembly

Page 11: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

11

use HTML::TokeParser;

use LWP;

use URI::Escape;

use strict;

sub query

{ … … … …

open(f, $ARGV[0]);

my @keywords = <f>;

close(f);

… … … …

$query_url = "http://patft.uspto.gov/netacgi/nphParser?Sect1=PTO2&Sect2=HITOFF&p=$pno&u=%2Fnetahtml%2Fsearc-bool.html&r=0&f=S&l=50&TERM1=$kw&FIELD1=&co1=AND&TERM2=$start%3E$end&FIELD2=ISD&d=ptx";

$response = $browser->get($query_url);

$result = $response->content();

open(f, "> $fpage-$pno.html");

select(f);

print $result;

close(f);

}

query('1/1/2007', '12/31/2007');

Example code

Get keywords

Download search pages

Set up time range

Page 12: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

12

Patent IDs

Search result page example

Page 13: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

13

2. Spider individual patent documents (detailed pages)2. Spider individual patent documents (detailed pages)

• In this step, we need to:– 1st, collect all the patent IDs;– 2nd, download all the patents based on

the patent IDs by using proxies.• The data set is often very large, so using

proxies can save a lot of time.

Page 14: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

14

1

Download detailed patent documents

Create several files, each of which contains a fixed amount of patent IDs (e.g., 300 patent IDs).

Server:

Send different patent ID files to different client threads.

… … … …

open(f, $ARGV[0]);my @theids = <f>;close(f);

my $theid;foreach $theid (@theids){

$new_sock = $sock->accept(); my $buf = <$new_sock>;

print ($new_sock $theid."\n");print $buf . " " . $theid."\n";

close $new_sock;… … … …

Client:

Use proxy to download the patents whose IDs are in the file sent from the server.

… … … …

do {

$response = $browser->get($pat_url);

if (!$response->is_success()){

select(stdout);

print $response->status_line, "\n\n";

sleep(rand(7)+1);

}while (!$response->is_success())

… … … …

Page 15: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

15

Patent document example

Page 16: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

16

Page 17: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

17

3. Noise filtering3. Noise filtering

• Some patents we gathered may have noisy NSE keywords, some may even have no NSE keywords.– Such patents need to be filtered out.

• Noise keywords includes:– nanosecond– nanoliter– nano$– nano-second– nano-liter– nano.sub– nano [space]– nano2

Page 18: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

18

4. Parsing4. Parsing

• Extract different data fields from the HTML patent documents and parse into database.

USP_Patent

PK patentId

issueDate title appSerialNumber appDate appType attorneyAgent primaryExaminer assistantExaminer

USP_inventor

PK inventorId

iLname iMname iFname iCity iState iCoutnry

USP_Patent_Inventor

PK patentIdPK inventorId

USP_Assignee

PK AssigneeId

aName aCity aState aCountry

USP_Patent_Assignee

PK patentIdPK assgneeId

USP_OtherRef

PK refenceId

CitingPatentId reference

USP_usClass

PK patentId

us_Class1 us_Class2 major

USP_Countryname

PK Country_lable

Country_fullname

USP_Patent_Content

PK patentId

Abstract Title Claim USP_Patent_Citation

PK CitingPatentIdPK CitedPatentId

CitedPatentDate

USP_Foreignref

PK CitingPatentIdPK CitedPatentID

CitedPatentDate CitedPatentSource

USP_Int_Class

PK patentId

section class subclass maingroup subgroup

Database Design (USPTO)

Page 19: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

19

public static void processAssignees( ) throws IOException{ … … … …

String[] assignees = assigneeString.split("<BR>");for (int i = 0; i < assignees.length; i++){

currentassignee=assignees[i].trim();if(currentassignee.length()==0)

continue;currentassignee = currentassignee.replaceAll("\r\n", "");

name =findBetween(currentassignee,0,"<B>","</B>");currPosition=currentassignee.indexOf("</B>")+"</B>".length();

address=findBetween(currentassignee,currPosition,"(",")");if(address==null){ System.err.println("wrong address: " + patentId); }int startIndex=0, endIndex=0;if((endIndex = address.lastIndexOf(',')) >= 0){ city = address.substring(0, endIndex);

if (city.lastIndexOf(',') >= 0){ city = city.substring(city.lastIndexOf(',')

+ 1);city.replaceAll("[^a-zA-Z]", "");

}startIndex = endIndex + 1;

}else

city="-";address = address.substring(startIndex);country=findBetween(address,0,"<B>","</B>");if(country==null){ country="US";

state=address.trim();}else

state="-";name=name.trim();city=city.trim();state=state.trim();rank++;

}}

Parsing example: parsing inventor data

Process inventor name

Process inventor address

Keep the ranking order of inventors

Page 20: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

20

Data Analysis ExamplesData Analysis Examples

• Bibliographic analysis– Top 50 countries

select c.countryName, count(distinct b.patentId)

from usp_assignee a, usp_patentAssignee b, usp_countryName c

where a.assigneeId=b.assigneeId and a.aCountry not in ('unknown','') and a.aCountry=c.countryCode

group by c.countryName

order by count(distinct b.patentId)desc

Rank Assignee CountryNumber of

Patents

1 United States 13,506

2 Japan 2,653

3 Federal Republic of Germany 836

4 France 534

5 China (Taiwan) 428

6 Republic of Korea 406

7 Canada 333

8 Netherlands 325

9 Australia 276

10 United Kingdom 258

11 Switzerland 193

12 Israel 163

13 Sweden 108

14 Belgium 106

15 Italy 82

16 Singapore 70

17 China 66

18 Denmark 56

19 Finland 51

20 India 39

21 Hong Kong 33

22 Bermuda 28

23 Ireland 26

24 Austria 24

25 Norway 23

26 Spain 15

27 Liechtenstein 13

28 Barbados 13

29 British Virgin Islands 7

30 New Zealand 7

Page 21: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

21

Citation Network AnalysisCitation Network Analysis

Developing software: Graphviz http://www.pixelglow.com/graphviz/download/

Page 22: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

22

Content Map AnalysisContent Map Analysis

Developing software: multi-level self-organizing map algorithm developed by AI Lab at the U of Arizona

Page 23: USPTO Patent Data Source and Data Extraction Mandy Dang MIS 580 University of Arizona 02-06-2008.

23

Thanks!