Top Banner
Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman
18

Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Crawler-Based Search Engine Milestone IV

By Ryan Caplet, Morris Wright and Bryan Chapman

Page 2: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Topics Breakdown

• Updated Task Breakdown

• Parts of the Search Engine that are within the System

• Diagram

• Testing and Integration

Page 3: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Task Breakdown

• Bryan– Crawler– Keyword Generator

• Morris– Database and Server Administrator– Search Function

• Ryan– Part of Crawler– Search Function– User Interface

• All– Testing System Components

Page 4: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Topic Breakdown

• Updated Task Breakdown

• Parts of the Search Engine that are within the System

• Diagram

• Testing and Integration

Page 5: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Breakdown of System Components

• Recursive wget

• Crawler / Indexer

• Keyword Generator

• Search Page

Page 6: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Recursive wget

• Run to recursively run on the Uconn Network

• Web pages (2800+) pages were downloaded into www folder

• ~ 3 GB in size

Page 7: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

The Crawler – new_strip.pl

• Written in the Perl Programming Language

• Strips the title of each page and URL and stores them into the Page Index Database

• Uses File::Basename Library to get titles when none is found.

Page 8: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Keyword Generator

• Uses Index built from the Crawler

• Stemming Algorithm is used

• PHP is used to stem the words but Perl is used to interact with the Keywords Database.

• Filenames: process2.php, fileopen.php, stemming.php and processKeyword.pl

Page 9: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Side Topic: Stemming Algorithm

• Process of finding the root or natural form of a word.

• Example: “stemmer”, “stemming”, “stemmed” are based on “stem”. “Stem” is the stem.

• In this case it is going to give us the stems of those word variations

Page 10: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Keyword Generator Cont’d

• Keyword Generator will produce thousands of tables for each word.

• Those tables will contain URLs and frequencies of those words at that URL.

• Use of md5 checksum

• This is what we will be searching from!

Page 11: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Search Page

• Written in HTML and PHP• Filenames: index.html and results.php• Will access the Database and search the

tables for the words specified• Uses Quicksort Algorithm to sort results by

Frequency• Use of md5 checksum to make it search

only what was generated by keyword script.

Page 12: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Topic Breakdown

• Updated Task Breakdown

• Parts of the Search Engine that are within the System

• Diagram

• Testing and Integration

Page 13: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Diagram

Page 14: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Topic Breakdown

• Updated Task Breakdown

• Parts of the Search Engine that are within the System

• Diagram

• Testing and Integration

Page 15: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Testing Entry Criteria

• Must work adequately for the creator.

• Once a first party sees it works it is then verified by a second party.

Page 16: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Integration Stategy Points

• All parts of the system are relatively separate.

• Yet the earlier parts depend on the later parts output.

• Integration is done as shown in the diagram.

Page 17: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Exit Criteria

• In order for this system to be ready for beta testing:– The search page must be test thoroughly to

make sure that it functions correctly also with proper security concerns taken care of as they come up

– Make sure that the keyword tables build properly and are able to be accessed by the search page.

Page 18: Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

The End

• Any Questions, Concerns or Criticisms?