Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Crawler-Based Search Engine Milestone IV

By Ryan Caplet, Morris Wright and Bryan Chapman

Topics Breakdown

• Updated Task Breakdown

• Parts of the Search Engine that are within the System

• Diagram

• Testing and Integration

Task Breakdown

• Bryan– Crawler– Keyword Generator

• Morris– Database and Server Administrator– Search Function

• Ryan– Part of Crawler– Search Function– User Interface

• All– Testing System Components

Topic Breakdown



• Diagram


Breakdown of System Components

• Recursive wget

• Crawler / Indexer

• Keyword Generator

• Search Page

Recursive wget

• Run to recursively run on the Uconn Network

• Web pages (2800+) pages were downloaded into www folder

• ~ 3 GB in size

The Crawler – new_strip.pl

• Written in the Perl Programming Language

• Strips the title of each page and URL and stores them into the Page Index Database

• Uses File::Basename Library to get titles when none is found.

Keyword Generator

• Uses Index built from the Crawler

• Stemming Algorithm is used

• PHP is used to stem the words but Perl is used to interact with the Keywords Database.

• Filenames: process2.php, fileopen.php, stemming.php and processKeyword.pl

Side Topic: Stemming Algorithm

• Process of finding the root or natural form of a word.

• Example: “stemmer”, “stemming”, “stemmed” are based on “stem”. “Stem” is the stem.

• In this case it is going to give us the stems of those word variations

Keyword Generator Cont’d

• Keyword Generator will produce thousands of tables for each word.

• Those tables will contain URLs and frequencies of those words at that URL.

• Use of md5 checksum

• This is what we will be searching from!

Search Page

• Written in HTML and PHP• Filenames: index.html and results.php• Will access the Database and search the

tables for the words specified• Uses Quicksort Algorithm to sort results by

Frequency• Use of md5 checksum to make it search

only what was generated by keyword script.

Topic Breakdown



• Diagram


Diagram

Topic Breakdown



• Diagram


Testing Entry Criteria

• Must work adequately for the creator.

• Once a first party sees it works it is then verified by a second party.

Integration Stategy Points

• All parts of the system are relatively separate.

• Yet the earlier parts depend on the later parts output.

• Integration is done as shown in the diagram.

Exit Criteria

• In order for this system to be ready for beta testing:– The search page must be test thoroughly to

make sure that it functions correctly also with proper security concerns taken care of as they come up

– Make sure that the keyword tables build properly and are able to be accessed by the search page.

The End

• Any Questions, Concerns or Criticisms?

Crawler-Based Search Engine Milestone IV By Ryan Caplet, Morris Wright and Bryan Chapman.

Documents