AN APPROACH TO CATEGORIZATION OF TEXT IN WEBSITES USING PARALLEL SEARCH BAKTAVATCHALAM.G (08MW03) MASTER OF ENGINEERING Branch: SOFTWARE ENGINEERING of Anna University May 2009 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING PSG COLLEGE OF TECHNOLOGY (Autonomous Institution) COIMBATORE – 641 004
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
AN APPROACH TO CATEGORIZATION OF TEXT IN WEBSITES USING PARALLEL SEARCH
BAKTAVATCHALAM.G (08MW03)
MASTER OF ENGINEERING
Branch: SOFTWARE ENGINEERING
of Anna University
May 2009
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING PSG COLLEGE OF TECHNOLOGY
(Autonomous Institution)
COIMBATORE – 641 004
PSG COLLEGE OF TECHNOLOGY (Autonomous Institution)
COIMBATORE – 641 004
AN APPROACH TO CATEGORIZATION OF TEXT IN WEBSITES USING
PARALLEL SEARCH
Bona fide record of work done by
BAKTAVATCHALAM.G (08MW03)
MASTER OF ENGINEERING
Branch: COMPUTER SCIENCE AND ENGINEERING
of Anna University, Coimbatore.
May 2009
Acknowledgement
i
ACKNOWLEDGEMENT
We wish to express our sincere gratitude to our respected Principal Dr. R. Rudramoorthy for having given us the opportunity to undertake our project.
We also wish to express our sincere thanks to Dr. S. N. Sivanandam, Professor and Head of the Department of Computer Science and Engineering, for his
encouragement and support that he extends towards our project work.
We extend our sincere thanks to our internal guide, Mrs. D. Indumathi, Asst. Professor, Department of Computer Science and Engineering, for his guidance and
help rendered for the successful completion of our project.
Contents
iii
CONTENTS
CHAPTER Page No. Synopsis………………….………………………………………………..…………….. .(i) List of Figures.………….………………………………………………...…………….. .(ii) List of Tables.…………………………………………………………………………….(iii) 1. INTRODUCTION.……...…………………………………………………………... .1
1.1. Problem Definition 1
1.2. Objective of the Project 1
1.3. Significance of the Project 1
1.4. Outline of the Project 1
2. SYSTEM STUDY..…….……………………..……………………………………...3 2.1. Proposed System 3
3. SYSTEM ANALYSIS..…….……………………..………………………………….4 3.1 Requirement Analysis 4 3.2 Feasibility Study 4
4. SYSTEM IMPLEMENTATION.………………..…………………………………...10 5.1 Server Module 10
5.2 Parser Module 11
5. TESTING……………………….………………..……………………………………12 6.1 Unit Testing 12
This phase is broken up into two phases: Development and Implementation. The
individual system components are built during the development period. Programs are
written and tried by users.
During Implementation, the components built during development are put into
operational use.
In the development phase of our system, the following system components were
built.
• Server module
• Parser module
The Server & Parser module is developed using Java.
5.1 Server Module This module contains following sub-modules,
• Load Details
• Categorizing
• Searching
5.1.1 Load Details In this module we load Categories & its related categories, Documents & its
categories, Categories & its Keys with Weights.
5.1.2 Categorizing In this module we categorize the given document using key set parsed from that
document and corresponding weights relevant to available categories.
5.1.3 Searching In this module we search documents and its category using given key set.
Implementation Chapter 5
11
5.2 Parser Module This module contains following sub-modules,
• Load Module
• URL Content Grabber Module
5.2.1 Load Module In this module we load keywords from server and then retrieve URL to begin
searching.
5.2.1 URL Content Grabber Module Whenever a URL is coming from server then the parser makes connection to that
URL and retrieves the contents to begin searching and after it collects key sets from that
site.
Testing Chapter 6
12
CHAPTER 6
TESTING
This chapter explains the various testing procedures conducted on the system.
Testing is a process of executing a program with the intent of finding an error. A
successful test is one that uncovers an as yet undiscovered error. A testing process
cannot show the absence of defects but can only show that software errors are present.
It ensures that defined input will produce actual results that agree with the required
results. A good testing methodology should include
• Clearly define testing roles, responsibilities and procedures
• Establish consistent testing process
• Streamline testing requirements
• Overcome “requirements slow me down” mentality
• Common sense process approach
• Use some elements of existing Process
• Not an attempt to replace, rewrite or redefine Process
• To find defects early and to give good time to developers for bug fixes
• Independent perspective in testing
Some of the testing principles used in this project are:
• Unit Testing
• Integration Testing
6.1 UNIT TESTING Unit testing is a strategy by which individual components, which make up the
system, are tested first to ensure that system works up to the desired extent. It focuses
on the verification effort on the smallest unit of the software design i.e. module. Various
modules of the system are tested to see whether they perform their intended functions.
Using procedural design description, important control paths are tested to uncover the
Testing Chapter 6
13
errors with in the boundary of the module. While accepting a connection using specified
functions we go for unit testing in their respective modules. The unit test is normally a
white box test (a testing method in which the control structure of the procedural design is
used to derive test cases).
6.1.1 Process Objectives To test every unit of the software in isolation before integrating it with other units.
6.1.2 Definition of Unit
A unit is a module as identified during size estimation process with a size
estimate that does not exceed 1000LOC.
For GUI applications each screen will be a unit.
If the size estimate for a unit exceeds 1000 LOC and it is not feasible to break it
into smaller logically independent units that can be tested in isolation, the project lead in
concurrence with the SQA can decide to define this as a unit.
6.1.3 Entry Criteria The entry criteria for this process are the following:
• Unit completed
• Unit peer reviewed
6.1.4 Exit Criteria The exit criteria for this process are the following:
• Unit test cases executed
• Any defects that are identified during unit testing and that are not fixed before the
unit enters component testing is listed in the test report and verified
• 100% statement coverage
If unit will be tested before code review of unit, this must be identified in the
project plan. In these projects the developer will self-review (desk check) the code
before unit testing.
In cases of exception handling of error conditions that are difficult to generate,
thereby making it impossible to achieve 100% statement coverage, the code should be
formally reviewed with this additional criteria
Testing Chapter 6
14
6.2 INTEGRATION TESTING The integration testing is a systematic technique for constructing the program
structure while conducting tests to uncover errors associated with interfacing. It is a type
of testing by which the individual modules of the system are combined and tested
whether they work properly as a whole. The objective is to take unit test modules and
build a program that has been dictated by the design. Integration testing can be either
‘Incremental’ or ‘Non-Incremental’.
The objective of the integration testing is to help engineers plan and execute the
component and Integration testing for their respective projects.
Integration testing should include the following objectives:
• Performed by the product group/Dev test team after feature complete
• Determines that all product components on a list of specific platforms function
successfully together (The List specified in Master test plan)
• Performed in a basic product / platform environment (Basic environment
specified in Master test plan)
• Tests the product functionality against the specification
• Tests functionality of fake languages with sample single and double byte
languages
• Tests scaling to an acceptable minimum level as called out in the master test
plan
• Tests performance, reliability to an acceptable level as called out in the master
test plan
• Final integration tests done after all components are integrated, with the build in
production format
The tasks of the project have been integrated and the functioning of the entire
system has been found to be satisfactory. The functionality of the entire system has
been subjected to a series of tests and all the modules have been found to interoperate
properly.
Finally the integration testing was performed on the integrated system and found
to work properly.
Testing Chapter 6
15
6.3 SAMPLE TEST CASES The following are the some of the sample test cases employed along with the
test results have been described in the table below.
Table 6.1 Sample Test Cases
Test Description
Result
Is Server stable for running more than one key set? OK
Is parser returns the results properly? OK
Is searching is done correctly? OK
Is Server takes Lower Resources? OK
Is the result is got over a less time? OK
Snapshot Chapter 7
16
CHAPTER 7
SNAPSHOT
This chapter contains the snapshot of various forms in our system.
7.1 Finding Category of given document
Snapshot Chapter 7
17
7.2 Finding the Document & its Category using given keyword
Conclusion
17
CONCLUSION
Thus the analysis, design and implementation of text categorization and
searching are done successfully. So that the user can able to do searching of a set of
keywords in a list of websites and the user can able to view the each keyword count for
a particular website. This searching is very useful for crawl the websites with particular
perspective view of specific content. Also the search is running concurrently, so we can
get higher performance.
Future Enhancements
18
FUTURE ENHANCEMENTS Currently we have flat classification scheme to find categories, in future it will
extended to hierarchical tree structure classification to reduce the time complexity and
improve relevancy. Currently we give set of websites for classification, in future
classification is done by automatic parsing of sites.
Bibliography
19
BIBLIOGRAPHY
• [Lorenz 1994] Lorenz, L. Kidd, J. Object Oriented Software Metrics, Prentice Hall 1994, ISBN 0-13-179292-X
• Saturnino Luz, Implementing a Text Categorization System: a step-by-step tutorial
• A. McCallum and K. Nigam. A comparison of event models for naive Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41–48. AAAI Press, 1998.
• Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization.
• In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412–420, Nashville, 1997. Morgan Kaufmann Publishers.
• Java Network Programming, O'Reilly & Associates, Inc.,, Second Edition
• Herbert Schildt ., and Patrick Naughton , 2001,“Java2: The Complete Reference “, Fourth
Edition , Tata McGraw-Hill Publishing Company Limited . Websites
SERVER /* * ServerGUI.java * * Created on November 2, 2008, 3:09 PM */ import java.io.*; import java.util.*; import javax.swing.*; /** * * @author SuperStar */ interface ServerI { public void setErr(String err); public void setInfo(String info); } public class ServerGUI extends javax.swing.JFrame implements ServerI { String[] ip; int ipN=0,rN=0,jN=0,jT=0,kN=0; String[] jobs; String[] rank; String[] key; ServerManager SM; /** Creates new form ServerGUI */ public ServerGUI() { initComponents(); this.jTextArea2.setText("Err Stream:"); this.jList1.removeAll();
Appendix
22
// this.jList2.removeAll(); this.jList3.removeAll(); (new MessageBox("welcome To SuperStar's Network!")).setVisible(true); } /** This method is called from within the constructor to * initialize the form. * WARNING: Do NOT modify this code. The content of this method is * always regenerated by the Form Editor. */ // <editor-fold defaultstate="collapsed" desc="Generated Code">//GEN-BEGIN:initComponents private void initComponents() { jScrollPane1 = new javax.swing.JScrollPane(); jList1 = new javax.swing.JList(); jLabel1 = new javax.swing.JLabel(); jButton1 = new javax.swing.JButton(); jScrollPane3 = new javax.swing.JScrollPane(); jList3 = new javax.swing.JList(); jLabel2 = new javax.swing.JLabel(); jScrollPane2 = new javax.swing.JScrollPane(); jTextArea1 = new javax.swing.JTextArea(); jButton3 = new javax.swing.JButton(); jScrollPane4 = new javax.swing.JScrollPane(); jTextArea2 = new javax.swing.JTextArea(); jButton2 = new javax.swing.JButton(); jScrollPane5 = new javax.swing.JScrollPane(); jTextArea3 = new javax.swing.JTextArea(); setDefaultCloseOperation(javax.swing.WindowConstants.EXIT_ON_CLOSE); setTitle("Server"); jList1.setModel(new javax.swing.AbstractListModel() { String[] strings = { "Item 1", "Item 2", "Item 3", "Item 4", "Item 5" }; public int getSize() { return strings.length; } public Object getElementAt(int i) { return strings[i]; } }); jScrollPane1.setViewportView(jList1); jLabel1.setText("Clients IP :"); jButton1.setText("Load Details"); jButton1.addActionListener(new java.awt.event.ActionListener() { public void actionPerformed(java.awt.event.ActionEvent evt) { jButton1ActionPerformed(evt); } }); jList3.setModel(new javax.swing.AbstractListModel() { String[] strings = { "Item 1", "Item 2", "Item 3", "Item 4", "Item 5" }; public int getSize() { return strings.length; } public Object getElementAt(int i) { return strings[i]; } }); jScrollPane3.setViewportView(jList3); jLabel2.setText("Clients Rank :");
} ////////////// class ServerReadThread extends Thread { Socket S; ServerI SI=null; ServerIF SIF; public ServerReadThread(Socket s,ServerI si,ServerIF sif) { S=s; SIF=sif; SI=si; //SI.setInfo(s.toString()); start(); } public void run() { try { BufferedReader in=new BufferedReader(new InputStreamReader(S.getInputStream())); while(true) { //PrintWriter out=new PrintWriter(new BufferedWriter(new OutputStreamWriter(os.getOutputStream())),true); SIF.dataFC(in.readLine()); } } catch(Exception e2) { SI.setErr(e2.getMessage()); } } } /* * MessageBox.java * * Created on November 2, 2008, 9:15 PM */ /** * * @author SuperStar */ public class MessageBox extends javax.swing.JFrame { String MSG="SuperStar"; /** Creates new form MessageBox */ public MessageBox(String msg) { MSG=msg; initComponents(); this.jTextArea1.setText(MSG); } /** This method is called from within the constructor to * initialize the form. * WARNING: Do NOT modify this code. The content of this method is