1.INTRODUCTION
1.Intoduction 1.1 Data Mining:Data mining is the process of
automatically discovering useful information in large data
repositories. Data mining techniques are deployed to scour large
databases in order to find novel and useful patterns that might
otherwise remain unknown. They also provide capabilities to predict
the outcome of the future observation, such as predicting whether a
newly arrived customer will spendmore than 100$ at a department
store.
(Figure 1.1 Data Mining flow chart)1.2 Structure of Data
Mining:Generally, data mining (sometimes called data or knowledge
discovery) is the process of analyzing data from different
perspectives and summarizing it into useful information -
information that can be used to increase revenue, cuts costs, or
both. Data mining software is one of a number of analytical tools
for analyzing data. It allows users to analyze data from many
different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process
of finding correlations or patterns among dozens of fields in large
relational databases.1.3 Data Mining Work:While large-scale
information technology has been evolving separate transaction and
analytical systems, data mining provides the link between the two.
Data mining software analyzes relationships and patterns in stored
transaction data based on open-ended user queries. Several types of
analytical software are available: statistical, machine learning,
and neural networks. Generally, any of four types of relationships
are sought: -Classes: Stored data is used to locate data in
predetermined groups. For example, a restaurant chain could mine
customer purchase data to determine when customers visit and what
they typically order. This information could be used to increase
traffic by having daily specials.-Clusters: Data items are grouped
according to logical relationships or consumer preferences. For
example, data can be mined to identify market segments or consumer
affinities. -Associations: Data can be mined to identify
associations. The beer-diaper example is an example of associative
mining.-Sequential patterns: Data is mined to anticipate behavior
patterns and trends. For example, an outdoor equipment retailer
could predict the likelihood of a backpack being purchased based on
a consumer's purchase of sleeping bags and hiking shoes.
Data mining consists of five major elements:Extract, transform,
and load transaction data onto the data warehouse system. Store and
manage the data in a multidimensional database system.Provide data
access to business analysts and information technology
professionals.Analyze the data by application software.Present the
data in a useful format, such as a graph or table.
Different levels of analysis are available: Artificial neural
networks: Non-linear predictive models that learn through training
and resemble biological neural networks in structure. Genetic
algorithms: Optimization techniques that use process such as
genetic combination, mutation, and natural selection in a design
based on the concepts of natural evolution. Decision trees:
Tree-shaped structures that represent sets of decisions. These
decisions generate rules for the classification of a dataset.
Specific decision tree methods include Classification and
Regression Trees (CART) and Chi Square Automatic Interaction
Detection (CHAID). CART and CHAID are decision tree techniques used
for classification of a dataset. They provide a set of rules that
you can apply to a new (unclassified) dataset to predict which
records will have a given outcome. CART segments a dataset by
creating 2-way splits while CHAID segments using chi square tests
to create multi-way splits. CART typically requires less data
preparation than CHAID. Nearest neighbor method: A technique that
classifies each record in a dataset based on a combination of the
classes of thekrecord(s) most similar to it in a historical dataset
(wherek=1). Sometimes called thek-nearest neighbor technique. Rule
induction: The extraction of useful if-then rules from data based
on statistical significance. Data visualization: The visual
interpretation of complex relationships in multidimensional data.
Graphics tools are used to illustrate data relationships.1.4
Characteristics of Data Mining: Large quantities of data: The
volume of data so great it has to be analyzed by automated
techniques e.g. satellite information, credit card transactions
etc. Noisy, incomplete data: Imprecise data is the characteristic
of all data collection. Complex data structure: conventional
statistical analysis not possible Heterogeneous data stored in
legacy systems1.5 Benefits of Data Mining:1) Its one of the most
effective services that are available today. With the help of data
mining, one can discover precious information about the customers
and their behavior for a specific set of products and evaluate and
analyze, store, mine and load data related to them2) An analytical
CRM model and strategic business related decisions can be made with
the help of data mining as it helps in providing a complete
synopsis of customers3) An endless number of organizations have
installed data mining projects and it has helped them see their own
companies make an unprecedented improvement in their marketing
strategies (Campaigns)4) Data mining is generally used by
organizations with a solid customer focus. For its flexible nature
as far as applicability is concerned is being used vehemently in
applications to foresee crucial data including industry analysis
and consumer buying behaviors5) Fast paced and prompt access to
data along with economic processing techniques have made data
mining one of the most suitable services that a company seek1.6
Advantages of Data Mining:1.6.1 Marketing / Retail:Data mining
helps marketing companies build models based on historical data to
predict who will respond to the new marketing campaigns such as
direct mail, online marketing campaignetc. Through the results,
marketers will have appropriate approach to sell profitable
products to targeted customers.Data mining brings a lot of benefits
to retail companies in the same way as marketing. Through market
basket analysis, a store can have an appropriate production
arrangement in a way that customers can buy frequent buying
products together with pleasant. In addition, it also helps the
retail companies offer certain discounts for particular products
that will attract more customers.1.6.2 Finance / BankingData mining
gives financial institutions information about loan information and
credit reporting. By building a model from historical customers
data, the bank and financial institution can determine good and bad
loans. In addition, data mining helps banks detect fraudulent
credit card transactions to protect credit cards owner.1.6.3
ManufacturingBy applying data mining in operational engineering
data, manufacturers can detect faulty equipments and determine
optimal control parameters. For example semi-conductor
manufacturers has a challenge that even the conditions of
manufacturing environments at different wafer production plants are
similar, the quality of wafer are lot the same and some for unknown
reasons even has defects. Data mining has been applying to
determine the ranges of control parameters that lead to the
production of golden wafer. Then those optimal control parameters
are used to manufacture wafers with desired quality.1.6.4
GovernmentsData mining helps government agency by digging and
analyzing records of financial transaction to build patterns that
can detect money laundering or criminal activities.
1.6.5 Law enforcementData mining can aid law enforcers in
identifying criminal suspects as well as apprehending these
criminals by examining trends in location, crime type, habit, and
other patterns of behaviors.1.6.6 ResearchersData mining can assist
researchers by speeding up their data analyzing process; thus,
allowing those more time to work on other projects.1.7 Network:A
network consists of two or more computers that are linked in order
to share resources (such as printers and CDs), exchange files, or
allow electronic communications. The computers on a network may be
linked through cables, telephone lines, radio waves, satellites, or
infrared light beams.Two very common types of networks include:1.
1.7.1 Local1. 1.7.2 Wide Area NetworkYou may also see references to
a Metropolitan Area Networks (MAN), a Wireless LAN (WLAN), or a
Wireless WAN (WWAN).1.7.1 Local Area NetworkA Local Area Network
(LAN) is a network that is confined to a relatively small area. It
is generally limited to a geographic area such as a writing lab,
school, or building.Computers connected to a network are broadly
categorized as servers or workstations. Servers are generally not
used by humans directly, but rather run continuously to provide
"services" to the other computers (and their human users) on the
network. Services provided can include printing and faxing,
software hosting, file storage and sharing, messaging, data storage
and retrieval, complete access control (security) for the network's
resources, and many others.Workstations are called such because
they typically do have a human user which interacts with the
network through them. Workstations were traditionally considered a
desktop, consisting of a computer, keyboard, display, and mouse, or
a laptop, with integrated keyboard, display, and touchpad. With the
advent of the tablet computer, and the touch screen devices such as
iPad and iPhone, our definition of workstation is quickly evolving
to include those devices, because of their ability to interact with
the network and utilize network services.Servers tend to be more
powerful than workstations, although configurations are guided by
needs. For example, a group of servers might be located in a secure
area, away from humans, and only accessed through the network. In
such cases, it would be common for the servers to operate without a
dedicated display or keyboard. However, the size and speed of the
server's processor(s), hard drive, and main memory might add
dramatically to the cost of the system. On the other hand, a
workstation might not need as much storage or working memory, but
might require an expensive display to accommodate the needs of its
user. Every computer on a network should be appropriately
configured for its use.On a single LAN, computers and servers may
be connected by cables or wirelessly. Wireless access to a wired
network is made possible by wireless access points (WAPs). These
WAP devices provide a bridge between computers and networks. A
typical WAP might have the theoretical capacity to connect hundreds
or even thousands of wireless users to a network, although
practical capacity might be far less.Nearly always servers will be
connected by cables to the network, because the cable connections
remain the fastest. Workstations which are stationary (desktops)
are also usually connected by a cable to the network, although the
cost of wireless adapters has dropped to the point that, when
installing workstations in an existing facility with inadequate
wiring, it can be easier and less expensive to use wireless for a
desktop.See theTopology,Cabling, andHardwaresections of this
tutorial for more information on the configuration of a LAN.1.7.2
Wide Area NetworkWide Area Networks (WANs) connect networks in
larger geographic areas, such as Florida, the United States, or the
world. Dedicated transoceanic cabling or satellite uplinks may be
used to connect this type of global network.Using a WAN, schools in
Florida can communicate with places like Tokyo in a matter of
seconds, without paying enormous phone bills. Two users a
half-world apart with workstations equipped with microphones and a
webcams might teleconference in real time. A WAN is complicated. It
uses multiplexers, bridges, and routers to connect local and
metropolitan networks to global communications networks like the
Internet. To users, however, a WAN will not appear to be much
different than a LAN.1.8 Social network: Asocial networkis asocial
structuremade up of a set ofsocialactors (such as individuals or
organizations) and a set of thedyadicties between these actors. The
social network perspective provides a set of methods for analyzing
the structure of whole social entities as well as a variety of
theories explaining the patterns observed in these structures.The
study of these structures uses socialto identify local and global
patterns, locate influential entities, and examine network
dynamics.Social networks and the analysis of them is an
inherentlyinterdisciplinary academic field which emerged fromsocial
psychology,sociology,statistics, and graph.Georg Simmelauthored
early structural theories in sociology emphasizing the dynamics of
triads and "web of group affiliations."Jacob Morenois credited with
developing the firstsociogramsin the 1930s to study interpersonal
relationships. These approaches were mathematically formalized in
the 1950s and theories and methods of social networks became
pervasive in the social and behavioral sciences by the 1980s.Social
network analysisis now one of the major paradigms in contemporary
sociology, and is also employed in a number of other social and
formal sciences. Together with other complex, it forms part of the
nascent field ofnetwork science.Communication through social
networks, such as Facebook and Twitter, is increasing its
importance in our daily life. Since the information exchanged over
social networks are not only texts but also URLs, images, and
videos, they are challenging test beds for the study of data
mining. There is another type of information that is intentionally
or unintentionally exchanged over social networks: mentions. Here
we mean by mentions links to other users of the same social network
in the form of message-to, reply-to, retweet-of, or explicitly in
the text. One post may contain a number of mentions. Some users may
include mentions in their posts rarely; other users may be
mentioning their friends all the time. Some users (like
celebrities) may receive mentions every minute; for others, being
mentioned might be a rare occasion. In this sense, mention is like
a language with the number of words equal to the number of users in
a social network.We are interested in detecting emerging topics
from social network streams based on monitoring the mentioning
behavior of users. Our basic assumption is that a new (emerging)
topic is something people feel like discussing about, commenting
about, or forwarding the information further to their friends.
Conventional approaches for topic detection have mainly been
concerned with the frequencies of (textual) words. A term frequency
based approach could suffer from the ambiguity caused by synonyms
or homonyms. It may also require complicated preprocessing (e.g.,
segmentation) depending on the target language. Moreover, it cannot
be applied when the contents of the messages are mostly non-textual
information. On the other hands, the words formed by mentions are
unique, requires little prepossessing to obtain (the information is
often separated from the contents), and are available regardless of
the nature of the contents.1.9 Anomaly Detection:Indata
mining,anomaly detection(oroutlier detection) is the identification
of items, events or observations which do not conform to an
expected pattern or other items in adataset.Typically the anomalous
items will translate to some kind of problem such asbank fraud, a
structural defect, medical problems or finding errors in text.
Anomalies are also referred to as outliers, novelties, noise,
deviations and exceptions. In particular in the context of abuse
and network intrusion detection, the interesting objects are often
notrareobjects, but unexpectedburstsin activity. This pattern does
not adhere to the common statistical definition of an outlier as a
rare object, and many outlier detection methods (in particular
unsupervised methods) will fail on such data, unless it has been
aggregated appropriately. Instead, acluster analysisalgorithm may
be able to detect the micro clusters formed by these patterns.
Three broad categories of anomaly detection techniques
exist.Unsupervised anomaly detectiontechniques detect anomalies in
an unlabeled test data set under the assumption that the majority
of the instances in the data set are normal by looking for
instances that seem to fit least to the remainder of the data
set.Supervised anomaly detectiontechniques require a data set that
has been labeled as "normal" and "abnormal" and involves training a
classifier (the key difference to many otherstatistical
classificationproblems is the inherent unbalanced nature of outlier
detection).Semi-supervised anomaly detectiontechniques construct a
model representing normal behavior from a givennormaltraining data
set, and then testing the likelihood of a test instance to be
generated by the learnt model.
2.Literature Survey
2.1 Detection and Tracking Pilot Study,AUTHORS: J. Allan et al
Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative
to investigate the state of the art in finding and following new
events in a stream of broadcast news stories. The TDT problem
consists of three major tasks: (1) segmenting a stream of data,
especially recognized speech, into distinct stories; (2)
identifying those news stories that are the first to discuss a new
event occurring in the news; and (3) given a small number of sample
news stories about an event, finding all following stories in the
stream. The TDT Pilot Study ran from September 1996 through October
1997. The primary participants were DARPA, Carnegie Mellon
University, Dragon Systems, and the University of Massachusetts at
Amherst. This report summarizes the findings of the pilot study.
The TDT work continues in a new project involving larger training
and test corpora, more active participants, and a more broadly
defined notion of "topic" than was used in the pilot study.
2.2 Bursty and Hierarchical Structure in Streams:AUTHORS: J.
KleinbergA fundamental problem in text data mining is to extract
meaningful structure from document streams that arrive continuously
over time. E-mail and news articles are two natural examples of
such streams, each characterized by topics that appear, grow in
intensity for a period of time, and then fade away. The published
literature in a particular research field can be seen to exhibit
similar phenomena over a much longer time scale. Underlying much of
the text mining work in this area is the following intuitive
premise --- that the appearance of a topic in a document stream is
signaled by a "burst of activity," with certain features rising
sharply in frequency as the topic emerges. The goal of the present
work is to develop a formal approach for modeling such "bursts," in
such a way that they can be robustly and efficiently identified,
and can provide an organizational framework for analyzing the
underlying content. The approach is based on modeling the stream
using an infinite-state automaton, in which bursts appear naturally
as state transitions; in some ways, it can be viewed as drawing an
analogy with models from queuing theory for bursty network traffic.
The resulting algorithms are highly efficient, and yield a nested
representation of the set of bursts that imposes a hierarchical
structure on the overall stream. Experiments with e-mail and
research paper archives suggest that the resulting structures have
a natural meaning in terms of the content that gave rise to
them.
2.3. Real-Time Change-Point Detection Using Sequentially
Discounting Normalized Maximum Likelihood Coding,AUTHORS: Y. Urabe,
K. Yamanishi, R. Tomioka, and H. IwaiWe are concerned with the
issue of real-time change-point detection in time series. This
technology has recently received vast attentions in the area of
data mining since it can be applied to a wide variety of important
risk management issues such as the detection of failures of
computer devices from computer performance data, the detection of
masqueraders/malicious executables from computer access logs, etc.
In this paper we propose a new method of real-time change point
detection employing the sequentially discounting normalized maximum
likelihood coding (SDNML). Here the SDNML is a method for
sequential data compression of a sequence, which we newly develop
in this paper. It attains the least code length for the sequence
and the effect of past data is gradually discounted as time goes
on, hence the data compression can be done adaptively to
non-stationary data sources. In our method, the SDNML is used to
learn the mechanism of a time series, then a change-point score at
each time is measured in terms of the SDNML code-length. We
empirically demonstrate the significant superiority of our method
over existing methods, such as the predictive-coding method and the
hypothesis testing method, in terms of detection accuracy and
computational efficiency for artificial data sets. We further apply
our method into real security issues called malware detection. We
empirically demonstrate that our method is able to detect unseen
security incidents at significantly early stages.
2.4. Model Selection by Sequentially Normalized Least
Squares,AUTHORS: J. Rissanen, T. Roos, and P. MyllymakiModel
selection by means of the predictive least squares (PLS) principle
has been thoroughly studied in the context of regression model
selection and autoregressive (AR) model order estimation. We
introduce a new criterion based on sequentially minimized squared
deviations, which are smaller than both the usual least squares and
the squared prediction errors used in PLS. We also prove that our
criterion has a probabilistic interpretation as a model which is
asymptotically optimal within the given class of distributions by
reaching the lower bound on the logarithmic prediction errors,
given by the so called stochastic complexity, and approximated by
BIC. This holds when the regressor (design) matrix is non-random or
determined by the observed data as in AR models. The advantages of
the criterion include the fact that it can be evaluated efficiently
and exactly, without asymptotic approximations, and importantly,
there are no adjustable hyper-parameters, which makes it applicable
to both small and large amounts of data.
2.5 Dynamic Syslog Mining for Network Failure
Monitoring,AUTHORS: K. Yamanishi and Y. Maruyama Syslog monitoring
technologies have recently received vast attentions in the areas of
network management and network monitoring. They are used to address
a wide range of important issues including network failure symptom
detection and event correlation discovery. Syslog are intrinsically
dynamic in the sense that they form a time series and that their
behavior may change over time. This paper proposes a new
methodology of dynamic syslog mining in order to detect failure
symptoms with higher confidence and to discover sequential alarm
patterns among computer devices. The key ideas of dynamic syslog
mining are 1) to represent syslog behavior using a mixture of
Hidden Markov Models, 2) to adaptively learn the model using an
on-line discounting learning algorithm in combination with dynamic
selection of the optimal number of mixture components, and 3) to
give anomaly scores using universal test statistics with a
dynamically optimized threshold. Using real syslog data we
demonstrate the validity of our methodology in the scenarios of
failure symptom detection, emerging pattern identification, and
correlation discovery.
3.System Study3.1 Feasibility Study:A feasibility study, also
known as feasibility analysis, is an analysis of the viability of
an idea. It describes a preliminary study undertaken to determine
and document a projects viability. The results of this analysis are
used in making the decision whether to proceed with the project or
not. This analytical tool used during the project planning phrase
shows how a business would operate under a set of assumption, such
as the technology used, the facilities and equipment, the capital
needs, and other financial aspects. The study is the first time in
a project development process that show whether the project create
a technical and economically feasible concept. As the study
requires a strong financial and technical background, outside
consultants conduct most studies. A feasible project is one where
the project could generate adequate amount of cash flow and
profits, withstand the risks it will encounter, remain viable in
the long-term and meet the goals of the business. The venture can
be a start-up of the new business, a purchase of the existing
business, and expansion of the current business .The feasibility of
the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost
estimates. During system analysis the feasibility study of the
proposed system is to be carried out. This is to ensure that the
proposed system is not a burden to the company. For feasibility
analysis, some understanding of the major requirements for the
system is essential.
( Figure 3.1 Model of the systems development life cycle,
highlighting the maintenance phase)Three key considerations
involved in the feasibility analysis are:- 3.1.1 Economical
Feasibility. 3.1.2 Technical Feasibility. 3.1.3 Social
Feasibility.
3.1.1 Economical Feasibility: Economic evaluation is a vital
part of investment appraisal, dealing with factors that can be
quantified, measured, and compared in monetary terms. The results
of an economic evaluation are considered with other aspects to make
the project investment decision as the proper investment appraisal
helps to ensure that the right project is undertaken in a manner
that gives it the best chances of success.Project investments
involve the expenditure of capital funds and other resources to
generate future benefits, whether in the form of profits, cost
savings, or social benefits. For an investment to be worthwhile,
the future benefit should compare favorably with the prior
expenditure of resources need to achieve them.This study is carried
out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures
must be justified. Thus the developed system as well within the
budget and this was achieved because most of the technologies used
are freely available. Only the customized products had to be
purchased. 3.1.2 Technical Feasibility: This study is carried out
to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a
high demand on the available technical resources. This will lead to
high demands on the available technical resources. This will lead
to high demands being placed on the client. The developed system
must have a modest requirement, as only minimal or null changes are
required for implementing this system. 3.1.3 Social
Feasibility:Social impact analysis / social feasibilitySocial
Impact Assessment (SIA) is a process that provides a framework for
prioritizing, gathering, analyzing, and incorporating
socialinformationand participation into the design and delivery of
projects. It ensures that infrastructure project development
is:
informed and takes into account the key relevant social issues,
and incorporates a participation strategy for involving a wide
range of stakeholders At the micro-level, SIA impacts on
individuals, at the meso-level it impacts on collectives (eg,
groups of people, institutions, and organizations) and at the
macro-level it impacts on social macro-systems (eg, national and
international political and legal systems).The stages in Social
Impact Assessment are: Describe the relevant human environment/
area of influence and baseline conditions Develop an effective
public plan to involve all potentially affected public Describe the
proposed action or policy change and reasonable alternatives
Scoping to identify the full range of probable social impacts
Screening to determine the boundaries of the SIA. Predicting
Responses to Impacts Develop Monitoring Plan & Mitigation
Measures. Ideally the SIA should an Integral part of other
assessments as shown below.
(Figure 3.2 SIA in relation to other assessments) The aspect of
study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the
system efficiently. The user must not feel threatened by the
system, instead must accept it as a necessity. The level of
acceptance by the users solely depends on the methods that are
employed to educate the user about the system and to make him
familiar with it. His level of confidence must be raised so that he
is also able to make some constructive criticism, which is
welcomed, as he is the final user of the system.
4. System Analysis & System Requirements After analyzing the
requirements of the task to be performed, the next step is to
analyze the problem and understand its context. The first activity
in the phase is studying the existing system and other is to
understand the requirements and domain of the new system. Both the
activities are equally important, but the first activity serves as
a basis of giving the functional specifications and then successful
design of the proposed system.Understanding the properties and
requirements of a new system is more difficult and requires
creative thinking and understanding of existing runningsystem is
also difficult, improper understanding of present system can lead
diversion from solution.
4.1 Analysis Model: SDLC METHDOLOGIESThis document play a vital
role in the development of life cycle (SDLC) as it describes the
complete requirement of the system. It means for use by developers
and will be the basic during testing phase. Any changes made to the
requirements in the future will have to go through formal change
approvalprocess.SPIRAL MODEL was defined by Barry Boehm in his 1988
article, A spiralModel of Software Development and Enhancement.
This model was not thefirst model to discuss iterative development,
but it was the first model to explain why the iteration models. As
originally envisioned, the iterations were typically 6 months to 2
years long. Each phase starts with a design goal and ends with a
client reviewing the progress thus far. Analysis and engineering
efforts are applied at each phase of the project, with an eye
toward the end goal of the project.
The steps for Spiral Model can be generalized as follows:
The new system requirements are defined in as much details as
possible. This usually involves interviewing a number of users
representing all the external or internal users and other aspects
of the existing system. A preliminary design is created for the new
system. A first prototype of the new system is constructed from the
preliminary design. This is usually a scaled-down system, and
represents an approximation of the characteristics of the final
product. A second prototype is evolved by a fourfold procedure:1)
Evaluating the first prototype in terms of its strengths,
weakness2) Defining the requirements of the second prototype.3)
Planning a designing the second prototype.4) Constructing and
testing the second prototype. At the customer option, the entire
project can be aborted if the risk is deemed too great. Risk
factors might involved development cost overruns, operating-cost
miscalculation, or any other factor that could, in the customers
judgment, result in a less-than-satisfactory final product. The
existing prototype is evaluated in the same manner as was the
previous prototype, and if necessary, another prototype is
developed from it according to the fourfold procedure outlined
above.The preceding steps are iterated until the customer is
satisfied that the refined prototype represents the final product
desired. The final system is constructed, based on the refined
prototype. The final system is thoroughly evaluated and tested.
Routine maintenance is carried on a continuing basis to prevent
large scale failures and to minimize down time.
The following diagram shows how a spiral model acts like:
(Figure 4.1-Spiral Model)
In the flexibility of the uses the interface has been developed
a graphicsconcept in mind, associated through a browser interface.
The GUIS at the top level have been categorized as:1)
Administrative user interface.2) The operational or generic user
interface.The administrative user interface concentrates on the
consistent information that is practically, part of the
organizational activities and which needs properauthentication for
the data collection. The interfaces help the administrationswith
all the transactional states like Data insertion, Data deletion and
Data updating along with the extensive data search capabilities.The
operational or generic user interface helps the users upon the
system in transactions through the existing data and required
services. The operational user interface also helps the ordinary
users in managing their own information helps the ordinary users in
managing their own information in a customized manner as per the
assisted flexibilities.
4.2 Existing System : A new (emerging) topic is something people
feel like discussing, commenting, or forwarding the information
further to their friends. Conventional approaches for topic
detection have mainly been concerned with the frequencies of
(textual) words. 4.2.1 Disadvantages of Existing System :A
term-frequency-based approach could suffer from the ambiguity
caused by synonyms or homonyms. It may also require complicated
preprocessing (e.g., segmentation) depending on the target
language. Moreover, it cannot be applied when the contents of the
messages are mostly nontextual information. On the other hand, the
words formed by mentions are unique, require little preprocessing
to obtain (the information is often separated from the contents),
and are available regardless of the nature of the contents.4.3
Proposed System: In this paper, we have proposed a new approach to
detect the mergence of topics in a social network stream. The basic
idea of our approach is to focus on the social aspect of the posts
reflected in the mentioning behavior of users instead of the
textual contents. We have proposed a probability model that
captures both the number of mentions per post and the frequency of
mentioned.4.3.1 Advantages of Proposed System: The proposed method
does not rely on the textual contents of social network posts, it
is robust to rephrasing and it can be applied to the case where
topics are concerned with information other than texts, such as
images, video, audio, and so on. The proposed link-anomaly-based
methods performed even better than the keyword-based methods on
NASA and BBC data sets4.4 Developers Responsibilities Overview:
The developer is responsible for: Developing the system, which
meets the SRS and solving all the requirements of the system?
Demonstrating the system and installing the system at client's
location after the acceptance testing is successful. Submitting the
required user manual describing the system interfaces to work on it
and also the documents of the system. Conducting any user training
that might be needed for using the system. Maintaining the system
for a period of one year after installation.
4.5 Functional Requirements:
Functional Requirements refer to very important system
requirements in a software engineering process (or at micro level,
a sub part of requirementengineering) such as technical
specifications, system design parameters and guidelines, data
manipulation, data processing and calculation modules
etc.Functional Requirements are in contrast to other software
design requirements referred to as Non-Functional Requirements
which are primarily based on parameters of system performance,
software quality attributes, reliability and security, cost,
constraints in design/implementation etc.The key goal of
determining functional requirements in a software product design
and implementation is to capture the required behavior of a
software system in terms of functionality and the technology
implementation of the business processes.47The Functional
Requirement document (also called Functional specifications or
Functional Requirement Specifications), defines the capabilities
and functions that a System must be able to perform
successfully.Functional Requirements should include: Descriptions
of data to be entered into the system. Descriptions of operations
performed by each screen. Descriptions of work-flows performed by
the system. Descriptions of system reports or other outputs. Who
can enter the data into the system? How the system meets applicable
regulatory requirements ?The functional specification is designed
to be read by a general audience. Readers should understand the
system, but no particular technical knowledgeshould be required to
understand the document.
4.5.1 Examples of Functional Requirements:
Functional requirements should include functions performed by
pecificscreens, outlines of work-flows performed by the system and
other businessor compliance requirements the system must meet.4.5.2
Interface Requirements:
Field accepts numeric data entry. Field only accepts dates
before the current date. Screen can print on-screen data to the
printer.484.5.3 Business Requirements:
Data must be entered before a request can approvedClicking the
Approve Button moves the request to the Approval WorkflowAll
personnel using the system will be trained according to
internaltraining strategies.
4.5.4 Regulatory/Compliance Requirements:
The database will have a functional audit trail. The system will
limit access to authorized users. The spreadsheet can secure data
with electronic signatures.
4.5.5 Security Requirements:
Member of the Data Entry group can enter requests but not
approveor delete requests. Members of the Managers group can enter
or approve a request, butnot delete requests. Members of the
Administrators group cannot enter or approverequests, but can
delete requests.The functional specification describes what the
system must do; how the system does it is described in the Design
Specification. If a User Requirement Specification was written, all
requirements outlined in the user requirement specification should
be addressed in the functional requirements.
4.6 Non functional Requirements :
All the other requirements which do not form a part of the above
specification are categorized as Non-Functional Requirements. A
system may be require to present the user with a display of the
number of records in a database.4.7 Hardware Requirements :
System: Pentium IV 2.4 GHz. Hard Disk : 40 GB. Floppy Drive:
1.44 Mb. Monitor: 15 VGA Colour. Mouse: Logitech. Ram: 512 Mb.
4.8 Software Requirements :
Operating system : Windows XP/7. Coding Language: JAVA/J2EE
IDE:Netbeans 7.4 Database:MYSQL
+5.System DesignSystems designis the process of defining
thearchitecture, components, modules, interfaces, anddatafor
asystemto satisfy specifiedrequirements. Systems design could be
seen as the application ofsystems theorytoproduct development.
There is some overlap with the disciplines ofsystems
analysis,systems architectureandsystems engineering5.1 System
Architecture:
(Figure 5.1 Overall flow of the proposed method)
5.2 Block Diagram
(Figure 5.2 Block Diagram of Proposed System)
5.3 Data Flow Diagram:1. The DFD is also called as bubble chart.
It is a simple graphical formalism that can be used to represent a
system in terms of input data to the system, various processing
carried out on this data, and the output data is generated by this
system.2. The data flow diagram (DFD) is one of the most important
modeling tools. It is used to model the system components. These
components are the system process, the data used by the process, an
external entity that interacts with the system and the information
flows in the system.3. DFD shows how the information moves through
the system and how it is modified by a series of transformations.
It is a graphical technique that depicts information flow and the
transformations that are applied as data moves from input to
output.4. DFD is also known as bubble chart. A DFD may be used to
represent a system at any level of abstraction. DFD may be
partitioned into levels that represent increasing information flow
and functional detail.
Twitter TrendsKey Based DetectionLink Based DetectionLevel
0:
(Figure 5.3 Data Flow Diagram of Twitter Trends)Level 1:
Change Point AnalysisPerform TrainingAggregation
(Figure 5.4 Data Flow Diagram of Perform Training)
Key Based DetectionLevel 2:
Burst AnalysisChange point AnalysisLink Based Detection(Figure
5.5 Data Flow Diagram of Change Point)
Key Based DetectionLink Based DetectionPerform Training
AggregationChange Point AnalysisBurst AnalysisLevel 3:
(Figure 5.6 Data Flow Diagram of Key Based Detection)5.4UML
Diagrams:UML stands for Unified Modeling Language. UML is a
standardized general-purpose modeling language in the field of
object-oriented software engineering. The standard is managed, and
was created by, the Object Management Group. The goal is for UML to
become a common language for creating models of object oriented
computer software. In its current form UML is comprised of two
major components: a Meta-model and a notation. In the future, some
form of method or process may also be added to; or associated with,
UML.The Unified Modeling Language is a standard language for
specifying, Visualization, Constructing and documenting the
artifacts of software system, as well as for business modeling and
other non-software systems. The UML represents a collection of best
engineering practices that have proven successful in the modeling
of large and complex systems. The UML is a very important part of
developing objects oriented software and the software development
process. The UML uses mostly graphical notations to express the
design of software projects.
5.5 Goals :The Primary goals in the design of the UML are as
follows:1. Provide users a ready-to-use, expressive visual modeling
Language so that they can develop and exchange meaningful models.2.
Provide extendibility and specialization mechanisms to extend the
core concepts.3. Be independent of particular programming languages
and development process.4. Provide a formal basis for understanding
the modeling language.5. Encourage the growth of OO tools market.6.
Support higher level development concepts such as collaborations,
frameworks, patterns and components.7. Integrate best
practices.
5.6 Use Case Diagram:A use case diagram in the Unified Modeling
Language (UML) is a type of behavioral diagram defined by and
created from a Use-case analysis. Its purpose is to present a
graphical overview of the functionality provided by a system in
terms of actors, their goals (represented as use cases), and any
dependencies between those use cases. The main purpose of a use
case diagram is to show what system functions are performed for
which actor. Roles of the actors in the system can be depicted.
(Figure 5.7 Actor & Uses Case )5.7 Class Diagram :In
software engineering, a class diagram in the Unified Modeling
Language (UML) is a type of static structure diagram that describes
the structure of a system by showing the system's classes, their
attributes, operations (or methods), and the relationships among
the classes. It explains which class contains information.
(Figure 5.8 Class Diagram Representation )
-5.8 Sequence Diagram :A sequence diagram in Unified Modeling
Language (UML) is a kind of interaction diagram that shows how
processes operate with one another and in what order. It is a
construct of a Message Sequence Chart. Sequence diagrams are
sometimes called event diagrams, event scenarios, and timing
diagrams.
(Figure 5.9 Sequence Diagram Representing Sequence Activities
)5.9 Activity Diagram : Activity diagrams are graphical
representations of workflows of stepwise activities and actions
with support for choice, iteration and concurrency. In the Unified
Modeling Language, activity diagrams can be used to describe the
business and operational step-by-step workflows of components in a
system. An activity diagram shows the overall flow of control.
(Figure 5.10 Activity Diagram Showing Sequence of
Activities)
5.10 Collaborative Diagram :
(Figure 5.11 Collaborative Diagram Showing Collaboration Between
All The Use Case )
5.10 Input Design :The input design is the link between the
information system and the user. It comprises the developing
specification and procedures for data preparation and those steps
are necessary to put transaction data in to a usable form for
processing can be achieved by inspecting the computer to read data
from a written or printed document or it can occur by having people
keying the data directly into the system. The design of input
focuses on controlling the amount of input required, controlling
the errors, avoiding delay, avoiding extra steps and keeping the
process simple. The input is designed in such a way so that it
provides security and ease of use with retaining the privacy. Input
Design considered the following things: What data should be given
as input? How the data should be arranged or coded? The dialog to
guide the operating personnel in providing input. Methods for
preparing input validations and steps to follow when error
occur.
5.11 Objectives :1. Input Design is the process of converting a
user-oriented description of the input into a computer-based
system. This design is important to avoid errors in the data input
process and show the correct direction to the management for
getting correct information from the computerized system.2. It is
achieved by creating user-friendly screens for the data entry to
handle large volume of data. The goal of designing input is to make
data entry easier and to be free from errors. The data entry screen
is designed in such a way that all the data manipulates can be
performed. It also provides record viewing facilities.3. When the
data is entered it will check for its validity. Data can be entered
with the help of screens. Appropriate messages are provided as when
needed so that the user will not be in maize of instant. Thus the
objective of input design is to create an input layout that is easy
to follow5.12 Output Design :A quality output is one, which meets
the requirements of the end user and presents the information
clearly. In any system results of processing are communicated to
the users and to other system through outputs. In output design it
is determined how the information is to be displaced for immediate
need and also the hard copy output. It is the most important and
direct source information to the user. Efficient and intelligent
output design improves the systems relationship to help user
decision-making.1. Designing computer output should proceed in an
organized, well thought out manner; the right output must be
developed while ensuring that each output element is designed so
that people will find the system can use easily and effectively.
When analysis design computer output, they should Identify the
specific output that is needed to meet the requirements.2. Select
methods for presenting information.3. Create document, report, or
other formats that contain information produced by the system.The
output form of an information system should accomplish one or more
of the following objectives. Convey information about past
activities, current status or projections of the Future. Signal
important events, opportunities, problems, or warnings. Trigger an
action. Confirm an action.5.13Related Work:
Detection and tracking of topics have been studied extensively
in the area of topic detection and tracking (TDT). In this context,
the main task is to either classify a new document into one of the
known topics (tracking) or to detect that it belongs to none of the
known categories. Subsequently, temporal structure of topics have
been modeled and analyzed through dynamic model selection, temporal
text mining, and factorial hidden Markov models. Another line of
research is concerned with formalizing the notion of bursts in a
stream of documents. In his seminal paper, Kleinberg modeled bursts
using time varying Poisson process with a hidden discrete process
that controls the firing rate. Recently, He and Parker developed a
physics inspired model of bursts based on the change in the
momentum of topics. All the above mentioned studies make use of
textual content of the documents, but not the social content of the
documents. The social content (links) have been utilized in the
study of citation networks However, citation networks are often
analyzed in a stationary setting. The novelty of the current paper
lies in focusing on the social content of the documents (posts) and
in combining this with a change-point analysis.
5. 14 Proposed Method:The overall flow of the proposed method is
shown in Figure 5.1 We assume that the data arrives from a social
network service in a sequential manner through some API. For each
new post we use samples within the past T time interval for the
corresponding user for training the mention model we propose below.
We assign anomaly score to each post based on the learned
probability distribution. The score is then aggregated over users
and further fed into a change point analysis.A. Probability ModelWe
characterize a post in a social network stream by the number of
mentions k it contains, and the set V of names (IDs) of the users
mentioned in the post. Formally, we consider the following joint
probability distributionProbability
Model..........................................................
(1)Here the joint distribution consists of two parts: the
probability of the number of mentions k and the probability of each
mention given the number of mentions. The probability of the number
of mentions P(k| is defined as a geometric distribution with
parameter as
follows:..................................................................................
(2)On the other hand, the probability of mentioning users in V is
defined as independent, identical multinomial distribution with
parameters .Suppose that we are given n training examples T = {(),
. . .. , () } from which we would like to learn the predictive
distribution..................................................................(3)First
we compute the predictive distribution with respect to the number
of mentions P(k|T ). This can be obtained by assuming a beta
distribution as a prior and integrating out the parameters . The
density function of the beta prior distribution is written as
follows:
where and are parameters of the beta distribution and is the
beta function. By the Bayes rule, the predictive distribution can
be obtained as follows:
Both the integrals on the numerator and denominator can be
obtained in closed forms as beta functions and the predictive
distribution can be rewritten as follows:
Using the relation between beta function and gamma function, we
can further simplify the expression as follows:
where is the total number of mentions in the training set T .
Next, we derive the predictive distribution P(v|T ) of mentioning
user v. The maximum likelihood (ML) estimator is given as P(v|T ) =
, where m is the number of total mentions and is the number of
mentions to user v in the data set T . The ML estimator, however,
cannot handle users that did not appear in the training set T ; it
would assign probability zero to all these users, which would
appear infinitely anomalous in our framework. Instead we use the
Chinese Restaurant Process (CRP; see [9]) based estimation. The CRP
based estimator assigns probability to each user v that is
proportional to the number of mentions in the training set T ; in
addition, it keeps probability proportional to for mentioning
someone who was not mentioned in the training set T . Accordingly
the probability of known users is given as follows:
On the other hand, the probability of mentioning a new user is
given as follows:
B. Computing the link-anomaly scoreIn order to compute the
anomaly score of a new post x = (t; u; k; V ) by user u at time t
containing k mentions to users V , we compute the probability (3)
with the training set , which is the collection of posts by user u
in the time period [t-T,t] (we use T = 30 days in this paper).
Accordingly the link-anomaly score is defined as follows:
The two terms in the above equation can be computed via the
predictive distribution of the number of mentions (4), and the
predictive distribution of the mentionee (5)(6), respectivelyC.
Combining Anomaly Scores from Different UsersThe anomaly score in
(7) is computed for each user depending on the current post of user
u and his/her past behavior . In order to measure the general trend
of user behavior, we propose to aggregate the anomaly scores
obtained for posts x1 xn using a discretization of window size T
> 0 as follows:
where xi = is the post at time by user including ki mentions to
users Vi.D. Change-point detection via Sequentially
DiscountingNormalized Maximum Likelihood CodingGiven an aggregated
measure of anomaly (8), we apply a change-point detection technique
based on the SDNML coding [3]. This technique detects a change in
the statistical dependence structure of a time series by monitoring
the compressibility of the new piece of data. The SDNML proposed in
[3] is an approximation for normalized maximum likelihood (NML)
code length that can be computed sequentially and employs
discounting in the learning of the AR models;. Algorithmically, the
change point detection procedure can be outlined as follows. For
convenience, we denote the aggregate anomaly score as , instead of
.1. 1st layer learning: Let : = { be the collection of aggregate
anomaly scores from discrete time 1 to j - 1. Sequentially learn
the SDNML density function ) (j=1, 2, ); see Appendix A for
details.
2. 1st layer scoring: Compute the intermediate change-point
score by smoothing the log loss of the SDNML density function with
window size k as follows:
3. 2nd layer learning Let be the collection of smoothed
change-point score obtained as above. Sequentially learn the second
layer SDNML density ) (j=1, 2, ); function see Appendix A.
4. 2nd layer scoring Compute the final change-point score by
smoothing the log loss of the SDNML density function as
follows:
E. Dynamic Threshold Optimization (DTO) We make an alarm if the
change-point score exceeds a threshold, which was determined
adaptively using the method of dynamic threshold optimization
(DTO). In DTO, we use a 1-dimensional histogram for the
representation of the score distribution. We learn it in a
sequential and discounting way. Then, for a specified value , to
determine the threshold to be the largest score value such that the
tail probability beyond the value does not exceed .We call a
threshold parameter. The details of DTO are summarized in Algorithm
:Algorithm 1 Dynamic Threshold Optimization (DTO) Given: {scores, :
total number of cells, : parameter for threshold, _H: estimation
parameter, : discounting parameter, M: data sizeInitialization: Let
be a uniform distribution.For j = 1; : : : ;M - 1 do
Threshold optimization: Let l be the least index such that he
threshold at time j is given as
Alarm output: Raise an alarm if Histogram update:
If Scorej Falls into the hth cell, otherwise.
End for.6.Implementation 6.1 Modules :Modules
Description:6.1.1Training6.1.2 Identify individual Anomaly
Score6.1.3 Aggregate6.1.4 Change Point Analysis and DTO6.1.5 Burst
Detection
6.1.1 Training:In this section, we describe the probability
model that we used to capture the normal mentioning behavior of a
user and how to train the model. We characterize a post in a social
network stream by the number of mentions k it contains, and the set
V of names (IDs) of the mentioned (users who are mentioned in the
post). There are two types of infinity we have to take into account
here. The first is the number k of users mentioned in a post.
Although, in practice a user cannot mention hundreds of other users
in a post, we would like to avoid putting an artificial limit on
the number of users mentioned in a post. Instead, we will assume a
geometric distribution and integrate out the parameter to avoid
even an implicit limitation through the parameter. The second type
of infinity is the number of users one can possibly mention. To
avoid limiting the number of possible mentioned, we use Chinese
Restaurant Process (CRP) based estimation; who use CRP for infinite
vocabulary.
6.1.2 Aggregate:In this subsection, we describe how to combine
the anomaly scores from different users. The anomaly score is
computed for each user depending on the current post of user u and
his/her past behavior Ttu. To measure the general trend of user
behavior, we propose to aggregate the anomaly scores obtained for
posts x1;...;x xn using a discretization of window size >0.
6.1.3 Identify individual Anomaly Score:In this subsection, we
describe how to compute the deviation of a users behavior from the
normal mentioning behavior modeled in the previous subsection.
6.1.4 Change Point Analysis and DTO:This technique is an
extension of Change Finder proposed, that detects a change in the
statistical dependence structure of a time series by monitoring the
compressibility of a new piece of data. Urabe et al.proposed to use
a sequential version of normalized maximum-likelihood (NML) coding
called SDNML coding as a coding criterion instead of the plug-in
predictive distribution used. Specifically, a change point is
detected through two layers of scoring processes. The first layer
detects outliers and the second layer detects change-points. In
each layer, predictive loss based on the SDNML coding distribution
for an autoregressive (AR) model is used as a criterion for
scoring. Although the NML code length is known to be optimal, it is
often hard to compute. The SNML proposed is an approximation to the
NML code length that can be computed in a sequential manner. The
SDNML proposed further employs discounting in the learning of the
AR models.As a final step in our method, we need to convert the
change-point scores into binary alarms by thresholding. Since the
distribution of change-point scores may change over time, we need
to dynamically adjust the threshold to analyze a sequence over a
long period of time. In this subsection, we describe how to
dynamically optimize the threshold using the method of dynamic
threshold optimization proposed.In DTO, we use a one-dimensional
histogram for the representation of the score distribution. We
learn it in a sequential and discounting way.
6.1.5 Burst Detection:In addition to the change-point detection
based on SDNML followed by DTO described in previous sections, we
also test the combination of our method with Kleinbergs
burst-detection method. More specifically, we implemented a
two-state version of Kleinbergs burst-detection model. The reason
we chose the two-state version was because in this experiment we
expect nonhierarchical structure. The burst-detection method is
based on a probabilistic automaton model with two states, burst
state and non-burst state. Some events (e.g., arrival of posts) are
assumed to happen according to a time-varying Poisson processes
whose rate parameter depends on the current state.6.2 Java
Technology:Java technology is both a programming language and a
platform.6.3 The Java Programming Language:The Java programming
language is a high-level language that can be characterized by all
of the following buzzwords: Simple Architecture neutral Object
oriented Portable Distributed High performance Interpreted
Multithreaded Robust Dynamic SecureWith most programming languages,
you either compile or interpret a program so that you can run it on
your computer. The Java programming language is unusual in that a
program is both compiled and interpreted. With the compiler, first
you translate a program into an intermediate language called Java
byte codes the platform-independent codes interpreted by the
interpreter on the Java platform. The interpreter parses and runs
each Java byte code instruction on the computer. Compilation
happens just once; interpretation occurs each time the program is
executed. The following figure illustrates how this works.
(Figure 6.1 java structure)
You can think of Java byte codes as the machine code
instructions for the Java Virtual Machine (Java VM). Every Java
interpreter, whether its a development tool or a Web browser that
can run applets, is an implementation of the Java VM. Java byte
codes help make write once, run anywhere possible. You can compile
your program into byte codes on any platform that has a Java
compiler. The byte codes can then be run on any implementation of
the Java VM. That means that as long as a computer has a Java VM,
the same program written in the Java programming language can run
on Windows 2000, a Solaris workstation, or on an iMac. (Figure 6.2
Java Program & Compiler) 6.4 The Java Platform: A platform is
the hardware or software environment in which a program runs. Weve
already mentioned some of the most popular platforms like Windows
2000, Linux, Solaris, and MacOS. Most platforms can be described as
a combination of the operating system and hardware. The Java
platform differs from most other platforms in that its a
software-only platform that runs on top of other hardware-based
platforms. The Java platform has two components: The Java Virtual
Machine (Java VM) The Java Application Programming Interface (Java
API)
Youve already been introduced to the Java VM. Its the base for
the Java platform and is ported onto various hardware-based
platforms. The Java API is a large collection of ready-made
software components that provide many useful capabilities, such as
graphical user interface (GUI) widgets. The Java API is grouped
into libraries of related classes and interfaces; these libraries
are known as packages. The next section, What Can Java Technology
Do? Highlights what functionality some of the packages in the Java
API provide. The following figure depicts a program thats running
on the Java platform. As the figure shows, the Java API and the
virtual machine insulate the program from the hardware.
(Figure 6.3 Java Platform) Native code is code that after you
compile it, the compiled code runs on a specific hardware platform.
As a platform-independent environment, the Java platform can be a
bit slower than native code. However, smart compilers, well-tuned
interpreters, and just-in-time byte code compilers can bring
performance close to that of native code without threatening
portability. 6.5 What Can Java Technology Do? The most common types
of programs written in the Java programming language are applets
and applications. If youve surfed the Web, youre probably already
familiar with applets. An applet is a program that adheres to
certain conventions that allow it to run within a Java-enabled
browser. However, the Java programming language is not just for
writing cute, entertaining applets for the Web. The
general-purpose, high-level Java programming language is also a
powerful software platform. Using the generous API, you can write
many types of programs. An application is a standalone program that
runs directly on the Java platform. A special kind of application
known as a server serves and supports clients on a network.
Examples of servers are Web servers, proxy servers, mail servers,
and print servers. Another specialized program is a servlet. A
servlet can almost be thought of as an applet that runs on the
server side. Java Servlets are a popular choice for building
interactive web applications, replacing the use of CGI scripts.
Servlets are similar to applets in that they are runtime extensions
of applications. Instead of working in browsers, though, servlets
run within Java Web servers, configuring or tailoring the server.
How does the API support all these kinds of programs? It does so
with packages of software components that provides a wide range of
functionality. Every full implementation of the Java platform gives
you the following features: The essentials: Objects, strings,
threads, numbers, input and output, data structures, system
properties, date and time, and so on. Applets: The set of
conventions used by applets. Networking: URLs, TCP (Transmission
Control Protocol), UDP (User Data gram Protocol) sockets, and IP
(Internet Protocol) addresses. Internationalization: Help for
writing programs that can be localized for users worldwide.
Programs can automatically adapt to specific locales and be
displayed in the appropriate language. Security: Both low level and
high level, including electronic signatures, public and private key
management, access control, and certificates. Software components:
Known as JavaBeansTM, can plug into existing component
architectures. Object serialization: Allows lightweight persistence
and communication via Remote Method Invocation (RMI). Java Database
Connectivity (JDBCTM): Provides uniform access to a wide range of
relational databases. The Java platform also has APIs for 2D and 3D
graphics, accessibility, servers, collaboration, telephony, speech,
animation, and more. The following figure depicts what is included
in the Java 2 SDK.
(Figure 6.4 Java IDE)6.6 How Will Java Technology Change My
Life? We cant promise you fame, fortune, or even a job if you learn
the Java programming language. Still, it is likely to make your
programs better and requires less effort than other languages. We
believe that Java technology will help you do the following: Get
started quickly: Although the Java programming language is a
powerful object-oriented language, its easy to learn, especially
for programmers already familiar with C or C++. Write less code:
Comparisons of program metrics (class counts, method counts, and so
on) suggest that a program written in the Java programming language
can be four times smaller than the same program in C++. Write
better code: The Java programming language encourages good coding
practices, and its garbage collection helps you avoid memory leaks.
Its object orientation, its JavaBeans component architecture, and
its wide-ranging, easily extendible API let you reuse other peoples
tested code and introduce fewer bugs. Develop programs more
quickly: Your development time may be as much as twice as fast
versus writing the same program in C++. Why? You write fewer lines
of code and it is a simpler programming language than C++. Avoid
platform dependencies with 100% Pure Java: You can keep your
program portable by avoiding the use of libraries written in other
languages. The 100% Pure JavaTM Product Certification Program has a
repository of historical process manuals, white papers, brochures,
and similar materials online. Write once, run anywhere: Because
100% Pure Java programs are compiled into machine-independent byte
codes, they run consistently on any Java platform. Distribute
software more easily: You can upgrade applets easily from a central
server. Applets take advantage of the feature of allowing new
classes to be loaded on the fly, without recompiling the entire
program. 6.7 ODBC :Microsoft Open Database Connectivity (ODBC) is a
standard programming interface for application developers and
database systems providers. Before ODBC became a de facto standard
for Windows programs to interface with database systems,
programmers had to use proprietary languages for each database they
wanted to connect to. Now, ODBC has made the choice of the database
system almost irrelevant from a coding perspective, which is as it
should be. Application developers have much more important things
to worry about than the syntax that is needed to port their program
from one database to another when business needs suddenly change.
Through the ODBC Administrator in Control Panel, you can specify
the particular database that is associated with a data source that
an ODBC application program is written to use. Think of an ODBC
data source as a door with a name on it. Each door will lead you to
a particular database. For example, the data source named Sales
Figures might be a SQL Server database, whereas the Accounts
Payable data source could refer to an Access database. The physical
database referred to by a data source can reside anywhere on the
LAN. The ODBC system files are not installed on your system by
Windows 95. Rather, they are installed when you setup a separate
database application, such as SQL Server Client or Visual Basic
4.0. When the ODBC icon is installed in Control Panel, it uses a
file called ODBCINST.DLL. It is also possible to administer your
ODBC data sources through a stand-alone program called ODBCADM.EXE.
There is a 16-bit and a 32-bit version of this program and each
maintains a separate list of ODBC data sources.
From a programming perspective, the beauty of ODBC is that the
application can be written to use the same set of function calls to
interface with any data source, regardless of the database vendor.
The source code of the application doesnt change whether it talks
to Oracle or SQL Server. We only mention these two as an example.
There are ODBC drivers available for several dozen popular database
systems. Even Excel spreadsheets and plain text files can be turned
into data sources. The operating system uses the Registry
information written by ODBC Administrator to determine which
low-level ODBC drivers are needed to talk to the data source (such
as the interface to Oracle or SQL Server). The loading of the ODBC
drivers is transparent to the ODBC application program. In a
client/server environment, the ODBC API even handles many of the
network issues for the application programmer. The advantages of
this scheme are so numerous that you are probably thinking there
must be some catch. The only disadvantage of ODBC is that it isnt
as efficient as talking directly to the native database interface.
ODBC has had many detractors make the charge that it is too slow.
Microsoft has always claimed that the critical factor in
performance is the quality of the driver software that is used. In
our humble opinion, this is true. The availability of good ODBC
drivers has improved a great deal recently. And anyway, the
criticism about performance is somewhat analogous to those who said
that compilers would never match the speed of pure assembly
language. Maybe not, but the compiler (or ODBC) gives you the
opportunity to write cleaner programs, which means you finish
sooner. Meanwhile, computers get faster every year.
6.8 JDBC:In an effort to set an independent database standard
API for Java; Sun Microsystems developed Java Database
Connectivity, or JDBC. JDBC offers a generic SQL database access
mechanism that provides a consistent interface to a variety of
RDBMSs. This consistent interface is achieved through the use of
plug-in database connectivity modules, or drivers. If a database
vendor wishes to have JDBC support, he or she must provide the
driver for each platform that the database and Java run on. To gain
a wider acceptance of JDBC, Sun based JDBCs framework on ODBC. As
you discovered earlier in this chapter, ODBC has widespread support
on a variety of platforms. Basing JDBC on ODBC will allow vendors
to bring JDBC drivers to market much faster than developing a
completely new connectivity solution. JDBC was announced in March
of 1996. It was released for a 90 day public review that ended June
8, 1996. Because of user input, the final JDBC v1.0 specification
was released soon after. The remainder of this section will cover
enough information about JDBC for you to know what it is about and
how to use it effectively. This is by no means a complete overview
of JDBC. That would fill an entire book.
6.9 JDBC Goals:Few software packages are designed without goals
in mind. JDBC is one that, because of its many goals, drove the
development of the API. These goals, in conjunction with early
reviewer feedback, have finalized the JDBC class library into a
solid framework for building database applications in Java. The
goals that were set for JDBC are important. They will give you some
insight as to why certain classes and functionalities behave the
way they do. The eight design goals for JDBC are as follows:
6.9.1 SQL Level API: The designers felt that their main goal was
to define a SQL interface for Java. Although not the lowest
database interface level possible, it is at a low enough level for
higher-level tools and APIs to be created. Conversely, it is at a
high enough level for application programmers to use it
confidently. Attaining this goal allows for future tool vendors to
generate JDBC code and to hide many of JDBCs complexities from the
end user. 6.9.2 SQL Conformance:SQL syntax varies as you move from
database vendor to database vendor. In an effort to support a wide
variety of vendors, JDBC will allow any query statement to be
passed through it to the underlying database driver. This allows
the connectivity module to handle non-standard functionality in a
manner that is suitable for its users. 1. JDBC must be implemental
on top of common database interfaces The JDBC SQL API must sit on
top of other common SQL level APIs. This goal allows JDBC to use
existing ODBC level drivers by the use of a software interface.
This interface would translate JDBC calls to ODBC and vice versa.
2. Provide a Java interface that is consistent with the rest of the
Java systemBecause of Javas acceptance in the user community thus
far, the designers feel that they should not stray from the current
design of the core Java system. 3. Keep it simpleThis goal probably
appears in all software design goal listings. JDBC is no exception.
Sun felt that the design of JDBC should be very simple, allowing
for only one method of completing a task per mechanism. Allowing
duplicate functionality only serves to confuse the users of the
API.
4. Use strong, static typing wherever possible Strong typing
allows for more error checking to be done at compile time; also,
less error appear at runtime. 5. Keep the common cases simple
Because more often than not, the usual SQL calls used by the
programmer are simple SELECTs, INSERTs, DELETEs and UPDATEs, these
queries should be simple to perform with JDBC. However, more
complex SQL statements should also be possible. Java ha two things:
a programming language and a platform. Java is a high-level
programming language that is all of the following
SimpleArchitecture-neutralObject-orientedPortableDistributed
High-performanceInterpretedmultithreadedRobustDynamicSecureJava is
also unusual in that each Java program is both compiled and
interpreted. With a compile you translate a Java program into an
intermediate language called Java byte codes the
platform-independent code instruction is passed and run on the
computer.Compilation happens just once; interpretation occurs each
time the program is executed. The figure illustrates how this
works.
Java ProgramCompilersInterpreterMy Program
(Figure 6.5 Java Program Cycle)
You can think of Java byte codes as the machine code
instructions for the Java Virtual Machine (Java VM). Every Java
interpreter, whether its a Java development tool or a Web browser
that can run Java applets, is an implementation of the Java VM. The
Java VM can also be implemented in hardware.Java byte codes help
make write once, run anywhere possible. You can compile your Java
program into byte codes on my platform that has a Java compiler.
The byte codes can then be run any implementation of the Java VM.
For example, the same Java program can run Windows NT, Solaris, and
Macintosh.
6.10 Networking:6.10.1 TCP/IP stack:
(Figure 6.6 Application & h/w interface)
The TCP/IP stack is shorter than the OSI one:TCP is a
connection-oriented protocol; UDP (User Datagram Protocol) is a
connectionless protocol.
6.10.2 IP datagrams:The IP layer provides a connectionless and
unreliable delivery system. It considers each datagram
independently of the others. Any association between datagram must
be supplied by the higher layers. The IP layer supplies a checksum
that includes its own header. The header includes the source and
destination addresses. The IP layer handles routing through an
Internet. It is also responsible for breaking up large datagram
into smaller ones for transmission and reassembling them at the
other end.6.10.3 UDP:UDP is also connectionless and unreliable.
What it adds to IP is a checksum for the contents of the datagram
and port numbers. These are used to give a client/server model -
see later.6.10.4 TCP:TCP supplies logic to give a reliable
connection-oriented protocol above IP. It provides a virtual
circuit that two processes can use to communicate.6.10. 5 Internet
addresses:In order to use a service, you must be able to find it.
The Internet uses an address scheme for machines so that they can
be located. The address is a 32 bit integer which gives the IP
address. This encodes a network ID and more addressing. The network
ID falls into various classes according to the size of the network
address. 6.10.6 Network address:Class A uses 8 bits for the network
address with 24 bits left over for other addressing. Class B uses
16 bit network addressing. Class C uses 24 bit network addressing
and class D uses all 32. 6.10.7 Subnet address:Internally, the UNIX
network is divided into sub networks. Building 11 is currently on
one sub network and uses 10-bit addressing, allowing 1024 different
hosts. 6.1o.8 Host address:8 bits are finally used for host
addresses within our subnet. This places a limit of 256 machines
that can be on the subnet.
6.10.9 Total address:
(Figure 6.6 Total Address)
The 32 bit address is usually written as 4 integers separated by
dots.
6.10.10 Port addresses:A service exists on a host, and is
identified by its port. This is a 16 bit number. To send a message
to a server, you send it to the port for that service of the host
that it is running on. This is not location transparency! Certain
of these ports are "well known".
6.10.11 Sockets:A socket is a data structure maintained by the
system to handle network connections. A socket is created using the
call socket. It returns an integer that is like a file descriptor.
In fact, under Windows, this handle can be used with Read File and
Write File functions.#include #include int socket(int family, int
type, int protocol);Here "family" will be AF_INET for IP
communications, protocol will be zero, and type will depend on
whether TCP or UDP is used. Two processes wishing to communicate
over a network create a socket each. These are similar to two ends
of a pipe - but the actual pipe does not yet exist.6.10.12 JFree
Chart:JFreeChart is a free 100% Java chart library that makes it
easy for developers to display professional quality charts in their
applications. JFreeChart's extensive feature set includes:A
consistent and well-documented API, supporting a wide range of
chart types; A flexible design that is easy to extend, and targets
both server-side and client-side applications; Support for many
output types, including Swing components, image files (including
PNG and JPEG), and vector graphics file formats (including PDF, EPS
and SVG); JFreeChart is "open source" or, more specifically, free
software. It is distributed under the terms of the GNU Lesser
General Public Licence (LGPL), which permits use in proprietary
applications. 1. Map Visualizations:Charts showing values that
relate to geographical areas. Some examples include: (a) population
density in each state of the United States, (b) income per capita
for each country in Europe, (c) life expectancy in each country of
the world. The tasks in this project include:Sourcing freely
redistributable vector outlines for the countries of the world,
states/provinces in particular countries (USA in particular, but
also other areas); Creating an appropriate dataset interface (plus
default implementation), a rendered, and integrating this with the
existing XYPlot class in JFreeChart; Testing, documenting, testing
some more, documenting some more. 2. Time Series Chart
Interactivity:Implement a new (to JFreeChart) feature for
interactive time series charts --- to display a separate control
that shows a small version of ALL the time series data, with a
sliding "view" rectangle that allows you to select the subset of
the time series data to display in the main chart.3.
Dashboards:There is currently a lot of interest in dashboard
displays. Create a flexible dashboard mechanism that supports a
subset of JFreeChart chart types (dials, pies, thermometers, bars,
and lines/time series) that can be delivered easily via both Java
Web Start and an applet.4. Property Editors:The property editor
mechanism in JFreeChart only handles a small subset of the
properties that can be set for charts. Extend (or reimplement) this
mechanism to provide greater end-user control over the appearance
of the charts.
6.11 What is a Java Web Application?A Java web application
generates interactive web pages containing various types of markup
language (HTML, XML, and so on) and dynamic content. It is
typically comprised of web components such as JavaServer Pages
(JSP), servlets and JavaBeans to modify and temporarily store data,
interact with databases and web services, and render content in
response to client requests.Because many of the tasks involved in
web application development can be repetitive or require a surplus
of boilerplate code, web frameworks can be applied to alleviate the
overhead associated with common activities. For example, many
frameworks, such as JavaServer Faces, provide libraries for
templating pages and session management, and often promote code
reuse.
6.12What is Java EE?Java EE (Enterprise Edition) is a widely
used platform containing a set of coordinated technologies that
significantly reduce the cost and complexity of developing,
deploying, and managing multi-tier, server-centric applications.
Java EE builds upon the Java SE platform and provides a set of APIs
(application programming interfaces) for developing and running
portable, robust, scalable, reliable and secure server-side
applications.Some of the fundamental components of Java EE include:
Enterprise JavaBeans (EJB): a managed, server-side component
architecture used to encapsulate the business logic of an
application. EJB technology enables rapid and simplified
development of distributed, transactional, secure and portable
applications based on Java technology. Java Persistence API (JPA):
a framework that allows developers to manage data using
object-relational mapping (ORM) in applications built on the Java
Platform.
6.13 JavaScript and Ajax Development:JavaScript is an
object-oriented scripting language primarily used in client-side
interfaces for web applications. Ajax (Asynchronous JavaScript and
XML) is a Web 2.0 technique that allows changes to occur in a web
page without the need to perform a page refresh. JavaScript
toolkits can be leveraged to implement Ajax-enabled components and
functionality in web pages.
6.14 Web Server and Client:Web Server is a software that can
process the client request and send the response back to the
client. For example, Apache is one of the most widely used web
server. Web Server runs on some physical machine and listens to
client request on specific port.A web client is a software that
helps in communicating with the server. Some of the most widely
used web clients are Firefox, Google Chrome, Safari etc. When we
request something from server (through URL), web client takes care
of creating a request and sending it to server and then parsing the
server response and present it to the user.
6.15 HTML and HTTP:Web Server and Web Client are two separate
software, so there should be some common language for
communication. HTML is the common language between server and
client and stands for Hypertext Markup Language. Web server and
client needs a common communication protocol, HTTP (HyperText
Transfer Protocol) is the communication protocol between server and
client. HTTP runs on top of TCP/IP communication protocol.Some of
the important parts of HTTP Request are: HTTP Method action to be
performed, usually GET, POST, PUT etc. URL Page to access Form
Parameters similar to arguments in a java method, for example
user,password details from login page.Sample HTTP Request:
123GET /FirstServletProject/jsps/hello.jsp HTTP/1.1Host:
localhost:8080Cache-Control: no-cache
Some of the important parts of HTTP Response are: Status Code an
integer to indicate whether the request was success or not. Some of
the well known status codes are 200 for success, 404 for Not Found
and 403 for Access Forbidden. Content Type text, html, image, pdf
etc. Also known as MIME type Content actual data that is rendered
by client and shown to user.
MIME Type or Content Type: If you see above sample HTTP response
header, it contains tag Content-Type. Its also called MIME type and
server sends it to client to let them know the kind of data its
sending. It helps client in rendering the data for user. Some of
the mostly used mime types are text/html, text/xml, application/xml
etc.
6.16 Understanding URL:URL is acronym of Universal Resource
Locator and its used to locate the server and resource. Every
resource on the web has its own unique address. Lets see parts of
URL with an example.
http://localhost:8080/FirstServletProject/jsps/hello.jsp
http:// This is the first part of URL and provides the
communication protocol to be used in server-client
communication.
local host The unique address of the server, most of the times
its the hostname of the server that maps to unique IP address.
Sometimes multiple hostnames point to same IP addresses and web
server virtual host takes care of sending request to the particular
server instance.
8080 This is the port on which server is listening, its optional
and if we dont provide it in URL then request goes to the default
port of the protocol. Port numbers 0 to 1023 are reserved ports for
well known services, for example 80 for HTTP, 443 for HTTPS, 21 for
FTP etc.
FirstServletProject/jsps/hello.jsp Resource requested from
server. It can be static html, pdf, JSP, servlets, PHP etc.
6.17 Why we need Servlet and JSPs?Web servers are good for
static contents HTML pages but they dont know how to generate
dynamic content or how to save data into databases, so we need
another tool that we can use to generate dynamic content. There are
several programming languages for dynamic content like PHP, Python,
Ruby on Rails, Java Servlets and JSPs.Java Servlet and JSPs are
server side technologies to extend the capability of web servers by
providing support for dynamic response and data persistence.6.18
Web Container:Tomcat is a web container, when a request is made
from Client to web server, it passes the request to web container
and its web container job to find the correct resource to handle
the request (servlet or JSP) and then use the response from the
resource to generate the response and provide it to web server.
Then web server sends the response back to the client.When web
container gets the request and if its for servlet then container
creates two Objects HTTPServletRequest and HTTPServletResponse.
Then it finds the correct servlet based on the URL and creates a
thread for the request. Then it invokes the servlet service()
method and based on the HTTP method service() method invokes
doGet() or doPost() methods. Servlet methods generate the dynamic
page and write it to response. Once servlet thread is complete,
container converts the response to HTTP response and send it back
to client. Some of the important work done by web container are:
Communication Support Container provides easy way of communication
between web server and the servlets and JSPs. Because of container,
we dont need to build a server socket to listen for any request
from web server, parse the request and generate response. All these
important and complex tasks are done by container and all we need
to focus is on our business logic for our applications. Lifecycle
and Resource Management Container takes care of managing the life
cycle of servlet. Container takes care of loading the servlets into
memory, initializing servlets, invoking servlet methods and
destroying them. Container also provides utility like JNDI for
resource pooling and management. Multithreading Support Container
creates new thread for every request to the servlet and when its
processed the thread dies. So servlets are not initialized for each
request and saves time and memory. JSP Support JSPs doesnt look
like normal java classes and web container pro