~ 1 ~
Jan 01, 2016
~ 1 ~
~ 2 ~
CHAPTER-1
INTRODUCTION
1.1 Database
A database is an organized collection of data. The data is typically
organized to model relevant aspects of reality (for example, the availability of
rooms in hotels), in a way that supports processes requiring this information
(for example, finding a hotel with vacancies). Traditional databases are
organized by fields, records, and files. A field is a single piece of information;
a record is one complete set of fields; and a file is a collection of records. For
example, a telephone book is analogous to a file. It contains a list of records,
each of which consists of three fields: name, address, and telephone number.
1.2 Database Management System
To access information from a database, we need a database management
system (DBMS). This is a collection of programs that enable us to enter,
organize, and select data in a database.
1.3 Data Warehouse
A data warehouse is a relational database that is designed for query and
analysis rather than for transaction processing. It usually contains historical
data derived from transaction data, but it can include data from other sources. It
separates analysis workload from transaction workload and enables an
organization to consolidate data from several sources. It is a database of unique
data structure that allows relatively quick and easy performance of complex
query over large amount of data.
~ 3 ~
1.4 Data Mining
Generally, data mining (sometimes called data or knowledge discovery)
is the process of analyzing data from different perspectives and summarizing it
into useful information - information that can be used to increase speed , cuts
costs. Data mining software is one of a number of analytical tools for analyzing
data. It allows users to analyze data from many different dimensions or angles,
categorize it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among dozens of fields
in large relational databases.
Fig - 1.1 Data mining of finger print converted into digital data.
Fig -1.2
Example :
For example, one Midwest grocery chain used the data mining capacity
of Oracle software to analyze local buying patterns. They discovered that when
men bought diapers on Thursdays and Saturdays, they also tended to buy beer.
Further analysis showed that these shoppers typically did their weekly grocery
shopping on Saturdays. O
The retailer concluded that they purchased the beer to have it available for the
upcoming weekend. The grocery chain could use this newly discovered
information in various ways to increase revenue. For ex
the beer display closer to the diaper display. And, they could make sure beer
and diapers were sold at full price on Thursdays.
~ 4 ~
1.2 Data mining from data warehouse.
For example, one Midwest grocery chain used the data mining capacity
Oracle software to analyze local buying patterns. They discovered that when
men bought diapers on Thursdays and Saturdays, they also tended to buy beer.
Further analysis showed that these shoppers typically did their weekly grocery
shopping on Saturdays. On Thursdays, however, they only bought a few items.
The retailer concluded that they purchased the beer to have it available for the
upcoming weekend. The grocery chain could use this newly discovered
information in various ways to increase revenue. For example, they could move
the beer display closer to the diaper display. And, they could make sure beer
and diapers were sold at full price on Thursdays.
For example, one Midwest grocery chain used the data mining capacity
Oracle software to analyze local buying patterns. They discovered that when
men bought diapers on Thursdays and Saturdays, they also tended to buy beer.
Further analysis showed that these shoppers typically did their weekly grocery
n Thursdays, however, they only bought a few items.
The retailer concluded that they purchased the beer to have it available for the
upcoming weekend. The grocery chain could use this newly discovered
ample, they could move
the beer display closer to the diaper display. And, they could make sure beer
~ 5 ~
1.5 Dataset
A dataset (or data set) is a collection of data, usually presented in tabular
form. Each column represents a particular variable. Each row corresponds to a
given member of the dataset in question. It lists values for each of the variables,
such as height and weight of an object. Each value is known as a datum. The
dataset may comprise data for one or more members, corresponding to the
number of rows.
1.6 Structure Query Language(SQL)
SQL, which is an abbreviation for Structured Query Language, is a
language to request data from a database, to add, update, or remove data within
a database, or to manipulate the metadata of the database.
SQL is a declarative language in which the expected result or operation
is given without the specific details about how to accomplish the task. The
steps required to execute SQL statements are handled transparently by the SQL
database. Sometimes SQL is characterized as non-procedural because
procedural languages generally require the details of the operations to be
specified, such as opening and closing tables, loading and searching indexes, or
flushing buffers and writing data to files systems. Therefore, SQL is considered
to be designed at a higher conceptual level of operation than procedural
languages because the lower level logical and physical operations aren't
specified and are determined by the SQL engine or server process that executes
it.
1.7 Vertical Aggregation
The essential idea is to allow relevant sites to be overlaid on top of each
other by the end user to create a complete view of the information they are
looking for. It arrange dataset from database in vertically as respect with
necessary query (such as group by clause in SQL) .Generally in relational
database system the aggregation are arranged by vertical aggregation.
~ 6 ~
1.8 Horizontal Aggregation
Here introduce a new class of aggregations that have similar behavior to
SQL standard aggregations, but which produce tables with a horizontal layout.
In contrast, we call standard SQL aggregations vertical aggregations since they
produce tables with a vertical layout. Horizontal aggregations just require a
small syntax extension to aggregate functions called in a SELECT statement.
Alternatively, horizontal aggregations can be used to generate SQL code from a
data mining tool to build data sets for data mining analysis. We start by
explaining how to automatically generate SQL code.
~ 7 ~
~ 8 ~
CHAPTER-2
Review Literature
2.1 Analysis the literature
Preparing a data set for analysis is generally the most time consuming
task in a data mining project, requiring many complex SQL queries, joining
tables, and aggregating columns[1]. Existing SQL aggregations have
limitations to prepare data sets because they return one column per aggregated
group. In general, a significant manual effort is required to build data sets,
where a horizontal layout is required. A simple, yet powerful, methods to
generate SQL code to return aggregated columns in a horizontal tabular layout,
returning a set of numbers instead of one number per row. This new class of
functions is called horizontal aggregations[2]. Horizontal aggregations build
data sets with a horizontal de-normalized layout (e.g., point-dimension,
observation variable, instance-feature), which is the standard layout required by
most data mining algorithms. Here three fundamental methods to evaluate
horizontal aggregations: CASE: Exploiting the programming CASE construct;
SPJ: Based on standard relational algebra operators (SPJ queries); PIVOT:
Using the PIVOT operator, which is offered by some DBMSs. Experiments
with large tables compare the proposed query evaluation methods. CASE
method has similar speed to the PIVOT operator and it is much faster than the
SPJ method. In general, the CASE and PIVOT methods exhibit linear
scalability, whereas the SPJ method does not.
~ 9 ~
2.2 Explanation of F, FV , and FH Table
2.2.1 F(Original Table) :
This table contains data that can be aggregate first vertical then
horizontal. It can be contain null but must not contain blob(data type)
data.
K D1 D2 A
1 3 X 9
2 2 Y 6
3 1 Y 10
4 1 Y 0
5 2 X 1
6 1 X null
7 3 X 8
8 2 X 7
Table 2.1 Original Data Table
2.2.2 FV (Vertical Aggregated Table) :
The essential idea is to allow relevant sites to be overlaid on top
of each other by the end user to create a complete view of the
information they are looking for. It arrange dataset from database in
vertically as respect with necessary query (such as group by clause in
SQL) .Generally in relational database system the aggregation are
arranged by vertical aggregation.
D1 D2 A
1 X null
1 Y 10
2 X 8
2 Y 6
3 X 17
Table 2.2 Vertical Table
2.2.3 FH (Horizontal
Here introduce a new class of aggregations that have similar
behavior to SQL standard aggregations, but which produce tables with a
horizontal layout. In contrast, we call standard SQL aggregations
vertical aggregations since they produce tables with a vertica
Horizontal aggregations just require a small syntax extension to
aggregate functions
horizontal aggregations can be used to generate SQL code from a data
mining tool to build data sets for data mining
explaining how to automatically generate SQL code.
Fig- 2.1 Main steps of methods based on F (un
SPJ
d left joins
~ 10 ~
Horizontal Aggregated Table) :
introduce a new class of aggregations that have similar
behavior to SQL standard aggregations, but which produce tables with a
horizontal layout. In contrast, we call standard SQL aggregations
vertical aggregations since they produce tables with a vertica
Horizontal aggregations just require a small syntax extension to
aggregate functions called in a SELECT statement. Alternatively,
horizontal aggregations can be used to generate SQL code from a data
mining tool to build data sets for data mining analysis. We start by
explaining how to automatically generate SQL code.
D1 D2X D2Y
1 null 10
2 8 6
3 17 null
Table 2.3 Horizontal Table
Main steps of methods based on F (un-optimized).
Select Distinct
R1.....Rk
d pivoting Value
CASE
d sum(case) terms
Compute
Fh
introduce a new class of aggregations that have similar
behavior to SQL standard aggregations, but which produce tables with a
horizontal layout. In contrast, we call standard SQL aggregations
vertical aggregations since they produce tables with a vertical layout.
Horizontal aggregations just require a small syntax extension to
alled in a SELECT statement. Alternatively,
horizontal aggregations can be used to generate SQL code from a data
analysis. We start by
optimized).
PIVOT
d pivoting Value
Fig- 2.2 Main
2.3 SPJ method
The SPJ method is interesting from a theoretical point of view because it
is based on relational operators only. The basic idea is to create one table with a
vertical aggregation for each result
produce FH. We aggregate from F into d projected tables with d Select
Join-Aggregation queries (selection,
FI one subgrouping combin
aggregation on A as the only nonkey column. It is necessary to introduce an
additional table F, that will be outer joined with projected tables to get a
complete result set. We propose two basic substrategies to compute F . The
first one directly aggregates from F. The
vertical aggregation in a temporary table F
Then horizontal aggregations can be instead computed from F
a compressed version of F,
SPJ
d left joins
~ 11 ~
2.2 Main steps of methods based on FV (optimized).
The SPJ method is interesting from a theoretical point of view because it
is based on relational operators only. The basic idea is to create one table with a
vertical aggregation for each result column, and then join all those tables to
. We aggregate from F into d projected tables with d Select
Aggregation queries (selection, projection, join, aggregation). Each table
one subgrouping combination and has {L1; ...;Lj} primary key and an
aggregation on A as the only nonkey column. It is necessary to introduce an
additional table F, that will be outer joined with projected tables to get a
complete result set. We propose two basic substrategies to compute F . The
e directly aggregates from F. The second one computes the equivalent
a temporary table FV grouping by {L1; ...;Lj}.
Then horizontal aggregations can be instead computed from F
a compressed version of F, since standard aggregations are distributive [9].We
Select Distinct
R1.....Rk
d pivoting Value
CASE
d sum(case) terms
Compute
Fh
Compute
Fv
(optimized).
The SPJ method is interesting from a theoretical point of view because it
is based on relational operators only. The basic idea is to create one table with a
column, and then join all those tables to
. We aggregate from F into d projected tables with d Select-Project-
join, aggregation). Each table
primary key and an
aggregation on A as the only nonkey column. It is necessary to introduce an
additional table F, that will be outer joined with projected tables to get a
complete result set. We propose two basic substrategies to compute F . The
second one computes the equivalent
}.
Then horizontal aggregations can be instead computed from FV, which is
gregations are distributive [9].We
PIVOT
d pivoting Value
~ 12 ~
now introduce the indirect aggregation based on the intermediate table F , that
will be used for both the SPJ and the CASE method. Let FV be a table
containing the vertical aggregation, based on {L1……Lj} and {R1…..Rj}. Let
V() represent the corresponding vertical aggregation for H(). The statement to
compute F gets a cube:
INSERT INTO
SELECT L1 ………Lj, R1…..RK,V(A)
FROM F
GROUP BY L1 ………Lj, R1…..RK;
Then each table F aggregates only those rows that correspond to the Ith
unique combination of R1……….Rk, given by the WHERE clause. A possible
optimization is synchronizing table scans to compute the d tables in one pass.
Finally, to get FH we need d left outer joins with the d + 1 tables so that all
individual aggregations are properly assembled as a set of d dimensions for
each group. Outer joins set result columns to null for missing combinations for
the given group. In general, nulls should be the default value for groups with
missing combinations. We believe it would be incorrect to set the result to zero
or some other number by default if there are no qualifying rows. Such approach
should be considered on a per-case basis.
INSERT INTO FH
SELECT
F0.L1, F0.L2,…………,F0.Lj,
F1.A, F2.A,…………, Fd.A,
FROM F0
LEFT OUTER JOIN F1
ON F0.L1=F1.L1 and ……and F0.Lj = F1.Lj
LEFT OUTER JOIN F2
ON F0.L1=F2.L1 and ……and F0.Lj = F2.Lj
…..
LEFT OUTER JOIN Fd
ON F0.L1=Fd.L1 and ……and F0.Lj=Fd.Lj;
Then each table FI aggregates only those rows that correspond to the Ith
unique combination of R1, . . .,Rk, given by the WHERE clause. A possible
optimization is synchronizing table scans to compute the d tables in one pass.
Finally, to get FH we need d left outer joins with the d + 1 tables so that all
~ 13 ~
individual aggregations are properly assembled as a set of d dimensions for
each group. Outer joins set result columns to null for missing combinations for
the given group. In general, nulls should be the default value for groups with
missing combinations. We believe it would be incorrect to set the result to zero
or some other number by default if there are no qualifying rows. Such approach
should be considered on a per-case basis.
INSERT INTO FH
SELECT
F0.L1, F0.L2, . . . ,F0.Lj,
F1.A, F2.A, . . . , Fd.A
FROM F0
LEFT OUTER JOIN F1
ON F0.L1 = F1.L1 and . . . and F0.Lj = F1.Lj
LEFT OUTER JOIN F2
ON F0.L1 = F2.L1 and . . . and F0:Lj = F2.Lj
. . .
LEFT OUTER JOIN Fd
ON F0.L1 = Fd.L1 and . . . and F0.Lj = Fd.Lj;
This statement may look complex, but it is easy to see that each left
outer join is based on the same columns L1, . . . , Lj. To avoid ambiguity in
column references, L1, . . . , Lj are qualified with F0. Result column I is
qualified with table FI . Since F0 has n rows each left outer join produces a
partial table with n rows and one additional column. Then at the end, FH will
have n rows and d aggregation columns. The statement above is equivalent to
an update-based strategy. Table FH can be initialized inserting n rows with key
L1, . . . , Lj and nulls on the d dimension aggregation columns. Then FH is
iteratively updated from FI joining on L1, . . . ,Lj. This strategy basically incurs
twice I/O doing updates instead of insertion. Reordering the d projected tables
to join cannot accelerate processing because each partial table has n rows.
Another claim is that it is not possible to correctly compute horizontal
aggregations without using outer joins. In other words, natural joins would
produce an incomplete result set.
~ 14 ~
2.4 Case Method
For this method, the “case” programming construct available in SQL.
The case statement returns a value selected from a set of values based on
boolean expressions. From a relational database theory point of view this is
equivalent to doing a simple projection/aggregation query where each nonkey
value is given by a function that returns a number based on some conjunction
of conditions. Proposed two basic substrategies to compute F. In a similar
manner to SPJ, the first one directly aggregates from F andthe second one
computes the vertical aggregation in a temporary table FV and then horizontal
aggregations are indirectly computed from FV.
Now present the direct aggregation method. Horizontal aggregation
queries can be evaluated by directly aggregating from F and transposing rows
at the same time to produce FH. First, we need to get the unique combinations
of R. R1,……..,Rk. that define the matching Boolean expression for result
columns. The SQL code to compute horizontal aggregations directly from F is
as follows: observe V () is a standard (vertical) SQL aggregation that has a
“case” statement as argument. Horizontal aggregations need to set the result to
null when there are no qualifying rows for the specific horizontal group to be
consistent with the SPJ method and also with the extended relational model [4].
SELECT DISTINCT
FROM F;
INSERT INTO FH
SELECT L1,…………,Lj
,V(CASE WHEN R1=V11 and…….and RK=VK1
THEN A ELSE NULL END)
..
,V(CASE WHEN R1=V11 and…….and RK=VKd
THEN A ELSE null END)
FROM F
GROUP BY L1, L2,…….., Lj;
This statement computes aggregations in only one scan on F. The main
difficulty is that there must be a feedback process to produce the “case”
boolean expressions. We now consider an optimized version using FV . Based
~ 15 ~
on FV , we need to transpose rows to get groups based on L1, . . . , Lj. Query
evaluation needs to combine the desired aggregation with “CASE” statements
for each distinct combination of values of R1, . . .,Rk. As explained above,
horizontal aggregations must set the result to null when there are no qualifying
rows for the specific horizontal group. The boolean expression for each case
statement has a conjunction of k equality comparisons. The following
statements compute FH:
SELECT DISTINCT R1,. . .,Rk
FROM FV ;
INSERT INTO FH
SELECT L1,..,Lj
,sum(CASE WHEN R1 = v11 and .. and Rk = vk1
THEN A ELSE null END)
......
,sum(CASE WHEN R1 = v1d and .. and Rk = vkd
THEN A ELSE null END)
FROM FV
GROUP BY L1, L2, . . . , Lj;
As can be seen, the code is similar to the code presented before, the main
difference being that we have a call to sum() in each term, which preserves
whatever values were previously computed by the vertical aggregation. It has
the disadvantage of using two tables instead of one as required by the direct
computation from F. For very large tables F computing FV first, may be more
efficient than computing directly from F.
~ 16 ~
2.5 PIVOT Method
Here use the PIVOT operator which is a built-in operator in a
commercial DBMS. Since this operator can perform transposition it can help
evaluating horizontal aggregations. The PIVOT method internally needs to
determine how many columns are needed to store the transposed table and it
can be combined with the GROUP BY clause. The basic syntax to exploit the
PIVOT operator to compute a horizontal aggregation assuming one BY column
for the right key columns (i.e., k = 1) is as follows:
SELECT DISTINCT R1
FROM F;
SELECT L1, L2,……., Lj;
,v1,v2,………vd
INTO Ft
FROM F
PIVOT(
V(A) FOR R1 in (v1,v2……..vd)
)AS P;
SELECT L1, L2………….,Lj
,V(v1), V(v2)………. V(vd)
INTO FH
FROM Ft
GROUP BY L1, L2………….,Lj;
This set of queries may be inefficient because Ft can be a large intermediate
table. We introduce the following optimized set of queries which reduces of the
intermediate table:
SELECT DISTINCT R1
FROM F; /* produces v1, . . . , vd */
SELECT
L1, L2, . . . ,Lj
,v1, v2, . . . , vd
INTO FH
FROM (
SELECT L1, L2, . . . ,Lj, R1, A
FROM F) Ft
~ 17 ~
PIVOT(
V (A) FOR R1 in (v1, v2, . . . , vd)
) AS P;
Notice that in the optimized query the nested query trims F from
columns that are not later needed. That is, the nested query projects only
those columns that will participate in FH. Also, the first and second
queries can be computed from FV .
~ 18 ~
~ 19 ~
CHAPTER-3
Problem Structure Analysis
3.1 Problem of literature
3.1.1 Problem 1 :
Number of column may be exceed than the allowed number of column
of DBMS[1]. That means reaching the maximum number of columns in one
table and reaching the maximum column name length when columns are
automatically named.
To elaborate on this, a horizontal aggregation can return a table that
goes beyond the maximum number of columns in the DBMS when the set of
columns {R1,. . .,Rk} has a large number of distinct combinations of values, or
when there are multiple horizontal aggregations in the same query.
3.1.2 Problem 2 :
It is impossible to aggregate when data field’s are image or file(such as
blob data). Suppose when an image data converted to a column or attribute
name then it exceed the defined DBMS column name length.
This issue is automatically generating unique column names. If there are
many sub grouping columns {R1, . . .,Rk} or columns are of string data types,
this may lead to generate very long column names, which may exceed DBMS
limits. However, these are not important limitations because if there are many
dimensions that is likely to correspond to a sparse matrix (having many zeroes
or nulls) on which it will be difficult or impossible to compute a data mining
model. On the other hand, the large column name length can be solved as
explained below.
~ 20 ~
The problem of d going beyond the maximum number of columns can
be solved by vertically partitioning FH so that each partition table does not
exceed the maximum number of columns allowed by the DBMS. Evidently,
each partition table must have {L1,. . . , Lj } as its primary key. Alternatively,
the column name length issue can be solved by generating column identifiers
with integers and creating a “dimension” description table that maps identifiers
to full descriptions, but the meaning of each dimension is lost. An alternative is
the use of abbreviations, which may require manual input.
~ 21 ~
3.2 Introduce with Split-SPJ
When number of column exceed than the allowed number of column in
DBMS, then it limit SPJ method, But the Split-SPJ method create another table
when the DBMS column limit exceed. Without exceeding column number all
properties of SPJ are contains Split-SPJ.
Column limit of different Database System :
Database Maximum Permitted Column
Microsoft Access 255
Microsoft SQL Server 1024
MySql 4096
Oracle Default 1000 but it can be increase by
command.
Table 3.1 Different database permitted column
If we see the table, the lowest allowed column is 255 (Microsoft
Access). So we decide the splitting point is 255 sequentially.
Example :
If vertical attributes of a table is :
ID, VA1, VA2, VA3, VA4, VA5, VA6, VA7, . . . . . . . . . . . . . . . . . . . . . . . . .
,VA255, VA256, VA257, . . . . . . . . .. . . . . . . . . . . . . ,VA270, VA271, VA272, VA273
(It is impossible to aggregate in SPJ method)
The output of Split-SPJ method :
Table-1
ID, VA1, VA2, VA3, VA4, VA5, VA6, VA7, . . . . . . . . . . . . . . . . . . . . . . . .,VA255
Table-2
ID, VA256, VA257, . . . . . . . . .. . . . . . . . . . . . . ,VA270, VA271, VA272, VA273
~ 22 ~
~ 23 ~
CHAPTER-4
Experimental Description
4.1 Experimental data of this system
We use a system for the simulation that is now days popular social
networking system. There are a lot of picture is handled in facebook within a
few second. We consider here four users whose are named by user1, user2,
user3, user4 and 25 pictures are named by pic1 to pic25. Here any user can
comment any picture randomly by using any character length. We use
horizontal aggregation concept to find out the total character number of a
picture comment by each user. If any user does not comment any picture than
the field is defined by NULL.
By following the process of previous literature each user is aggregate
with each picture. Our proposed system is simulating that we assume the
column number of database is 20. So the total number column will be break at
20 and next 5 column will create a new table. This was unable at previous
thesis.
The time complexity of the proposed system is same to previous SPJ
method but able to show the full horizontal aggregation. If we indexed the
picture number than character length that commented by all users from pic1 to
pic2 are shown in table number one and last five number of picture are shown
in next table.
~ 24 ~
Fig- 4.1 Experimental data(Original data table)
~ 25 ~
Fig- 4.2 Experimental data(Vertical table)
~ 26 ~
Fig- 4.3 Experimental data(Horizontal table)
~ 27 ~
4.2 Figure of Split-SPJ horizontal aggregation
Fig- 4.4 Split-SPJ horizontal aggregation
~ 28 ~
4.3 Comparison of SPJ with Split-SPJ
When aggregated column < 255
5
4
3
2
1
SPJ
0 10 20 30 40 50 60 70 80 90 100
Tim
e (
ms)
Fig 4.3.2 : Split-SPJ curve when number of column is 100.
5
4
3
2
1
SPJ
0 10 20 30 40 50 60 70 80 90 100
Tim
e (
ms)
Fig 4.3.1 : SPJ curve when number of column is 100.
~ 29 ~
When aggregated column > 255
5
4
3
2
1
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360
Tim
e (
ms)
SPJ
255 No. of Column
Fig 4.3.3 : SPJ curve when number of column is 360.
Tim
e (
ms)
5
4
3
2
1
0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360
No. of Column 255
SPJ
SPJ
2.4
Fig 4.3.4 : Split-SPJ curve when number of column is 360.
~ 30 ~
4.4 Code for the different methods
4.4.1 Code for vertical aggregation :
using System;
using System.Windows.Forms;
using HorizontalAggregation.App_Code;
namespace HorizontalAggregation.UI
{
public partial class VerticalAggregationUI : Form
{
public VerticalAggregationUI()
{
InitializeComponent();
}
private DataManager dataManager = null;
private void VerticalAggregationUI_Load(object sender, EventArgs e)
{
dataManager = new DataManager();
dgvVerticalAggregation.DataSource = dataManager.GetVerticalTable();
}
}
}
public DataTable GetVerticalTable()
{
dataExecuteClass = new DataExecuteClass();
dataSet = new DataSet();
DataTable dataTable = null;
string queryString = string.Format("SELECT facebook_id, image_name, sum(comments_char) as
[SUM] from stdinfo group by facebook_id,image_name order by facebook_id,image_name;");
try
{
dataSet = dataExecuteClass.getDataSet(queryString);
dataTable = dataSet.Tables[0];
return dataTable;
}
catch (Exception ex)
{
throw ex;
}
}
~ 31 ~
4.4.2 Code for horizontal aggregation :
using System;
using System.Windows.Forms;
using HorizontalAggregation.App_Code;
namespace HorizontalAggregation.UI
{
public partial class HorizontalAggregationUI : Form
{
public HorizontalAggregationUI()
{
InitializeComponent();
}
private DataManager dataManager = null;
private DataExecuteClass DataExecuteClass = null;
private void HorizontalAggregationUI_Load(object sender, EventArgs e)
{
dataManager = new DataManager();
DataExecuteClass = new DataExecuteClass();
dataManager = new DataManager();
dgvHA.DataSource = dataManager.GetHorizontalTable();
}
}
}
public DataTable GetHorizontalTable()
{
dataExecuteClass = new DataExecuteClass();
dataSet = new DataSet();
DataTable dataTable = null;
string queryString = string.Format("SELECT * from horizontal order by facebook_id;");
try
{
dataSet = dataExecuteClass.getDataSet(queryString);
dataTable = dataSet.Tables[0];
return dataTable;
}
catch (Exception ex)
{
throw ex;
}
}
4.4.3 Main steps of Split
Fig- 4.5 Main steps of Split
From the experimental
table and then horizontal aggregated table
blob(Such as image, file etc).
~ 32 ~
Split-SPJ method based on FV :
Main steps of Split-SPJ method based on FV.
experimental data table first produced vertical aggregated
and then horizontal aggregated table. Data can be null but not
blob(Such as image, file etc).
Select Distinct
R1.....Rk
Split-SPJ
d left joins
Compute
Fh
Compute
Fv
produced vertical aggregated
Data can be null but not
~ 33 ~
4.4.4 The Split-SPJ Algorithm (Proposed Algorithm):
Algorithm 4.1 : Split-SPJ (D, DV, DH, TRV, TCH, TEMP)
Let experimental data table D, it produced vertical aggregated table DV
and then horizontal aggregated table DH. Data can be null but not
blob(Such as image, file etc). The variable TRV, TCH and TEMP denote
respectively total rows of DV, Total columns of DH.
1. [Create vertical aggregated table from experimental table.]
TEMP =: SELECT(D).
2. [Assigning vertical data.]
DV =: TEMP.
3. [Create horizontal aggregated table from vertical aggregated table.]
TEMP =: SELECT(DV).
4. [Assigning horizontal data.]
DH =: TEMP.
5. [Count column of horizontal data table.]
COUNTER =: COUNT(DH).
6. [Check condition.]
If COUNTER > 255 then :
Create table using 255 column.
COUNTER =: COUNTER – 255.
GoTo step 6.
Else :
Create table using total column.
End If
7. Exit.
~ 34 ~
4.4.5 Code for Split-SPJ horizontal aggregation :
For oracle :
SELECT
(SELECT column_name FROM user_tab_columns WHERE
table_name like ‘table_name’ and rownum = 255)
FROM (FROM F0
LEFT OUTER JOIN F1
ON F0.L1 = F1.L1 and . . . and F0.Lj = F1.Lj
LEFT OUTER JOIN F2
ON F0.L1 = F2.L1 and . . . and F0.Lj = F2.Lj
. . . . . . . . .
LEFT OUTER JOIN Fd
ON F0.L1 = Fd.L1 and . . . and F0.Lj = Fd.Lj)
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Data.OleDb;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using HorizontalAggregation.App_Code;
namespace HorizontalAggregation.UI
{
public partial class ProposedHorizontalAggregationUI : Form
{
public ProposedHorizontalAggregationUI()
{
InitializeComponent();
}
private DataManager dataManager = null;
private DataExecuteClass dataExecuteClass = null;
private DataGridView dataGridView = null;
private string[] attributeName = (new DataManager()).GetAllAttributeOfAtable("stdinfo");
private string[] col = new string[20];
private void ProposedHorizontalAggregationUI_Load(object sender, EventArgs e)
{
dataManager=new DataManager();
int maxColLength = int.Parse(dataManager.GetMaxColumnLength());
if (maxColLength==0)
{
dataGridView =new DataGridView();
dataGridView.Dock=DockStyle.Top;
dataGridView.DataSource = CrieateHorizantalAgreateTable();
this.Controls.Add(dataGridView);
}
else
~ 35 ~
{
int totalColumnLength = attributeName.Count()-1;
int fstSkipPoint = 0, lstSkipPoint = 0;
int numOfDGV = (int)Math.Ceiling((float)totalColumnLength/maxColLength);
for (int j = 0; j < numOfDGV; j++)
{
fstSkipPoint = j*maxColLength+1;
lstSkipPoint = fstSkipPoint+maxColLength-1;
DataTable dataTable = CrieateHorizantalAgreateTable();
for (int i = 1; i <= totalColumnLength; i++)
{
if((i>=fstSkipPoint && i<=lstSkipPoint) || i==1)
{
continue;
}
else
{
string column = col[i-1];
dataTable.Columns.Remove(column);
}
}
dataGridView = new DataGridView();
dataGridView.DataSource = dataTable;
dataGridView.Dock = DockStyle.Top;
this.Controls.Add(dataGridView);
}
}
}
private DataTable CrieateHorizantalAgreateTable()
{
dataManager = new DataManager();
dataExecuteClass = new DataExecuteClass();
int i = 0;
DataRow dr;
string[] horizontalColumn = dataManager.SelectDistinctRowInaColumn("D2", "stdinfo");
DataTable horizontalAggrigationTable = new DataTable();
//Column of horizontal table
string col1 = attributeName[1];
col[0] = col1;
string col2 = attributeName[2] + horizontalColumn[0];
col[1] = col2;
string col3 = attributeName[2] + horizontalColumn[1];
col[2] = col3;
horizontalAggrigationTable.Columns.Add(col1);
horizontalAggrigationTable.Columns.Add(col2);
horizontalAggrigationTable.Columns.Add(col3);
//Create Rows of horizontal table
string[] data1 = dataManager.SelectDistinctRowInaColumn("D1", "stdinfo");//Prepare 1st Column
string[] data2 = new string[data1.Count()];
string[] data3 = new string[data1.Count()];
//Prepare 2nd Column
string query = "SELECT SUM FROM (SELECT D1, D2, sum(A) as [SUM] from stdinfo group by
D1,D2 order by D1,D2) WHERE D2='x'";
OleDbDataReader reader = dataExecuteClass.ExecuteReader(query);
while (reader.Read())
{
data2[i] = reader["SUM"].ToString();
~ 36 ~
i++;
}
//Prepare 3rd Column
query = "SELECT SUM FROM (SELECT D1, D2, sum(A) as [SUM] from stdinfo group by D1,D2
order by D1,D2) WHERE D2='y'";
reader = null; i = 0;
reader = dataExecuteClass.ExecuteReader(query);
while (reader.Read())
{
data3[i] = reader["SUM"].ToString();
i++;
}
for (i = 0; i < data1.Count(); i++)
{
dr = horizontalAggrigationTable.NewRow();
dr[col1] = data1[i];
dr[col2] = data2[i];
dr[col3] = data3[i];
horizontalAggrigationTable.Rows.Add(dr);
}
return horizontalAggrigationTable;
}
}
}
~ 37 ~
~ 38 ~
CHAPTER-5
Conclusion and Future Research
5.1 Conclusion
We introduced a new method to extend aggregate functions, called Split
SPJ horizontal aggregations which help preparing data sets for data mining .
Specifically, the method is useful to create data sets with a horizontal layout, as
commonly required by data mining algorithms. Basically, a horizontal
aggregation returns a set of numbers instead of a single number for each group,
resembling a multidimensional vector. We proposed an abstract, but minimal,
extension to SQL standard aggregate functions to compute horizontal
aggregations which just Split the data set at the final limit of column of related
database. From a query optimization perspective, we used query evaluation
methods.
5.2 Future Research Work
We need to understand if Split-SPJ method of horizontal aggregations
can be applied to holistic functions (e.g., rank()). Optimizing a workload of
horizontal aggregation queries is another challenging problem.
If the length of aggregate object is exceed column length of related
database than there occur an error which may be overcome by using alias
method. That means it is very complex to aggregate when data field’s are
contain image or file (such as blob data).
~ 39 ~
REFERENCE
1. Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining
Analysis. [IEEE TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 24, NO. 4, APRIL 2012]
2. Vertical and Horizontal Percentage Aggregations. [Proc. ACM
SIGMOD Int’l Conf. Management of Data (SIGMOD ’04), pp. 866-871,
2004.]
3. Data Set Preprocessing and Transformation in a Database System.
[Intelligent Data Analysis, vol. 15, no. 4, pp. 613-631, 2011.]
4. Integrating K-Means Clustering with a Relational DBMS Using SQL.
[IEEE Trans. Knowledge and Data Eng., vol. 18, no. 2, pp. 188-201,
Feb. 2006.]
5. Data Cube A Relational Aggregation Operator [Proc. Int’l Conf. Data
Eng., pp. 152-159, 1996.]
6. Mining Low-Support Discriminative Patterns [IEEE TRANSACTIONS
ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 2,
FEBRUARY 2012]
7. Data Mining Techniques for Software Effort [IEEE TRANSACTIONS
ON SOFTWARE ENGINEERING, VOL. 38, NO. X, XXXXXXX
2012]
8. C. Galindo-Legaria and A. Rosenthal, “Outer Join Simplification and
Reordering for Query Optimization,” ACM Trans. Database Systems,
vol. 22, no. 1, pp. 43-73, 1997.
~ 40 ~
9. C. Ordonez, “Horizontal Aggregations for Building Tabular Data Sets,”
Proc. Ninth ACM SIGMOD Workshop Data Mining and Knowledge
Discovery (DMKD ’04), pp. 35-42, 2004.
10. H. Wang, C. Zaniolo, and C.R. Luo, “ATLAS: A Small But Complete
SQL Extension for Data Mining and Data Streams,” Proc. 29th Int’l
Conf. Very Large Data Bases (VLDB ’03), pp. 1113-1116, 2003.