http://gibaholms.wordpress.com/ Balanc eLine4j Framework Overview Revision: 01 Gilberto Augusto Holms [email protected] @gibaholms http://gibaholms.wordpre ss.com/
Jun 29, 2015
http://gibaholms.wordpress.com/
BalanceLine4j Framework Overview
Revision: 01
Gilberto Augusto [email protected]
@gibaholmshttp://gibaholms.wordpress.com/
http://gibaholms.wordpress.com/
About me...
Gilberto Augusto Holms
Java and SOA Architect Expertise: Java, EAI, SOA, BPEL, BPM, Oracle Fusion Middleware Interests: OpenSource, Artificial Intelligence, Innovation Twitter: @gibaholms Blog: http://gibaholms.wordpress.com/ SCJA, SCJP, SCWCD, SCBCD, SCDJWS, OCE WLP 10g
http://gibaholms.wordpress.com/
Balance Line Algorithm
What is “Balance Line” ?
Balance Line is an algorithm, a computational technique to coordinate the processing of sequential massive data.
http://gibaholms.wordpress.com/
Balance Line Algorithm
What are “Sequential Data” ?
Sequential Data are big data sets, from one or more data sources, that have a common key and present themselves ordered by that key.
http://gibaholms.wordpress.com/
Balance Line Algorithm
Why to use ?
Improves the processing performance
Saves computational resources
http://gibaholms.wordpress.com/
Balance Line Algorithm
When to use ?
Data synchronization (like iPod)
Data loading (full or partial)
Data conciliation
http://gibaholms.wordpress.com/
Case Study
The “X” company have in your database a big table containing main information about all the banks and agencies of the country (number, address, contacts). Daily, this company receives from the Central Bank a file that is a huge text file containing the newest data about the agencies, where might occur the following conditions:
Data update (changes on number, address, contacts and so on)Agency not exists anymoreNew agency added
Our work is to develop a software to maintain this table up-to-date, making the file process and syncronize the record changes.
http://gibaholms.wordpress.com/
Dummy Solution
For each text file line
Check if the agency exists
Exists ?
Check if the agency changed data
Data changed ?
UPDATE
N Y
Y
End of file ?
INSERT
N
N
Y
For each record that not exists anymore DELETE
End
Begin
http://gibaholms.wordpress.com/
Balance Line Algorithm Concepts
Master FileIs the main data set, represents the final view of the data, the persistent, the reference, the orign.
Transaction FileIs the secoundary data set, represents the transactions made, contais the data that must be syncronized with the orign.
KeyIs an unique identificator that identifies one single record (can be a single field, a mix of fields, a SHA-1 hash and so on).
Master
Transaction
Transaction
...
BalanceLine
BalanceLine
http://gibaholms.wordpress.com/
Balance Line Algorithm Concepts
The big secret ...
SORTING BY KEY !
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
1 – Identify one unique key
10 .....
5 .....
20 .....
17 .....
3 .....
10 .....
18 .....
17 .....
Master Transaction
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
2 – Sort the data sources (ascending)
5 .....
10 .....
17 .....
20 .....
3 .....
10 .....
17 .....
18 .....
Master Transaction
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
3 – Prepare two “pointers”
5 .....
10 .....
17 .....
20 .....
3 .....
10 .....
17 .....
18 .....
Master Transaction
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
4 – Begin key comparison
5 .....
10 .....
17 .....
20 .....
3 .....
10 .....
17 .....
18 .....
Master Transaction
KM > KT INSERT, moves T
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
4 – Begin key comparison
5 .....
10 .....
17 .....
20 .....
3 .....
10 .....
17 .....
18 .....
Master Transaction
KM < KT DELETE, moves M
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
4 – Begin key comparison
5 .....
10 .....
17 .....
20 .....
3 .....
10 .....
17 .....
18 .....
Master Transaction
KM = KT UPDATE, moves M and T
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
4 – Begin key comparison
5 .....
10 .....
17 .....
20 .....
3 .....
10 .....
17 .....
18 .....
Master Transaction
KM = KT UPDATE, moves M and T
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
4 – Begin key comparison
5 .....
10 .....
17 .....
20 .....
3 .....
10 .....
17 .....
18 .....
Master Transaction
KM > KT INSERT, moves T
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
4 – Begin key comparison
5 .....
10 .....
17 .....
20 .....
3 .....
10 .....
17 .....
18 .....
Master Transaction
KM (no KT) DELETE, moves M
http://gibaholms.wordpress.com/
Balance Line Algorithm – Step by Step
5 – Final master file
3 .....
10 .....
17 .....
18 .....
Master
http://gibaholms.wordpress.com/
BalanceLine4j Framework
Java implementation of Balance Line algorithm Focus on business rules and let the framework handle the
algorithm Provides abstraction of Sequential Data Sources that can be any
sortable data set (Comparable<T>): Object Collections, Sets, Maps Text files (with a built-in text file sorter) Database Resultsets Custom (interface provided)
Algorithm run by data streaming, little memory consumption Easy to use, easy API, no knowledge of the algorithm required Better to maintain and evolve because it promotes isolation of
business rules out of the algorithm code
http://gibaholms.wordpress.com/
BalanceLine4j Framework – Additional Features
FileSorter.java
The framework provides a great file sorter class capable of safely sort big quantity of text data without memory overflow, because it uses the file system to write temporary chunks of data and then merge-sort all chunks.
http://gibaholms.wordpress.com/
Back to Case Study
Master File: bank agencies database table (select * order by) Transaction File: positional text file with the newest agencies
information (if not sorted, use the FileSorter class) Key: string concatenation of bank number + agency number Sync Mode: full (if the agency not exists anymore, delete it)
Benchmark: Dummy Solution vs. Balance Line Solution
http://gibaholms.wordpress.com/
Back to Case Study
Dummy Solution 1 random access for each transaction record 33.218 lines x 1 query with “where” clause = 33.218
queries with “where” clause Same slow processing time in every sync
Balance Line Solution 1 single sequential access 1 query with “order by” clause Fastest processing time in first sync (70% up) and much
more faster in next syncs (less changes = less processing time because keys moves faster)
http://gibaholms.wordpress.com/
BalanceLine4j Framework – Complementary Strategies
To further increase performance of the Balance Line processing algorithm, there are some complementary techniques that can be used:
Dump data from database to text, work at filesystem I/O level and then update the database (filesystem I/O is faster than networking I/O)
Sometimes using a hash code (MD5, SHA-1) to check if a record have changed is faster than compare field by field
Use a transaction code (insert, update, delete) to identify the transaction type made per record in transaction file
Buffer some records into memory to optimize the data streaming
http://gibaholms.wordpress.com/
Thanks !
More Information and Samples
Project Site: https://github.com/gibaholms/balanceline4j/
Authors Blog: http://gibaholms.wordpress.com/
Authors Twitter: @gibaholms