Software Connector Classification and Selection for Data- Intensive Systems Chris A. Mattmann, David Woollard, Nenad Medvidovic, Reza Mahjourian 2nd Intl. Workshop on Incorporating COTS Software into Software Systems
Dec 20, 2015
Software Connector Classification and Selection for Data-Intensive
Systems
Chris A. Mattmann, David Woollard, Nenad Medvidovic, Reza Mahjourian
2nd Intl. Workshop on Incorporating COTS Software into Software Systems (IWICSS 2007)
Agenda
• Research Problem and Importance• Our Approach
– Classification– Selection– Analysis
• Evaluation– Precision, Recall, Accuracy Measurements
• Related Work• Conclusion & Future Work
Research Problem and Importance
• Content repositories are growing rapidly in size
• At the same time, we expect more immediate dissemination of this data
• How do we distribute it…– In a performant manor?– Fulfilling system
requirements? ?NASA Planetary Data System
Archive Volume Growth
0
10
20
30
40
50
60
70
80
90
1990 1992 1994 1996 1998 2000 2002 2004 2006 2008
Year
TB (Accum)
TBytes
Software Architecture
• The definition of a system in the form of its canonical building blocks– Software Components: the computational units in the system– Software Connectors: the communications and interactions
between software components– Software Configurations: arrangements of components and
connectors and the rules that guide their composition
Data Distribution Systems
Data Producer
Data ConsumerData ConsumerData ConsumerData Consumer
data
???
data
Connector
Insight: Use Software Connectors to model data distribution technologies
ComponentComponent
Data Movement Technologies
• Wide array of available OTS “large-scale” connector technologies– GridFTP, Aspera software, HTTP/REST, RMI,
CORBA, SOAP, XML-RPC, Bittorrent, JXTA, UFTP, FTP, SFTP, SCP, Siena, GLIDE/PRISM-MW, and more
• Which one is the best one?• How do we compare them
– Given our current architecture?– Given our distribution scenarios & requirements?
Research Question
• What types of software connectors are best suited for delivering vast amounts of data to users, that satisfy their particular scenarios, in a manner that is performant, scalable, in these hugely distributed data systems?
Data Distribution Problem Space
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Broad variety of distribution connector families
• P2P, Grid, Client/Server, and Event-based
• Though each connector family varies slightly in some form or fashion– They all share 3 common atomic connector
constituents• Data Access, Stream, Distributor• Adapted from Mehta et al.’s Connector
Taxonomy
Connector Tradeoff Space
• Surveyed properties of 13 representative distribution connectors, across all 4 distribution connector families and classified them– Client/Server
• SOAP, RMI, CORBA, HTTP/REST, FTP, UFTP, SCP, Commercial UDP Technology
– Peer to Peer• Bittorrent
– Grid• GridFTP, bbFTP
– Event-based• GLIDE, Sienna
Large Heterogeneity in Connector Properties
Procedure Call Connector Breakdown (5 connectors, 2 families)
0
1
2
3
4
5
6
HTTP ResponseRMI message
GridFTP messageSOAP messageCORBA message
one senderMethod Call
Globus Log LayerHTTP Server logRMI Registry
CORBA Name Registry
Web Server
valuereference
publicprotected
private
one receiverkeyword
Num Connectors
proc_call_params_return_valueproc_call_cardinality_sendersproc_call_invocation_explicitproc_call_params_invocation_recordproc_call_params_datatransferproc_call_accessibilityproc_call_semantics
Data Access Connector Breakdown (8 Connectors, 4 families)
0
1
2
3
4
5
6
7
8
9
ProcessGlobal
Dynamic Data Exchange
Database AccessRepository Access
File I/O
Session-Based
Cache
Peer-Based
Many ReceiversOne Receiver
AccessorMutator
Many SendersOne Sender
Num Connectors
data_access_localitydata_access_persistencedata_access_avail_transientdata_access_cardinality_receiversdata_access_accessesdata_access_cardinality_senders
Distributor Connector Breakdown (8 connectors, 4 families)
0
1
2
3
4
5
6
7
8
9
ad-hocbounded
RMI MessageGridFTP Message
SOAP Message
Event
HTTP MessagePeer Pieces
registry-basedattribute-basedHeirarchical
Flat
content-based
tcp/ip
architecture configuration
tracker
Exactly OnceAt least onceBest Effort
dynamiccachedstaticUnicastMulticastBroadcast
Num Connectors
distributor_routing_membershipdistributor_delivery_typedistributor_naming_typedistributor_naming_structuresdistributor_routing_typedistributor_delivery_semanticsdistributor_routing_pathdistributor_delivery_mechanisms
Stream Connector Breakdown (8 connectors, 4 families)
0
1
2
3
4
5
6
7
8
9
Raw
StructuredMany Senders
One Sender
RemoteLocal
Exactly OnceAt least onceBest Effort
bps
Many ReceiversOne Receiver
StatefulStatelessNamed
Bounded
Asynchronous
Time Out Synchronous
Buffered
Num Connectors
stream_formatsstream_cardinality_sendersstream_localitiesstream_deliveriesstream_throughputstream_cardinality_receiversstream_statestream_identitystream_boundsstream_synchronicitystream_buffering
How do experts make these decisions?
• Performed survey of 33 “experts”• Experts defined to be
– Practitioners in industry, building data-intensive systems
– Researchers in data distribution– Admitted architects of data
distribution technologies
• General consensus?– They don’t the how and the why
about which connector(s) are appropriate
– They rely on anecdotal evidence and “intuition”
Percentage Breakdown of Expert Responses
67%
15%
15%
3%
No ResponseNot ComfortableNo TimeFull Response
Expert Survey Demographic
6%
18%
12%
12%6%
22%
6%
12%
6%
Cancer Research
Planetary Science
Earth Science
Industry
Grid Computing
Professors
Web Technologies
Open Source
Students45% of respondents claimed to be uncomfortable being addressed as a data
distribution expert.
Our Approach: DISCO
• Develop a software framework for:– Connector Classification
• Build metadata profiles of connector technologies, describing their intrinsic properties (DCPs)
– Connector Selection• Adaptable, extensible algorithm development framework
for selecting the “right” connectors (and identifying wrong ones)
– Connector Selection Analysis• Measurement of accuracy of results
– Connector Performance Analysis
DISCO in a Nutshell
Building DCPs of all 13 connectors (Classification)
• Rely on Mehta et al. metadata to describe data distribution connectors
• Carefully select metadata to include/exclude
Develop complementary selection algorithms
Preliminary Evaluation
• We developed 13 connector profiles– Based on literature, expert
reviews, and our own development experience
• 30 distribution scenarios• 24 score functions (white
box) and Bayesian domain profiles with 100 conditional probabilities (black box)
ConnectorProfiles
Distribution Scenarios
Answer Key Score Bayesian
DISCO
Precision-RecallAnalysis
Clustering Clustering
Precision-Recall Results
• Error Rate– Probability of incorrectly
labeling a connector as appropriate for a scenario
• Precision– The fraction of selected
connectors appropriate for a scenario
• Recall– Probability of detecting a
connector as appropriate for a scenario
Bayesian Scored-based
True Positive (TP) 101 63
False Positive (FP) 25 200
True Negative (TN) 245 67
False Negative (FN) 19 60
Bayesian Scored-based
Error Rate 11.28% 32.56%
Precision 80.16% 48.46%
Recall 25.90% 16.15%
Related Work
Conclusions & Future Work
• Conclusions– Domain experts (gurus) rely on tacit knowledge and
often cannot explain design rationale– Disco provides a quantification of & framework for
understanding an ad hoc process– Bayesian algorithm has a higher precision rate
• Future Work– Explore the tradeoffs between white-box and black-
box approaches– Investigate the role of architectural mismatch in
connectors for data system architectures
Thank You!
Questions?