Top Banner
Meandre: Semantic-Driven Data-Intensive Flows in the Clouds Xavier Llora, Bernie Acs, Loretta Auvil, Boris Capitanu, Michael Welge, David Goldberg National Center for Supercomputing Applications University of Illinois at Urbana-Champaign {xllora, acs1, lauvil, capitanu, mwelge, deg}@illinois.edu The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR:
64

SEASR eScience 2008

Nov 29, 2014

Download

Technology

Loretta Auvil

Presentation of Meandre: Semantic-Driven Data-Intensive Flows in the Clouds at eScience 2008 by Bernie Acs

Data-intensive flow computing allows efficient processing of large volumes of data otherwise unapproachable. This paper introduces a new semantic-driven data-intensive flow infrastructure which: (1) provides a robust and transparent scalable solution from a laptop to large-scale clusters, (2) creates an unified solution for batch and interactive tasks in high-performance computing environments, and (3) encourages reusing and sharing components. Banking on virtualization and cloud computing techniques, the Meandre infrastructure is able to create and dispose Meandre clusters on demand, being transparent to the final user. This paper also presents a prototype of such clustered infrastructure and some results obtained using it.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 1. SEASR: Meandre: ! Semantic-Driven Data-Intensive ! Flows in the Clouds Xavier Llora, Bernie Acs, Loretta Auvil, Boris Capitanu, Michael Welge, David Goldberg National Center for Supercomputing Applications! University of Illinois at Urbana-Champaign {xllora, acs1, lauvil, capitanu, mwelge, deg}@illinois.edu The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 2. SEASR: The Project SEASR: Software Environment for the! Advancement of Scholarly Research Funded by the Andrew W. Mellon Foundation to answer the humanities communitys call for a research and development environment capable of powering leading edge digital humanities initiatives. Fosters collaboration through empowering scholars to share data and research processes with an infrastructure and framework designed to support reusable, repeatable, and scalable services and processes. Designed to enable developers to rapidly design, build, and share software applications that support research and collaboration using modular components that can be assembled to create reusable data-ows. Project web site: http://seasr.org The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 3. SEASR: The High-Altitude Picture The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 4. SEASR: @ Work DISCUS
  • 5. SEASR: @ Work NEMA
  • 6. SEASR: @ Work NESTER
  • 7. SEASR: @ Work MONK
  • 8. SAESR: @ Work Evolution Highway
  • 9. SEASR: A Quick Overview Addresses: Challenges of transforming information into knowledge Constructs software bridges to migrate unstructured and semi- structured data into structured data and/or metadata to enable analysis and accessibility. Aims: Make digital collections more useful and exible Provide access to analytic processes and visualizations Enable easy mash-up with other web-based services (SOA) The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 10. SEASR: Knowledge Discovery Predictable process The Process Selection Preparation Transform Processing Interpret The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 11. SEASR: Knowledge Discovery Predictable process across domains. Domains Literature History Music Art Science The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 12. SEASR: Knowledge Discovery Predictable process across domains and digital collections. Collection Types Text Multimedia Data The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 13. SEASR: Design Goals Transparency From a single laptop to a HPC cluster Not bound to a particular computation fabric Allow heterogeneous development Intuitive programming paradigm Modular Components, Flows, and Reusable Foster Collaboration and Sharing Open Source Service Orientated Architecture (SOA) The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 14. Meandre: Infrastructure SEASR/Meandre Infrastructure: Dataow execution paradigm Semantic-web driven Web Oriented Supports publishing services Modular components Encapsulation and execution mechanism Promotes reuse, sharing, and collaboration The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 15. Meandre: Data Driven Execution Execution Paradigms Conventional programs perform computational tasks by executing a sequence of instructions. Data driven execution revolves around the idea of applying transformation operations to a ow or stream of data when it is available. Dataow Approach May have zero to many inputs May have zero to many outputs Performs a logical operation when data is available The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 16. Meandre: Dataow Example Dataow Addition Example Logical Operation + Value1 Requires two inputs Sum Produces one output Value2 When two inputs are available Logical operation can be preformed Sum is output When output is produced Reset internal values Wait for two new input values to become available The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 17. Meandre: The Dataow Component Data dictates component execution semantics Inputs Outputs Component P Descriptor in RDF! The component ! of its behavior implementation The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 18. Meandre: Component Metadata Describes a component Separates: Components semantics (black box) Components implementation Provides a unied framework: Basic building blocks or units (components) Complex tasks (ows) Standardized metadata The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 19. Meandre: Semantic Web Concepts Relies on the usage of the resource description framework (RDF) which uses simple notation to express graph relations written usually as XML to provide a set of conventions and common means to exchange information Provides a common framework to share and reuse data across application, enterprise, and community boundaries Focuses on common formats for integration and combination of data drawn from diverse sources Pays special attention to the language used for recording how the data relates to real world objects Allows navigation to sets of data resources that are semantically connected. The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 20. Meandre: Metadata Ontologies Meandre's metadata relies on three ontologies: The RDF ontology serves as a base for dening Meandre descriptors The Dublin Core Elements ontology provides basic publishing and descriptive capabilities in the description of Meandre descriptors The Meandre ontology describes a set of relationships that model valid components, as understood by the Meandre execution engine architecture The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 21. Meandre: Components in RDF @prefix meandre: . @prefix xsd: . Existing! @prefix dc: . Standards @prefix rdfs: . @prefix rdf: . @prefix : . meandre:name Limited iterations^^xsd:string ; rdf:type meandre:executable_component ; dc:creator Xavier Llora^^xsd:string ; dc:date 2007-11-17T00:32:35^^xsd:date ; dc:description Allows only a limited number of iterations^^xsd:string ; dc:format java/class^^xsd:string ; dc:rights University of Illinois/NCSA Open Source License^^xsd:string ; meandre:execution_context , , , are sponsored by The Andrew W. Mellon Foundation ,
  • 22. Meandre: Components Types Components are the basic building block of any computational task. There are two kinds of Meandre components: Executable components Perform computational tasks that require no human interactions during runtime Processes are initialized during ow startup and are red when in accordance to the policies dened for it. Control components Used to pause dataow during user interaction cycles WebUI may be a HTML Form, Applet, or Other user interface The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 23. Meandre: Component Assemblies Dened by connecting outputs from one component to the inputs of another. Cyclical connections are supported Components may have Zero to many inputs Zero to many output Properties that control runtime behavior Described using RDF Enables storage, reuse, and sharing like components Allows discovery and dynamic execution The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 24. Meandre: Flow (Complex Tasks) A ow is a collection of connected components Read Merge P P Show Get P P Do P Dataflow execution The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 25. Meandre: Create, Publish, & Share Components and Flows have RDF descriptors Easily shared, fosters sharing, & reuse Allow machines to read and interpret Independent of the implementations Combine different implementation & platforms Components: Java, Python, Lisp, Web Services Execution: On a Laptop or a High Performance Cluster A Location is RDF descriptor of one to many components, one to many ows, and their implementations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 26. Meandre: Repository & Locations Each location represents a set components/ows Users can Combine different locations together Create components Assemble ows Share components and ows Repositories Help Administrate complex environments Organize components and ows The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 27. Meandre: Metadata Properties Components and Flows share properties such as component name, creator, creation date, description, tags, and rights. Components specic metadata to describe the components' behavior, its location, type of implementation, ring policy, runnable, format, resource location, and execution context Flow specic metadata describes the directed graph of components, components instances, connectors, connector instance data port source, connector, instance data port target, connector instance source, connector instance target, instance name The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 28. Meandre: Programming Paradigm The programming paradigm creates complex tasks by linking together a bunch of specialized components. Meandre's publishing mechanism allows components develop by third parties to be assembled in a new ow. There are two ways to develop ows : Meandres Workbench visual programming tool Meandres ZigZag scripting language The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 29. Meandre: Workbench Existing Flow Components Flows Locations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 30. Meandre: Workbench Create Flow Components Flows Locations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 31. Meandre: Workbench Create Flow Drag & Drop Selected Component into workspace Components Flows Locations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 32. Meandre: Workbench Create Flow Properties for Selected Component Exposed Components Flows Locations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 33. Meandre: Workbench Create Flow Description for Selected Component Exposed Components Flows Locations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 34. Meandre: Workbench Create Flow Drag & Drop Another Component into workspace Components Flows Locations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 35. Meandre: Workbench Create Flow Connect Output of First Component to Input of Second Click First Port to connect will highlight Components with color change (Red) Flows Locations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 36. Meandre: Workbench Create Flow Connect Output of First Component to Input of Second Click Port to Connect will cause a line to be Components displayed as visual indicator Flows Locations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 37. Meandre: Workbench Create Flow Repeat Drag & Drop to Complete the Assembly Components Flows Locations The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 38. Meandre: ZigZag Script Language ZigZag is a simple language for describing data- intensive ows Modeled on Python for simplicity. ZigZag is declarative language for expressing the directed graphs that describe ows. Command-line tools allow ZigZag les to compile and execute. A compiler is provided to transform a ZigZag program (.zz) into Meandre archive unit (.mau). Mau(s) can then be executed by a Meandre engine. The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 39. Meandre: ZigZag Script Language As an example the Flow Diagram The ow below pushes two strings that get concatenated and printed to the console The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 40. Meandre: ZigZag Script Language ZigZag code that represents example ow: # # Imports the three required components and creates the component aliases # Repository import Location alias as PUSH alias as CONCAT alias as PRINT # Defines the logical # Creates four instances for the flow repository location # push_hello, push_world, concat, print = PUSH(), PUSH(), CONCAT(), PRINT() where components in # this flow can be found # Sets up the properties of the instances # similar to defining a push_hello.message, push_world.message = Hello , world! location for workbench # # Describes the data-intensive flow which would then # display available @phres, @pwres = push_hello(), push_world() @cres = concat( string_one: phres.string; string_two: pwres.string ) components located print( object: cres.concatenated_string ) there # The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 41. Meandre: ZigZag Script Language ZigZag code that represents example ow: # # Imports the three required components and creates the component aliases # import alias as PUSH Alias alias as CONCAT alias as PRINT # # Creates four instances for the flow Assigns a logical # name reference for push_hello, push_world, concat, print = PUSH(), PUSH(), CONCAT(), PRINT() # each component # Sets up the properties of the instances making subsequent # push_hello.message, push_world.message = Hello , world! program calls easier to # read and write. # Describes the data-intensive flow # @phres, @pwres = push_hello(), push_world() @cres = concat( string_one: phres.string; string_two: pwres.string ) print( object: cres.concatenated_string ) # The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 42. Meandre: ZigZag Script Language ZigZag code that represents example ow: # # Imports the three required components and creates the component aliases # import alias as PUSH alias as CONCAT alias as PRINT # # Creates four instances for the flow # Implementation push_hello, push_world, concat, print = PUSH(), PUSH(), CONCAT(), PRINT() Instances # # Sets up the properties of the instances # Create instances of push_hello.message, push_world.message = Hello , world! the components using # # Describes the data-intensive flow the Alias references # similar to dragging @phres, @pwres = push_hello(), push_world() @cres = concat( string_one: phres.string; string_two: pwres.string ) components on to print( object: cres.concatenated_string ) workbench canvas # The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 43. Meandre: ZigZag Script Language ZigZag code that represents example ow: # # Imports the three required components and creates the component aliases # Define the property import alias as PUSH values for components alias as CONCAT which is similar to filing alias as PRINT # in values in the # Creates four instances for the flow workbenchs properties # push_hello, push_world, concat, print = PUSH(), PUSH(), CONCAT(), PRINT() panel. # # Sets up the properties of the instances # Set the Property push_hello.message, push_world.message = Hello , world! Values # # Describes the data-intensive flow # @phres, @pwres = push_hello(), push_world() @cres = concat( string_one: phres.string; string_two: pwres.string ) print( object: cres.concatenated_string ) # The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 44. Meandre: ZigZag Script Language ZigZag code that represents example ow: # # Imports the three required components and creates the component aliases # import alias as PUSH Define the connections alias as CONCAT alias as PRINT or relationships between # the components in this # Creates four instances for the flow # flow which is similar to push_hello, push_world, concat, print = PUSH(), PUSH(), CONCAT(), PRINT() drawing connection # # Sets up the properties of the instances lines on the workbench # canvas push_hello.message, push_world.message = Hello , world! # # Describes the data-intensive flow # @phres, @pwres = push_hello(), push_world() Describe @cres = concat( string_one: phres.string; string_two: pwres.string ) Connections print( object: cres.concatenated_string ) # The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 45. Meandre: ZigZag Script Language Automatic Parallelization Multiple instances of a component could be run in parallel to boost throughput. Specialized operator available in ZigZag Scripting to cause multiple instances of a given component to used Consider a simple ow example show in the diagram The dataow declaration would look like # # Describes the data-intensive flow # @pu = push() @pt = pass( string:pu.string ) print( object:pt.string ) The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 46. Meandre: ZigZag Script Language Automatic Parallelization Adding the operator [+AUTO] to middle component # Describes the data-intensive flow # @pu = push() @pt = pass( string:pu.string ) [+AUTO] print( object:pt.string ) [+AUTO] tells the ZigZag compiler to parallelize the pass component instance by the number of cores available on system. [+AUTO] may also be written [+N] where N is an numeric value to use for example [+10]. The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 47. Meandre: ZigZag Script Language Automatic Parallelization Adding the operator [+4] would result in a directed graph # Describes the data-intensive flow # @pu = push() @pt = pass( string:pu.string ) [+4] print( object:pt.string ) The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 48. Meandre: ZigZag Script Language Automatic Parallelization ZigZag has created 4 parallel instances of the component. It has also introduced a mapper instance that is in charge of distributing the incoming data to each of the parallel instance. This is called unordered parallelization, since data may be arriving to the print ow out of the original order in which they were generated by the push component instance. The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 49. Meandre: ZigZag Script Language Automatic Parallelization The operator [+AUTO] can be told to maintain data order with ! # Describes the data-intensive flow # @pu = push() @pt = pass( string:pu.string ) [+AUTO!] print( object:pt.string ) The [+AUTO!] tells the ZigZag compiler to parallelize the pass component instance by the number of cores available on system and to maintain order of data throughput. The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 50. Meandre: ZigZag Script Language Automatic Parallelization ZigZag has created 4 parallel instances of the component. It has also introduced a mapper instance that is in charge of distributing the incoming data to each of the parallel instance. It has also introduced a reducer instance that is in charge of distributing the incoming data to each of the parallel instance The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 51. Meandre: Flows to MAU Flows can be executed using their RDF descriptors Flows can be compiled into MAU MAU is: Self-contained representation Ready for execution Portable The base of ow execution in grid environments The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 52. Meandre: The Architecture The design of the Meandre architecture follows three directives: provide a robust and transparent scalable solution from a laptop to large-scale clusters create an unied solution for batch and interactive tasks encourage reusing and sharing components To ensure such goals, the designed architecture relies on four stacked layers and builds on top of service-oriented architectures (SOA) The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 53. Meandre: Basic Single Server The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 54. Meandre MDX: Cloud Computing Servers can be instantiated on demand disposed when done or on demand A cluster is formed by at least one server The Meandre Distributed Exchange (MDX) Orchestrates operational integrity by managing cluster conguration and membership using a shared database resource. The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 55. Meandre MDX: The Picture MDXBackbone The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 56. Meandre MDX: The Architecture Virtualization infrastructure Provide a uniform access to the underlying execution environment. It relies on virtualization of machines and the usage of Java for hardware abstraction. IO standardization A unied layer provides access to shared data stores, distributed le-system, specialized metadata stores, and access to other service-oriented architecture gateways. Data-intensive ow infrastructure Provide the basic Meandre execution engine for data-intensive ows, component repositories and discovery mechanisms, extensible plugins and web user interfaces (webUIs). Interaction layer Can provide self-contained applications via webUIs, create plugins for third-party services, interact with the embedding application that relies on the Meandre engine, or provide services to the cloud. The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 57. Meandre MDX: The Experiment Experimental Prototype Designed and built to validate viability of MDX cluster Using VMWare Server 2.0 on three identical hosts with Windows Server 2003 Equipped with two quad-core 2.8GHz Xeon processors 1600MHz front side bus 32Gb of RAM 4Tb of RAID 5 disk The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 58. Meandre MDX: The Experiment Experimental Prototype 8 virtual Machine instances were created on each host with 32-bit Ubuntu 8.04 Linux 3 Gb RAM dedicated to each instance 1 Physical processor core assigned to each VM VM instances were equipped to run a Meandre MDX server using Sun's Java 1.5 JVM A Third Physical hosts support 2 virtual machine instances with 32-bit Ubuntu 8.04 Linux 3 Gb RAM dedicated to each instance 1 Physical processor core assigned to each VM Highly available MySQL database and HTTP load-balancing facility The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 59. Meandre MDX: The Experiment We conducted three different experiments All three were based on the same ow shown earlier in the ZigZag example with a single change to make the single line of text into 250,000 lines of text for each iteration of the ow. The rst test was designed to test the scalability of a single Meandre server. Concurrent ows ! running on a standalone! engine on a log/log scale, ! each iteration of the ow ! pushed 250,000 lines of text The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 60. Meandre MDX: The Experiment We conducted three different experiments All three were based on the same ow shown earlier in the ZigZag example with a single change to make the single line of text into 250,000 lines of text for each iteration of the ow. The second experiment were run against a virtual Meandre cluster consisting of 16 Meandre servers. Concurrent ows ! running on a standalone! engine on a log/log scale, ! each iteration of the ow ! pushed 1 lines of text The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 61. Meandre MDX: The Experiment We conducted three different experiments All three were based on the same ow shown earlier in the ZigZag example with a single change to make the single line of text into 250,000 lines of text for each iteration of the ow. The third experiment were run against a virtual Meandre cluster consisting of 16 Meandre servers. Concurrent ows ! running on a standalone! engine on a log/log scale, ! each iteration of the ow ! pushed 250,000 lines of text The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 62. Meandre MDX: The Experiment We conducted three different experiments The rst test clearly shows The average time per ow increased linearly with the number of concurrent ows The next experiments clearly shows Cluster throughput grows linearly with the number of Meandre servers available The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation
  • 63. Upcoming Events SEASR 2009 workshop The workshop is organized to provide expanded opportunities for learning, knowledge sharing, and support and is intended to provide sufcient introduction and support so that teams can implement a study using SEASR. The workshop is intended for institutional teams of scholars from the Humanities. The workshop will include communication and work from a teams home campus as well as face-to-face meeting on the University of Illinois campus.
  • 64. SEASR: Meandre: ! Semantic-Driven Data-Intensive ! Flows in the Clouds Xavier Llora, Bernie Acs, Loretta Auvil, Boris Capitanu, Michael Welge, David Goldberg National Center for Supercomputing Applications! University of Illinois at Urbana-Champaign {xllora, acs1, lauvil, capitanu, mwelge, deg}@illinois.edu The SEASR project and its Meandre infrastructure! are sponsored by The Andrew W. Mellon Foundation