Parallel Distributed Processing
Distributed Execution Engines
Distributed execution engine is a software systems which runs on a cluster of networked computers.Each machine in these networked systems is capable of handling data-dependent control flow.Distributed executed engines are attractive because they shield developersfrom distributed and parallel computing’s challenging aspects such as task scheduling, data synchronization, transferring data and dealing with failures in networks.
Distributed Data-Parallel Patterns and Execution Engines:
The are many distributed data-parallel patterns which are identified recently and provide the opportunities to facilitate the data-intensive applications and their workflow,which can be executed with the split data in parallel on various distributed computing nodes. The main advantages of using these patterns are:
1. Support the distribution of data and parallel processing of data with distributed data on multiple nodes/cores.
2. Provide a higher-level programming model in order to facilitate the user program parallelization.
3. Follow a principle of ”moving computations to data” that reduces the data movements overheads.
4. Have a good scalability and efficiency in performance when executing on distributed resources .
5. Support run time features such as load balancing, fault-tolerance, etc.
6. Simplify the difficulties for parallel programming as compared to the traditional programming interfaces MPI and openMP.
There are many execution engines which can be implemented over the distributed data-parallel patterns. MadReduce is such kind of example that has different DDP execution engine implementations. A most known MapReduce execution engine is Hadoop.
Main Characteristics of Distributed Execution Engines:
Parallel Distributed Processing
The design purpose of Distributed Execution Engines is to consider the fundamental characteristics which are as follows;
4.Transparent Fault Tolerance
Types of Distributed Exection Engines:
There are different types of DEEs such as Dryad, Nephele, MapReduce, Hyracks,Pregel, CIEL, etc. Dryad,Nephele, Hyracks, MapReduce are all use the Directed Acyclic Graph (DAG) system and in Pregel the Bulk Synchronous Parallel (BSP) system is implemented.
Dryad Execution Engine:
The Dryad distributed computing engine is designed by Microsoft to support the dataparallel computing and used in many important applications such as data-mining applications, image and stream processing and scientific computations.Dryad data-parallel applications combines the computational ”vertices” with the communication ”channels” to form a data flow graph. A Dryad runs the data-parallel applications on a set of available computers by executing the vertices of this graph, communicating through files, TCP pipes, and shared-memory FIFOs.
A Dryad job systems is represented as Directed Acyclic Graph (DAG) where each vertex is a program and edges are denoted by channels, and there are many vertices in the graph while executing the cores in the computing cluster.
MapReduce Execution Engine:
MapReduce is a programming model and associated implementation for processing and generating large scale data sets. Users need to specify a map function that processes a record (key/value pair) to generate a set of intermediate record (key/value pair), and a reduce function that merges all intermediate values associated with the same intermediate key.
The implementation of MapReduce runs on a large cluster of commodity machines and that is highly scalable such as a typical MapReduce computation processes of many terabytes of data on thousands of computers.
Hyracks Execution Engine:
Hyrack is a new software platform which is flexible, extensible, partition-parallel framework designed to run efficient data-intensive computations on large shared nothing clusters of commodity computers. Hyrack computational model is represented as DAG (Directed Acyclic Graph) of data operators and connectors. Operators run on partitions of input data and produce partitions of output data while connectors
repartitions the operators’ outputs to make the newly produced partitions available at the consuming operators. Hyracks has the significant promise as a next-generation platform for data-intensive applications.
Pregel Execution Engine:
There are many practical problems that mainly concern with large graph. The large scale of these graphs (billions of vertices and edges) poses challenges for their efficient processing. In this case, there is a need for a computational model for this task and that is Pregel computation model. In Pregel, programs are expressed as a sequence of iteration, in which each of a vertex receives messages sent by the previous iteration, send messages to other vertices and modify its own state that of its outgoing edges.
The vertex-centric approach is flexible enough to express a broad set of algorithms. The design purpose of this model is efficient processing, scalable and fault-tolerant implementation on clusters of thousands of commodity computers,and its synchronicity that makes reasoning about program easier. Distribution related details are hidden in system and kept behind an abstract API. The result is a framework for processing large-scale graphs which is expressive and easy to program.
This are execution engine for parallel distributed processing,which are very useful for solving bigdata problems.