Cloud Computing - Task Scheduling Based on Genetic Algorithms
Autor: Mikki • February 14, 2019 • 3,478 Words (14 Pages) • 818 Views
...
The Map operation processes the logical records contained in the input and generates a set of intermediate (key, value) pairs that are sent as input to the Reduce operation.
The Reduce operation takes the set of pairs sharing the same key and reduces it to generate an output.
In order to write a Map-Reduce application, the user has to define two operations: Map and Reduce. The tasks that run inside the cluster are based on these functions. The user can also specify other auxiliary operations: Combiner function, Partitioning function.
Combiner operation can be used to minimize data copy delay of multiple outputs. This function does partial merging of the map output data before it is sent over network. It is executed on the map task machine and typically it has the same code as the reduce function. The only difference is that the Combine’s is written in a intermediate file.
Partitioning operation is used to split the output data into R files. This tends to result in fairly well-balanced partitions.
C. Fault tolerance
Map-Reduce is a very reliable framework. Since it is designed to help process very large amounts of data using hundreds or thousands of machines, it must tolerate machine failures gracefully. It implements fault-tolerant mechanisms.
Worker Failure
The master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset bask to theri initial idle state, and therefore become eligigle for scheduling on other workers. Completed map tasks are re-executed on a failure because their output is stored on the local disk of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.
If a map task was executed by worker w1 (that failed) and then re-executed by worker w2, all the workers executing reduce tasks are notified of the re-execution. The reduce tasks that have not already read the data from worker w1 will read the data from worker w2.
Map-Reduce can handle large scale worker failures
(percentages of the cluster size).
Master Failure
The master keeps several data structures for assuring a good work in the cluster. For each map and reduce task, it stores the state (idle, in-progress or completed) and also the identity of idle workers. The master writes periodic checkpoints of these data structures. If the master crashes, a new copy can be started from the last checkpointed state.
Back-up (speculative) tasks
Another mechanism for fault-tolerance are speculative tasks. Before reaching the end of a job execution, some tasks can be launched on more than one machine. This can solve the problem of worker failures and also the “straggler” problem (a machine taking longer than expected time to finish its task).
D. Map-Reduce Applications
Here are a few simple examples of interesting programs that can be easily expressed as Map-Reduce computations.
Distributed Grep: The map function emits a line if it matches a supplied pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output.
Count of URL Access Frequency: The map function processes logs of web page requests and outputs URL,1. The reduce function adds together all values for the same URL and emits a URL,total count pair.
Reverse Web-Link Graph: The map function outputs target,source pairs for each link to a target URL found in a page named source. The reduce function concatenates the list of all source URLs as associated with a given target URL and emits the pair: target, list(source)
Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of word, frequency pairs. The map function emits a hostname, term vector pair, for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all perdocument term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final hostname, term vector pair.
E. Genetic algorithms
Genetic Algorithms (GAs) [3] are a particular category of evolutionary algorithms invented by John Holland, which aim at finding exact or approximate solutions to optimization problems. A GA is a stochastic technique based on the principles of natural evolution and genetics. GAs combine the exploitation of past results with the exploration of new areas of the search space. By using survival of the fittest techniques combined with a structured yet randomized information exchange, a genetic algorithm often lead to optimized solutions faster than other techniques. Gas can search multiple peaks in parallel, so are less likely to get stuck at a local optimum.
The robustness of Genetic Algorithms (hereafter referred to as GAS) is due to their capacity to locate the global optimum in a multimodal landscape. A plethora of such multimodal functions exist in engineering problems (optimization of neural network structure and learning neural network weights, solving optimal control problems, designing structures, and solving flow Task Scheduling based on Genetic Algorithms in Cloud Computing problems) are a few examples. It is for the above reason that considerable attention has been paid to the design of GAS for optimizing multimodal functions.
A generation is a collection of artificial creatures (strings). In every new generation, a set of strings is created using information from the previous ones. Occasionally, a new part is tried for good measure. GAs are randomized, but they are not simple random walks. They efficiently exploit historical information to speculate on new search points with expected improvement [5], [9], [11],[15].
The majority of optimization methods move from a single point in the decision space to the next using some transition rule to determine the next point. This point-to- point method is dangerous as it can locate false peaks in multimodal (many-peaked) search spaces. By contrast, GAs work from a database of points simultaneously (a population of strings), climbing many peaks in parallel. The probability
...