Overview of Hadoop MapReduce – Sandeep Kanao
Hadoop MapReduce is a
software framework to process vast amounts of data in-parallel on large cluster
hardware in a reliable, fault-tolerant manner.
A MapReduce job usually
splits the input data-set into independent chunks which are processed by the map
tasks in a completely parallel manner. The framework sorts the outputs
of the maps, which
are then input to the reduce tasks. Typically both the input and the output of the job are
stored in a file-system.
The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.
Hadoop configuration
allows the framework to effectively schedule tasks on the nodes where data is
already present, resulting in very high aggregate bandwidth across the cluster.
The MapReduce
framework consists of a single master JobTracker and one slave TaskTracker per
cluster-node.
The master is responsible
for scheduling the jobs component tasks on the slaves, monitoring them and
re-executing the failed tasks.
The slaves execute
the tasks as directed by the master.
The Hadoop Job
configuration consists of map and
reduce functions via implementations of appropriate interfaces and job
parameters.
The Hadoop job
client then submits the job (jar/executable etc.) and configuration to the JobTracker
which then takes the responsibility of distributing the software/configuration
to the slaves, scheduling tasks and monitoring them, providing status and
diagnostic information to the job-client.
Hadoop Streaming is a
utility which allows users to create and run jobs with any executables (e.g. shell
utilities) as the mapper and/or the reducer.
Hadoop Pipes is a
SWIG- compatible C++ API to implement MapReduce applications
Inputs
and Outputs - Overview
of Hadoop MapReduce – Sandeep Kanao
The MapReduce
framework operates exclusively on <key,
value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs
and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be
serializable by the framework and hence need to implement the Writable
interface. Additionally, the key classes
have to implement the WritableComparable interface
to facilitate sorting by the framework.
Input and Output
types of a MapReduce job - Sandeep Kanao:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3,v3> (output)
No comments:
Post a Comment