Is it possible to specify on which nodes to run specific mapper jobs?
Have distributed data on nodes and want to run jobs on the nodes containing that data
I'm not sure there's a reliable way to make sure a map task runs on a specific node. You can create a custom InputFormat and override InputFormat.getLocations() to return only the host name of the node you want that split to run on. However, these locations are merely suggestions to the MR framework and it can choose to ignore them.
Apache Hadoop out of the box doesn't support it. But, this feature is supported in MapR distribution (1).
Related
We are building a data workflow with NiFi and want the final (custom) processor (which runs the deduplication logic) to run only one one of the NiFi cluster nodes (instead of running on all of them). I see that NiFi 1.7.0 (which is not yet released) has a PrimaryNodeOnly annotation to enforce a single node execution behaviour. Is there a way or workaround to enforce such behaviour in NiFi 1.6.0?
NOTE: In addition to #PrimaryNodeOnly, it would be better if NiFi provides a way to run a processor on a single node only (i.e., some annotation like #SingleNodeOnly). This way the execution node need not necessarily be the primary node which therefore will reduce the load on primary node. This is just an ask for future and not necessary to solve the problem mentioned above.
There is no specific workaround to enforce it in previous versions, it is on the data flow designer to mark the intended processor(s) to run on the Primary Node only. You could write a script to query the NiFi API for processors of certain types or names, then check/set the strategy as Primary Node Only.
In NiFi 1.6.0 it's possible and looks like this:
I need to process some data in MR and load it into an external system that sits on the same physical machines as my MR nodes. Right now I run the job and read the output from HDFS and re-route individual records back out onto the desired nodes.
Is it possible to define some mapping such that records with key X always go straight to the desired node Y? Simply put, I want to control where hadoop routes post-sorted partitioned groups.
Not easily. The only way I know of to affect the physical location of a block of data on the fly is to implement a custom BlockPlacementPolicy. I'll just throw out some ideas for your use case.
A custom BlockPlacementPolicy can route blocks based on the file name
The file name of a partition can be modified using MultipleOutputs in MapReduce
Keys can be routed to specific partitions using a custom Partitioner
It seems like you can get the result you're looking for, but it won't be pretty.
I need to execute MapReduce on my Cassandra cluster, including data locality, ie. each job queries only rows which belong to local Casandra Node where the job runs.
Tutorials exist, on how to setup Hadoop for MR on older Cassandra version (0.7). I cannot find such for current release.
What has changed since 0.7 in this regard ?
What software modules are required for minimal setup (Hadoop+HDFS+...)?
Do I need Cassandra Enterprise ?
Cassandra contains a few classes which are sufficient to integrate with Hadoop:
ColumnFamilyInputFormat - This is an input for a Map function which can read all rows from a single CF in when using Cassandra's random partitioner, or it can read a row range when used with Cassandra's ordered partitioner. Cassandra cluster has ring form, where each ring part is responsible for concrete key range. Main task of Input Format is to divide Map input into data parts which can be processed in parallel - those are called InputSplits. In Cassandra case this is simple - each ring range has one master node, and this means that Input Format will create one InputSplit for each ring element, and it will result in one Map task. Now we would like to execute our Map task on the same host where data is stored. Each InputSplit remembers IP address of its ring part - this is the IP address of Cassandra node responsible to this particular key range. JobTracker will create Map tasks form InputSplits and assign them to TaskTracker for execution. JobTracker will try to find TaskTracker which has the same IP address as InputSplit - basically we have to start TaskTracker on Cassandra host, and this will guarantee data locality.
ColumnFamilyOutputFormat - this configures context for Reduce function. So that the results can be stored in Cassandra
Results from all Map functions has to be combined together before they can be passed to reduce function - this is called shuffle. It uses local file system - from Cassandra perspective nothing has to be done here, we just need to configure path to local temp directory. Also there is no need to replace this solution with something else (like persisting in Cassandra) - this data does not have to be replicated, Map tasks are idempotent.
Basically using provided Hadoop integration gives up possibility to execute Map job on hosts where data resides, and Reduce function can store results back into Cassandra - it's all that I need.
There are two possibilities to execute Map-Reduce:
org.apache.hadoop.mapreduce.Job - this class simulates Hadoop in one process. It executes Map-Resuce task and does not require any additional services/dependencies, it needs only access to temp directory to store results from map job for shuffle. Basically we have to call few setters on Job class, which contain things like class names for Map task, Reduce task, input format, Cassandra connection, when setup is done job.waitForCompletion(true) has to be called - it starts Map-Reduce task and waits for results. This solution can be used to quickly get into Hadoop world, and for testing. It will not scale (single process), and it will fetch data over network, but still - it will be fine for beginning.
Real Hadoop cluster - I did not set it up yet, but as I understood, Map-Reduce jobs from previous example will work just fine. We need additionally HDFS which will be used to distribute jars containing Map-Reduce classes in Hadoop cluster.
yes I was looking for the same thing, seems DataStaxEnterprise has a simplified Hadoop integration,
read this http://wiki.apache.org/cassandra/HadoopSupport
I am new in hadoop so I have some doubts. If the master-node fails what happened the hadoop cluster? Can we recover that node without any loss? Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails? The secondary namenode is the backup of namenode only not to datenode, right? If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
How can we restore the entire cluster data if anything happens?
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
Thanks in advance
Although, It is too late to answer your question but just It may help others..
First of all let me Introduce you with Secondary Name Node:
It Contains the name space image, edit log files' back up for past one
hour (configurable). And its work is to merge latest Name Node
NameSpaceImage and edit logs files to upload back to Name Node as
replacement of the old one. To have a Secondary NN in a cluster is not
mandatory.
Now coming to your concerns..
If the master-node fails what happened the hadoop cluster?
Supporting Frail's answer, Yes hadoop has single point of failure so
whole of your currently running task like Map-Reduce or any other that
is using the failed master node will stop. The whole cluster including
client will stop working.
Can we recover that node without any loss?
That is hypothetical, Without loss it is least possible, as all the
data (block reports) will lost which has sent by Data nodes to Name
node after last back up taken by secondary name node. Why I mentioned
least, because If name node fails just after a successful back up run
by secondary name node then it is in safe state.
Is it possible to keep a secondary master-node to switch automatically to the master when the current one fails?
It is staright possible by an Administrator (User). And to switch it
automatically you have to write a native code out of the cluster, Code
to moniter the cluster that will cofigure the secondary name node
smartly and restart the cluster with new name node address.
We have the backup of the namenode (Secondary namenode), so we can restore the namenode from Secondary namenode when it fails. Like this, How can we restore the data's in datanode when the datanode fails?
It is about replication factor, We have 3 (default as best practice,
configurable) replicas of each file block all in different data nodes.
So in case of failure for time being we have 2 back up data nodes.
Later Name node will create one more replica of the data that failed
data node contained.
The secondary namenode is the backup of namenode only not to datenode, right?
Right. It just contains all the metadata of data nodes like data node
address,properties including block report of each data node.
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
HDFS will forcely try to continue the job. But again it depends on
replication factor, rack awareness and other configuration made by
admin. But if following Hadoop's best practices about HDFS then it
will not get failed. JobTracker will get replicated node address to
continnue.
How can we restore the entire cluster data if anything happens?
By Restarting it.
And my final question, can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
yes, you can use any programming language which support Standard file
read write operations.
I Just gave a try. Hope it will help you as well as others.
*Suggestions/Improvements are welcome.*
Currently hadoop cluster has a single point of failure which is namenode.
And about the secondary node isssue (from apache wiki) :
The term "secondary name-node" is somewhat misleading. It is not a
name-node in the sense that data-nodes cannot connect to the secondary
name-node, and in no event it can replace the primary name-node in
case of its failure.
The only purpose of the secondary name-node is to perform periodic
checkpoints. The secondary name-node periodically downloads current
name-node image and edits log files, joins them into new image and
uploads the new image back to the (primary and the only) name-node.
See User Guide.
So if the name-node fails and you can restart it on the same physical
node then there is no need to shutdown data-nodes, just the name-node
need to be restarted. If you cannot use the old node anymore you will
need to copy the latest image somewhere else. The latest image can be
found either on the node that used to be the primary before failure if
available; or on the secondary name-node. The latter will be the
latest checkpoint without subsequent edits logs, that is the most
recent name space modifications may be missing there. You will also
need to restart the whole cluster in this case.
There are tricky ways to overcome this single point of failure. If you are using cloudera distribution, one of the ways explained here. Mapr distribution has a different way to handle to this spof.
Finally, you can use every single programing language to write map reduce over hadoop streaming.
Although, It is too late to answer your question but just It may help others..firstly we will discuss role of Hadoop 1.X daemons and then your issues..
1. What is role of secondary name Node
it is not exactly a backup node. it reads a edit logs and create updated fsimage file for name node periodically. it get metadata from name node periodically and keep it and uses when name node fails.
2. what is role of name node
it is manager of all daemons. its master jvm proceess which run at master node. it interact with data nodes.
3. what is role of job tracker
it accepts job and distributes to task trackers for processing at data nodes. its called as map process
4. what is role of task trackers
it will execute program provided for processing on existing data at data node. that process is called as map.
limitations of hadoop 1.X
single point of failure
which is name node so we can maintain high quality hardware for the name node. if name node fails everything will be inaccessible
Solutions
solution to single point of failure is hadoop 2.X which provides high availability.
high availability with hadoop 2.X
now your topics ....
How can we restore the entire cluster data if anything happens?
if cluster fails we can restart it..
If a node is failed before completion of a job, so there is job pending in job tracker, is that job continue or restart from the first in the free node?
we have default 3 replicas of data(i mean blocks) to get high availability it depends upon admin that how much replicas he has set...so job trackers will continue with other copy of data on other data node
can we use C program in Mapreduce (For example, Bubble sort in mapreduce)?
basically mapreduce is execution engine which will solve or process big data problem in(storage plus processing) distributed manners. we are doing file handling and all other basic operations using mapreduce programming so we can use any language of where we can handle files as per the requirements.
hadoop 1.X architecture
hadoop 1.x has 4 basic daemons
I Just gave a try. Hope it will help you as well as others.
Suggestions/Improvements are welcome.
I want to know if I can compare two consecutive jobs in Hadoop. If not I would appreciate if anyone can tell me how to proceed with that. To be precise, I want to compare the jobs in terms of what exactly two jobs did? The reason behind doing this is to create a statistics about how many jobs executed on Hadoop were similar in terms of the behavior. For example how many times same sorting function was executed on the same input.
For example if first job did something like SortList(A) and some other job did SortList(A)+Group(result(SortList(A)). Now, I am wondering if in Hadoop there is some mapping being stored somewhere like JobID X-> SortList(A).
So far, I thought of this problem as finding the entry point in Hadoop and try to understand how job is created and what information is being kept with a jobID and in what form (in a code form or some description) , but I was not able to figure it out successfully.
Hadoop's Counters might be a good place to start. You can define your own counter names (like each counter name is a data set you are working on) and increment that counter each time you perform a sort on it. Finding which data set you are working on, however, may be the more difficult task.
Here's a tutorial I found:
http://philippeadjiman.com/blog/2010/01/07/hadoop-tutorial-series-issue-3-counters-in-action/
No. Hadoop jobs are just programs. They can have any side effects. They can write ordinary files, hdfs file, or a database. Nothing in hadoop is recording all of their activities. All hadoop is manage the schedule and the flow of data.