Map reduce job can't be debugged remotely in a distributed cluster because each map and reduce spawns their own JVM. What does it exactly mean ? Can't we attach debugger for each and every process in each node in the cluster involved in map reduce job ?
I've been reading so many articles and solutions, but not able to understand the problem behind debugging a map reduce job in a distributed cluster. Any help would be appreciated.
Thanks
You can debug only a single task at any given time., no debugger that I know of can create multiple sessions at once ; specifically each mapreduce task isn't able to be individually configured with JVM debug ports, so if it were possible, you would be have to know which nodemanager the jobs get started on, and ensure there's no port overlap on same hosts
If you really needed to remote debug, seems like you have poor unit test coverage to begin with and you probably shouldn't deploy said code into production anyway.
Related
I am working on a spark program that monitor each executors' performance such as mark down when one executor start to work and when it finishes its job. I am thinking two ways to do that:
First, develop programs so when the executor starts work, it mark down the current time to a file, when it finishes, mark down that time to the same file. In the ends, all "log" files will be spread the whole cluster networks except for the driver machine.
Second, since executors will report to driver periodically, each time the driver receives message from executors, if the message contains "start" and "finish" information, let the driver record everything.
Is that possible?
There are many ways to Monitor the executor performance as well as application performance
Best ways are to Monitor with the help of Spark Web UI and Other Monitoring tools available Open Source (Ganglia)
You Need to Monitor your application whether your cluster is under utilized or not how much resources are used by your application which you have created.
Monitoring can be done using various tools eg. Ganglia From Ganglia you can find CPU, Memory and Network Usage.Based on Observation about CPU and Memory Usage you can get a better idea what kind of tuning is needed for your application
Hope this Helps!!!....
I'm testing "shutting down servers using UPS" while hadoop task is running, and I have two questions.
I wonder if running task can be saved, and then it continues the remaining work again after rebooting. (at all nodes)
If "1" is not supported, is it safe to start shutting down process while hadoop tasks are running? Or, is there anything I have to do to preserve hadoop system? (cluster?)
No, you can't "save" the task in an intermediate state. If you shut down hadoop while some jobs are running, you could end up with intermediate data from abandoned jobs occupying space. Apart from that, you could shut down the system while jobs are running.
It is not possible to save the state of running tasks with Hadoop as of now. It would be an extremely difficult process since all of the resource allocations happen based on the current load of the system but after restarting your entire cluster there might be entirely different workload therefore restoring the state does not make sense.
Answering your second questions, Hadoop was designed to tolerate node failures or temporary problems with accessing files and network outages as well. Individual tasks might fail and then the system restarts them on a other node. It is safe to shut down nodes from the cluster point of view, the only thing to keep in mind that the job will ultimately fail and you need to re-submit it after bringing back the cluster to life. One problem might arise with shutting down the cluster using the power switch is that temporary files are not getting cleaned up. This is usually not a major problem.
Couple of days ago Yahoo posted about Storm-on-YARN project http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.html that makes possibility to run Storm on YARN.
That's big improvement, however I have two questions regarding to running tasks like Storm with YARN. Tasks like Storm don't have some limit on execution time... I mean, when you run Storm you expect it will work days or months - listen queue or whatever.
I mean there are set of tasks that don't have limitation in time execution (I'd like to report 0% progress)
1) what's about timeout? regular M/R is killed when it hangs on, how to prevent it? I walked through the code, but didn't find any special code
2) also, MR1 has queue where jobs waited for execution: when cluster finish one job, it picked up next job from queue. What about YARN? if I will push endless Storm-like jobs A, and the job B, will job B be executed?
Sorry, if my questions seem ridiculous, maybe I miss/don't understand something
Hadoop's JobTracker was(is) responsible for both cluster resources and the application lifecycle. YARN is only responsible for managing cluster resources and the application lifecycle is the responsibility of the application.
This change means that YARN can be used to manage any distributed paradigm. MR2 is of course the initial implementation ( map/reduce over YARN) but you can see some other implementations like the Storm-on-YARN you mentioned or HortonWorks intention to integrate SQL in hadoop etc.
You can take a look at a library called Weave from continuuity that provides a simple API for building distributed apps on YARN
I'm thinking of learning hadoop but not sure if it'll solve my problem. Basically I have a job with a queue and a bunch of workers. Each worker does a small amount of work and then either saves the results(if successful) or sends it back to the queue for further processing. My problem is scalable, is limited by the bandwidth on the network(ec2) which will never keep up with multiple cpu's crunching the data. I thought maybe I could run my jobs in Java in a hadoop cluster and have hadoop distribute the work via a queue. Would this be a better approach? I am correct in assuming hadoop can a queue and try to run jobs as locally as possible to minimize bandwidth usage and maximize cpu usage? My program is very cpu bound but most of my recent problems with its performence are related to passing work over a network(I want to keep the work as local as possible), but the difference between the hadoop tutorials I see and my problem is that in the tutorials all the work is known in advance while my program is generating new work for its self constantly(until its finally done). Would this work and would it help me reduce the impact of passing messages over a network?
Sorry I'm new to hadoop and wanted to know if it could solve my problem.
Hadoop is all about running jobs in a batch-like mode over a large data set. It's hard to get it to have some sort of queue-like behavior, but not impossible. There is Apache ZooKeeper, which will give you synchronization to build a queue if you need it.
There are plenty of tools to solve the problem it looks like you are trying to solve. I suggest taking a look at RabbitMQ. If you use python, Celery is quite fantastic.
When running hadoop jobs, I noticed that sometimes the number of completed tasks decreases and number of canceled tasks increases.
How is this possible? Why does this happen?
I've only experienced this when our cluster was in a strange state, so I'm not sure if this is the same issue. Basically, map tasks would complete, and then the reducers would start... and then mappers would be reprocessed.
I believe that the problem is that mapper output hangs around on that data node waiting for reducers to pick it up. If that node has issues or it dies, the JobTracker decides that it needs to rerun that task again, even if it had completed. Our issue was that the system our NameNode was on was having some non-Hadoop related issues and once those were resolves it seemed to go away.
Sorry if my experience was not relevant to your issue. Perhaps, can you post more details? Do you see any error messages? Is there anything weird in your JobTracker or NameNode logs?