how to add GUI to a hadoop program? - user-interface

I use hadoop to write a Mapreduce program which is able to deploy to ec2 and local cluster, I am fine to use the command line to run the program, but is there any way to add interface to the hadoop program, so that users just need to click and run the program instead of using command line? Thanks!

I'm not sure exactly what you want, but I think your asking if there is an UI for submitting map reduce jobs hadoop? if so, you should try Hue: http://cloudera.github.com/hue/

Related

Uploading scripts to run to non-local cluster using slurm?

I am new to running on multiple cluster systems and I am stuck. I have a bash shell script (myjob.sh) and a secondary script to be executed (stuff.R). I am unsure of how, once I am logged on to the non-local cluster, to upload these files to be ran? "scp" is usually my go to for this sort of thing but I cannot figure out how to make the file move to the cluster. If I go into an interactive shell mode I can make the files using "nano" but I really need to figure this out without the interactive shell. I'm weirdly stuck in limbo. Any help would be much appreciated, thank you!

what's the difference between spark-shell and submitted sbt programs

Spark-shell can be used to interact with the distributed storage of data, then what is the essential difference between coding in spark-shell and uploading packaged sbt independent applications to the cluster operation?(I found a difference is sbt submit the job can be seen in the cluster management interface, and the shell can not) After all, sbt is very troublesome, and the shell is very convenient.
Thanks a lot!
Spark-shell gives you a bare console-like interface in which you can run your codes like individual commands. This can be very useful if you're still experimenting with the packages or debugging your code.
I found a difference is sbt submit the job can be seen in the cluster management interface, and the shell can not
Actually, spark shell also comes up in the job UI as "Spark-Shell" itself and you can monitor the jobs you are running through that.
Building spark applications using SBT gives you some organization in your development process, iterative compilation which is helpful in day-to-day development, and a lot of manual work can be avoided by this. If you have a constant set of things that you always run, you can simply run the same package again instead of going through the trouble of running the entire thing like commands. SBT does take some time getting used to if you are new to java style of development, but it can help maintain applications on the long run.

How to run hadoop application automatically?

I know that a MapReduce program can be ran using the command line "hadoop jar *.jar" for a time.
But now the program is required to be ran a time for every hour in background.
Are there any methods to make the MR program be hourly submitted to hadoop automatically?
Adding sleeptime(600sec) in program perhaps is working, but it make the app run all the time.
Your saviour is oozie. Happy Learning

What is the exact difference between pseudo mode and stand alone mode in hadoop?

What is the exact difference between pseudo mode and stand alone mode in hadoop?
How can we get to know that when working on our own laptop / desktop?
The differences are the one described in the product documentation:
Standalone Operation: By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
Pseudo-Distributed Operation: Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Unless you want to debug Hadoop code, you should always run in pseudo-distributed.

Jenkins/Hudson - Run script on all slaves

I have a requirement to run a script on all available slave machines. Primarily this is so they get relevant windows hotfixes and new 3rd party tools before building.
The script I have can be run multiple times without undesirable side effects & is quite light weight, so I'm happy for this to be brute force if necessary.
Can anybody give suggestions as to how to ensure that a slave is 'up-to-date' before it works on a job?
I'm happy with solutions that are driven by a job on the master, or ones which can inject the task (automatically) before normal slave job processing.
My shop does this as part of the slave launch process. We have the slaves configured to launch via execution of a command on the master; this command runs a shell script that rsync's the latest tool files to the slave and then launches the slave process. When there is a tool update, all we need to do is to restart the slaves or the master.
However - we use Linux whereas it looks like you are on Windows, so I'm not sure what the equivalent solution would be for you.
To your title: either use Parameter Plugin or use matrix configuration and list your nodes in it.
To your question about ensuring a slave is reliable, we mark it with a 'testbox' label and try out a variety of jobs on it. You could also have a job that is deployed to all of them and have the job take the machine offline it fails, I imagine.
Using Windows for slaves is very obnoxious for us too :(

Resources