running a custom shell script distributed shell apache yarn - shell

I have been going through the Apache Hadoop Yarn Book from HortonWorks, wherein they have explained two ways of running a Yarn task.
My intent is to run a shell script ( which compiles and runs various java and python scripts ) and it runs a set of these scripts/patches for various folders. An easy metaphor :- " Unzipping 100 folders and logging their 'ls' "
Now say I want to parallelize the flow , such that a container runs for 1-2 folders and say I ask for 50 such containers.
How do I do that using distributed shell ? I have seen examples of ls / whoami / uptime / hostname but that is not what I want. I want to run a script that takes / iterated over argument path and this I want to run in a distributed fashion on yarn. Any Help ?

Related

Running python script in parallel with Ansible

I am managing 6 machines, or more, at AWS with Ansible. Those machines must run a python script that runs forever (the script has a while True).
I call the python script via command: python3 script.py
But just 5 machines run the script, the others doesn't. I can't figure out what I am doing wrong.
(Before the script call everything works fine for all machines like echo, ping, etc)
I already found the awnser.
The fork in ansible restrict to 5 machines as default. You must add an fork to the configuration file with a greater number, but your machine with Ansible must have power to manage that.
I'll let the question because to me was pretty hard to find the awnser.

In Oozie, how I'd be able to use script output

I have to create a cron-like coordinator job and collect some logs.
/mydir/sample.sh >> /mydir/cron.log 2>&1
Can I use simple oozie wf, which I use for any shell command?
I'm asking because I've seen that there are specific workflows to execute .sh scripts
Sure, you can execute Shell action (On any node in the Yarn cluster) or use the Ssh action if you'd like to target specific hosts. You have to keep in mind that the "/mydir/cron.log" file will be created on the host the action is executed on and the generated file might no be available for other Oozie actions.

How do I write script to start multiple services in centos?

I am having multi-node cluster of Hadoop, Kafka, Zookeeper, Spark.
I am running following commands to start respective service,
$ ./Hadoop/sbin/start-all.sh
$ ./zookeeper/bin/zkServer.sh start
$ ./Kafka/Kafka-server-start.sh ./config/server-properties.sh
$ ./spark/sbin/start-all.sh
and so on..
can anyone tell me how to write a script to automate this process of running each command individually?
Have you tried creating a simple shell script with all these commands and running that script instead? For example, following is a simple bash script
#!/bin/bash
./Hadoop/sbin/start-all.sh
./zookeeper/bin/zkServer.sh start
./kafka/kafka-server-start.sh ./config/server-properties.sh
./spark/sbin/start-all.sh
and so on ...

How is running a script using aws emr script-runner different from running it from bash?

I have used script-runner on aws emr, and given that it may look very basic (and maybe stuid) question, but I read many documents and noone answers why we need a script runner in emr, when all it does is executing a script in the master node.
Can the same script not be run using a bash?
The script runner is needed when you want to simply execute a script but the entry point is expecting a jar. For example, submitting an EMR Step will execute a "hadoop jar blah ..." command. But if "blah" is a script this will fail. Script runner becomes the jar that the Step expects and then uses its argument (path to script) to execute shell script.
When you are running your script in bash, you need to have the script locally and also you need to set all the configurations to work as you expect it.
With the script-runner you have more options, for example, run it as part of your cluster launch command, as well execute a script that is hosted remotely in S3. See the example from the EMR documentations: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html

Perform a command on cluster computers

I'd like to perform some bash command on set of computers of yarn cluster. For example, print last line of each logs:
tail -n 1 `ls /data/pagerank/Logs/*`
There are too many computers in cluster subset to manually enter to each computer and perform the command. Is there a way to automate the procedure?
I think you could use the Parallel SSH tool. You can find more at
https://code.google.com/p/parallel-ssh/
A basic tutorial how to use it can be found at
http://noone.org/blog/English/Computer/Debian/CoolTools/SSH/parallel-ssh.html

Resources