read remote properties file over ssh oozie - hadoop

I want to run oozie workflow from remote machine whereas my config file exists on local machine. could you please help me how I can achieve this?
I tried below approach but it didnt work:
ssh user#remote_host "oozie job -run -config" < config.properties
giving error
Invalid sub-command: Missing argument for option: config
use 'help [sub-command]' for help details

You can pass properties in you config file with -D, such as:
ssh user#remote_host "oozie job -run -D oozie.wf.application.path=hdfs://hdf-example-cluster -D key=value"
Oozie will read specific value from system properties if you don't provide.

Related

migration from Cloudera Hadoop to HDINSIGHT

Can anyone tell me. I have HQL scripts that I used to run on Cloudera using hive -f scriptname.hql Now I want to run on these scripts in HDINSIGHT(Hadoop cluster) but the hive command line is not available in HDINSIGHT. Can someone guide how I can do that.
beeline -u 'jdbc:hive2://headnodehost:10001/;transportMode=http' -i query.hql
Anyone has experience of using above rather than
hive -f query.hql
I don't see there is any other way to execute the HQL files.You can refer to this document-https://learn.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-use-hive-beeline#run-a-hiveql-file
You can also use the zookeeper quorum(encircled) to avoid failure of queries during head node failover
beeline -u '<zookeeper quorum>' -i /path/query.hql
Create an environment variable :
export hivef="beeline -u 'jdbc:hive2://hn0-hdi-uk.witechmill.co.uk:10001/default;principal=hive/_HOST#witechmill.CO.UK;auth-kerberos;transportMode=http' -n umerrkhan "
witechmill is my clustername
Then call the script by using the below
$hivef scriptname.hql

unable to start a job using spark-submit via ssh (on EC2)

I set up spark on a single EC2 machine and, when I am connected to it, I am able to use spark either with jupyter or spark-submit, without any issue. Unfortunately, though, I am not able to use spark-submit via ssh.
So, to recap:
This works:
ubuntu#ip-198-43-52-121:~$ spark-submit job.py
This does not work:
ssh -i file.pem ubuntu#blablablba.compute.amazon.com "spark-submit job.py"
Initially, I kept getting the following error message over and over:
'java.io.IOException: Cannot run program "python": error=2, No such file or directory'
After having read many articles and posts about this issue, I thought that the problem was due to some variables not having been set properly, so I added the following lines to the machine's .bashrc file:
export SPARK_HOME=/home/ubuntu/spark-3.0.1-bin-hadoop2.7 #(it's where i unzipped the spark file)
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=/usr/bin/python3
export PYSPARK_PYTHON=python3
(As the error message referenced python, I also tried adding the line "alias python=python3" to .bashrc, but nothing changed)
After all this, if I try to submit the spark job via ssh I get the following error message:
"command spark-submit not found".
As it looks like the system ignores all the environment variables when sending commands via SSH, I decided to source the machine's .bashrc file before trying to run the spark job. As I was not sure about the most appropriate way to send multiple commands via SSH, I tried all the following ways:
ssh -i file.pem ubuntu#blabla.compute.amazon.com "source .bashrc; spark-submit job.file"
ssh -i file.pem ubuntu#blabla.compute.amazon.com << HERE
source .bashrc
spark-submit job.file
HERE
ssh -i file.pem ubuntu#blabla.compute.amazon.com <<- HERE
source .bashrc
spark-submit job.file
HERE
(ssh -i file.pem ubuntu#blabla.compute.amazon.com "source .bashrc; spark-submit job.file")
All attempts worked with other commands like ls or mkdir, but not with source and spark-submit.
I have also tried providing the full path running the following line:
ssh -i file.pem ubuntu#blabla.compute.amazon.com "/home/ubuntu/spark-3.0.1-bin-hadoop2.7/bin/spark-submit job.py"
In this case too I get, once again, the following message:
'java.io.IOException: Cannot run program "python": error=2, No such file or directory'
How can I tell spark which python to use if SSH seems to ignore all environment variables, no matter how many times I set them?
It's worth mentioning I have got into coding and data a bit more than a year ago, so I am really a newbie here and any help would be highly appreciated. The solution may be very simple, but I cannot get my head around it. Please help.
Thanks a lot in advance :)
The problem was indeed with the way I was expecting the shell to work (which was wrong).
My issue was solved by:
Setting my variables in .profile instead of .bashrc
Providing full path to python
Now I can launch spark jobs via ssh.
I found the solution in the answer #VinkoVrsalovic gave to this post:
Why does an SSH remote command get fewer environment variables then when run manually?
Cheers

How do I configure the aws instance ip during the user data configuration?

I have question about passing a shell script to an instance with user data. So what I need to configure here is, since my server is going to run on the instance, before the instance got created, the shell script should configure the server.xml information, (like the instance ip address, database ip addresss...) before starting the instance/server.
But, since the instance/server hasn't be generated yet, is there any variable I can use to pass the localhost information in the shell script? is there any way for user to specify some custom variable while running the user data before the instance got created? (before using the aws user data, I used to run it manually, through the configure.sh file and the config.properties file after the instance got created)
#!/bin/bash
# source the properties:
. ./config.properties
echo "Installation"
echo "Updating server.xml"
cd "Server/server/configuration/"
sed -i -s "s/SERVER_IP/"$LOCALHOST_IP"/g" server.xml
sed -i -s "s/DB_IP/"$DATABASE_IP"/g" server.xml
cd "../tomcat/bin"
sh startup.sh

How to copy files from one machine to another machine

I want to copy /home/cmind012/m.sh from one system to another system (both system Linux) using shell script.
Command $
scp /home/cmind012/m.sh cmind013:/home/cmind013/tanu
getting message
ssh: cmind013: Name or service not known
lost connection
It seems that cmind013 is not being resolved, I would try using first
nslookup cming013
and see what why donesn't it resolve.
It seems that you are missing the IP Address/Domain of the remote host. The format should be user#host:[directory]
You could do the following:
scp -r [directory/files] [remote host]:[destination directory]
ex: scp -r /var/www/html/* root#192.168.1.0:/var/www/html/
Try the following command:
scp /home/cmind012/m.sh denil#172.22.192.105:/home/denil/

Messed up sed syntactics in hadoop startup script after reinstalling JVM

i'm trying to run 3 node Hadoop cluster on Windows Azure cloud. I've gone through configuration, and test launch. Everything look fine, however, as i used to use OpedJDK which is not recommended as VM for Hadoop according to what i read, i decide to replace it with Oracle Server JVM. Removed old installation of java with Yum, along with all java folders in /usr/lib, installed most recent version of Oracle JVM, updated PATH and JAVA_HOME variables; however, now on launch i getting following masseges:
sed: -e expression #1, char 6: unknown option to `s'
64-Bit: ssh: Could not resolve hostname 64-Bit: Name or service not known
HotSpot(TM): ssh: Could not resolve hostname HotSpot(TM): Name or service not known
Server: ssh: Could not resolve hostname Server: Name or service not known
VM: ssh: Could not resolve hostname VM: Name or service not known
e.t.c. (in total about 20-30 strings with words which should not have anything in common with hostnames)
For me it looks like it's trying to pass part of code as Hostname because of incorrect usage of sed in start up script:
if [ "$HADOOP_SLAVE_NAMES" != '' ] ; then
SLAVE_NAMES=$HADOOP_SLAVE_NAMES
else
SLAVE_FILE=${HADOOP_SLAVES:-${HADOOP_CONF_DIR}/slaves}
SLAVE_NAMES=$(cat "$SLAVE_FILE" | sed 's/#.*$//;/^$/d')
fi
# start the daemons
for slave in $SLAVE_NAMES ; do
ssh $HADOOP_SSH_OPTS $slave $"${#// /\\ }" \
2>&1 | sed "s/^/$slave: /" &
if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then
sleep $HADOOP_SLAVE_SLEEP
fi
done
Which looks unchanged, so the question is: how change of JVM could affect sed? And how can i fix it?
So i found an answer to this question: My guess was wrong, and everything with sed is fine. Problem however was in how Oracle JVM works with external libraries compare to OpenJDK. It did throw exception where script was not expecting it, and it ruin whole sed input.
You can fix it by adding following system variables:
HADOOP_COMMON_LIB_NATIVE_DIR which should point to /lib/native folder of your Hadoop installation and add -Djava.library.path=/opt/hadoop/lib to whatever options you already have in HADOOP_OPTS variable (notice that /opt/hadoop is my installation folder, you might need to change it in order for stuff to work properly).
I personally add export commands to hadoop-env.sh script, but adding it to .bash file or start-all.sh should work as well.

Resources