Coursera Bigdata Grader and how to set Hadoop Streaming number of reducers? - bash

I'm trying to pass the course task on the Coursera, but fail at some unit test with the following error:
RES1_6 description: The first job should have more than 1 reducer or
shouldn't have them at all. Please set the appropriate number in -D mapreduce.job.reduces. It can be 0 or more than 1.
BUT, I use NUM_REDUCERS=4 in the following script!
yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapreduce.job.name="somename" \
-D mapreduce.job.reduces=${NUM_REDUCERS}\
-files mapper.py,reducer.py,somesupplimentary.txt \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py" \
-input someinput.txt \
-output ${OUT_DIR} > /dev/null 2> $LOGS
And when I read the logs, I see the following:
Job Counters
Killed reduce tasks=1
Launched map tasks=2
Launched reduce tasks=9
Data-local map tasks=2
So I feel myself stupid and totally do not understand what does the grader want from me. That simply doesn't make a consecutive picture for me. It seems that I use MORE than one reducer and the log seem to approve it. Why does the unit test fail? Or am I not understanding some homespun truth?

I have needed to delete "2> $LOGS" part. That was the grader issue, since it have implied that I would not catch the logs and would not write it into the file, at least I shouldn't do it myself.

Related

Not entering while loop in shell script

I was trying to implement page rank in hadoop. I created a shell script to iteratively run map-reduce. But the while loop just doesn't work. I have 2 map-reduce, one to find the initial page rank and to print the adjacency list. The other one will take the output of the first reducer and take that as input to the second mapper.
The shell script
#!/bin/sh
CONVERGE=1
ITER=1
rm W.txt W1.txt log*
$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave
hdfs dfs -rm -r /task-*
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.3.jar \-mapper "'$PWD/mapper.py'" \-reducer "'$PWD/reducer.py' '$PWD/W.txt'" \-input /assignment2/task2/web-Google.txt \-output /task-1-output
echo "HERE $CONVERGE"
while [ "$CONVERGE" -ne 0 ]
do
echo "############################# ITERATION $ITER #############################"
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.3.jar \-mapper "'$PWD/mapper2.py' '$PWD/W.txt' '$PWD/page_embeddings.json'" \-reducer "'$PWD/reducer2.py'" \-input task-1-output/part-00000 \-output /task-2-output
touch w1
hadoop dfs -cat /task-2-output/part-00000 > "$PWD/w1"
CONVERGE=$(python3 $PWD/check_conv.py $ITER>&1)
ITER=$((ITER+1))
hdfs dfs -rm -r /task-2-output/x
echo $CONVERGE
done
The first mapper runs perfectly fine and I am getting output for it. The condition for while loop [ '$CONVERGE' -ne 0 ] just gives false so it doesn't enter the while loop to run 2nd map-reduce. I removed the quotes on $CONVERGE and tried it still doesn't work.
I defined CONVERGE at the beginning of the file and is updated in while loop with the output of check.py. The while loop just doesn't run.
What could I be doing wrong?
Self Answer:
I tried doing everything possible to correct the mistakes. But later I was told to download dos2unix and run it again. Surprisingly it worked. The file was being read properly. I don't know why that happened.

How to distribute executions over a cluster

I am doing research, and I often need to execute the same program with different inputs (each combination of inputs repeatedly) and store the results, for aggregation.
I would like to speed up the process by executing these experiments in parallel, over multiple machines. However, I would like to avoid the hassle of launching them manually. Furthermore, I would like my program to be implemented as a single thread and only add parallelization on top of it.
I work with Ubuntu machines, all reachable in the local network.
I know GNU Parallel can solve this problem, but I am not familiar with it. Can someone help me to setup an environment for my experiments?
Please, notice that this answer has been adapted from one of my scripts and is untested. If you find bugs, you are welcome to edit the answer.
First of all, to make the process completely batch, we need a non-interactive SSH login (that's what GNU Parallel uses to launch commands remotely).
To do this, first generate a pair of RSA keys (if you don't already have one) with:
ssh-keygen -t rsa
which will generate a pair of private and public keys, stored by default in ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub. It is important to use these locations, as openssh will go looking for them here. While openssh commands allow you to specify the private key file (passing it by -i PRIVATE_KEY_FILE_PATH), GNU Parallel does not have such an option.
Next, we need to copy the public key on all the remote machines we are going to use. For each of the machines of your cluster (I will call them "workers"), run on this command on your local machine:
ssh-copy-id -i ~/.ssh/id_rsa.pub WORKER_USER#WORKER_HOST
This step is interactive, as you will need to login to each of the workers through user id and password.
From this moment on, login from your client to each of the workers is non-interactive. Next, let's setup a bash variable with a comma-separated list of your workers. We will set this up using GNU Parallel special syntax, which allows to indicate how many CPUs to use on each worker:
WORKERS_PARALLEL="2/user1#192.168.0.10,user2#192.168.0.20,4/user3#10.0.111.69"
Here, I specified that on 192.168.0.10 I want only 2 parallel processes, while on 10.0.111.69 I want for. As for 192.168.0.20, since I did not specify any number, GNU Parallel will figure out how many CPUs (CPU cores, actually) the remote machine has and execute that many parallel processes.
Since I will also need the same list in a format that openssh can understand, I will create a second variable without the CPU information and with spaces instead of commas. I do this automatically with:
WORKERS=`echo $WORKERS_PARALLEL | sed 's/[0-9]*\///g' | sed 's/,/ /g'`
Now it's time to setup my code. I assume that each of the workers is configured to run my code, so that I will just need to copy the code. On workers, I usually work in the /tmp folder, so what follows assumes that. The code will be copied though an SSH tunnel and extracted remotely:
WORKING_DIR=/tmp/myexperiments
TAR_PATH=/tmp/code.tar.gz
# Clean from previous executions
parallel --nonall -S $WORKERS rm -rf $WORKING_DIR $TAR_PATH
# Copy the the tar.gz file on the worker
parallel scp LOCAL_TAR_PATH {}:/tmp ::: `echo $WORKERS`
# Create the working directory on the worker
parallel --nonall -S $WORKERS mkdir -p $WORKING_DIR
# Extract the tar file in the working directory
parallel --nonall -S $WORKERS tar --warning=no-timestamp -xzf $TAR_PATH -C $WORKING_DIR
Notice that multiple executions on the same machine will use the same working directory. I assume only one version of the code will be run at a specific time; if this is not the case you will need to modify the commands to use different working directories.
I use the --warning=no-timestamp directive to avoid annoying warnings that could be issued if the time of your machine ahead of that of your workers.
We now need to create directories in the local machine for storing the results of the runs, one for each group of experiments (that is, multiple executions with the same parameters). Here, I am using two dummy parameters alpha and beta:
GROUP_DIRS="results/alpha=1,beta=1 results/alpha=0.5,beta=1 results/alpha=0.2,beta=0.5"
N_GROUPS=3
parallel --header : mkdir -p {DIR} ::: DIR $GROUP_DIRS
Notice here that using parallel here is not necessary: using a loop would have worked, but I find this more readable. I also stored the number of groups, which we will use in the next step.
A final preparation step consists in creating a list of all the combinations of parameters that will be used in the experiments, each repeated as many times as necessary. Each repetition is coupled with an incremental number for identifying different runs.
ALPHAS="1.0 0.5 0.2"
BETAS="1.0 1.0 0.5"
REPETITIONS=1000
PARAMS_FILE=/tmp/params.txt
# Create header
echo REP GROUP_DIR ALPHA BETA > $PARAMS_FILE
# Populate
parallel \
--header : \
--xapply \
if [ ! -e {GROUP_DIR}"exp"{REP}".dat" ]';' then echo {REP} {GROUP_DIR} {ALPHA} {BETA} '>>' $PARAMS_FILE ';' fi \
::: REP $(for i in `seq $REPETITIONS`; do printf $i" %.0s" $(seq $N_GROUPS) ; done) \
::: GROUP_DIR $GROUP_DIRS \
::: ALPHA $ALPHAS \
::: BETA $BETAS
In this step I also implemented a control: if a .dat file already exists, I skip that set of parameters. This is something that comes out of practice: I often interrupt the execution of GNU Parallel and later decide to resume it by re-executing these commands. With this simple control I avoid running more experiments than necessary.
Now we can finally run the experiments. The algorithm in this example generates a file as specified in the parameter --save-data which I want to retrieve. I also want to save the stdout and stderr in a file, for debugging purposes.
cat $PARAMS_FILE | parallel \
--sshlogin $WORKERS_PARALLEL \
--workdir $WORKING_DIR \
--return {GROUP_DIR}"exp"{REP}".dat" \
--return {GROUP_DIR}"exp"{REP}".txt" \
--cleanup \
--xapply \
--header 1 \
--colsep " " \
mkdir -p {TEST_DIR} ';' \
./myExperiment \
--random-seed {REP} \
--alpha {ALPHA} \
--beta {BETA} \
--save-data {GROUP_DIR}"exp"{REP}".dat" \
'&>' {GROUP_DIR}"exp"{REP}".txt"
A little bit of explanation about the parameters. --sshlogin, which could be abbreviated with -S, passes the list of workers that Parallel will use to distribute the computational load. --workdir sets the working dir of Parallel, which by default is ~. --return directives copy back the specified file after the execution is completed. --cleanup removes the files copied back. --xapply tells Parallel to interpret the parameters as tuples (rather than sets to multiply by cartesian product). --header 1 tells Parallel that the first line of the parameters file has to be interpreted as header (whose entries will be used as names for the columns). --colsep tells Parallel that columns in the parameters file are space-separated.
WARNING: Ubuntu's version of parallel is outdated (2013). In particular, there is a bug preventing the above code to run properly, which has been fixed only a few days ago. To get the latest monthly snapshot, run (does not need root privileges):
(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash
Notice that the fix to the bug I mentioned above will only be included in the next snapshot, on September 22nd, 2015. If you are in a hurry you should perform a manual installation of the smoking hottest .
Finally, it is a good habit to clean our working environments:
rm $PARAMS_FILE
parallel --nonall -S $WORKERS rm -rf $WORKING_DIR $TAR_PATH
If you use this for reseach and publish a paper, remember to cite the original work by Ole Tange (see parallel --bibtex).

TEZ as execution at job level

How to selectively set TEZ as execution engine for PIG jobs?
We can set execution engine in pig.properties but its at the cluster impacts all the jobs of cluster.
Its possible if the jobs are submitted through Templeton.
Example of PowerShell usage
New-AzureHDInsightPigJobDefinition -Query $QueryString -StatusFolder $statusFolder -Arguments #("-x”, “tez")
Example of CURL usage:
curl -s -d file=<file name> -d arg=-v -d arg=-x -d arg=tez 'https://<dnsname.azurehdinsight.net>/templeton/v1/pig?user.name=admin'
Source: http://blogs.msdn.com/b/tiny_bits/archive/2015/09/19/pig-tez-as-execution-at-job-level.aspx
you can pass the execution engine as parameter as shown below, for mapreduce it is mr and for tez it is tez.
pig -useHCatalog -Dexectype=mr -Dmapreduce.job.queuename=<queue name> -param_file dummy.param dummy.pig

how to pass parameter to EMR job

I have an EMR job that I run it as follows
/opt/emr/elastic-mapreduce --jobflow $JOB_FLOW_ID --jar s3n://mybucket/example1.jar
Now I need to pass parameter to the job maperd.job.name="My job"
I have tried passing it via the -D flag but that did not work
/opt/emr/elastic-mapreduce --jobflow $JOB_FLOW_ID -Dmaperd.job.name="My job --jar s3n://mybucket/example1.jar \\ does not work
Any idea ?

Difference between Hadoop jar command and job command

What is difference between the two commands "jar" and "job".
*> Below is my understanding.
The command "jar"could be used to run MR jobs locally.
The "hadoop job" is deprecated and used to submit a job to the
cluster. The alternative to that is the mapred command.
Also the jar command would run the MR job locally in the same node
where we are executing the command and not anywhere else on the
cluster. If we were to submit a job that would run on some non
deterministic node on the cluster.*
Let me know if my understanding is correct and if not what exactly is the difference.
Thanks
They both are completely different and I don't think are comparable. Both co-exist and have separate functions and none is deprecated AFAIK.
job isn't used to submit a job to the cluster, rather it is used to get information on the jobs that have already been run or are running, also it is used to kill a running job or even kill a specific task.
While jar is simply used to execute the custom mapred jar, example:
$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
hadoop jar
Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command.
Usage: hadoop jar [mainClass] args...
hadoop job
Command to interact with Map Reduce Jobs.
*Usage: hadoop job [GENERIC_OPTIONS] [-submit ] | [-status ] | [-counter ] | [-kill ] | [-events <#-of-events>] | [-history [all] ] | [-list [all]] | [-kill-task ] | [-fail-task ] | [-set-priority ]*
For more info, read here.

Resources