Not entering while loop in shell script - bash

I was trying to implement page rank in hadoop. I created a shell script to iteratively run map-reduce. But the while loop just doesn't work. I have 2 map-reduce, one to find the initial page rank and to print the adjacency list. The other one will take the output of the first reducer and take that as input to the second mapper.
The shell script
#!/bin/sh
CONVERGE=1
ITER=1
rm W.txt W1.txt log*
$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave
hdfs dfs -rm -r /task-*
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.3.jar \-mapper "'$PWD/mapper.py'" \-reducer "'$PWD/reducer.py' '$PWD/W.txt'" \-input /assignment2/task2/web-Google.txt \-output /task-1-output
echo "HERE $CONVERGE"
while [ "$CONVERGE" -ne 0 ]
do
echo "############################# ITERATION $ITER #############################"
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.3.jar \-mapper "'$PWD/mapper2.py' '$PWD/W.txt' '$PWD/page_embeddings.json'" \-reducer "'$PWD/reducer2.py'" \-input task-1-output/part-00000 \-output /task-2-output
touch w1
hadoop dfs -cat /task-2-output/part-00000 > "$PWD/w1"
CONVERGE=$(python3 $PWD/check_conv.py $ITER>&1)
ITER=$((ITER+1))
hdfs dfs -rm -r /task-2-output/x
echo $CONVERGE
done
The first mapper runs perfectly fine and I am getting output for it. The condition for while loop [ '$CONVERGE' -ne 0 ] just gives false so it doesn't enter the while loop to run 2nd map-reduce. I removed the quotes on $CONVERGE and tried it still doesn't work.
I defined CONVERGE at the beginning of the file and is updated in while loop with the output of check.py. The while loop just doesn't run.
What could I be doing wrong?

Self Answer:
I tried doing everything possible to correct the mistakes. But later I was told to download dos2unix and run it again. Surprisingly it worked. The file was being read properly. I don't know why that happened.

Related

Problem with submitting dependent job in slurm

I'm using the following .sh script to submit some Slurm jobs.
# Set hardware resource requests
nb_partitions=200 # number of distinct partitions to which you'll send sbatch statements
nb_threads=7 # number of CPUs requested per partition
# Make directory in which you'll store results
mkdir results
# Make a separate folder for each partition
for j in $(seq 1 $nb_partitions)
do
mkdir results/partition$j
done
# Copy MainFolder which has code for execution to each partition folder and change partition number stored in each input file to correspond to its number
for i in $(seq 1 $nb_partitions)
do
cp -r MainFolder results/partition$i
sed -i "1s/.*/${i[#]}/g" results/partition$i/MainFolder/partition_nb.txt
done
echo "Files and folders have been copied and created."
# Loop to submit simulations job for each partition
slurmids="" # storage of slurm job ids
for k in $(seq 1 $nb_partitions)
do
cd results/partition$k/MainFolder
# Write parameters to files within the main folder so they can be read into FORTRAN
echo "$nb_partitions" >> nb_partitions.txt
echo "$nb_threads" >> nb_threads.txt
slurmids="$slurmids:$(sbatch --parsable --cpus-per-task=$nb_threads run.sh)"
cd ..
cd ..
cd ..
done
echo "Jobs are now running."
# Submit job to combine files
ID=$(sbatch --dependency=afterok=$slurmids combine_results.sh)
The final line of this code is supposed to submit a batch job that will execute only after all of the other jobs submitted in the loop are done. However, when I run this shell script the job combine_results in the final line of code never runs. The Slurm manager simply says the job is waiting because the dependencies are not satisifed.
Any ideas on what I'm doing wrong here? Thanks!

hdfs test result only displayed after typing $?

How come when I run the command:
hdfs dfs -test -d /user/me/dirs
I have to type in $? to get the result:
sh: 0: command not found
Is there a way I could just get the result (0 or 1)itself and none of the other text?
Pretty terrible at bash so I might be missing something
In bash, $? gives you the exit code of the previous command. Simply typing $? will cause your shell to attempt to execute that return code as a command (which is not usually what you want). You could instead print the value using echo $?, or save it in a variable using retcode=$?.
Exit codes are not normally printed to your shell. The reason you may not be seeing any other output is because the hdfs command is likely not printing any text to your screen (either through stdout or stderr).
I suspect your best option might be hdfs dfs -test -d /user/me/dirs; echo $?, or some variant using a variable.
Exit codes
Understanding exit codes and how to use them in bash scripts

how to call Pig scripts from shell script sequentially

I have squence of Pig scripts in a file and I want to execute it from Shell script
which execute pig scripts sqeuenciatly.
For Ex:
sh script.sh /it/provider/file_name PIGddl.txt
Suppose PIGddl.txt has Pig scripts like
Record Count
Null validation e.t.c
If all the Pig queries are in one file then how to execute the pig scripts from Shell scripts?
below idea works ,but if you want sequential process like if 1 execute then execute 2 else execute 3 kind of flow,you may go with Oozie for running and scheduling the jobs.
#!/bin/sh
x=1
while [ $x -le 3 ]
do
echo "pig_dcnt$x.pig will be run"
pig -f /home/Scripts/PigScripts/pig_dcnt$x.pig --param timestamp=$timestamp1
x=$(( $x + 1 ))
done
I haven't tested this but I'm pretty sure this will work fine.
Lets assume you have two pig files which you want to run using shell script then you would write a shell script file with following:
#!/bin/bash
pig
exec pig_script_file1.pig
exec pig_script_file2.pig
so when you run this shell script, initially it will execute pig command and goes into grunt shell and there it will execute your pig files in the order that you have mentioned
Update:
The above solution doesn't work. Please refer the below one which is
tested
Update your script file with the following so that it can run your pig files in the order that you have defined
#!/bin/bash
pig pig_script_file1.pig
pig pig_script_file2.pig
Here is what you have to do
1. Keep xxx.pig file at some location #
2. to execute this pig script from shell use the below command
pig -p xx=date(if you have some arguments to pass) -p xyz=value(if there is another arguments to be passed) -f /path/xxx.pig
-f is used to execute the pig lines of code from .pig file.

cront not working with hadoop command in shell script

I'm trying to schedule a cronjob using crontab to execute a shell script which executes a list of hadoop commands sequentially, but when i look at the hadoop folder the folders are not created or dropped. The hadoop connectivity on our cluster is pretty slow. so these hadoop command might take sometime to execute due to number of retries.
Cron expression
*/5 * * * * sh /test1/a/bin/ice.sh >> /test1/a/run.log
shell script
#!/bin/sh
if [ $# == 1 ]
then
TODAY=$1
else
TODAY=`/bin/date +%m%d%Y%H%M%S`
fi
# define seed folder here
#filelist = "ls /test1/a/seeds/"
#for file in $filelist
for file in `/bin/ls /test1/a/seeds/`
do
echo $file
echo $TODAY
INBOUND="hadoop fs -put /test1/a/seeds/$file /apps/hdmi-set/inbound/$file.$TODAY/$file"
echo $INBOUND
$INBOUND
SEEDDONE="hadoop fs -put /test1/a/seedDone /apps/hdmi-set/inbound/$file.$TODAY/seedDone"
echo $SEEDDONE
$SEEDDONE
done
echo "hadoop Inbound folders created for job1 ..."
Since there are no output that has been captured that could be used to debug the output, I can only speculate.
But from my past experience, one of the common reason hadoop jobs fail when they are spawned through scripts is that HADOOP_HOME is not available when these commands are executed.
Usually that is not the case when working directly from the terminal. Try adding the following to both ".bashrc" and ".bash_profile" or ".profile":
export HADOOP_HOME=/usr/lib/hadoop
You may have to change the path based on your specific installation.
And yes as comment says, don't just redirect standard output but error too in the file.

QSUB a process for every file in a directory?

I've been using
qsub -t 1-90000 do_stuff.sh
to submit my tasks on a Sun GridEngine cluster, but now find myself with data sets (super large ones, too) which are not so conveniently named. What's the best way to go about this? I could try to rename them all, but the names contain information which needs to be preserved, and this obviously introduces a host of problems. I could just preprocess everything into jsons, but if there's a way to just qsub -all_contents_of_directory, that would be ideal.
Am I SOL? Should I just go to the directory in question and find . -exec 'qsub setupscript.sh {}'?
Use another script to submit the job - here's an example I used where I want the directory name in the job name. "run_openfoam" is the pbs script in the particular directory.
#!/bin/bash
cd $1
qsub -N $1 run_openfoam
You can adapt this script to suit your job and then run it through a loop on the command line. So rather than submitting a job array, you submit a job for each dir name passed as the first parapmeter to this script.
I tend to use Makefiles to automate this stuff:
INPUTFILES=$(wildcard *.in)
OUTPUTFILES=$(patsubst %.in,%.out,$(INPUTFILES))
all : $(OUTPUTFILES)
%.out : %.in
#echo "mycommand here < $< > $#" | qsub
Then type 'make', and all files will be submitted to qsub. Of course, this will submit everything all at once, which may do unfortunate things to your compute cluster and your sysadmin's blood pressure.
If you remove the "| qsub", the output of make is a list of commands to run. Feed that list into one or more qsub commands, and you'll get an increase in efficiency and a reduction in qsub jobs. I've been using GNU parallel for that, but it needs a qsub that blocks until the job is done. I wrote a wrapper that does that, but it calls qstat a lot, which means a lot of hitting on the system. I should modify it somehow, but there aren't a lot of computationally 'good' options here.
I cannot understand "-t 1-90000" in your qsub command. My searching of qsub manual doesn't show such "-t" option.
Create a file with a list of the datasets in it
find . -print >~/list_of_datasets
Script:
#!/bin/bash
exec ~/setupscript.sh $(sed -n -e "${SGE_TASK_ID}p" <~/list_of_datasets)
qsub -t 1-$(wc -l ~/list_of_datasets) job_script

Resources