makefile move command make it do not work - makefile

I write a makefile to run hadoop in Ubuntu. When the inputscommand is follow run:, it works. But if I move it down to the below of hdfs dfs -rm -f -r $(EXAMPLE_DIR), it failed and shows the error message :
make: inputs: Command not found.I am new to Ubuntu so I do not know how to fix the problem after searching the result because this error has too many possible causes. The makefile is showed below. I mark the part which confuse me.
EXAMPLE_DIR = /user/$(USER)/matmult-dense/
INPUT_DIR = $(EXAMPLE_DIR)/input
OUTPUT_DIR = $(EXAMPLE_DIR)/output
OUTPUT_FILE = $(OUTPUT_DIR)/part-00000
HADOOP_VERSION = 2.6.0
# generally I use HADOOP_HOME, for not modifying the original makefile, I set up the HADOOP_PREFIX here
HADOOP_PREFIX = /usr/local/hadoop
TOOLLIBS_DIR=$(HADOOP_PREFIX)/share/hadoop/tools/lib/
//Hi, start here
run: inputs
hdfs dfs -rm -f -r $(EXAMPLE_DIR)
//Hi, end here. If swap them, the error comes
hadoop jar $(TOOLLIBS_DIR)/hadoop-streaming-$(HADOOP_VERSION).jar \
-files ./map1.py,./reduce1.py \
-mapper ./map1.py \
-reducer ./reduce1.py \
-input $(INPUT_DIR) \
-output $(OUTPUT_DIR) \
-numReduceTasks 1 \
-jobconf stream.num.map.output.key.fields=5 \
-jobconf stream.map.output.field.separator='\t' \
-jobconf mapreduce.partition.keypartitioner.options=-k1,3
hdfs dfs -rm $(INPUT_DIR)/file01
hdfs dfs -mv $(OUTPUT_FILE) $(INPUT_DIR)/file01
hdfs dfs -rm -f -r $(OUTPUT_DIR)
hadoop jar $(TOOLLIBS_DIR)/hadoop-streaming-$(HADOOP_VERSION).jar \
-files ./map2.py,./reduce2.py \
-mapper ./map2.py \
-reducer ./reduce2.py \
-input $(INPUT_DIR) \
-output $(OUTPUT_DIR) \
-numReduceTasks 1 \
-jobconf stream.num.map.output.key.fields=2 \
-jobconf stream.map.output.field.separator='\t'
hdfs dfs -cat $(OUTPUT_FILE)
directories:
hdfs dfs -test -e $(EXAMPLE_DIR) || hdfs dfs -mkdir $(EXAMPLE_DIR)
hdfs dfs -test -e $(INPUT_DIR) || hdfs dfs -mkdir $(INPUT_DIR)
hdfs dfs -test -e $(OUTPUT_DIR) || hdfs dfs -mkdir $(OUTPUT_DIR)
inputs: directories
hdfs dfs -test -e $(INPUT_DIR)/file01 \
|| hdfs dfs -put matrices $(INPUT_DIR)/file01
clean:
hdfs dfs -rm -f -r $(INPUT_DIR)
hdfs dfs -rm -f -r $(OUTPUT_DIR)
hdfs dfs -rm -r -f $(EXAMPLE_DIR)
hdfs dfs -rm -f matrices
.PHONY: run

You seem to be badly confused about how makefiles work. You must start with simoke ones before you attempt complex ones.
If a makefile contains a rule like this:
foo: bar
kleb
then foo is a target (usually the name of a file that Make can build). bar is another target, and kleb is a command, which is to be executed by the shell. If you swap bar and kleb, you will probably get an error, because kleb is probably not a target that Make knows how to build, and bar is probably not a command the that the shell knows how to execute.

Related

How to pass multiple input directories to a hadoop command using a loop

To run a script using hadoop streaming - I use a bash script that looks like this -
IP1="/data/hdp/f1/part-*"
IP2="/data/hdp/f2/part-*"
OP="/data/hdp/op"
hadoop jar $HADOOP_JAR_PATH \
-file $MAPPER_FILE -mapper "$PY $MAPPER_FILE" \
-input $IP1 -input $IP2
-output $OP
How do I generalize this to a case where I have 20 input directories. One approach is specifying it as
-input $IP1 -input $IP2 -input $IP3 ... -input $IP20
But I would want to know if we can use the shell variables and loops/arrays to get it done like this:
IP_LIST=${!IP*}
IP_CMD=''
for ip in $IP_LIST
do
IP_CMD=$IP_CMD"-input $"$ip" "
done
IP_ARRAY=($IP_CMD)
hadoop jar $HADOOP_JAR_PATH \
-file $MAPPER_FILE -mapper "$PY $MAPPER_FILE" \
"${IP_ARRAY[#]}"
-output $OP
When I try this, I get an Input path does not exist: hdfs://... error.
FULL COMMAND THAT I AM USING AS IS...
IP1="/data/hdp/f1/part-*"
IP2="/data/hdp/f2/part-*"
OP="/data/hdp/op"
MAPPER_FILE="map_code.py"
REDUCER="reduce_code.py"
IP_LIST=${!IP*}
IP_CMD=''
for ip in $IP_LIST
do
IP_CMD=$IP_CMD"-input $"$ip" "
done
hadoop fs -rm -r -skipTrash $OP
cmd="hadoop jar $HADOOP_JAR_PATH \
-D mapred.reduce.tasks=00 \
-Dmapreduce.output.fileoutputformat.compress=true \
-Dmapreduce.output.fileoutputformat.compress.codec=\
org.apache.hadoop.io.compress.GzipCodec \
-file $MAPPER_FILE\
-file $REDUCER \
-mapper $PY $MAPPER_FILE\
-reducer $PY $REDUCER\
-output $OP -cacheFile $DC#ref\
$IP_CMD"
eval $cmd
You could make all the command as a string and after you finished use the eval command.
In your example-
add to IP_CMD the rest of the command and the use eval $IP_CMD

Hadoop streaming flat-files to gzip

I've been trying to gzip files (pipe separated csv) in hadoop using the hadoop-streaming.jar. I've found the following thread on stackoverflow:
Hadoop: compress file in HDFS?
and I tried both solutions (cat/cut for the mapper). Although I end up with a gzipped file in HDFS it also now has a tab character at the end of each line. Any ideas how to get rid of these? The tab at the end is messing up my last column.
I've tried the following two commands (in lots of flavours):
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-Dmapred.reduce.tasks=0 \
-input <filename> \
-output <output-path> \
-mapper "cut -f 2"
and
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-Dmapred.reduce.tasks=0 \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input <filename> \
-output <output-path> \
-mapper /bin/cat \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
I know that mapreduce outputs a key-value par that is tab separated but the "cut -f 2" (also tried "cut -f 2 -d,") should only return the value part, not the tab. So why does every line ends with a tab?
I hope someone can enlighten me.

why shell script with hadoop wont work?

#!/usr/bin/env bash
echo textFile :"$1"
echo mapper : "$2"
echo reducer: "$3"
echo inputDir :"$4"
echo outputDir: "$5"
hdfs dfs -ls ~
hdfs dfs -rm ~/"$2"
hdfs dfs -rm ~/"$3"
hdfs dfs -copyFromLocal "$2" ~ # copies mapper.py file from argument to hdfs dir
hdfs dfs -copyFromLocal "$3" ~ # copies reducer.py file from argument to hdfs dir
hdfs dfs -test -d ~/"$5" #checks to see if hadoop output dir exists
if [ $? == '0' ]; then
hdfs dfs -rm -r ~/"$5"
else
echo "Output file doesn't exist and will be created when hadoop runs"
fi
hdfs dfs -test -d ~/"$4" #checks to see if hadoop input dir exists
if [ $? == 0 ]; then
hdfs dfs -rm -r ~/"$4"
echo "Hadoop input dir alread exists deleting it now and creating a new one..."
hdfs dfs -mkdir ~/"$4" # makes an input dir for text file to be put in
else
echo "Input file doesn't exist will be created now"
hdfs dfs -mkdir ~/"$4" # makes an input dir for text file to be put in
fi
hdfs dfs -copyFromLocal /home/hduser/"$1" ~/"$4" # sends textfile from local to hdfs folder
# runs the hadoop mapreduce program with given parameters
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.2.jar \
-input /home/hduser/"$4"/* \
-output /home/hduser/"$5" \
-file /home/hduser/"$2" \
-mapper /home/hduser/"$2" \
-file /home/hduser/"$3" \
-reducer /home/hduser/"$3"
i wanted to avoid keep tying all the commands to run simple mapreduce everytime i want to test out mapper and reducer files so i wrote this script and i am new to shell scripting. I attached the screens
Two obvious details you should correct:
the operator for equals in bash spells '"' not '=='
(actualle this is true for test expressions)
your long comman line for the hadoop call is spread accross several lines
you need to concatenate these to a single (long) line or better indicate
continuation by ending the line using a backslash "\".

How to copy the output of -text HDFS command into another file?

Is there any way we can copy text content of hdfs file into another file system using HDFS command:
hadoop fs -text /user/dir1/abc.txt
Can I print the output of -text into another file by using -cat or any method ?:
hadoop fs -cat /user/deepak/dir1/abc.txt
As it's written in the documentation you can use hadoop fs -cp to copy files in hdfs. You can use hadoop fs -copyToLocal to copy files from hdfs to local file system. If you want to copy files from one hdfs to another then use DistCp tool.
As a general command line tip you can use | to another program or > or >> to a file, e.g.
# Will output to standard output (console) and the file /my/local/file
# this will overwrite the file, use ... tee -a ... to append
hdfs dfs -text /path/to/file | tee /my/local/file
# Will redirect output to some other command
hdfs dfs -text /path/to/file | some-other-command
# Will overwrite /my/local/file
hdfs dfs -text /path/to/file > /my/local/file
# Will append to /my/local/file
hdfs dfs -text /path/to/file >> /my/local/file
Thank you I did use streaming jar example in hadoop-home lib folder as follow :
hadoop -jar hadoop-streaming.jar -input hdfs://namenode:port/path/to/sequencefile \
-output /path/to/newfile -mapper "/bin/cat" -reducer "/bin/cat" \
-file "/bin/cat" -file "/bin/cat" \
-inputformat SequenceFileAsTextInputFormat
you can use "/bin/wc" in case you would like to count the number of lines at the hdfs sequence file.
you can use following:
copyToLocal
hadoop dfs -copyToLocal /HDFS/file /user/deepak/dir1/abc.txt
getmerge
hadoop dfs -getmerge /HDFS/file /user/deepak/dir1/abc.txt
get
hadoop dfs -get /HDFS/file /user/deepak/dir1/abc.txt

Hadoop Streaming Problems

I ran into these issues while using Hadoop Streaming. I'm writing code in python
1) Aggregate library package
According to the hadoop streaming docs ( http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29 ), there is an inbuilt Aggregate class which can work both as a mapper and a reducer.
Here is the command:
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -combiner aggregate -reducer NONE -input input_files -output output_path
Executing this command fails the mapper with this error:
java.io.IOException: Cannot run program "aggregate": java.io.IOException: error=2, No such file or directory
However, if you run this command using aggregate as the reducer and not the combiner, the job works fine.
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -reduce aggregate -reducer NONE -input input_files -output output_path
Does this mean I cannot use the aggregate class as the combiner?
2) Cannot use | as a seperator for the generic options
This is an example command from the above link
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
You cannot use | as an argument for map.output.key.field.separator. The error is
-D: command not found
11/08/03 10:48:02 ERROR streaming.StreamJob: Missing required options: input, output
(Update)You have to escape the | with a \ like this
-D stream.map.output.field.separator=\|
3) Cannot specify the -D options at the end of the command just like in the example. The Error is
-D: command not found
11/08/03 10:50:23 ERROR streaming.StreamJob: Unrecognized option: -D
Is the documentation flawed or I'm doing something wrong?
Any insight on what I'm doing wrong is appreciated. Thnx
This question was asked 3 years ago, but today I still got the problem with -D option so I will add a little information for other people if they have the same problem.
According to the manual of hadoop streaming:
bin/hadoop command [genericOptions] [commandOptions]
-D is a genereic option so you have to put it before any other options.
So in this case, the command should look like:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \

Resources