#!/usr/bin/env bash
echo textFile :"$1"
echo mapper : "$2"
echo reducer: "$3"
echo inputDir :"$4"
echo outputDir: "$5"
hdfs dfs -ls ~
hdfs dfs -rm ~/"$2"
hdfs dfs -rm ~/"$3"
hdfs dfs -copyFromLocal "$2" ~ # copies mapper.py file from argument to hdfs dir
hdfs dfs -copyFromLocal "$3" ~ # copies reducer.py file from argument to hdfs dir
hdfs dfs -test -d ~/"$5" #checks to see if hadoop output dir exists
if [ $? == '0' ]; then
hdfs dfs -rm -r ~/"$5"
else
echo "Output file doesn't exist and will be created when hadoop runs"
fi
hdfs dfs -test -d ~/"$4" #checks to see if hadoop input dir exists
if [ $? == 0 ]; then
hdfs dfs -rm -r ~/"$4"
echo "Hadoop input dir alread exists deleting it now and creating a new one..."
hdfs dfs -mkdir ~/"$4" # makes an input dir for text file to be put in
else
echo "Input file doesn't exist will be created now"
hdfs dfs -mkdir ~/"$4" # makes an input dir for text file to be put in
fi
hdfs dfs -copyFromLocal /home/hduser/"$1" ~/"$4" # sends textfile from local to hdfs folder
# runs the hadoop mapreduce program with given parameters
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.2.jar \
-input /home/hduser/"$4"/* \
-output /home/hduser/"$5" \
-file /home/hduser/"$2" \
-mapper /home/hduser/"$2" \
-file /home/hduser/"$3" \
-reducer /home/hduser/"$3"
i wanted to avoid keep tying all the commands to run simple mapreduce everytime i want to test out mapper and reducer files so i wrote this script and i am new to shell scripting. I attached the screens
Two obvious details you should correct:
the operator for equals in bash spells '"' not '=='
(actualle this is true for test expressions)
your long comman line for the hadoop call is spread accross several lines
you need to concatenate these to a single (long) line or better indicate
continuation by ending the line using a backslash "\".
Related
When I execute these commands, it is very slow and always takes five hours or more to finish.
hdfs dfsadmin -fetchImage ${t_save_fsimage_path}
# 获取下载的fsimage具体文件路径
t_fsimage_file=`ls ${t_save_fsimage_path}/fsimage*`
# 处理fsimage为可读的csv格式文件
hdfs oiv -i ${t_fsimage_file} -o ${t_save_fsimage_path}/fsimage.csv -p Delimited
# 删除fsimage.csv的首行数据
sed -i -e "1d" ${t_save_fsimage_path}/fsimage.csv
# 创建数据目录
hadoop fs -test -e ${t_save_fsimage_path}/fsimage || hdfs dfs -mkdir -p ${t_save_fsimage_path}/fsimage
# 拷贝fsimage.csv到指定的路径
hdfs dfs -copyFromLocal -f ${t_save_fsimage_path}/fsimage.csv ${t_save_fsimage_path}/fsimage/
The below helped especially when dealing with a large fsimage:
Setting the java heap size:
export HADOOP_OPTS="-Xmx55G"
Using the -t or --temp option to use temporary dir (instead of memory) to cache intermediate result: e.g., hdfs oiv -i fsimage_example -o fsimage_example.csv -p Delimited -delimiter "|" --temp /tmp/fsimage_example
You could either programmatically analyze the fsimage using HFSA lib or HFSA cli tool (depending on your use case).
I write a makefile to run hadoop in Ubuntu. When the inputscommand is follow run:, it works. But if I move it down to the below of hdfs dfs -rm -f -r $(EXAMPLE_DIR), it failed and shows the error message :
make: inputs: Command not found.I am new to Ubuntu so I do not know how to fix the problem after searching the result because this error has too many possible causes. The makefile is showed below. I mark the part which confuse me.
EXAMPLE_DIR = /user/$(USER)/matmult-dense/
INPUT_DIR = $(EXAMPLE_DIR)/input
OUTPUT_DIR = $(EXAMPLE_DIR)/output
OUTPUT_FILE = $(OUTPUT_DIR)/part-00000
HADOOP_VERSION = 2.6.0
# generally I use HADOOP_HOME, for not modifying the original makefile, I set up the HADOOP_PREFIX here
HADOOP_PREFIX = /usr/local/hadoop
TOOLLIBS_DIR=$(HADOOP_PREFIX)/share/hadoop/tools/lib/
//Hi, start here
run: inputs
hdfs dfs -rm -f -r $(EXAMPLE_DIR)
//Hi, end here. If swap them, the error comes
hadoop jar $(TOOLLIBS_DIR)/hadoop-streaming-$(HADOOP_VERSION).jar \
-files ./map1.py,./reduce1.py \
-mapper ./map1.py \
-reducer ./reduce1.py \
-input $(INPUT_DIR) \
-output $(OUTPUT_DIR) \
-numReduceTasks 1 \
-jobconf stream.num.map.output.key.fields=5 \
-jobconf stream.map.output.field.separator='\t' \
-jobconf mapreduce.partition.keypartitioner.options=-k1,3
hdfs dfs -rm $(INPUT_DIR)/file01
hdfs dfs -mv $(OUTPUT_FILE) $(INPUT_DIR)/file01
hdfs dfs -rm -f -r $(OUTPUT_DIR)
hadoop jar $(TOOLLIBS_DIR)/hadoop-streaming-$(HADOOP_VERSION).jar \
-files ./map2.py,./reduce2.py \
-mapper ./map2.py \
-reducer ./reduce2.py \
-input $(INPUT_DIR) \
-output $(OUTPUT_DIR) \
-numReduceTasks 1 \
-jobconf stream.num.map.output.key.fields=2 \
-jobconf stream.map.output.field.separator='\t'
hdfs dfs -cat $(OUTPUT_FILE)
directories:
hdfs dfs -test -e $(EXAMPLE_DIR) || hdfs dfs -mkdir $(EXAMPLE_DIR)
hdfs dfs -test -e $(INPUT_DIR) || hdfs dfs -mkdir $(INPUT_DIR)
hdfs dfs -test -e $(OUTPUT_DIR) || hdfs dfs -mkdir $(OUTPUT_DIR)
inputs: directories
hdfs dfs -test -e $(INPUT_DIR)/file01 \
|| hdfs dfs -put matrices $(INPUT_DIR)/file01
clean:
hdfs dfs -rm -f -r $(INPUT_DIR)
hdfs dfs -rm -f -r $(OUTPUT_DIR)
hdfs dfs -rm -r -f $(EXAMPLE_DIR)
hdfs dfs -rm -f matrices
.PHONY: run
You seem to be badly confused about how makefiles work. You must start with simoke ones before you attempt complex ones.
If a makefile contains a rule like this:
foo: bar
kleb
then foo is a target (usually the name of a file that Make can build). bar is another target, and kleb is a command, which is to be executed by the shell. If you swap bar and kleb, you will probably get an error, because kleb is probably not a target that Make knows how to build, and bar is probably not a command the that the shell knows how to execute.
I have written a simple bash script. The exact code is here.
ideone.com/8XQCjH
#!/bin/bash
if ! bzip2 -t "$file"
then
printf '%s is corrupted\n' "$file"
rm -f "$file"
#echo "$file" "is corrupted" >> corrupted.log
else
tar -xjvf "$file" -C ./uncompressed
rm -f "$file"
fi
Basically, it reads a compressed file, tests it and uncompresses it and moves it to another directory.
How do I modify this code so that it will be able to read files in a hdfs input directory instead and output to another hdfs output directory ?
I have seen some examples here which though involves reading the contents of the file. Though in my case, I am not interested in reading any contents.
http://www.oraclealchemist.com/news/tf-idf-hadoop-streaming-bash-part-1/
If anyone could write a hadoop command which unzips files in a hdfs or a similar example, that'll greatly help me.
Edit:
Try 1:
hadoop fs -get /input/temp.tar.bz2 | tar -xjv | hadoop fs -put - /output
Not good as it moves the file into the native filesystem, uncompresses it and puts it back into the output directory in hdfs.
Try 2:
wrote a script uncompress.sh with just one line of code
uncompress.sh
tar -xjv
hadoop jar contrib/streaming/hadoop-streaming.jar \
-numReduceTasks 0 \
-file /home/hadoop/uncompress.sh \
-input /input/temp.tar.bz2 \
-output /output \
-mapper uncompress.sh \
-verbose
However this gave the below error.
INFO mapreduce.Job: Task Id : attempt_1409019525368_0015_m_000002_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
Thanks
From the man page of bzip2:
-t --test
Check integrity of the specified file(s), but don't decompress them. This really performs a trial decompression and throws away the result
This means that there is no way to check the file without reading it. Also, if you are going to perform the decompression after that if the archive is deemed valid, you should probably decompress it directly.
That said, you can use
hadoop fs -cat hdfs://my_file_name | bzip2 -ct
to test the file and
tmpdir=`mktmp -d`
hadoop fs -cat hdfs://my_file_name | tar jxv -C $tmpdir
hadoop fs -copyFromLocal $tmpdir/ hdfs://dest_dir
to decompress it. There is no way to have tar write the files directly into hdfs. Hadoop streaming is thought as "download the stuff you need, perform the job in a temp directory, upload them back".
That said, you are using hadoop to perform decompression of a large number of files, or you want to parallelize the decompression of a big single giant file? In the second case you have to write an ad-hoc program to split the input into multiple parts, and decompress them. Hadoop will not automatically parallelize tasks for you. In the first case, you can use a script like this as mapper:
#!/bin/bash
while IFS="\n" read filename ; do
tmpdir=`mktmp -d`
hadoop fs -cat "hdfs:/$filename" | tar jxv -C $tmpdir
hadoop fs -copyFromLocal $tmpdir/ "hdfs:/$filename".dir/
rm -rf $tmpdir
done
and as input you use instead a file with the list of the tar.bz2 files to decompress
...
/path/my_file.tar.bz2
/path2/other_file.tar.bz2
....
Is there any way we can copy text content of hdfs file into another file system using HDFS command:
hadoop fs -text /user/dir1/abc.txt
Can I print the output of -text into another file by using -cat or any method ?:
hadoop fs -cat /user/deepak/dir1/abc.txt
As it's written in the documentation you can use hadoop fs -cp to copy files in hdfs. You can use hadoop fs -copyToLocal to copy files from hdfs to local file system. If you want to copy files from one hdfs to another then use DistCp tool.
As a general command line tip you can use | to another program or > or >> to a file, e.g.
# Will output to standard output (console) and the file /my/local/file
# this will overwrite the file, use ... tee -a ... to append
hdfs dfs -text /path/to/file | tee /my/local/file
# Will redirect output to some other command
hdfs dfs -text /path/to/file | some-other-command
# Will overwrite /my/local/file
hdfs dfs -text /path/to/file > /my/local/file
# Will append to /my/local/file
hdfs dfs -text /path/to/file >> /my/local/file
Thank you I did use streaming jar example in hadoop-home lib folder as follow :
hadoop -jar hadoop-streaming.jar -input hdfs://namenode:port/path/to/sequencefile \
-output /path/to/newfile -mapper "/bin/cat" -reducer "/bin/cat" \
-file "/bin/cat" -file "/bin/cat" \
-inputformat SequenceFileAsTextInputFormat
you can use "/bin/wc" in case you would like to count the number of lines at the hdfs sequence file.
you can use following:
copyToLocal
hadoop dfs -copyToLocal /HDFS/file /user/deepak/dir1/abc.txt
getmerge
hadoop dfs -getmerge /HDFS/file /user/deepak/dir1/abc.txt
get
hadoop dfs -get /HDFS/file /user/deepak/dir1/abc.txt
Probably a noob question but is there a way to read the contents of file in hdfs besides copying to local and reading thru unix?
So right now what I am doing is:
bin/hadoop dfs -copyToLocal hdfs/path local/path
nano local/path
I am wondering if I can open a file directly to hdfs rather than copying it on local and then opening it.
I believe hadoop fs -cat <file> should do the job.
If the file size is huge (which will be the case most of the times), by doing 'cat' you don't want to blow up your terminal by throwing the entire content of your file. Instead, use piping and get only few lines of the file.
To get the first 10 lines of the file, hadoop fs -cat 'file path' | head -10
To get the last 5 lines of the file, hadoop fs -cat 'file path' | tail -5
If you are using hadoop 2.x , you can use
hdfs dfs -cat <file>
hadoop dfs -cat <filename> or hadoop dfs -cat <outputDirectory>/*
SSH onto your EMR cluster ssh hadoop#emrClusterIpAddress -i yourPrivateKey.ppk
Run this command /usr/lib/spark/bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://yourEmrClusterIpAddress:8020/eventLogging --class org.apache.spark.examples.SparkPi --master yarn --jars /usr/lib/spark/examples/jars/spark-examples_2.11-2.4.0.jar
List the contents of that directory we just created which should now have a new log file from the run we just did
[hadoop#ip-1-2-3-4 bin]$ hdfs dfs -ls /eventLogging
Found 1 items
-rwxrwx--- 1 hadoop hadoop 53409 2019-05-21 20:56 /eventLogging/application_1557435401803_0106
Now to view the file run hdfs dfs -cat /eventLogging/application_1557435401803_0106
Resources:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
I usually use
$ hdfs dfs -cat <filename> | less
This also helps me to search for words to find what I'm interested in while looking at the contents.
For less context irrelevant purposes like knowing if a particular word exists in a file, or count word occurrences, I use.
$ hdfs dfs -cat <filename> | grep <search_word>
Note: grep also have -C option for contexts, with -A and -B for lines after/before the match.
I was trying to figure out the above commands and that didnt work for me to read the file.
But this did,
cat <filename>
For example,
cat data.txt