I want to count the line of a hdfs file,so I use
hadoop fs -cat /path/part-00982.lzo_deflate |wc -l ,the result is 311424.
while I use
hadoop fs -text /path/part-00982.lzo_deflate |wc -l , the result is 2099305.
Why the same file,but two totally different result?Is there any defference between cat and text?
I am copying a folder from one path to another, basically creating a backup.
The source(input) folder size is 5 TB. I use the following distcp command to copy:
hadoop distcp -m 150 <source_folder_path> <destination_folder_path>
hadoop fs -du -s -h source_folder
hadoop fs -du -s -h destination_folder
hadoop fs -ls source_folder | wc -l
hadoop fs -ls destination_folder | wc -l
This is within the same cluster.
I am unable to understand as why my input folder is 5 tb and output folder is only 1 tb. The job completes successfully without any error.
Also I see that the number of files is same in input and output.
I don't use compression or anything in the process. Can someone point out to me why is it like this.
Hadoop version is 2.7
How to put timestamp appended file name in HDFS?
hadoop fs -put topic_2018-12-15%2016:31:15.csv /user/file_structure/
You're just running a shell command, so you can use command evaulation
For example, to rename a file to have yyyy-MM-dd
hadoop fs -put \
'topic_2018-12-15%2016:31:15.csv' \
"/user/file_structure/topic_$(date +%Y-%m-%d).csv"
#!/usr/bin/env bash
echo textFile :"$1"
echo mapper : "$2"
echo reducer: "$3"
echo inputDir :"$4"
echo outputDir: "$5"
hdfs dfs -ls ~
hdfs dfs -rm ~/"$2"
hdfs dfs -rm ~/"$3"
hdfs dfs -copyFromLocal "$2" ~ # copies mapper.py file from argument to hdfs dir
hdfs dfs -copyFromLocal "$3" ~ # copies reducer.py file from argument to hdfs dir
hdfs dfs -test -d ~/"$5" #checks to see if hadoop output dir exists
if [ $? == '0' ]; then
hdfs dfs -rm -r ~/"$5"
else
echo "Output file doesn't exist and will be created when hadoop runs"
fi
hdfs dfs -test -d ~/"$4" #checks to see if hadoop input dir exists
if [ $? == 0 ]; then
hdfs dfs -rm -r ~/"$4"
echo "Hadoop input dir alread exists deleting it now and creating a new one..."
hdfs dfs -mkdir ~/"$4" # makes an input dir for text file to be put in
else
echo "Input file doesn't exist will be created now"
hdfs dfs -mkdir ~/"$4" # makes an input dir for text file to be put in
fi
hdfs dfs -copyFromLocal /home/hduser/"$1" ~/"$4" # sends textfile from local to hdfs folder
# runs the hadoop mapreduce program with given parameters
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.2.jar \
-input /home/hduser/"$4"/* \
-output /home/hduser/"$5" \
-file /home/hduser/"$2" \
-mapper /home/hduser/"$2" \
-file /home/hduser/"$3" \
-reducer /home/hduser/"$3"
i wanted to avoid keep tying all the commands to run simple mapreduce everytime i want to test out mapper and reducer files so i wrote this script and i am new to shell scripting. I attached the screens
Two obvious details you should correct:
the operator for equals in bash spells '"' not '=='
(actualle this is true for test expressions)
your long comman line for the hadoop call is spread accross several lines
you need to concatenate these to a single (long) line or better indicate
continuation by ending the line using a backslash "\".
Probably a noob question but is there a way to read the contents of file in hdfs besides copying to local and reading thru unix?
So right now what I am doing is:
bin/hadoop dfs -copyToLocal hdfs/path local/path
nano local/path
I am wondering if I can open a file directly to hdfs rather than copying it on local and then opening it.
I believe hadoop fs -cat <file> should do the job.
If the file size is huge (which will be the case most of the times), by doing 'cat' you don't want to blow up your terminal by throwing the entire content of your file. Instead, use piping and get only few lines of the file.
To get the first 10 lines of the file, hadoop fs -cat 'file path' | head -10
To get the last 5 lines of the file, hadoop fs -cat 'file path' | tail -5
If you are using hadoop 2.x , you can use
hdfs dfs -cat <file>
hadoop dfs -cat <filename> or hadoop dfs -cat <outputDirectory>/*
SSH onto your EMR cluster ssh hadoop#emrClusterIpAddress -i yourPrivateKey.ppk
Run this command /usr/lib/spark/bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://yourEmrClusterIpAddress:8020/eventLogging --class org.apache.spark.examples.SparkPi --master yarn --jars /usr/lib/spark/examples/jars/spark-examples_2.11-2.4.0.jar
List the contents of that directory we just created which should now have a new log file from the run we just did
[hadoop#ip-1-2-3-4 bin]$ hdfs dfs -ls /eventLogging
Found 1 items
-rwxrwx--- 1 hadoop hadoop 53409 2019-05-21 20:56 /eventLogging/application_1557435401803_0106
Now to view the file run hdfs dfs -cat /eventLogging/application_1557435401803_0106
Resources:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
I usually use
$ hdfs dfs -cat <filename> | less
This also helps me to search for words to find what I'm interested in while looking at the contents.
For less context irrelevant purposes like knowing if a particular word exists in a file, or count word occurrences, I use.
$ hdfs dfs -cat <filename> | grep <search_word>
Note: grep also have -C option for contexts, with -A and -B for lines after/before the match.
I was trying to figure out the above commands and that didnt work for me to read the file.
But this did,
cat <filename>
For example,
cat data.txt