Hadoop Distcp: Input size is greater then output size - hadoop

I am copying a folder from one path to another, basically creating a backup.
The source(input) folder size is 5 TB. I use the following distcp command to copy:
hadoop distcp -m 150 <source_folder_path> <destination_folder_path>
hadoop fs -du -s -h source_folder
hadoop fs -du -s -h destination_folder
hadoop fs -ls source_folder | wc -l
hadoop fs -ls destination_folder | wc -l
This is within the same cluster.
I am unable to understand as why my input folder is 5 tb and output folder is only 1 tb. The job completes successfully without any error.
Also I see that the number of files is same in input and output.
I don't use compression or anything in the process. Can someone point out to me why is it like this.
Hadoop version is 2.7

Related

hadoop fs -cat and hadoop fs -text to count the file length , but the result is not equal

I want to count the line of a hdfs file,so I use
hadoop fs -cat /path/part-00982.lzo_deflate |wc -l ,the result is 311424.
while I use
hadoop fs -text /path/part-00982.lzo_deflate |wc -l , the result is 2099305.
Why the same file,but two totally different result?Is there any defference between cat and text?

HDFS Offline Image Viewer takes hours

When I execute these commands, it is very slow and always takes five hours or more to finish.
hdfs dfsadmin -fetchImage ${t_save_fsimage_path}
# 获取下载的fsimage具体文件路径
t_fsimage_file=`ls ${t_save_fsimage_path}/fsimage*`
# 处理fsimage为可读的csv格式文件
hdfs oiv -i ${t_fsimage_file} -o ${t_save_fsimage_path}/fsimage.csv -p Delimited
# 删除fsimage.csv的首行数据
sed -i -e "1d" ${t_save_fsimage_path}/fsimage.csv
# 创建数据目录
hadoop fs -test -e ${t_save_fsimage_path}/fsimage || hdfs dfs -mkdir -p ${t_save_fsimage_path}/fsimage
# 拷贝fsimage.csv到指定的路径
hdfs dfs -copyFromLocal -f ${t_save_fsimage_path}/fsimage.csv ${t_save_fsimage_path}/fsimage/
The below helped especially when dealing with a large fsimage:
Setting the java heap size:
export HADOOP_OPTS="-Xmx55G"
Using the -t or --temp option to use temporary dir (instead of memory) to cache intermediate result: e.g., hdfs oiv -i fsimage_example -o fsimage_example.csv -p Delimited -delimiter "|" --temp /tmp/fsimage_example
You could either programmatically analyze the fsimage using HFSA lib or HFSA cli tool (depending on your use case).

Rebuild Accumulo after namenode crash corrupts root block

our development HDP cluster had a power outage that corrupted some HDFS system blocks used by Accumulo, now the cluster is in safemode and ambari won't restart.
Being a DEV box, HDFS has a replication factor of 1, so I can't restore the corrupted blocks.
What is the best way to rebuild Accumulo to cleanly to restore the HDFS filesystem, and bring the HDP cluster back up? There's no user data in accumulo to save, so a wipe and reinitialise would be fine in this case. Just not sure of the best way to do this.
Some corruption details:
hdfs fsck / | egrep -v '^\.+$' | grep -v replica | grep -v Replica| grep "^\/" | grep "CORRUPT" | sed 's/: CORRUPT.*//' | grep -v "^$"
output is:
Connecting to namenode via http://xyz.fakedomain.com:50070/fsck?ugi=andrew&path=%2F
/apps/accumulo/data/tables/!0/table_info/A000133q.rf
/apps/accumulo/data/tables/+r/root_tablet/A000133t.rf
/apps/accumulo/data/tables/1/default_tablet/F000133r.rf
/user/accumulo/.Trash/Current/apps/accumulo/data/tables/+r/root_tablet/delete+A000133t.rf+F000133s.rf
Cluster details are:
Hortonworks HDP-2.4.0.0-169
Accumulo 1.7.0.2.4
YARN 2.7.1.2.4
First find the bad blocks with:
hdfs fsck / | egrep -v '^\.+$' | grep -v eplica
then delete the file(s) in the block(s) in question and delete with:
hdfs dfs -rm -skipTrash /some/path/to/files
As HDFS user run the following:
hdfs dfsadmin -safemode leave
hdfs dfs -rm -R -skipTrash hdfs://servername:8020/apps/accumulo
hadoop fs -mkdir -p /apps/accumulo
hadoop fs -chmod -R 700 /apps/accumulo
hadoop fs -chown -R accumlo:accumulo /apps/accumulo
From Ambari restart Accumulo to initialise or run:
/usr/hdp/current/accumulo-client/bin/accumulo init
and then start with
/usr/hdp/current/accumulo-client/bin/start-all.sh

How to copy the output of -text HDFS command into another file?

Is there any way we can copy text content of hdfs file into another file system using HDFS command:
hadoop fs -text /user/dir1/abc.txt
Can I print the output of -text into another file by using -cat or any method ?:
hadoop fs -cat /user/deepak/dir1/abc.txt
As it's written in the documentation you can use hadoop fs -cp to copy files in hdfs. You can use hadoop fs -copyToLocal to copy files from hdfs to local file system. If you want to copy files from one hdfs to another then use DistCp tool.
As a general command line tip you can use | to another program or > or >> to a file, e.g.
# Will output to standard output (console) and the file /my/local/file
# this will overwrite the file, use ... tee -a ... to append
hdfs dfs -text /path/to/file | tee /my/local/file
# Will redirect output to some other command
hdfs dfs -text /path/to/file | some-other-command
# Will overwrite /my/local/file
hdfs dfs -text /path/to/file > /my/local/file
# Will append to /my/local/file
hdfs dfs -text /path/to/file >> /my/local/file
Thank you I did use streaming jar example in hadoop-home lib folder as follow :
hadoop -jar hadoop-streaming.jar -input hdfs://namenode:port/path/to/sequencefile \
-output /path/to/newfile -mapper "/bin/cat" -reducer "/bin/cat" \
-file "/bin/cat" -file "/bin/cat" \
-inputformat SequenceFileAsTextInputFormat
you can use "/bin/wc" in case you would like to count the number of lines at the hdfs sequence file.
you can use following:
copyToLocal
hadoop dfs -copyToLocal /HDFS/file /user/deepak/dir1/abc.txt
getmerge
hadoop dfs -getmerge /HDFS/file /user/deepak/dir1/abc.txt
get
hadoop dfs -get /HDFS/file /user/deepak/dir1/abc.txt

view contents of file in hdfs hadoop

Probably a noob question but is there a way to read the contents of file in hdfs besides copying to local and reading thru unix?
So right now what I am doing is:
bin/hadoop dfs -copyToLocal hdfs/path local/path
nano local/path
I am wondering if I can open a file directly to hdfs rather than copying it on local and then opening it.
I believe hadoop fs -cat <file> should do the job.
If the file size is huge (which will be the case most of the times), by doing 'cat' you don't want to blow up your terminal by throwing the entire content of your file. Instead, use piping and get only few lines of the file.
To get the first 10 lines of the file, hadoop fs -cat 'file path' | head -10
To get the last 5 lines of the file, hadoop fs -cat 'file path' | tail -5
If you are using hadoop 2.x , you can use
hdfs dfs -cat <file>
hadoop dfs -cat <filename> or hadoop dfs -cat <outputDirectory>/*
SSH onto your EMR cluster ssh hadoop#emrClusterIpAddress -i yourPrivateKey.ppk
Run this command /usr/lib/spark/bin/spark-shell --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=hdfs://yourEmrClusterIpAddress:8020/eventLogging --class org.apache.spark.examples.SparkPi --master yarn --jars /usr/lib/spark/examples/jars/spark-examples_2.11-2.4.0.jar
List the contents of that directory we just created which should now have a new log file from the run we just did
[hadoop#ip-1-2-3-4 bin]$ hdfs dfs -ls /eventLogging
Found 1 items
-rwxrwx--- 1 hadoop hadoop 53409 2019-05-21 20:56 /eventLogging/application_1557435401803_0106
Now to view the file run hdfs dfs -cat /eventLogging/application_1557435401803_0106
Resources:
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html
I usually use
$ hdfs dfs -cat <filename> | less
This also helps me to search for words to find what I'm interested in while looking at the contents.
For less context irrelevant purposes like knowing if a particular word exists in a file, or count word occurrences, I use.
$ hdfs dfs -cat <filename> | grep <search_word>
Note: grep also have -C option for contexts, with -A and -B for lines after/before the match.
I was trying to figure out the above commands and that didnt work for me to read the file.
But this did,
cat <filename>
For example,
cat data.txt

Resources