Where is the temp output data of map or reduce tasks - hadoop

With MapReduce v2, the output data that comes out from a map or a reduce task is saved in the local disk or the HDFS when all the tasks finish.
Since tasks end at different times, I was expecting that the data were written as a task finish. For example, task 0 finish and so the output is written, but task 1 and task 2 are still running. Now task 2 finish the output is written, and task 1 is still running. Finally, task 1 finish and the last output is written. But this does not happen. The outputs only appear in the local disk or HDFS when all the tasks finish.
I want to access the task output as the data is being produced. Where is the output data before all the tasks finish?
Update
After I have set these params in mapred-site.xml
<property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>
<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>
and these params in hdfs-site.xml
<property> <name>dfs.name.dir</name> <value>/tmp/data/dfs/name/</value> </property>
<property> <name>dfs.data.dir</name> <value>/tmp/data/dfs/data/</value> </property>
And this value in core-site.xml
<property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-temp</value> </property>
but I still can't found where the intermediate output or the final output is saved as they are produced by the tasks.
I have listed all directories in hdfs dfs -ls -R / and in the tmp dir I have only found the job configuration files.
drwx------ - root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
-rw-r--r-- 10 root supergroup 112872 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
-rw-r--r-- 10 root supergroup 6641 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
-rw-r--r-- 1 root supergroup 797 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
-rw-r--r-- 1 root supergroup 88675 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
-rw-r--r-- 1 root supergroup 439848 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
-rw-r--r-- 1 root supergroup 105176 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml
Where is the output saved? I am talking about the output that it is stored as it is being produced by the tasks, and not the final output that comes when all map or reduce tasks finish.

The output put of a task is in <output dir>/_temporary/1/_temporary.

HDFS /tmp directory mainly used as a temporary storage during mapreduce operation. Mapreduce artifacts, intermediate data etc will be kept under this directory. These files will be automatically cleared out when mapreduce job execution completes. If you delete this temporary files, it can affect the currently running mapreduce jobs.

Answer from this stackoverflow link:
It's not a good practice to depend on temporary files, whose location and format can change anytime between releases.
Anyway, setting mapreduce.task.files.preserve.failedtasks to true will keep the temporary files for all the failed tasks and setting mapreduce.task.files.preserve.filepattern to regex of the ID of the task will keep the temporary files for the matching pattern irrespective of the task success or failure.
There is some more information in the same post.

Related

Too many small files HDFS Sink Flume

agent.sinks=hpd
agent.sinks.hpd.type=hdfs
agent.sinks.hpd.channel=memoryChannel
agent.sinks.hpd.hdfs.path=hdfs://master:9000/user/hduser/gde
agent.sinks.hpd.hdfs.fileType=DataStream
agent.sinks.hpd.hdfs.writeFormat=Text
agent.sinks.hpd.hdfs.rollSize=0
agent.sinks.hpd.hdfs.batchSize=1000
agent.sinks.hpd.hdfs.fileSuffix=.i
agent.sinks.hpd.hdfs.rollCount=1000
agent.sinks.hpd.hdfs.rollInterval=0
I'm trying to use HDFS Sink to write events to HDFS. And have tried Size, Count and Time bases rolling but none is working as expected. It is generating too many small files in HDFS like:
-rw-r--r-- 2 hduser supergroup 11617 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832879.i
-rw-r--r-- 2 hduser supergroup 1381 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832880.i
-rw-r--r-- 2 hduser supergroup 553 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832881.i
-rw-r--r-- 2 hduser supergroup 2212 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832882.i
-rw-r--r-- 2 hduser supergroup 1379 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832883.i
-rw-r--r-- 2 hduser supergroup 2762 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832884.i.tmp
Please assist to resolve the given problem. I'm using flume 1.6.0
~Thanks
My provided configurations were all correct. The reason behind such behavior was HDFS. I had 2 data nodes out of which one was down. So, files were not achieving minimum required replication. In Flume logs one can see below warning message too:
"Block Under-replication detected. Rotating file."
To remove this problem one can opt for any of below solution:-
Up the data node to achieve required replication of blocks, or
Set property hdfs.minBlockReplicas accordingly.
~Thanks
You are now rolling the files for every 1000 items. You can try either of two methods mentioned below.
Try increasing hdfs.rollCount to much higher value, this value decides number of events contained in each rolled file.
Remove hdfs.rollCount and set hdfs.rollInterval to interval at which you want to roll your file. Say hdfs.rollInterval = 600 to roll file every 10 minutes.
For more information refer Flume Documentation

AvroStorage - output file name definition

I use AvroStorage to store result set from the pig. Is there a way how can I store data into one specified avro file...e.g OutputFileGen1? Pig is storing data into the directory named OutpuFileGen1 with structure as listed below:
ls -al OutputFileGen1/
total 20
drwxr-xr-x 2 root root 4096 2016-01-18 14:35 .
drwxr-xr-x 6 root root 4096 2016-01-19 10:27 ..
-rw-r--r-- 1 root root 4083 2016-01-18 14:35 part-m-00000.avro
-rw-r--r-- 1 root root 40 2016-01-18 14:35 .part-m-00000.avro.crc
-rw-r--r-- 1 root root 0 2016-01-18 14:35 _SUCCESS
-rw-r--r-- 1 root root 8 2016-01-18 14:35 ._SUCCESS.crc
Thank you
The number of part in the pig output directory depends on how many parallel task your job does. Here you have only have one file : part-m-00000.
http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features
But maybe you want a single file in purpose, so if you want to get this file I suggest to use the hadoop fs -getmerge <src dir> <target dir>command, to get the file in the local file system in order to use the data it contains.

Two copies of each file being copied from local to HDFS

I am using fs.copyFromLocalFile(local path, Hdfs dest path) in my program.
I am deleting the destination path on HDFS every time and before copying file from local machine. But after copying files from Local path, and implementing map reduce on it generates two copies of each file, hence the word count doubles.
To be clear, I have "Home/user/desktop/input/" as my local path and HDFS dest path to be "/input".
When I check the HDFS Destination path, i.e folder on which map reduce was applied this is the result
hduser#rallapalli-Lenovo-G580:~$ hdfs dfs -ls /input
14/03/30 08:30:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r-- 1 hduser supergroup 62 2014-03-30 08:28 /input/1.txt
-rw-r--r-- 1 hduser supergroup 62 2014-03-30 08:28 /input/1.txt~
-rw-r--r-- 1 hduser supergroup 21 2014-03-30 08:28 /input/2.txt
-rw-r--r-- 1 hduser supergroup 21 2014-03-30 08:28 /input/2.txt~
When I provide Input as single file Home/user/desktop/input/1.txt creates no problem and only single file is copied. But mentioning the directory creates a problem
But manually placing each file in the HDFS Dest through command line creates no problem.
I am not sure If I am missing a simple logic of file system. But would be great if any one could suggest where I am going wrong.
I am using hadoop 2.2.0.
I have tried deleting the local temporary files and made sure the text files are not open. Looking for a way to avoid copiying the temporary files.
Thanks in advance.
The files /input/1.txt~ /input/2.txt~ are temporary files created by the File editor you are using in your machine.You can use Ctrl + H to see all hidden temporary files in your local directory and delete them.

How to put a file to hdfs with secondary group?

I have a local file
-rw-r--r-- 1 me developers 102445154 Oct 22 10:02 file1.csv
which I'm attempting to put to hdfs:
/usr/bin/hdfs dfs -put ./file1.csv hdfs://000.00.00.00/user/me/
which works fine, but the group is wrong
-rw-r--r-- 3 me me 102445154 2013-10-22 10:23 hdfs://000.00.00.00/user/file1.csv
How do I get the group developers to come with?
Use the chgrp option on the file.

Target already exists error in hadoop put command

I am trying my hands on Hadoop 1.0. I am getting Target does not exists while copying one file from local system into HDFS.
My hadoop command and its output is as follows :
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt .
Warning: $HADOOP_HOME is deprecated.
put: Target already exists
After observing the output, we can see that there are two blank spaces between word 'Target' and 'already'. I think there has to be something like /user/${user} between those 2 words. If I give destination path explicitly as /user/shekhar then I get following error :
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt /user/shekhar/data.txt
Warning: $HADOOP_HOME is deprecated.
put: java.io.FileNotFoundException: Parent path is not a directory: /user/shekhar
Output of ls command is as follows :
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -lsr /
Warning: $HADOOP_HOME is deprecated.
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /tmp
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /tmp/hadoop-shekhar
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /tmp/hadoop-shekhar/mapred
drwx------ - shekhar supergroup 0 2012-02-21 19:56 /tmp/hadoop-shekhar/mapred/system
-rw------- 1 shekhar supergroup 4 2012-02-21 19:56 /tmp/hadoop-shekhar/mapred/system/jobtracker.info
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /user
-rw-r--r-- 1 shekhar supergroup 6541526 2012-02-21 19:56 /user/shekhar
Please help me in copying file into HDFS. If you need any other information then please let me know.
I am trying this in Ubuntu which is installed using WUBI (Windows Installer for ubuntu).
Thanks in Advance !
The problem in the put command is the trailing .. You need to specify the full path on HDFS where you want the file to go, for ex:
hadoop fs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt /whatever/20111201.txt
If the directory that you are putting the file in doesn't exist yet, you need to create it first:
hadoop fs -mkdir /whatever
The problem that you are having when you specify the path explicitly is that on your system, /user/shekar is a file, not a directory. You can see that because it has a non-0 size.
-rw-r--r-- 1 shekhar supergroup 6541526 2012-02-21 19:56 /user/shekhar
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt /user/shekhar/data.txt
you must make the file first!
hdfs dfs -mkdir /user/hadoop
hdfs dfs -put /home/bigdata/.password /user/hadoop/

Resources