AvroStorage - output file name definition - hadoop

I use AvroStorage to store result set from the pig. Is there a way how can I store data into one specified avro file...e.g OutputFileGen1? Pig is storing data into the directory named OutpuFileGen1 with structure as listed below:
ls -al OutputFileGen1/
total 20
drwxr-xr-x 2 root root 4096 2016-01-18 14:35 .
drwxr-xr-x 6 root root 4096 2016-01-19 10:27 ..
-rw-r--r-- 1 root root 4083 2016-01-18 14:35 part-m-00000.avro
-rw-r--r-- 1 root root 40 2016-01-18 14:35 .part-m-00000.avro.crc
-rw-r--r-- 1 root root 0 2016-01-18 14:35 _SUCCESS
-rw-r--r-- 1 root root 8 2016-01-18 14:35 ._SUCCESS.crc
Thank you

The number of part in the pig output directory depends on how many parallel task your job does. Here you have only have one file : part-m-00000.
http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features
But maybe you want a single file in purpose, so if you want to get this file I suggest to use the hadoop fs -getmerge <src dir> <target dir>command, to get the file in the local file system in order to use the data it contains.

Related

Where is the temp output data of map or reduce tasks

With MapReduce v2, the output data that comes out from a map or a reduce task is saved in the local disk or the HDFS when all the tasks finish.
Since tasks end at different times, I was expecting that the data were written as a task finish. For example, task 0 finish and so the output is written, but task 1 and task 2 are still running. Now task 2 finish the output is written, and task 1 is still running. Finally, task 1 finish and the last output is written. But this does not happen. The outputs only appear in the local disk or HDFS when all the tasks finish.
I want to access the task output as the data is being produced. Where is the output data before all the tasks finish?
Update
After I have set these params in mapred-site.xml
<property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>
<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>
and these params in hdfs-site.xml
<property> <name>dfs.name.dir</name> <value>/tmp/data/dfs/name/</value> </property>
<property> <name>dfs.data.dir</name> <value>/tmp/data/dfs/data/</value> </property>
And this value in core-site.xml
<property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-temp</value> </property>
but I still can't found where the intermediate output or the final output is saved as they are produced by the tasks.
I have listed all directories in hdfs dfs -ls -R / and in the tmp dir I have only found the job configuration files.
drwx------ - root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
-rw-r--r-- 10 root supergroup 112872 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
-rw-r--r-- 10 root supergroup 6641 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
-rw-r--r-- 1 root supergroup 797 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
-rw-r--r-- 1 root supergroup 88675 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
-rw-r--r-- 1 root supergroup 439848 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
-rw-r--r-- 1 root supergroup 105176 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml
Where is the output saved? I am talking about the output that it is stored as it is being produced by the tasks, and not the final output that comes when all map or reduce tasks finish.
The output put of a task is in <output dir>/_temporary/1/_temporary.
HDFS /tmp directory mainly used as a temporary storage during mapreduce operation. Mapreduce artifacts, intermediate data etc will be kept under this directory. These files will be automatically cleared out when mapreduce job execution completes. If you delete this temporary files, it can affect the currently running mapreduce jobs.
Answer from this stackoverflow link:
It's not a good practice to depend on temporary files, whose location and format can change anytime between releases.
Anyway, setting mapreduce.task.files.preserve.failedtasks to true will keep the temporary files for all the failed tasks and setting mapreduce.task.files.preserve.filepattern to regex of the ID of the task will keep the temporary files for the matching pattern irrespective of the task success or failure.
There is some more information in the same post.

hive script file not found exception

I am running below command file is in my local directory but I am getting below error while running the file.
[hdfs#ip-xxx-xxx-xx-xx scripts]$ ls -lrt
total 28
-rwxrwxrwx. 1 root root 17 Apr 1 15:53 hive.hive
-rwxrwxrwx 1 hdfs hadoop 88 May 7 11:53 shell_fun
-rwxrwxrwx 1 hdfs hadoop 262 May 7 12:23 first_hive
-rwxrwxrwx 1 root root 88 May 7 16:59 311_cust_shell
-rwxrwxrwx 1 root root 822 May 8 20:29 script_1
-rw-r--r-- 1 hdfs hadoop 31 May 8 20:30 script_1.log
**-rwxrwxrwx 1 hdfs hdfs 64 May 8 22:07 **hql2.sql***
[hdfs#ip-xxx-xxx-xx-xx scripts]$ hive -f hql2.sql
WARNING: Use "yarn jar" to launch YARN applications.
Logging initialized using configuration in file:/etc/hive/2.3.4.0-3485/0/hive-log4j.properties Could not open input file for reading.
(File file:/home/ec2-user/scripts/hive/scripts/hql2.sql does not exist)
[hdfs#ip-xxx-xxx-xx-xx scripts]$

Switch a disk containing cloudera hadoop / hdfs / hbase data

we have a Cloudera 5 installation based on one single node on a single server. Before adding 2 additional nodes on the cluster, we want to increase the size of the partition using a fresh new disk.
We have the following services installed:
yarn with 1 NodeManager 1 JobHistory and 1 ResourceManager
hdfs with 1 datanode 1 primary node and 1 secondary node
hbase with 1 master and 1 regionserver
zookeeper with 1 server
All data is currently installed on a partition. The number of data that will be collected has increased so we need to use another disk where store all the information.
All the data are under a partition mounted into the folder /dfs
The working partition is:
df -h
hadoop-dfs-partition
119G 9.8G 103G 9% /dfs
df -i
hadoop-dfs-partition
7872512 18098 7854414 1% /dfs
the content of this folder is the following:
drwxr-xr-x 11 root root 4096 May 8 2014 dfs
drwx------. 2 root root 16384 May 7 2014 lost+found
drwxr-xr-x 5 root root 4096 May 8 2014 yarn
under dfs there are these folders:
drwx------ 3 hdfs hadoop 4096 Feb 23 18:14 dn
drwx------ 3 hdfs hadoop 4096 Feb 23 18:14 dn1
drwx------ 3 hdfs hadoop 4096 Feb 23 18:14 dn2
drwx------ 3 hdfs hadoop 4096 Feb 23 18:14 nn
drwx------ 3 hdfs hadoop 4096 Feb 23 18:14 nn1
drwx------ 3 hdfs hadoop 4096 Feb 23 18:14 nn2
drwx------ 3 hdfs hadoop 4096 Feb 23 18:14 snn
drwx------ 3 hdfs hadoop 4096 Feb 23 18:14 snn1
drwx------ 3 hdfs hadoop 4096 Feb 23 18:14 snn2
under yarn there are these folders:
drwxr-xr-x 9 yarn hadoop 4096 Nov 9 15:46 nm
drwxr-xr-x 9 yarn hadoop 4096 Nov 9 15:46 nm1
drwxr-xr-x 9 yarn hadoop 4096 Nov 9 15:46 nm2
How can we achieve this? I found only ways to migrate data beetween clusters with distcp command.
Didn't find any way to move raw data.
Stopping all services and shutting down the entire cluster before performing a
cp -Rp /dfs/* /dfs-new/
command is a viable option?
(/dfs-new in the folder where the fresh new ext4 partition of the new disk is mounted)
Any better way of doing this?
Thank you in advance
i've resolved in this way:
stop all services but hdfs
export data out of the hdfs. In my case the interesting part was in hbase:
su - hdfs
hdfs dfs -ls /
command show me the following data:
drwxr-xr-x - hbase hbase 0 2015-02-26 20:40 /hbase
drwxr-xr-x - hdfs supergroup 0 2015-02-26 19:58 /tmp
drwxr-xr-x - hdfs supergroup 0 2015-02-26 19:38 /user
hdfs dfs -copyToLocal / /a_backup_folder/
to export all data from hdfs to a normal file system
control-D
to return root
stop ALL services on Cloudera (hdfs included)
now you can umount the "old" and "new" partition.
mount the "new" partition in place of the path of the "old" one (in my case is /dfs)
mount the "old" partition in a new place in my case is /dfs-old (remember to mkdir /dfs-old) in this way can check the old structure
make this change permanent editing /etc/fstab. Check if everything is correct repeating step 3 and after try a
mount -a
df -h
to check if you have /dfs and /dfs-old mapped on the proper partitions (the "new" and the "old" one respectively)
format namenode going into
services > hdfs > namenode > action format namenode
in my case doing
ls -l /dfs/dfs
i have:
drwx------ 4 hdfs hadoop 4096 Feb 26 20:39 nn
drwx------ 4 hdfs hadoop 4096 Feb 26 20:39 nn1
drwx------ 4 hdfs hadoop 4096 Feb 26 20:39 nn2
start hdfs service on cloudera
you should have new folders:
ls -l /dfs/dfs
i have:
drwx------ 3 hdfs hadoop 4096 Feb 26 20:39 dn
drwx------ 3 hdfs hadoop 4096 Feb 26 20:39 dn1
drwx------ 3 hdfs hadoop 4096 Feb 26 20:39 dn2
drwx------ 4 hdfs hadoop 4096 Feb 26 20:39 nn
drwx------ 4 hdfs hadoop 4096 Feb 26 20:39 nn1
drwx------ 4 hdfs hadoop 4096 Feb 26 20:39 nn2
drwx------ 3 hdfs hadoop 4096 Feb 26 20:39 snn
drwx------ 3 hdfs hadoop 4096 Feb 26 20:39 snn1
drwx------ 3 hdfs hadoop 4096 Feb 26 20:39 snn2
now copy back data into the new partition
hdfs dfs -copyFromLocal /a_backup_folder/user/* /user
hdfs dfs -copyFromLocal /a_backup_folder/tmp/* /tmp
hdfs dfs -copyFromLocal /a_backup_folder/hbase/* /hbase
The hbase folder need to have the proper permission, hbase:hbase as user:group
hdfs dfs -chown -R hbase:hbase /hbase
if you forgot this step you get permission denied error on the hbase log file later
check the result with
hdfs dfs -ls /hbase
you should see something like this:
drwxr-xr-x - hbase hbase 0 2015-02-26 20:40 /hbase/.tmp
drwxr-xr-x - hbase hbase 0 2015-02-26 20:40 /hbase/WALs
drwxr-xr-x - hbase hbase 0 2015-02-27 11:38 /hbase/archive
drwxr-xr-x - hbase hbase 0 2015-02-25 15:18 /hbase/corrupt
drwxr-xr-x - hbase hbase 0 2015-02-25 15:18 /hbase/data
-rw-r--r-- 3 hbase hbase 42 2015-02-25 15:18 /hbase/hbase.id
-rw-r--r-- 3 hbase hbase 7 2015-02-25 15:18 /hbase/hbase.version
drwxr-xr-x - hbase hbase 0 2015-02-27 11:42 /hbase/oldWALs
(the important part here is to have the proper user and group of file and folders)
now start all services and check if hbase is working with
hbase shell
list
you should see all the tables you had before migration. Try with
count 'a_table_name'

How to put a file to hdfs with secondary group?

I have a local file
-rw-r--r-- 1 me developers 102445154 Oct 22 10:02 file1.csv
which I'm attempting to put to hdfs:
/usr/bin/hdfs dfs -put ./file1.csv hdfs://000.00.00.00/user/me/
which works fine, but the group is wrong
-rw-r--r-- 3 me me 102445154 2013-10-22 10:23 hdfs://000.00.00.00/user/file1.csv
How do I get the group developers to come with?
Use the chgrp option on the file.

unable to load hadoop fs

I have installed hadoop on Ubuntu 4.4.3.I have followed all steps written in here.When I ran a command hadoop fs -ls . I got following output.
hduser#ubuntu:/usr/local/hadoop/sbin$ hadoop fs -ls /
Found 26 items
drwx------ - root root 16384 2010-04-04 05:08 /lost+found
drwxr-xr-x - root root 4096 2012-08-25 09:12 /bin
drwxr-xr-x - root root 4096 2009-10-28 13:55 /srv
-rw-r--r-- 1 root root 7986235 2012-08-25 09:29 /initrd.img
dr-xr-xr-x - root root 0 2013-09-01 15:57 /proc
drwx------ - root root 4096 2013-09-01 11:04 /root
drwxrwxrwx - root root 4096 2012-08-26 05:12 /opt
drwxr-xr-x - root root 4096 2010-04-04 05:29 /mnt
drwxr-xr-x - root root 4096 2009-10-28 13:55 /usr
drwxr-xr-x - root root 4096 2010-04-04 05:09 /cdrom
drwxr-xr-x - root root 0 2013-09-01 15:57 /sys
drwxr-xr-x - hduser hadoop 4096 2013-08-25 03:47 /app
drwxr-xr-x - root root 4096 2010-11-24 10:50 /var
-rw-r--r-- 1 root root 4050496 2012-07-25 09:53 /vmlinuz
-rw-r--r-- 1 root root 3890400 2009-10-16 11:03 /vmlinuz.old
drwxr-xr-x - root root 4096 2010-11-27 08:37 /.cache
drwxr-xr-x - root root 4096 2013-09-01 22:26 /media
-rw-r--r-- 1 root root 7233695 2012-08-25 08:53 /initrd.img.old
drwxr-xr-x - root root 12288 2013-09-01 22:46 /etc
drwxr-xr-x - root root 4096 2013-08-25 03:30 /home
drwxr-xr-x - root root 3980 2013-09-01 15:57 /dev
drwxr-xr-x - root root 12288 2012-08-25 22:07 /lib
drwxrwxrwt - root root 4096 2013-09-01 23:53 /tmp
drwxr-xr-x - root root 4096 2012-08-25 09:29 /boot
drwxr-xr-x - root root 4096 2009-10-19 16:05 /selinux
drwxr-xr-x - root root 4096 2012-08-25 09:04 /sbin
When I run same command in our office lab , I didnt get this op.
Can anyone tell me where I am going wrong ?
Try bin/hadoop fs -ls /. Scripts should be present inside bin folder. Have you followed the link, you have shown, properly?I don't find sbin anywhere in this. Could you please point me to it, if I am wrong.
The reason is you did not configure your core-site.xml correctly with the attribute "fs.default.name". When you don't configure it or you incorrectly configure it, the filesystem will be the default one which is your local file system. So rightly it is listing the root of your local file system.
Please check your core-site.xml carefully and also you need to start the DFS before using the HDFS.

Resources