Two copies of each file being copied from local to HDFS - hadoop

I am using fs.copyFromLocalFile(local path, Hdfs dest path) in my program.
I am deleting the destination path on HDFS every time and before copying file from local machine. But after copying files from Local path, and implementing map reduce on it generates two copies of each file, hence the word count doubles.
To be clear, I have "Home/user/desktop/input/" as my local path and HDFS dest path to be "/input".
When I check the HDFS Destination path, i.e folder on which map reduce was applied this is the result
hduser#rallapalli-Lenovo-G580:~$ hdfs dfs -ls /input
14/03/30 08:30:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r-- 1 hduser supergroup 62 2014-03-30 08:28 /input/1.txt
-rw-r--r-- 1 hduser supergroup 62 2014-03-30 08:28 /input/1.txt~
-rw-r--r-- 1 hduser supergroup 21 2014-03-30 08:28 /input/2.txt
-rw-r--r-- 1 hduser supergroup 21 2014-03-30 08:28 /input/2.txt~
When I provide Input as single file Home/user/desktop/input/1.txt creates no problem and only single file is copied. But mentioning the directory creates a problem
But manually placing each file in the HDFS Dest through command line creates no problem.
I am not sure If I am missing a simple logic of file system. But would be great if any one could suggest where I am going wrong.
I am using hadoop 2.2.0.
I have tried deleting the local temporary files and made sure the text files are not open. Looking for a way to avoid copiying the temporary files.
Thanks in advance.

The files /input/1.txt~ /input/2.txt~ are temporary files created by the File editor you are using in your machine.You can use Ctrl + H to see all hidden temporary files in your local directory and delete them.

Related

Where is the temp output data of map or reduce tasks

With MapReduce v2, the output data that comes out from a map or a reduce task is saved in the local disk or the HDFS when all the tasks finish.
Since tasks end at different times, I was expecting that the data were written as a task finish. For example, task 0 finish and so the output is written, but task 1 and task 2 are still running. Now task 2 finish the output is written, and task 1 is still running. Finally, task 1 finish and the last output is written. But this does not happen. The outputs only appear in the local disk or HDFS when all the tasks finish.
I want to access the task output as the data is being produced. Where is the output data before all the tasks finish?
Update
After I have set these params in mapred-site.xml
<property><name>mapreduce.task.files.preserve.failedtasks</name><value>true</value></property>
<property><name>mapreduce.task.files.preserve.filepattern</name><value>*</value></property>
and these params in hdfs-site.xml
<property> <name>dfs.name.dir</name> <value>/tmp/data/dfs/name/</value> </property>
<property> <name>dfs.data.dir</name> <value>/tmp/data/dfs/data/</value> </property>
And this value in core-site.xml
<property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-temp</value> </property>
but I still can't found where the intermediate output or the final output is saved as they are produced by the tasks.
I have listed all directories in hdfs dfs -ls -R / and in the tmp dir I have only found the job configuration files.
drwx------ - root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_STARTED
-rw-r--r-- 1 root supergroup 0 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/COMMIT_SUCCESS
-rw-r--r-- 10 root supergroup 112872 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.jar
-rw-r--r-- 10 root supergroup 6641 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.split
-rw-r--r-- 1 root supergroup 797 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.splitmetainfo
-rw-r--r-- 1 root supergroup 88675 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job.xml
-rw-r--r-- 1 root supergroup 439848 2016-08-11 16:17 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1.jhist
-rw-r--r-- 1 root supergroup 105176 2016-08-11 16:14 /tmp/hadoop-yarn/staging/root/.staging/job_1470912033891_0002/job_1470912033891_0002_1_conf.xml
Where is the output saved? I am talking about the output that it is stored as it is being produced by the tasks, and not the final output that comes when all map or reduce tasks finish.
The output put of a task is in <output dir>/_temporary/1/_temporary.
HDFS /tmp directory mainly used as a temporary storage during mapreduce operation. Mapreduce artifacts, intermediate data etc will be kept under this directory. These files will be automatically cleared out when mapreduce job execution completes. If you delete this temporary files, it can affect the currently running mapreduce jobs.
Answer from this stackoverflow link:
It's not a good practice to depend on temporary files, whose location and format can change anytime between releases.
Anyway, setting mapreduce.task.files.preserve.failedtasks to true will keep the temporary files for all the failed tasks and setting mapreduce.task.files.preserve.filepattern to regex of the ID of the task will keep the temporary files for the matching pattern irrespective of the task success or failure.
There is some more information in the same post.

Zero-length file in S3 folder possibly prevents accessing that folder with Hive?

I cannot access a folder on AWS S3 with Hive, presumably, a zero-length file in that directory is the reason. AWS management console's folder is a zero-byte object with key that ends with a slash, i.e. "folder_name/". I think that Hive or Hadoop may have a bug in how they define a folder scheme on S3.
Here is what I have done.
CREATE EXTERNAL TABLE is_data_original (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/logs/';
SELECT * FROM is_data_original LIMIT 10;
Failed with exception java.io.IOException:java.lang.NullPointerException
username#client:~$ hadoop fs -ls s3n://bucketname/logs/
Found 4 items
-rwxrwxrwx 1 0 2015-01-22 20:30 /logs/data
-rwxrwxrwx 1 8947 2015-02-27 18:57 /logs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-27 18:57 /logs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-27 18:57 /logs/data_2015-02-15.csv
hadoop fs -mkdir s3n://bucketname/copylogs/
hadoop fs -cp s3n://bucketname/logs/*.csv s3n://bucketname/copylogs/
username#client:~$ hadoop fs -ls s3n://bucketname/copylogs/
Found 3 items
-rwxrwxrwx 1 8947 2015-02-28 05:09 /copylogs/data_2015-02-13.csv
-rwxrwxrwx 1 7912 2015-02-28 05:09 /copylogs/data_2015-02-14.csv
-rwxrwxrwx 1 16786 2015-02-28 05:09 /copylogs/data_2015-02-15.csv
CREATE EXTERNAL TABLE is_data_copy (user_id STRING, action_name STRING, timestamp STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3n://bucketname/copylogs/';
SELECT * FROM is_data_copy LIMIT 10;
The latter, after copying, works fine.
Below two commands both work:
hadoop fs -cat s3n://bucketname/logs/data_2015-02-15.csv
hadoop fs -cat s3n://bucketname/copylogs/data_2015-02-15.csv
Versions: Hive 0.11.0 and Hadoop 1.0.3.
Is this some kind of bug? Is it related to AWS S3? Any ideas? I need to be able to read the original location, because this is where that data keeps flowing.
I have no control on the processes that created the directory and placed log files in there, so I cannot check anything on that end.
I carried an experiment: created a key/folder on S3 and placed a file in there in two different ways: using AWS Management Console and using hadoop fs.
I can see a zero-byte file in the folder in case I used AWS Console and I am getting a null-pointer exception assessing it with Hive. With hadoop fs I don't have such a problem. I assume, that zero-byte file supposed to be deleted but it was not in case of AWS Console. I am sure, that in my case, s3 folder is not created from AWS console, but possibly Ruby or Javascript.
Seems like a Hive bug. Hive 0.12.0 does not have that problem.

Hadoop, Mapreduce - Cannot obtain block length for LocatedBlock

I've a file on hdfs in the path 'test/test.txt' which is 1.3G
output of ls and du commands is:
hadoop fs -du test/test.txt -> 1379081672 test/test.txt
hadoop fs -ls test/test.txt ->
Found 1 items
-rw-r--r-- 3 testuser supergroup 1379081672 2014-05-06 20:27 test/test.txt
I want to run a mapreduce job on this file but when i start the mapreduce job on this file the job fails with the following error:
hadoop jar myjar.jar test.TestMapReduceDriver test output
14/05/29 16:42:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/05/29 16:42:03 INFO input.FileInputFormat: Total input paths to process : 1
14/05/29 16:42:03 INFO mapred.JobClient: Running job: job_201405271131_9661
14/05/29 16:42:04 INFO mapred.JobClient: map 0% reduce 0%
14/05/29 16:42:17 INFO mapred.JobClient: Task Id : attempt_201405271131_9661_m_000004_0, Status : FAILED
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-428948818-namenode-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode4:50010, datanode3:50010, datanode1:50010]}
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:319)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:263)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:205)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:198)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1117)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:249)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:746)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:83)
at org.apache.hadoop.mapred.Ma`
i tried the following commands:
hadoop fs -cat test/test.txt gives the following error
cat: Cannot obtain block length for LocatedBlock{BP-428948818-10.17.56.16-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode3:50010, datanode1:50010, datanode4:50010]}
additionally i can't copy the file hadoop fs -cp test/test.txt tmp gives same error:
cp: Cannot obtain block length for LocatedBlock{BP-428948818-10.17.56.16-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode1:50010, datanode3:50010, datanode4:50010]}
output of the hdfs fsck /user/testuser/test/test.txt command:
Connecting to namenode via `http://namenode:50070`
FSCK started by testuser (auth:SIMPLE) from /10.17.56.16 for path
/user/testuser/test/test.txt at Thu May 29 17:00:44 EEST 2014
Status: HEALTHY
Total size: 0 B (Total open files size: 1379081672 B)
Total dirs: 0
Total files: 0 (Files currently being written: 1)
Total blocks (validated): 0 (Total open file blocks (not validated): 21)
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu May 29 17:00:44 EEST 2014 in 0 milliseconds
The filesystem under path /user/testuser/test/test.txt is HEALTHY
by the way i can see the content of the test.txt file from the web browser.
hadoop version is: Hadoop 2.0.0-cdh4.5.0
I got the same issue with you and I fixed it by the following steps.
There are some files that opened by flume but never closed (I am not sure about your reason).
You need to find the name of the opened files by the command:
hdfs fsck /directory/of/locked/files/ -files -openforwrite
You can try to recover files as command:
hdfs debug recoverLease -path <path-of-the-file> -retries 3
Or removing them by the command:
hdfs dfs -rmr <path-of-the-file>
I had the same error, but it was not due to the full disk problem, and I think the inverse, where there were files and blocks referenced by in the namenode that did not exist on any datanodes.
Thus, hdfs dfs -ls shows the files, but any operation on them fails, e.g. hdfs dfs -copyToLocal.
In my case, the hard part was isolating which files were listed but corrupted, as they existed in a tree having thousands of files. Oddly, hdfs fsck /path/to/files/ did not report any problems.
My solution was:
Isolate the location using copyToLocal which resulted in copyToLocal: Cannot obtain block length for LocatedBlock{BP-1918381527-10.74.2.77-1420822494740:blk_1120909039_47667041; getBlockSize()=1231; corrupt=false; offset=0; locs=[10.74.2.168:50010, 10.74.2.166:50010, 10.74.2.164:50010]} for several files
Get a list of the local directories using ls -1 > baddirs.out
get rid of the local files from the first copyToLocal
use for files incat baddirs.out;do echo $files; hdfs dfs -copyToLocal $files This will produce a list of directories checks, and errors where files are found.
get rid of the local files again, and now get lists of files from each affected subdirectory. Use that as input to a file-by-file copyToLocal, at which point you can echo each file as it's copied, then see where the error occurs.
use hdfs dfs -rm <file> for each file.
Confirm you got 'em all be removing all local files again, and using the original copyToLocal on the top level directory where you had problems.
A simple two hour process!
You are having some corrupted files with no blocks on datanode but an entry in namenode. Best to follow this:
https://stackoverflow.com/a/19216037/812906
According to this this may be produced by a full disk problem. I came across the same problem recently with an old file and checking my servers metrics it effectively was a full disk problem during the creation of that file. Most solutions just claim to delete the file and prey for it not happening again.

Loading new files using Pig LOAD statement

I wanted to load data from HDFS to HBSE table sing PIG script.
I have hadfs folder structure as below:
-rw-r--r-- 1 user supergroup 63 2014-05-15 20:28 dataparse/good/goodrec_051520142028
-rw-r--r-- 1 user supergroup 72 2014-05-15 20:30 dataparse/good/goodrec_051520142030
-rw-r--r-- 1 user supergroup 110 2014-05-15 20:32 dataparse/good/goodrec_051520142032
In the above all filenames are attached with the timestamp.
Below is my PIG script to load from HDFS to HBASE:
G = LOAD '/user/user/dataparse/good/' USING PigStorage(',') as (c1:chararray, c2:chararray,c3:chararray,c4:chararray,c5:chararray);
STORE G INTO 'hbase://test' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('t1:name t1:state t1:phone_no t1:gender');
The script is working fine and the data from all the 3 files are written to the Hbase "test" table.
Suppose after some time if some more files comes to HDFS with the same structure and when i run the pig script it will LOAD all the files in the "good" directory along with the already read file. So how can i load only those files which are new files. Already loaded files should not be loaded again into my HBASE table.
How can i do this?
Thanks,
Sapthashree
I think you have a few options here.
Using globs
Using a shell script pick up the "new" files, Use the glob feature so
that multiple files can be fed into the script. A related use case is
here
If the files have a date and timestamp in the filename then you can
use globs directly, look here to inspiration
Using big guns
If using globs is failing you, then you need to bring out the big
guns, use a custom load function put in the logic to identify "new
files" in it and you should be good to go. Details here
you need to have some scheduling mechanism where pig job runs time to time. So, in this process you can only process the files which are not processed earlier by keep traking the timestamp and file names or any other field.
See here for more information Execute Pig from within Java Application

Target already exists error in hadoop put command

I am trying my hands on Hadoop 1.0. I am getting Target does not exists while copying one file from local system into HDFS.
My hadoop command and its output is as follows :
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt .
Warning: $HADOOP_HOME is deprecated.
put: Target already exists
After observing the output, we can see that there are two blank spaces between word 'Target' and 'already'. I think there has to be something like /user/${user} between those 2 words. If I give destination path explicitly as /user/shekhar then I get following error :
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt /user/shekhar/data.txt
Warning: $HADOOP_HOME is deprecated.
put: java.io.FileNotFoundException: Parent path is not a directory: /user/shekhar
Output of ls command is as follows :
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -lsr /
Warning: $HADOOP_HOME is deprecated.
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /tmp
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /tmp/hadoop-shekhar
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /tmp/hadoop-shekhar/mapred
drwx------ - shekhar supergroup 0 2012-02-21 19:56 /tmp/hadoop-shekhar/mapred/system
-rw------- 1 shekhar supergroup 4 2012-02-21 19:56 /tmp/hadoop-shekhar/mapred/system/jobtracker.info
drwxr-xr-x - shekhar supergroup 0 2012-02-21 19:56 /user
-rw-r--r-- 1 shekhar supergroup 6541526 2012-02-21 19:56 /user/shekhar
Please help me in copying file into HDFS. If you need any other information then please let me know.
I am trying this in Ubuntu which is installed using WUBI (Windows Installer for ubuntu).
Thanks in Advance !
The problem in the put command is the trailing .. You need to specify the full path on HDFS where you want the file to go, for ex:
hadoop fs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt /whatever/20111201.txt
If the directory that you are putting the file in doesn't exist yet, you need to create it first:
hadoop fs -mkdir /whatever
The problem that you are having when you specify the path explicitly is that on your system, /user/shekar is a file, not a directory. You can see that because it has a non-0 size.
-rw-r--r-- 1 shekhar supergroup 6541526 2012-02-21 19:56 /user/shekhar
shekhar#ubuntu:/host/Shekhar/Softwares/hadoop-1.0.0/bin$ hadoop dfs -put /host/Users/Shekhar/Desktop/Downloads/201112/20111201.txt /user/shekhar/data.txt
you must make the file first!
hdfs dfs -mkdir /user/hadoop
hdfs dfs -put /home/bigdata/.password /user/hadoop/

Resources