Hadoop WordCount Output - hadoop

I am new to hadoop and am running some of the examples to become more familiar with it. I ran wordcount and when I went to check the output hadoop fs -cat outt I got 3 directories instead of the usual one named outt/part-00000. Here are the directories I have:
-rw-r--r-- 1 hadoop supergroup 0 2014-07-11 20:13 outt/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 15 2014-07-11 20:13 outt/part-r-00000
-rw-r--r-- 1 hadoop supergroup 0 2014-07-11 20:13 outt/part-r-00001
When I do hadoop fs -cat outt/_SUCCESS and hadoop fs -cat outt/part-r-00001, nothing appears. However, when I do hadoop fs -cat outt/part-r-00000 I get: record_count 1.
My file just says "Hello World" so I am expecting the result: Hello 1 World 1.
Does anyone know how to get the correct output?

1.)The _success and part-r-00000/1 are not directories but files. Directory is more like a set of files and other directories
2.) _Success file is automatically created by hadoop if the submitted job is performed successfully by all the nodes and reducers and the result set is complete.
3.)If you are getting two part files it implies that you have two reducers in your job description. Check the code to find if there is any statement like job.setNumReduceTasks(2);. The part named 00000 is the output of first reducer and 00001 is the output of the second reducer. 'r' implies that the output is from reducer. If see 'm' instead of 'r' it means that you dont have a reducer and the job is map only job.

When you are doing hadoop fs -cat outt/part-r-00000 and getting output as : record_count 1
Which mean probably you are counting the number of lines in the input file.
Once you read a line, you need to tokenize the line and take each word (token) out of this.
Here is sample code:
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
You can find the full code here: WordCount
Here instead of StringTokenizer you can you split method of java API.

Related

How to know the exact block size of a file on a Hadoop node?

I have a 1 GB file that I've put on HDFS. So, it would be broken into blocks and sent to different nodes in the cluster.
Is there any command to identify the exact size of the block of the file on a particular node?
Thanks.
You should use hdfs fsck command:
hdfs fsck /tmp/test.txt -files -blocks
This command will print information about all the blocks of which file consists:
/tmp/test.tar.gz 151937000 bytes, 2 block(s): OK
0. BP-739546456-192.168.20.1-1455713910789:blk_1073742021_1197 len=134217728 Live_repl=3
1. BP-739546456-192.168.20.1-1455713910789:blk_1073742022_1198 len=17719272 Live_repl=3
As you can see here is shown (len field in every row) actual used capacities of blocks.
Also there are many another useful features of hdfs fsck which you can see at the official Hadoop documentation page.
You can try:
hdfs getconf -confKey dfs.blocksize
I do not have reputation to comment.
Have a look at documentation page to set various properties, which covers
dfs.blocksize
Apart from configuration change, you can view actual size of file with
hadoop fs -ls fileNameWithPath
e.g.
hadoop fs -ls /user/edureka
output:
-rwxrwxrwx 1 edureka supergroup 391355 2014-09-30 12:29 /user/edureka/cust

Hadoop, Mapreduce - Cannot obtain block length for LocatedBlock

I've a file on hdfs in the path 'test/test.txt' which is 1.3G
output of ls and du commands is:
hadoop fs -du test/test.txt -> 1379081672 test/test.txt
hadoop fs -ls test/test.txt ->
Found 1 items
-rw-r--r-- 3 testuser supergroup 1379081672 2014-05-06 20:27 test/test.txt
I want to run a mapreduce job on this file but when i start the mapreduce job on this file the job fails with the following error:
hadoop jar myjar.jar test.TestMapReduceDriver test output
14/05/29 16:42:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/05/29 16:42:03 INFO input.FileInputFormat: Total input paths to process : 1
14/05/29 16:42:03 INFO mapred.JobClient: Running job: job_201405271131_9661
14/05/29 16:42:04 INFO mapred.JobClient: map 0% reduce 0%
14/05/29 16:42:17 INFO mapred.JobClient: Task Id : attempt_201405271131_9661_m_000004_0, Status : FAILED
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-428948818-namenode-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode4:50010, datanode3:50010, datanode1:50010]}
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:319)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:263)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:205)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:198)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1117)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:249)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:746)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:83)
at org.apache.hadoop.mapred.Ma`
i tried the following commands:
hadoop fs -cat test/test.txt gives the following error
cat: Cannot obtain block length for LocatedBlock{BP-428948818-10.17.56.16-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode3:50010, datanode1:50010, datanode4:50010]}
additionally i can't copy the file hadoop fs -cp test/test.txt tmp gives same error:
cp: Cannot obtain block length for LocatedBlock{BP-428948818-10.17.56.16-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode1:50010, datanode3:50010, datanode4:50010]}
output of the hdfs fsck /user/testuser/test/test.txt command:
Connecting to namenode via `http://namenode:50070`
FSCK started by testuser (auth:SIMPLE) from /10.17.56.16 for path
/user/testuser/test/test.txt at Thu May 29 17:00:44 EEST 2014
Status: HEALTHY
Total size: 0 B (Total open files size: 1379081672 B)
Total dirs: 0
Total files: 0 (Files currently being written: 1)
Total blocks (validated): 0 (Total open file blocks (not validated): 21)
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu May 29 17:00:44 EEST 2014 in 0 milliseconds
The filesystem under path /user/testuser/test/test.txt is HEALTHY
by the way i can see the content of the test.txt file from the web browser.
hadoop version is: Hadoop 2.0.0-cdh4.5.0
I got the same issue with you and I fixed it by the following steps.
There are some files that opened by flume but never closed (I am not sure about your reason).
You need to find the name of the opened files by the command:
hdfs fsck /directory/of/locked/files/ -files -openforwrite
You can try to recover files as command:
hdfs debug recoverLease -path <path-of-the-file> -retries 3
Or removing them by the command:
hdfs dfs -rmr <path-of-the-file>
I had the same error, but it was not due to the full disk problem, and I think the inverse, where there were files and blocks referenced by in the namenode that did not exist on any datanodes.
Thus, hdfs dfs -ls shows the files, but any operation on them fails, e.g. hdfs dfs -copyToLocal.
In my case, the hard part was isolating which files were listed but corrupted, as they existed in a tree having thousands of files. Oddly, hdfs fsck /path/to/files/ did not report any problems.
My solution was:
Isolate the location using copyToLocal which resulted in copyToLocal: Cannot obtain block length for LocatedBlock{BP-1918381527-10.74.2.77-1420822494740:blk_1120909039_47667041; getBlockSize()=1231; corrupt=false; offset=0; locs=[10.74.2.168:50010, 10.74.2.166:50010, 10.74.2.164:50010]} for several files
Get a list of the local directories using ls -1 > baddirs.out
get rid of the local files from the first copyToLocal
use for files incat baddirs.out;do echo $files; hdfs dfs -copyToLocal $files This will produce a list of directories checks, and errors where files are found.
get rid of the local files again, and now get lists of files from each affected subdirectory. Use that as input to a file-by-file copyToLocal, at which point you can echo each file as it's copied, then see where the error occurs.
use hdfs dfs -rm <file> for each file.
Confirm you got 'em all be removing all local files again, and using the original copyToLocal on the top level directory where you had problems.
A simple two hour process!
You are having some corrupted files with no blocks on datanode but an entry in namenode. Best to follow this:
https://stackoverflow.com/a/19216037/812906
According to this this may be produced by a full disk problem. I came across the same problem recently with an old file and checking my servers metrics it effectively was a full disk problem during the creation of that file. Most solutions just claim to delete the file and prey for it not happening again.

Copy Table from Hive to HDFS

I would like to copy HIVE table from HIVE to HDFS. Please suggest the steps. Later I would like to use this HFDS file for Mahout Machine Learning.
I have created a HIVE table using data stored in the HDFS. Then I trasfromed the few variables in that data set and created a new table from that.
Now I would like to dump the HIVE table from HIVE to HDFS. So that it can be read by Mahout.
When I type this
hadoop fs -ls -R /user/hive/
I can able to see the list of table I have created ,
drwxr-xr-x - hdfs supergroup 0 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-25 17:00 /user/hive/warehouse/telecom.db/telecom_tr/000000_0
I tried to copy the file from Hive to HDFS,
hadoop fs -cp /user/hive/warehouse/telecom.db/telecom_tr/* /user/hdfs/tele_copy
Here I was excepting tele_copy should be a csv file, stored in hdfs.
But when I do hadoop fs -tail /user/hdfs/tele_copy I get the below result.
7.980.00.00.0-9.0-30.00.00.670.00.00.00.06.00.06.670.00.670.00.042.02.02.06.04.0198.032.030.00.03.00.01.01.00.00.00.01.00.01.01.00.00.00.01.00.00.00.00.00.00.06.00.040.09.990.01.01
32.64296.7544.990.016.00.0-6.75-27.844.672.3343.334.671.3331.4725.05.3386.6754.07.00.00.044.01.01.02.02.0498.038.00.00.07.01.00.00.00.01.00.00.01.00.00.00.00.00.01.01.01.00.01.00.00.03.00.010.029.991.01.01
30.52140.030.00.250.00.0-42.0-0.520.671.339.00.00.034.6210.677.3340.09.332.00.00.040.02.02.01.01.01214.056.050.01.05.00.00.00.00.00.00.01.00.01.01.00.00.01.01.00.00.01.00.00.00.06.00.001.00.00.01.01
60.68360.2549.990.991.250.038.75-10.692.331.6715.670.00.0134.576.00.0102.6729.674.00.00.3340.02.01.08.03.069.028.046.00.05.00.01.00.00.00.00.00.01.01.01.00.00.00.01.00.00.01.00.00.00.02.00.020.0129.990.01.01
Which is not comma separated.
Also received the same result I received after running this command.
INSERT OVERWRITE DIRECTORY '/user/hdfs/data/telecom' SELECT * FROM telecom_tr;
When I do a -ls
drwxr-xr-x - hdfs supergroup 0 2014-04-29 17:34 /user/hdfs/data/telecom
-rw-r--r-- 1 hdfs supergroup 5199062 2014-04-29 17:34 /user/hdfs/data/telecom/000000_0
When I do a cat the result is not a CSV
What you're really asking is to have Hive store the file as a CSV file. Try using ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' see Row Format, Storage Format, and SerDe.

hadoop map reduce -archives not unpacking archives

hope you can help me. I've got a head-scratching problem with hadoop map-reduce. I've been using the "-files" option successfully on a map-reduce, with hadoop version 1.0.3. However, when I use the "-archives" option, it copies the files, but does not uncompress them. What am I missing? The documentation says "Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes", but that's not what I'm seeing.
I have created 3 files - a text file "alice.txt", a zip file "bob.zip" (containing b1.txt and bdir/b2.txt), and a tar file "claire.tar" (containing c1.txt and cdir/c2.txt). I then invoke the hadoop job via
hadoop jar myJar myClass -files ./etc/alice.txt -archives ./etc/bob.zip,./etc/claire.tar <input_path> <output_path>
The files are indeed there and well-formed:
% ls -l etc/alice.txt etc/bob.zip etc/claire.tar
-rw-rw-r-- 1 hadoop hadoop 6 Aug 20 18:44 etc/alice.txt
-rw-rw-r-- 1 hadoop hadoop 282 Aug 20 18:44 etc/bob.zip
-rw-rw-r-- 1 hadoop hadoop 10240 Aug 20 18:44 etc/claire.tar
% tar tf etc/claire.tar
c1.txt
cdir/c2.txt
I then have my mapper test for the existence of the files in question, like so, where 'lineNumber' is the key passed into the mapper:
String key = Long.toString(lineNumber.get());
String [] files = {
"alice.txt",
"bob.zip",
"claire.tar",
"bdir",
"cdir",
"b1.txt",
"b2.txt",
"bdir/b2.txt",
"c1.txt",
"c2.txt",
"cdir/c2.txt"
};
String fName = files[ (int) (lineNumber.get() % files.length)];
String val = codeFile(fName);
output.collect(new Text(key), new Text(val));
The support routine 'codeFile' is:
private String codeFile(String fName) {
Vector<String> clauses = new Vector<String>();
clauses.add(fName);
File f = new File(fName);
if (!f.exists()) {
clauses.add("nonexistent");
} else {
if (f.canRead()) clauses.add("readable");
if (f.canWrite()) clauses.add("writable");
if (f.canExecute()) clauses.add("executable");
if (f.isDirectory()) clauses.add("dir");
if (f.isFile()) clauses.add("file");
}
return Joiner.on(',').join(clauses);
}
Using the Guava 'Joiner' class.
The output values from the mapper look like this:
alice.txt,readable,writable,executable,file
bob.zip,readable,writable,executable,dir
claire.tar,readable,writable,executable,dir
bdir,nonexistent
b1.txt,nonexistent
b2.txt,nonexistent
bdir/b2.txt,nonexistent
cdir,nonexistent
c1.txt,nonexistent
c2.txt,nonexistent
cdir/c2.txt,nonexistent
So you see the problem - the archive files are there, but they are not unpacked. What am I missing? I have also tried using DistributedCache.addCacheArchive() instead of using -archives, but the problem is still there.
the distributed cache doesn't unpack the archives files to the local working directory of your task - there's a location on each task tracker for job as a whole, and it's unpacked there.
You'll need to check the DistributedCache to find this location and look for the files there. The Javadocs for DistributedCache show an example mapper pulling this information.
You can use symbolic linking when defining the -files and -archives generic options and a symlink will be created in the local working directory of the map / reduce tasks making this easier:
hadoop jar myJar myClass -files ./etc/alice.txt#file1.txt \
-archives ./etc/bob.zip#bob,./etc/claire.tar#claire
And then you can use the fragment names in your mapper when trying to open files in the archive:
new File("bob").isDirectory() == true

Is there an apache pig equivalent of "SHOW TABLES"?

I have a Hadoop data store I'm accessing in Pig and not a lot of documentation on it, plus I'm new to Pig, so I am looking for the Pig equivalent of "SHOW TABLES". When I have a connection to a MySQL db I can do this and get a general sense of what data is in there; I have found several tutorials but nothing on point. If not, is there some other way to orient myself to a Hadoop data store I know nothing about?
ETA: This would be when running Pig in interactive mode, rather than loading a script. Probably obvious, but I thought I should mention it.
The closest thing I can see to 'show tables' is the 'history' command, which effectively lists all aliases created.
grunt> history
1 a = LOAD 'iris.csv' USING PigStorage (',') AS
(sl:double,sw:double,pl:double,pw:double,spec:int);
2 b = FILTER a BY spec==1;
3 c = GROUP b BY pw;
4 d = FOREACH c GENERATE COUNT(b);
Pig doesn't have a concept of tables. It can read any file that is on your HDFS filesystem and stores the parsed result in a relation.
Note that you can also run HDFS filesystem commands from the grunt shell
It's probably best you familiarise yourself with HDFS first and make sure you can comfortably navigate the filesystem first so you can find what data you want to process with Pig.
We had also came across similar situation and applied all solutions of stackoverflow but none had solved my issue . Now solution of these problem is that , you should use store command of pig and also provide dedicated folder for it .
Now the set up which we prefer is ,
grunt> fs -mkdir /user/hduser/AllPigTableStructures/
grunt> fs -chmod 777 /user/hduser/AllPigTableStructures/
Now we will store all table informations into these folder named "AllPigTableStructures".
Then you should use "store" function as below code,
grunt> store extract_details into '/user/hduser/AllPigTableStructures/SchemaTwit' using PigStorage('\t', '-schema');
the last line of these code should be
/*2017-09-18 02:13:56,566 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
*/
Now you should see a folder with named SchemaTwit like these,
grunt> fs -ls /user/hduser/AllPigTableStructures
Found 12 items
drwxr-xr-x - hduser supergroup 0 2017-09-18 02:13 /user/hduser/AllPigTableStructures/SchemaTwit
and at last if you will see content of SchemaTwit directory then it will contain your schema of your table and all details about your table below is command for it and part-m-xxx kind of file will contains your data part.
grunt> fs -ls /user/hduser/AllPigTableStructures/SchemaTwit
Found 4 items
-rw-r--r-- 2 hduser supergroup 8 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/.pig_header
-rw-r--r-- 2 hduser supergroup 239 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/.pig_schema
-rw-r--r-- 2 hduser supergroup 0 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/_SUCCESS
-rw-r--r-- 2 hduser supergroup 140 2017-09-18 02:26 /user/hduser/AllPigTableStructures/SchemaTwit/part-m-00000
Now you can use below cat command on schema file to see schema of your table of part-m-xxx for browsing your data part
grunt> fs -cat /user/hduser/AllPigTableStructures/SchemaTwit/.pig_schema
{"fields":[{"name":"id","type":50,"description":"autogenerated from Pig Field Schema","schema":null},{"name":"text","type":50,"description":"autogenerated from Pig Field Schema","schema":null}],"version":0,"sortKeys":[],"sortKeyOrders":[]}
Now for loading your table with schema these command help,
WithSchema = LOAD '/user/hduser/AllPigTableStructures/SchemaTwit';
PS: We are running our pig into mapreduce mode .
Looks like you have mistaken Pig. As #seedhead has specified, you handle files with Pig. Folks quite often mistake it as a a database(like Hbase) or a warehouse(like Hive), which it is not. And, as far as visualizing the data is concerned, you could list the files and directories through Pig shell. And if you need to see how many records(or lines) a particular files has, you could do something like this :
Records = LOAD '/path_of_the_file';
Records_Group= GROUP Records ALL;
Records_Count = FOREACH Records_Group GENERATE COUNT(Records);

Resources