access hadoop intermediate map output files - hadoop

Is it possible to access or read map intermediate output files i.e. sequential file- file.out ?
I want to read a file.out file. I used approach mentioned in this link, http://hadooptutorial.info/hadoop-sequence-files-example/ ,
but it says that file.out is not a sequential file.

If you want to read the output of the mapper use the following configuration argument while running the job
-Dmapreduce.job.reduces = 0
It will set the number of reducers to 0, leading to 0 partitioners and only mappers result will be displayed in output directory.

Related

Having multiple reduce tasks assemble a single HDFS file as output

Is there any low level API in Hadoop allowing multiple reduce tasks running on different machines to assemble a single HDFS as output of their computation?
Something like, a stub HDFS file is created at the beginning of the job then each reducer creates, as output, a variable number of data blocks and assigns them to this file according to a certain order
The answer is no, that would be an unnecessary complication for a rare use case.
What you should do
option 1 - add some code at the end of your hadoop command
int result = job.waitForCompletion(true) ? 0 : 1;
if (result == 0) { // status code OK
// ls job output directory, collect part-r-XXXXX file names
// create HDFS readers for files
// merge them in a single file in whatever way you want
}
All of the required methods are present in hadoop FileSystem api.
option 2 - add job to merge files
You can create a generic hadoop job that would accept directory name as input and pass everything as-is to the single reducer, that would merge results into one output file. Call this job in a pipeline with your main job.
This would work faster for big inputs.
If you want merged output file on local, you can use hadoop command getmerge to combine multiple reduce task files into one single local output file, below is command for same.
hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt

how to insert header file as first line into data file in HDFS without using getmerge(performance issue while copying to local)?

I am trying to insert header.txt as first line into data.txt without using getmerge. Getmerge copies to local and inserts into third file. But I want in HDFS only
Header.txt
Head1,Head2,Head3
Data.txt
100,John,28
101,Gill,25
102,James,29
I want output in Data.txt file only like below :
Data.txt
Head1,Head2,Head3
100,John,28
101,Gill,25
102,James,29
Please suggest me whether can we implement in HDFS only ?
HDFS supports a concat (short for concatenate) operation in which two files are merged together into one without any data transfer. It will do exactly what you are looking for. Judging by the file system shell guide documentation, it is not currently supported from the command line, so you will need to implement this in Java:
FileSystem fs = ...
Path data = new Path("Data.txt");
Path header = new Path("Header.txt");
Path dataWithHeader = new Path("DataWithHeader.txt");
fs.concat(dataWithHeader, header, data);
After this, Data.txt and Header.txt both cease to exist, replaced by DataWithHeader.txt.
Thanks for your reply.
I got other way like :
Hadoop fs cat hdfs_path/header.txt hdfs_path/data.txt | Hadoop fs -put - hdfs_path/Merged.txt
This is having drawback as cat command reads complete data which impacts performance.

Hadoop: How to generate custom reduce output file name?

Now, I use MultipuleOuputs.
I would like to remove the suffix string "-00001" from reducer's output filename such as "xxxx-[r/m]-00001".
Is there any idea?
Thanks.
From Hadoop javadoc to the write() method of MultipleOutputs:
Output path is a unique file generated for the namedOutput. For example, {namedOutput}-(m|r)-{part-number}
So you need to rename or merge these files on the HDFS.
I think you can do it on job driver. When your job completes, change the file names. Also you could do it via terminal commands.

Running Word Count or Pig Script on a Directory to produce result in separate files

I am new to Hadoop/Pig.
I have a directory which has several files. Now I need to run a word count on those. I can use the Hadoop sample example wordcount and run it on the directory to get the output, but the output will be in a single file. What should I do if I want the output of each file should be in a different file?
I can use Pig too. And give the directory as input to pig. However how can I read the file names inside the Directory and then give it to the LOAD?
What I meant is:
Say I have a directory Test which has 5 files test1, test2, test3, test4, test5. Now I want the word count of each file separately in a separate file. I know I can provide individual names and do it, but that would take a lot of time.
Is it possible that I can read filenames from the directory and provide them as input to LOAD of pig?
If you're using Pig version 0.10.0 or later, you can take advantage of a combination of source tagging and MultiStorage to keep track of the files.
For example, if you had an input directory pigin with files and content as the following:
pigin
|-test1 => "hello"
|-test2 => "world"
|-test3 => "Apache"
|-test4 => "Hadoop"
|-test5 => "Pig"
The following script will read each script and write the contents of each file to a different directory.
%declare inputPath 'pigin'
%declare outputPath 'pigout'
-- Define MultiStorage to write output to different directories based on the
-- first element in the tuple
define MultiStorage org.apache.pig.piggybank.storage.MultiStorage('$outputPath','0');
-- Load the input files, prepending each tuple with the file name
A = load '$inputPath' using PigStorage(',', '-tagsource');
-- Write output to different directories
store A into '$outputPath' using MultiStorage();
The above script will create an output directory tree that looks like the following:
pigout
|-test1
| `-test1-0 => "test1 hello"
|-test2
| `-test2-0 => "test2 world"
|-test3
| `-test3-0 => "test3 Apache"
|-test4
| `-test4-0 => "test4 Hadoop"
|-test5
| `-test5-0 => "test5 Pig"
The -0 at the end of the filenames correspond to the reducers that produced the output. If you have more than one reducer, you may see more than one file per directory.
You could extend the PigStorage code to add the file name to the tuple, see Code Sample look for question "Q: I load data from a directory which contains different file. How do I find out where the data comes from?". For the output you could do similar extension of the PigStorage to write into different output files.

Mahout - Naive Bayes

I tried deploying 20- news group example with mahout, it seems working fine. Out of curiosity I would like to dig deep into the model statistics,
for example: bayes-model directory contains the following sub directories,
trainer-tfIdf trainer-thetaNormalizer trainer-weights
which contains part-0000 files. I would like to read the contents of the file for better understanding, cat command doesnt seems to work, it prints some garbage.
Any help is appreciated.
Thanks
The 'part-00000' files are created by Hadoop, and are in Hadoop's SequenceFile format, containing values specific to Mahout. You can't open them as text files, no. You can find the utility class SequenceFileDumper in Mahout that will try to output the content as text to stdout.
As to what those values are to begin with, they're intermediate results of the multi-stage Hadoop-based computation performed by Mahout. You can read the code to get a better sense of what these are. The "tfidf" directory for example contains intermediate calculations related to term frequency.
You can read part-0000 files using hadoop's filesystem -text option. Just get into the hadoop directory and type the following
`bin/hadoop dfs -text /Path-to-part-file/part-m-00000`
part-m-00000 will be printed to STDOUT.
If it gives you an error, you might need to add the HADOOP_CLASSPATH variable to your path. For example, if after running it gives you
text: java.io.IOException: WritableName can't load class: org.apache.mahout.math.VectorWritable
then add the corresponding class to the HADOOP_CLASSPATH variable
export HADOOP_CLASSPATH=/src/mahout/trunk/math/target/mahout-math-0.6-SNAPSHOT.jar
That worked for me ;)
In order to read part-00000 (sequence files) you need to use the "seqdumper" utility. Here's an example I used for my experiments:
MAHOUT_HOME$: bin/mahout seqdumper -s
~/clustering/experiments-v1/t14/tfidf-vectors/part-r-00000
-o ~/vectors-v2-1010
-s is the sequence file you want to convert to plain text
-o is the output file

Resources