Hadoop streaming api: how to remove unwanted delimiters - hadoop

Say I have a file on HDFS:
1
2
3
I want it transformed to
a
b
c
I wrote a mapper.py:
#!/usr/bin/python
import sys
for line in sys.stdin:
print chr(int(line) + ord('a') - 1)
then using the streaming api:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar \
-mapper mapper.py -file mapper.py -input /input -output /output
But the result in /output is "a\t\nb\t\nc\t\n":
a\t
b\t
c\t
note those extra unprintable tab characters, I used '\t' instead. It's documented here:
If there is no tab character in the line, then entire line is considered as key and the value is null.
So the tabs were added by streaming api as separators. But however I modify the separator related options, I can't make it disappear.
Thus my problem is, is there a way to do that job clean, without extra things like tabs?
Or to make it clearer, is there a way to use hadoop just as a distributed filter, disgarding its key/value mechanism?
====
update # 2013.11.27
As I discussed with friends, there's no easy way to achieve the goal, and I made a workaround to this problem by using tabs as field separator in my output, and set tab as field separator in hive as well.
Some of my friends proposed using of -D mapred.textoutputformat.ignoreseparator=true, but that parameter just won't work. I investigated into this file:
hadoop-1.1.2/src/mapred/org/apache/hadoop/mapred/TextOutputFormat.java
and didn't find the option. But as an alternative solution, streaming api accepts a parameter -outputformat which specifies another outputformat.
According to this article, you can make a copy of TextOutputFormat.java, remove the default '\t', compile it, and then pack it as a jar, and call streaming api with -libjars yourjar.jar -outputformat path.to.your.outputformat. But I didn't succeed this way with hadoop-1.1.2. Just write this down for others' reference.

You should be able to get rid of these delimiters by specifying that your job is a map-only job - which is essentially what you want to be a distributed filter and the output of your mapper will be the final output.
To do that in Hadoop streaming, you can use the following option:
-D mapred.reduce.tasks=0
So for the full command this would look something like:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-*.jar -D mapred.reduce.tasks=0 -mapper mapper.py -file mapper.py -input /input -output /output

Related

What are the parameters' usage here in the shell code?

hadoop jar cc-jar-with-dependencies.jar com.coupang.pz.cc.merge.Merge_Run \
${IDF_OUT}\
${IG_OUT}\
${PROB_OUT}\
${MERGE_OUT}\
1.00 \
0.000001 \
0.0001 \
There is a piece of shell code and I know the hadoop will run the cc-jar-with-dependencies.jar on hdfs. But what are the meaning of the other parameters below from the second line. Are they the parameters needed for the jar package ?
${...} is the path on hdfs, like ${IDF_OUT} and so on.
The usage of {WORD} is the basic case of Paramter Expansion in bash, shell
$PARAMETER
${PARAMETER}
The easiest form is to just use a parameter's name within braces. This is identical to using $FOO like you see it everywhere, but has the advantage that it can be immediately followed by characters that would be interpreted as part of the parameter name otherwise.
with an example,
word="car"
echo "The plural of $word is most likely $words"
echo "The plural of $word is most likely ${word}s"
produces an output as,
The plural of car is most likely
The plural of car is most likely cars
See the first line not containing cars as expected because shell was able to interpret only ${word} and not $words.
Coming back to your example,
hadoop jar cc-jar-with-dependencies.jar com.coupang.pz.cc.merge.Merge_Run \
${IDF_OUT}\
${IG_OUT}\
${PROB_OUT}\
${MERGE_OUT}\
1.00 \
0.000001 \
0.0001 \
From the second line on-wards, the variables ${IDF_OUT}, ${IG_OUT}, ${PROB_OUT} and ${MERGE_OUT} are all in likelihood some variables (could be environment variables in the hadoop file system) which will get expanded to values when the command is run.
Whilst I have explained what the ${WORD} syntaxes are, the actual purposes of the above variables are not quite relevant in the context of shell.
Those parameters are passed to the hadoop command, so you would need to read the documentation for that command.
However, it might be interesting for you to find out the values contained in these parameters when your script is run. You can do that my modifying the code as shown below :
echo >&2 \
hadoop jar cc-jar-with-dependencies.jar com.coupang.pz.cc.merge.Merge_Run \
${IDF_OUT}\
${IG_OUT}\
${PROB_OUT}\
${MERGE_OUT}\
1.00 \
0.000001 \
0.0001 \
This change will cause the whole command to be printed rather than executed, while the >&2 causes standard output to be output to standard error (which may help getting the data printed to the terminal if there is some output capture going on). Please note that this change is for debugging/curiosity only, it will make your script omit execution of the command.
If you know the values, the whole command is likely be easier to make sense of.

Concat Avro files using avro-tools

Im trying to merge avro files into one big file, the problem is concat command does not accept the wildcard
hadoop jar avro-tools.jar concat /input/part* /output/bigfile.avro
I get:
Exception in thread "main" java.io.FileNotFoundException: File does
not exist: /input/part*
I tried to use "" and '' but no chance.
I quickly checked Avro's source code (1.7.7) and it seems that concat does not support glob patterns (basically, they call FileSystem.open() on each argument except the last one).
It means that you have to explicitly provide all the filenames as argument. It is cumbersome, but following command should do what you want:
IN=$(hadoop fs -ls /input/part* | awk '{printf "%s ", $NF}')
hadoop jar avro-tools.jar concat ${IN} /output/bigfile.avro
It would be a nice addition to add support of glob pattern to this command.
Instead of hadoop jar avro-tools.jar one can run java -jar avro-tools.jar, since you don't need hadoop for this operation.

Hadoop sort example fails with 'not a SequenceFile'. How set the SequenceFile

I'm trying to run bin/hadoop jar hadoop-examples-1.0.4.jar sort input output
but get an error "java.io.IOException: hdfs://master:9000/usr/ubuntu/input/file1 not a SequenceFile"
If I run bin/hadoop jar hadoop-examples-1.0.4.jar wordcount input output It's work.
So I can't figure out how to deal with it
The error message here is exactly right; the sort example is expecting a sequence file - a flat file of binary keys and values as input, the kind that are often generated as output from MapReduce jobs.
However, the wordcount example is not expecting a sequence file in particular as input, merely a text file which is read in with the keys being the offset (line number) into the file, with the value being the line content.
Seeing as the input files you have are not sequence files per se, sort cannot run using them.
#Jork, If you observe sort the example given in hadoop-examples-1.0.4.jar, You can change the Input and Output Formats through command line arguements, Or You can change in the program from SequenceFileInputFormat to Text type. hadoop
I had the same issue. Here , https://wiki.apache.org/hadoop/Sort, it says "The inputs and outputs must be Sequence files."
You should convert your input file to a hadoop sequence file, I wish there was an easier way. I found this tutorial helpful, good luck! https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/hadoop-sequence-file-example/

hadoop converting \r\n to \n and breaking ARC format

I am trying to parse data from commoncrawl.org using hadoop streaming. I set up a local hadoop to test my code, and have a simple Ruby mapper which uses a streaming ARCfile reader. When I invoke my code myself like
cat 1262876244253_18.arc.gz | mapper.rb | reducer.rb
It works as expected.
It seems that hadoop automatically sees that the file has a .gz extension and decompresses it before handing it to a mapper - however while doing so it converts \r\n linebreaks in the stream to \n. Since ARC relies on a record length in the header line, the change breaks the parser (because the data length has changed).
To double check, I changed my mapper to expect uncompressed data, and did:
cat 1262876244253_18.arc.gz | zcat | mapper.rb | reducer.rb
And it works.
I don't mind hadoop automatically decompressing (although I can quite happily deal with streaming .gz files), but if it does I need it to decompress in 'binary' without doing any linebreak conversion or similar. I believe that the default behaviour is to feed decompressed files to one mapper per file, which is perfect.
How can I either ask it not to decompress .gz (renaming the files is not an option) or make it decompress properly? I would prefer not to use a special InputFormat class which I have to ship in a jar, if at all possible.
All of this will eventually run on AWS ElasticMapReduce.
Looks like the Hadoop PipeMapper.java is to blame (at least in 0.20.2):
PipeMapper.java (0.20.2)
Around line 106, the input from TextInputFormat is passed to this mapper (at which stage the \r\n has been stripped), and the PipeMapper is writing it out to stdout with just a \n.
A suggestion would be to amend the source for your PipeMapper.java, check this 'feature' still exists, and amend as required (maybe allow it to be set via a configuration property).

How do I concatenate a lot of files into one inside Hadoop, with no mapping or reduction

I'm trying to combine multiple files in multiple input directories into a single file, for various odd reasons I won't go into. My initial try was to write a 'nul' mapper and reducer that just copied input to output, but that failed. My latest try is:
vcm_hadoop lester jar /vcm/home/apps/hadoop/contrib/streaming/hadoop-*-streaming.jar -input /cruncher/201004/08/17/00 -output /lcuffcat9 -mapper /bin/cat -reducer NONE
but I end up with multiple output files anyway. Anybody know how I can coax everything into a single output file?
Keep the cat mappers and use a single cat reducer. Make sure you're setting the number of reducers to one. The output will also have gone through the sorter.
You need to use a reducer because you can only suggest the number of mappers.
If you don't want the output sorted, you could have your mappers take filenames as input, read from that file, and output the filename and line number as the key and a line from the file as the value, and have the reducer throw away the key and output the value.

Resources