Hadoop MapReduce Streaming sorting on multiple columns - sorting

I have mapreduce input that looks like this:
key1 \t 4.1 \t more ...
key1 \t 10.3 \t more ...
key2 \t 6.9 \t more ...
key2 \t 3 \t more ...
I want to sort by the first column then by second column (reverse numerical). Is there a way to achieve this Streaming MapReduce?
My current attempt is this:
hadoop jar hadoop-streaming-1.2.1.jar -Dnum.key.fields.for.partition=1 -Dmapred.text.key.comparator.options='-k1,2rn' -Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -mapper cat -reducer cat -file mr_base.py -file common.py -file mr_sort_combiner.py -input mr_combiner/2013_12_09__05_47_21/part-* -output mr_sort_combiner/2013_12_09__07_15_59/
But this sorts by first part of key and second (but does not sort second as numeric but rather as a string).
Any ideas on how I can sort two fields (one numeric and one textual)?

you can achieve numerical sorting on multiple columns by specifying multiple k options in mapred.text.key.comparator.options (similarly to the linux sort command)
e.g. in bash
sort -k1,1 -k2rn
so for your example it would be
hadoop jar hadoop-streaming-1.2.1.jar \
-Dmapred.text.key.comparator.options='-k1,1 - k2rn' \
-Dmapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-mapper cat \
-reducer cat \
-file mr_base.py \
-file common.py \
-file mr_sort_combiner.py \
-input mr_combiner/2013_12_09__05_47_21/part-* \
-output mr_sort_combiner/2013_12_09__07_15_59/

Related

How to find out those words that repeate one character more than twice in a word(such as "aa,aaxx")

I want to find all vocabulary from a text file(It convert from a true ebook,so maybe bigger,and the solution need to be effecient), and I have output the vocabulary in a text file named voclist.But still have some illegal words I want to remove it, such as (aa,aazzz).
I have tried "egrep [a-z]+ voclist".
Surely, it won't work.
This is the block contain illegal words:
2 accepting
2 absence
1 zz
1 yyybb
1 yarn
I want output like this:
2 accepting
2 absence
1 yarn
I think about this question much times. Remove "yyybb" and remain "accepting" at the same time maybe a little diffcult, and "yyybb" is rarely occured in a true ebook,so we can just remove "zz".Anybody have any idea?
Suppose inputfile contains:
2 accepting
2 absence
1 zz
1 yyybb
1 yarn
To get a list of words with two or more repeated characters:
$ egrep "(\w)\w*\1" inputfile
2 accepting
2 absence
1 zz
1 yyybb
and to filter illegal words you could use a dictionary, e.g.
$ cat dictionary
accepting
absence
and compare against it:
$ egrep "(\w)\w*\1" inputfile | grep -f dictionary
2 accepting
2 absence
The format you have is a bit unhandy. It looks like it comes from a combination of sort and uniq -c. For simplification, I'll assume the following input format:
accepting
absence
zz
yyybb
yarn
In a bit a lengthy way, you could write:
$ grep -v -e '^.$' \ # single char
-e '^\(.\)\1$' \ # single repeated char (e.g. zzzz)
-e '\(.\)\1\+' \ # repeated char (3 or more times)
-e '^[aeiou]\+$' \ # only vowels
-e '^[bcdfghjklmnpqrstvwxyz]\+$' \ # only consonants
file
We make use of grep as it supports backreferencing in the matching part. Something that awk does not allow.
It is now possible to use this on the original format as:
awk '{print $2}' file \
| grep -v -e '^.$' -e '^\(.\)\1$' -e '\(.\)\1\+' \
-e '^[aeiou]\+$' -e '^[bcdfghjklmnpqrstvwxyz]\+$' \
| grep -wFf - file

How to do a secondary sort on filenames with numbers in hadoop streaming?

I'm trying to sort file names such as
cat1.pdf, cat2.pdf, ... cat10.pdf ...
I'm utilizing a sort right now with the following parameters:
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator
-D stream.num.map.output.key.fields=2
-D mapreduce.partition.keypartitioner.options="-k1,1"
-D mapreduce.partition.keycomparator.options="-k1,1 -k2,2 -V"
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
The key value pairs are separated by a tab with the file name as the value and a string as the key. The problem is that my sort right now secondary sorts the file names such that I get
cat1.pdf, cat10.pdf, cat2.pdf, cat3.pdf, cat30.pdf ...
How can I get it such that the files are sorted like this:
cat1.pdf, cat2.pdf, cat3.pdf ... cat10.pdf,cat11.pdf...
I'm using hadoop streaming 2.7.1

Hadoop sample map reduce program using Awk

I am familiar with Hadoop using Java. Looking for sample Hadoop Map reduce program using AWK only.
For a text file containing...
A k1
B k1
C k2
D k3
Looking for an o/p
k1 2
k2 1
k3 1
I would advise using Hadoop streaming to do this. I'm not a Awk expert by any means, but using #sudo_O answer and transforming it into the Hadoop world here is what I would do:
Write an Awk script that will be used as your mapper. You need only a mapper for this, no need for reducers.
$ cat mapper.awk
#!/usr/bin/awk -f
{a[$2]++}END{for(k in a)print k,a[k]}
You can run your Hadoop streaming job doing the following:
${HADOOP_HOME}/bin/hadoop \
jar ${HADOOP_HOME}/contrib/streaming/*.jar \
-D mapreduce.job.reduces=0 \
-D mapred.reduce.tasks=0 \
-input /path/to/input.txt \
-output /path/to/output/dir \
-mapper mapper.awk \
-file /path/to/mapper.awk
You can view the results in HDFS by doing:
hadoop fs -cat /path/to/output/dir/*
This will do the trick:
$ awk '{a[$2]++}END{for(k in a)print k,a[k]}' file
k1 2
k2 1
k3 1

hadoop file splitting using KeyFieldBasedPartitioner

I have a big file that is formatted as follows
sample name \t index \t score
And I'm trying to split this file based off of sample name using Hadoop Streaming.
I know ahead of time how many samples there are, so can specify how many reducers I need.
This post is doing something very similar, so I know that this is possible.
I tried using the following script to split this file into 16 files (there are 16 samples)
hadoop jar $STREAMING \
-D mapred.text.key.partitioner.options=-k1,1 \
-D stream.num.map.output.key.fields=2 \
-D mapred.reduce.tasks=16 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper cat \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-input input_dir/*part* -output output_dir
This somewhat works - some of the files contain only one sample name. However most of the part* files are blank and some of the part* files contain multiple sample names.
Is there a better way to make sure that every reducer gets only one sample name?
FYI, there is actually a much cleaner way to split up files using a custom OutputFormat
This link describes how to do this really well. I ended up tailoring this other link for my specific application. Altogether, its only a few extra lines of Java

How to run external program within mapper or reducer giving HDFS files as input and storing output files in HDFS?

I have a external program which take file as a input and give output file
//for example
input file: IN_FILE
output file: OUT_FILE
//Run External program
./vx < ${IN_FILE} > ${OUT_FILE}
I want both input and output files in HDFS
I have cluster with 8 nodes.And I have 8 input files each have 1 line
//1 input file : 1.txt
1:0,0,0
//2 input file : 2.txt
2:0,0,128
//3 input file : 3.txt
3:0,128,0
//5 input file : 4.txt
4:0,128,128
//5 input file : 5.txt
5:128,0,0
//6 input file : 6.txt
6:128,0,128
//7 input file : 7.txt
7:128,128,0
//8 input file : 8.txt
8:128,128,128
I am using KeyValueTextInputFormat
key :file name
value: initial coordinates
For example 5th file
key :5
value:128,0,0
each map tasks generate huge amount of data according to their initial coordinates.
Now I want to run external program in each map task and generate output file.
But I am confuse how to do that with files in HDFS .
I can use zero reducer and create file in HDFS
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path outFile;
outFile = new Path(INPUT_FILE_NAME);
FSDataOutputStream out = fs.create(outFile);
//generating data ........ and writing to HDFS
out.writeUTF(lon + ";" + lat + ";" + depth + ";");
I am confuse how to run external program with HDFS file without getting file into file into local directory .
with dfs -get
Without using MR I am getting results with shell script as following
#!/bin/bash
if [ $# -lt 2 ]; then
printf "Usage: %s: <infile> <outfile> \n" $(basename $0) >&2
exit 1
fi
IN_FILE=/Users/x34/data/$1
OUT_FILE=/Users/x34/data/$2
cd "/Users/x34/Projects/externalprogram/model/"
./vx < ${IN_FILE} > ${OUT_FILE}
paste ${IN_FILE} ${OUT_FILE} | awk '{print $1,"\t",$2,"\t",$3,"\t",$4,"\t",$5,"\t",$22,"\t",$23,"\t",$24}' > /Users/x34/data/combined
if [ $? -ne 0 ]; then
exit 1
fi
exit 0
And then I run it with
ProcessBuilder pb = new ProcessBuilder("SHELL_SCRIPT","in", "out");
Process p = pb.start();
I would much appreciate any idea how to use hadoop streaming or any other way to run external program .I want both INPUT and OUTPUT files in HDFS for further processing .
Please help
So assuming that your external program doesnt know how to recognize or read from hdfs, then what you will want to do is load in the file from java and pass it as input directly to the program
Path path = new Path("hdfs/path/to/input/file");
FileSystem fs = FileSystem.get(configuration);
FSDataInputStream fin = fs.open(path);
ProcessBuilder pb = new ProcessBuilder("SHELL_SCRIPT");
Process p = pb.start();
OutputStream os = p.getOutputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(fin));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(os));
String line = null;
while ((line = br.readLine())!=null){
writer.write(line);
}
The output can be done in the reverse manner. Get the InputStream from the process, and make a FSDataOutputStream to write to the hdfs.
Essentially your program with these two things becomes an adapter that converts HDFS to input and output back into HDFS.
You could emply Hadoop Streaming for that:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myPythonScript.py \
-reducer /bin/wc \
-file myPythonScript.py \
-file myDictionary.txt
See https://hadoop.apache.org/docs/r1.0.4/streaming.pdf for some examples.
Also a nice article : http://princetonits.com/blog/technology/hadoop-mapreduce-streaming-using-bash-script/
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
Another example:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /bin/wc
In the above example, both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout. The utility will create a Map/Reduce job, submit the job to an appropriate cluster, and monitor the progress of the job until it completes.
When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value. If there is no tab character in the line, then entire line is considered as key and the value is null. However, this can be customized, as discussed later.
When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) is the value. However, this can be customized.

Resources