Sort the output in Hadoop Mapreduce - hadoop

I am a beginner using Hadoop and I want to read a text file through MapReduce and output it. I have set up a counter to see the order of the data, but the output is not in order. Here are my code and screenshot.
Question: How can we sort the output based on the value of the key?
Sample input data in text file:
199907 21 22 23 24 25
199808 26 27 28 29 30
199909 31 32 33 34 35
200010 36 37 38 39 40
200411 41 42 43 44 45
Mapper
public static class TestMapper
extends Mapper<LongWritable, Text, Text, Text>{
int days = 1;
#Override
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
/* get the file name */
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String filename = fileSplit.getPath().getName();
//context.write(new Text(filename), new Text(""));
StringTokenizer token = new StringTokenizer(value.toString());
String yearMonth = token.nextToken();
if(Integer.parseInt(yearMonth) ==0)
return;
while(token.hasMoreTokens()){
context.write(new Text(yearMonth+" "+days),new Text(token.nextToken()));
}
days++;
}
}
Reducer
public static class TestReducer
extends Reducer<Text,Text,Text,Text> {
#Override
public void reduce(Text key, Iterable<Text> values,Context context)
throws IOException, InterruptedException {
ArrayList<String> valList = new ArrayList<String>();
for(Text val: values)
valList.add(val.toString());
context.write(key,new Text(valList.toString()));
}
}
Driver/Main class
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) {
System.err.println("Usage: Class name <in> [<in>...] <out>");
System.exit(2);
}
Job job = Job.getInstance(conf, "My Class");
job.addFileToClassPath(new Path("/myPath"));
job.setJarByClass(myJar.class);
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputDirRecursive(job, true);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
Screenshot part of the output

Related

MultipleOutputFormat does only one iteration in Reduce step

Here is my Reducer. Reducer takes in EdgeWritable and NullWritable
EdgeWritable has 4 integers, say <71, 74, 7, 2000>
The communication is between 71(FromID) to 74(ToID) on 7(July) 2000(Year).
Mapper outputs 10787 records as such to reducer, But Reducer only outputs 1.
I need to output 44 files with for 44 months between the period Oct-1998 and July 2002. The output should be in format "out"+month+year. ForExample July 2002 records will be in file out72002.
I have debugged the code. After one iteration, it outputs one file and stops without taking next record. Please suggest How I should use MultipleOutput. Thanks
public class MultipleOutputReducer extends Reducer<EdgeWritable, NullWritable, IntWritable, IntWritable>{
private MultipleOutputs<IntWritable,IntWritable> multipleOutputs;
protected void setup(Context context) throws IOException, InterruptedException{
multipleOutputs = new MultipleOutputs<IntWritable, IntWritable>(context);
}
#Override
public void reduce(EdgeWritable key, Iterable val , Context context) throws IOException, InterruptedException {
int year = key.get(3).get();
int month= key.get(2).get();
int to = key.get(1).get();
int from = key.get(0).get();
//if(year >= 1997 && year <= 2001){
if((month >= 9 && year >= 1997) || (month <= 6 && year <= 2001)){
multipleOutputs.write(new IntWritable(from), new IntWritable(to), "out"+month+year );
}
//}
}
#Override
public void cleanup(Context context) throws IOException, InterruptedException{
multipleOutputs.close();
}
Driver
public class TimeSlicingDriver extends Configured implements Tool{
static final SimpleDateFormat sdf = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z");
public int run(String[] args) throws Exception {
if(args.length != 2){
System.out.println("Enter <input path> <output path>");
System.exit(-1);
}
Configuration setup = new Configuration();
//setup.set("Input Path", args[0]);
Job job = new Job(setup, "Time Slicing");
//job.setJobName("Time Slicing");
job.setJarByClass(TimeSlicingDriver.class);
job.setMapperClass(TimeSlicingMapper.class);
job.setReducerClass(MultipleOutputReducer.class);
//MultipleOutputs.addNamedOutput(setup, "output", org.apache.hadoop.mapred.TextOutputFormat.class, EdgeWritable.class, NullWritable.class);
job.setMapOutputKeyClass(EdgeWritable.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
/**Set the Input File Path and output file path*/
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true)?0:1;
}
you are not iterating your Iterator "val", for that reason your logic in your code is executed one time for each group.

Bloom Filter in MapReduce

I have to use bloom filter in the reduce side join algorithm to filter one of my input, but I have a problem with the function readFields that de-serialise the input stream of a distributed cache (bloom filter) into a bloom filter.
public class BloomJoin {
//function map : input transaction.txt
public static class TransactionJoin extends
Mapper<LongWritable, Text, Text, Text> {
private Text CID=new Text();
private Text outValue=new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String record[] = line.split(",", -1);
CID.set(record[1]);
outValue.set("A"+value);
context.write(CID, outValue);
}
}
//function map : input customer.txt
public static class CustomerJoinMapper extends
Mapper<LongWritable, Text, Text, Text> {
private Text outkey=new Text();
private Text outvalue = new Text();
private BloomFilter bfilter = new BloomFilter();
public void setup(Context context) throws IOException {
URI[] files = DistributedCache
.getCacheFiles(context.getConfiguration());
// if the files in the distributed cache are set
if (files != null) {
System.out.println("Reading Bloom filter from: "
+ files[0].getPath());
// Open local file for read.
DataInputStream strm = new DataInputStream(new FileInputStream(
files[0].toString()));
bfilter.readFields(strm);
strm.close();
// Read into our Bloom filter.
} else {
throw new IOException(
"Bloom filter file not set in the DistributedCache.");
}
};
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String record[] = line.split(",", -1);
outkey.set(record[0]);
if (bfilter.membershipTest(new Key(outkey.getBytes()))) {
outvalue.set("B"+value);
context.write(outkey, outvalue);
}
}
}
//function reducer: join customer with transaction
public static class JoinReducer extends
Reducer<Text, Text, Text, Text> {
private ArrayList<Text> listA = new ArrayList<Text>();
private ArrayList<Text> listB = new ArrayList<Text>();
#Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
listA.clear();
listB.clear();
for (Text t : values) {
if (t.charAt(0) == 'A') {
listA.add(new Text(t.toString().substring(1)));
System.out.println("liste A: "+listA);
} else /* if (t.charAt('0') == 'B') */{
listB.add(new Text(t.toString().substring(1)));
System.out.println("listeB :"+listB);
}
}
executeJoinLogic(context);
}
private void executeJoinLogic(Context context) throws IOException,
InterruptedException {
if (!listA.isEmpty() && !listB.isEmpty()) {
for (Text A : listB) {
for (Text B : listA) {
context.write(A, B);
System.out.println("A="+A+",B="+B);
}
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Path bloompath=new Path("/user/biadmin/ezzaki/bloomfilter/output/part-00000");
DistributedCache.addCacheFile(bloompath.toUri(),conf);
Job job = new Job(conf, "Bloom Join");
job.setJarByClass(BloomJoin.class);
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 3) {
System.err
.println("ReduceSideJoin <Transaction data> <Customer data> <out> ");
System.exit(1);
}
MultipleInputs.addInputPath(job, new Path(otherArgs[0]),
TextInputFormat.class,TransactionJoin.class);
MultipleInputs.addInputPath(job, new Path(otherArgs[1]),
TextInputFormat.class, CustomerJoinMapper.class);
job.setReducerClass(JoinReducer.class);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
//job.setMapOutputKeyClass(Text.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 3);
}
}
How can I solve this problem?
Can you try changing
URI[] files = DistributedCache.getCacheFiles(context.getConfiguration());
to
Path[] cacheFilePaths = DistributedCache.getLocalCacheFiles(conf);
for (Path cacheFilePath : cacheFilePaths) {
DataInputStream fileInputStream = fs.open(cacheFilePath);
}
bloomFilter.readFields(fileInputStream);
fileInputStream.close();
Also, I think you are using Map side join and not Reduce side since you are using the Distributed cache in Mapper.
You can use a Bloom Filter from here:
https://github.com/odnoklassniki/apache-cassandra/blob/master/src/java/org/apache/cassandra/utils/BloomFilter.java
It goes with dedicated serializer:
https://github.com/odnoklassniki/apache-cassandra/blob/master/src/java/org/apache/cassandra/utils/BloomFilterSerializer.java
You can serialize like this:
Path file = new Path(bloomFilterPath);
FileSystem hdfs = file.getFileSystem(context.getConfiguration());
OutputStream os = hdfs.create(file);
BloomFilterSerializer serializer = new BloomFilterSerializer();
serializer.serialize(bloomFilter, new DataOutputStream(os));
And deserialize:
InputStream is = getInputStreamFromHdfs(context, bloomFilterPath);
Path path = new Path(bloomFilterPath);
InputStream is = path.getFileSystem(context.getConfiguration()).open(path);
BloomFilterSerializer serializer = new BloomFilterSerializer();
BloomFilter bloomFilter = serializer.deserialize(
new DataInputStream(new BufferedInputStream(is)));

First Hadoop program using map and reducer

I'm trying to compile my first Hadoop program. I have as input file something like that:
1 54875451 2015 LA89LP
2 47451451 2015 LA89LP
3 878451 2015 LA89LP
4 54875 2015 LA89LP
5 2212 2015 LA89LP
When I'm compiling it i get map 100%, reducer 0% and an java.lang.Exception: java.util.NoSuchElementException caused by a lot of staff, including:
java.util.NoSuchElementException
java.util.StringTokenizer.nextToken(StringTokenizer.java:349)
I don't really understand why. Any help is really appreciate
My Map and Reducer are in this way:
public class Draft {
public static class TokenizerMapper extends Mapper<Object, Text, Text, Text>{
private Text word = new Text();
private Text word2 = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String id = itr.nextToken();
String price = itr.nextToken();
String dateTransfer = itr.nextToken();
String postcode = itr.nextToken();
word.set(postcode);
word2.set(price);
context.write(word, word2);
}
}
}
public static class MaxReducer extends Reducer<Text,Text,Text,Text> {
private Text word = new Text();
private Text word2 = new Text();
public void reduce(Text key, Iterable<Text> values, Context context
) throws IOException, InterruptedException {
String max = "0";
HashSet<String> S = new HashSet<String>();
for (Text val: values) {
String d = key.toString();
String price = val.toString();
if (S.contains(d)) {
if (Integer.parseInt(price)>Integer.parseInt(max)) max = price;
} else {
S.add(d);
max = price;
}
}
word.set(key.toString());
word2.set(max);
context.write(word, word2);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Draft");
job.setJarByClass(Draft.class);
job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(MaxReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class); // output key type for mapper
job.setOutputValueClass(Text.class); // output value type for mapper
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
This error occurs, when some of your records have less than 4 fields. Your code in the mapper assumes that each record contains 4 fields: id, price, dateTransfer and postcode.
But, some of the records may not contain all the 4 fields.
For e.g. if the record is:
1 54875451 2015
then, following line will throw an exception (java.util.NoSuchElementException):
String postcode = itr.nextToken();
You are trying to assign postcode (which is assumed to be the 4th field), but there are only 3 fields in the input record.
To overcome this problem, you need to change your string tokenizer code in the map() method. Since you are emitting only postcode and price from the map(), you can change your are code as below:
String[] tokens = value.toString().split(" ");
String price = "";
String postcode = "";
if(tokens.length >= 2)
price = tokens[1];
if(tokens.length >= 4)
postcode = tokens[3];
if(!price.isEmpty())
{
word.set(postcode);
word2.set(price);
context.write(word, word2);
}

cleaning data using mapreduce program

i have a data with 30 lines. I am trying to clean the data using mapreduce program. data is cleaning properly , but only one line is displaying out of 30 lines. I guess record reader is not reading line by line here. Could you please check my code and let me know where the problem is. I am new to hadoop .
Data :-
1 Vlan154.DEL-ISP-COR-SWH-002.mantraonline.com (61.95.250.140) 0.460 ms 0.374 ms 0.351 ms
2 202.56.223.213 (202.56.223.213) 39.718 ms 39.511 ms 39.559 ms
3 202.56.223.17 (202.56.223.17) 39.714 ms 39.724 ms 39.628 ms
4 125.21.167.153 (125.21.167.153) 41.114 ms 40.001 ms 39.457 ms
5 203.208.190.65 (203.208.190.65) 120.340 ms 71.384 ms 71.346 ms
6 ge-0-1-0-0.sngtp-dr1.ix.singtel.com (203.208.149.158) 71.493 ms ge-0-1-2-0.sngtp-dr1.ix.singtel.com (203.208.149.210) 71.183 ms ge-0-1-0-0.sngtp-dr1.ix.singtel.com (203.208.149.158) 71.739 ms
7 ge-0-0-0-0.sngtp-ar3.ix.singtel.com (203.208.182.2) 80.917 ms ge-2-0-0-0.sngtp-ar3.ix.singtel.com (203.208.183.20) 71.550 ms ge-1-0-0-0.sngtp-ar3.ix.singtel.com (203.208.182.6) 71.534 ms
8 203.208.151.26 (203.208.151.26) 141.716 ms 203.208.145.190 (203.208.145.190) 134.740 ms 203.208.151.26 (203.208.151.26) 142.453 ms
9 219.158.3.225 (219.158.3.225) 138.774 ms 157.205 ms 157.123 ms
10 219.158.4.69 (219.158.4.69) 156.865 ms 157.044 ms 156.845 ms
11 202.96.12.62 (202.96.12.62) 157.109 ms 160.294 ms 159.805 ms
12 61.148.3.58 (61.148.3.58) 159.521 ms 178.088 ms 160.004 ms
MPLS Label=33 CoS=5 TTL=1 S=0
13 202.106.48.18 (202.106.48.18) 199.730 ms 181.263 ms 181.300 ms
14 * * *
15 * * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
mapreduce program:-
public class TraceRouteDataCleaning {
/**
* #param args
* #throws IOException
* #throws InterruptedException
* #throws ClassNotFoundException
*/
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
String userArgs[] = new GenericOptionsParser(conf, args).getRemainingArgs();
if (userArgs.length < 2) {
System.out.println("Usage: hadoop jar jarfilename mainclass input output");
System.exit(1);
}
Job job = new Job(conf, "cleaning trace route data");
job.setJarByClass(TraceRouteDataCleaning.class);
job.setMapperClass(TraceRouteMapper.class);
job.setReducerClass(TraceRouteReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(userArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(userArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class TraceRouteMapper extends Mapper<LongWritable, Text, Text, Text>{
StringBuilder emitValue = null;
StringBuilder emitKey = null;
Text kword = new Text();
Text vword = new Text();
public void map(LongWritable key, Text value, Context context) throws InterruptedException, IOException
{
// String[] cleanData;
String lines = value.toString();
//deleting ms in RTT time data
lines = lines.replace(" ms", "");
String[] data = lines.split(" ");
emitValue = new StringBuilder(1024);
emitKey = new StringBuilder(1024);
if (data.length == 6) {
emitKey.append(data[0]);
emitValue.append(data[1]).append("\t").append(data[2]).append("\t").append(data[3]).append("\t").append(data[4]).append("\t").append(data[5]);
kword.set(emitKey.toString());
vword.set(emitValue.toString());
context.write(kword, vword);
}
}
}
public static class TraceRouteReducer extends Reducer<Text, Text, Text, Text>{
Text vword = new Text();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
context.write(key,vword);
}
}
}
first thing your reducer classs should below based on your requirement.if your key is not emitting multiple texts then opt first reducer or else choose second one.
public static class TraceRouteReducer extends Reducer<Text, Text, Text, Text>{
Text vword = new Text();
public void reduce(Text key, Text values, Context context) throws IOException, InterruptedException{
vword=values;
/*
for (Iterator iterator = values.iterator(); iterator.hasNext();) {
vword.set(iterator.next().toString());
System.out.println("printing " +vword.toString());
}*/
context.write(key,vword);
}
}
----------or------------
public static class TraceRouteReducer extends Reducer<Text, Text, Text, Text>{
Text vword = new Text();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
for (Iterator iterator = values.iterator(); iterator.hasNext();) {
vword.set(iterator.next().toString());
context.write(key,vword);
}
}
}
second in your mapper you are splitting based on space.but not feasible as of my knowledge. split based on "\\s+" regular expression.
String[] data = lines.split("\\s+");

Getting error:- Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable

I have written a mapreduce job for doing log file analysis.My mappers output text both as key and value and I have explicitly set the map output classes in my driver class.
But i still get the error:-Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
public class CompositeUserMapper extends Mapper<LongWritable, Text, Text, Text> {
IntWritable a = new IntWritable(1);
//Text txt = new Text();
#Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Pattern p = Pattern.compile("\bd{8}\b");
Matcher m = p.matcher(line);
String userId = "";
String CompositeId = "";
if(m.find()){
userId = m.group(1);
}
CompositeId = line.substring(line.indexOf("compositeId :")+13).trim();
context.write(new Text(CompositeId),new Text(userId));
// TODO Auto-generated method stub
super.map(key, value, context);
}
My Driver class is as below:-
public class CompositeUserDriver extends Configured implements Tool {
public static void main(String[] args) throws Exception {
CompositeUserDriver wd = new CompositeUserDriver();
int res = ToolRunner.run(wd, args);
System.exit(res);
}
public int run(String[] arg0) throws Exception {
// TODO Auto-generated method stub
Job job=new Job();
job.setJarByClass(CompositeUserDriver.class);
job.setJobName("Composite UserId Count" );
FileInputFormat.addInputPath(job, new Path(arg0[0]));
FileOutputFormat.setOutputPath(job, new Path(arg0[1]));
job.setMapperClass(CompositeUserMapper.class);
job.setReducerClass(CompositeUserReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
//return 0;
}
}
Please advise how can sort this problem out.
Remove the super.map(key, value, context); line from your mapper code: it calls map method of the parent class, which is identity mapper that returns key and value passed to it, in this case the key is the byte offset from the beginning of the file

Resources