Hadoop not all values get assembled for one key - hadoop

I have some data that I would like to aggregate by key using Mapper code and then perform something on all values that belong to a key using Reducer code. For example if I have:
key = 1, val = 1,
key = 1, val = 2,
key = 1, val = 3
I would like to get key=1, val=[1,2,3] in my Reducer.
The thing is, I get something like
key = 1, val=[1,2]
key = 1, val=[3]
Why is that so?
I thought that all the values for one specific key will be assembled in one reducer, but now it seems that there can be more key, val [ ] pairs, since there can be multiple reducers, is that so?
Should I set number of reducers to be 1?
I'm new to Hadoop so this confuses me.
Here's the code
public class SomeJob {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException
{
Job job = new Job();
job.setJarByClass(SomeJob.class);
FileInputFormat.addInputPath(job, new Path("/home/pera/data/input/some.csv"));
FileOutputFormat.setOutputPath(job, new Path("/home/pera/data/output"));
job.setMapperClass(SomeMapper.class);
job.setReducerClass(SomeReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
}
}
public class SomeMapper extends Mapper<LongWritable, Text, Text, Text>{
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String parts[] = line.split(";");
context.write(new Text(parts[0]), new Text(parts[4]));
}
}
public class SomeReducer extends Reducer<Text, Text, Text, Text>{
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String properties = "";
for(Text value : values)
{
properties += value + " ";
}
context.write(key, new Text(properties));
}
}

Related

Mapping with key as Text is not working, but parsing it to Intwritable works

I'm a beginner in hadoop and to learn i started doing outer join on two tables.
one has details about movies and other table has ratings.
sample data for movies table
movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller
sample data for ratings
userId,movieId,rating,timestamp
1,122,2.0,945544824
1,172,1.0,945544871
1,1221,5.0,945544788
1,1441,4.0,945544871
1,1609,3.0,945544824
1,1961,3.0,945544871
1,1972,1.0,945544871
2,441,2.0,1008942733
2,494,2.0,1008942733
2,1193,4.0,1008942667
2,1597,3.0,1008942773
2,1608,3.0,1008942733
2,1641,4.0,1008942733
MovieId is primary key in movies table and foreign key in ratings table.So used movieId as key in mapper class.I have used two mappers, one for movieId table and other for ratings table.
code that i have written
public class Join {
public static class MovMapper
extends Mapper<Object, Text, Text, Text>{
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String[] arr= value.toString().split(",");
word.set(arr[0]);
//System.out.println(word.toString()+ " mov");
context.write(word, value);
}
}
public static class RatMapper
extends Mapper<Object, Text, Text, Text>{
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String[] arr= value.toString().split(",");
word.set(arr[1]);
//System.out.println(word.toString() + " rat");
context.write(word, value);
}
}
public static class JoinReducer
extends Reducer<Text,Text,Text,Text> {
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException {
List <Text> rat=new ArrayList<Text>();
Text mov= null;
System.out.println("#######################################################################################");
for(Text item:values){
if(item.toString().split(",").length == 3){
mov= new Text(item);
}
else
rat.add(new Text(item));
System.out.println("---->" + item);
}
System.out.println("item cnt: "+rat.size()+" mov"+mov+" key"+key+" byte: "+key.getBytes().toString());
for(Text item:rat){
if(mov != null) {
context.write(item,mov);
}
}
System.out.println("#######################################################################################");
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "join");
job.setJarByClass(Join.class);
job.setCombinerClass(JoinReducer.class);
job.setReducerClass(JoinReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(args[0]),TextInputFormat.class,MovMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]),TextInputFormat.class,RatMapper.class);
FileOutputFormat.setOutputPath(job, new Path(args[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
ProblemWhile mapping,records from movies table and ratings table are getting mapped to different tasks though the movieId is same.surprisingly when i convert movieId into intwritable, records from both tables matching with the key are getting mapped to same task.

Bloom Filter in MapReduce

I have to use bloom filter in the reduce side join algorithm to filter one of my input, but I have a problem with the function readFields that de-serialise the input stream of a distributed cache (bloom filter) into a bloom filter.
public class BloomJoin {
//function map : input transaction.txt
public static class TransactionJoin extends
Mapper<LongWritable, Text, Text, Text> {
private Text CID=new Text();
private Text outValue=new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String record[] = line.split(",", -1);
CID.set(record[1]);
outValue.set("A"+value);
context.write(CID, outValue);
}
}
//function map : input customer.txt
public static class CustomerJoinMapper extends
Mapper<LongWritable, Text, Text, Text> {
private Text outkey=new Text();
private Text outvalue = new Text();
private BloomFilter bfilter = new BloomFilter();
public void setup(Context context) throws IOException {
URI[] files = DistributedCache
.getCacheFiles(context.getConfiguration());
// if the files in the distributed cache are set
if (files != null) {
System.out.println("Reading Bloom filter from: "
+ files[0].getPath());
// Open local file for read.
DataInputStream strm = new DataInputStream(new FileInputStream(
files[0].toString()));
bfilter.readFields(strm);
strm.close();
// Read into our Bloom filter.
} else {
throw new IOException(
"Bloom filter file not set in the DistributedCache.");
}
};
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String record[] = line.split(",", -1);
outkey.set(record[0]);
if (bfilter.membershipTest(new Key(outkey.getBytes()))) {
outvalue.set("B"+value);
context.write(outkey, outvalue);
}
}
}
//function reducer: join customer with transaction
public static class JoinReducer extends
Reducer<Text, Text, Text, Text> {
private ArrayList<Text> listA = new ArrayList<Text>();
private ArrayList<Text> listB = new ArrayList<Text>();
#Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
listA.clear();
listB.clear();
for (Text t : values) {
if (t.charAt(0) == 'A') {
listA.add(new Text(t.toString().substring(1)));
System.out.println("liste A: "+listA);
} else /* if (t.charAt('0') == 'B') */{
listB.add(new Text(t.toString().substring(1)));
System.out.println("listeB :"+listB);
}
}
executeJoinLogic(context);
}
private void executeJoinLogic(Context context) throws IOException,
InterruptedException {
if (!listA.isEmpty() && !listB.isEmpty()) {
for (Text A : listB) {
for (Text B : listA) {
context.write(A, B);
System.out.println("A="+A+",B="+B);
}
}
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Path bloompath=new Path("/user/biadmin/ezzaki/bloomfilter/output/part-00000");
DistributedCache.addCacheFile(bloompath.toUri(),conf);
Job job = new Job(conf, "Bloom Join");
job.setJarByClass(BloomJoin.class);
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 3) {
System.err
.println("ReduceSideJoin <Transaction data> <Customer data> <out> ");
System.exit(1);
}
MultipleInputs.addInputPath(job, new Path(otherArgs[0]),
TextInputFormat.class,TransactionJoin.class);
MultipleInputs.addInputPath(job, new Path(otherArgs[1]),
TextInputFormat.class, CustomerJoinMapper.class);
job.setReducerClass(JoinReducer.class);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
//job.setMapOutputKeyClass(Text.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 3);
}
}
How can I solve this problem?
Can you try changing
URI[] files = DistributedCache.getCacheFiles(context.getConfiguration());
to
Path[] cacheFilePaths = DistributedCache.getLocalCacheFiles(conf);
for (Path cacheFilePath : cacheFilePaths) {
DataInputStream fileInputStream = fs.open(cacheFilePath);
}
bloomFilter.readFields(fileInputStream);
fileInputStream.close();
Also, I think you are using Map side join and not Reduce side since you are using the Distributed cache in Mapper.
You can use a Bloom Filter from here:
https://github.com/odnoklassniki/apache-cassandra/blob/master/src/java/org/apache/cassandra/utils/BloomFilter.java
It goes with dedicated serializer:
https://github.com/odnoklassniki/apache-cassandra/blob/master/src/java/org/apache/cassandra/utils/BloomFilterSerializer.java
You can serialize like this:
Path file = new Path(bloomFilterPath);
FileSystem hdfs = file.getFileSystem(context.getConfiguration());
OutputStream os = hdfs.create(file);
BloomFilterSerializer serializer = new BloomFilterSerializer();
serializer.serialize(bloomFilter, new DataOutputStream(os));
And deserialize:
InputStream is = getInputStreamFromHdfs(context, bloomFilterPath);
Path path = new Path(bloomFilterPath);
InputStream is = path.getFileSystem(context.getConfiguration()).open(path);
BloomFilterSerializer serializer = new BloomFilterSerializer();
BloomFilter bloomFilter = serializer.deserialize(
new DataInputStream(new BufferedInputStream(is)));

Can I get a Partition number of Hadoop?

I am a hadoop newbie.
I want to get a partition number on output file.
At first, I made a customized partitioner.
public static class MyPartitioner extends Partitioner<Text, LongWritable> {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
}
}
It works. But, I want to output partition numbers 'visually' on Reducer.
How can I get a partiton number ??
Below is my reducer source.
public static class MyReducer extends Reducer<Text, LongWritable, Text, Text>{
private Text textList = new Text();
public void reduce(Text key, Iterable<LongWritable> values, Context context)
throws IOException, InterruptedException {
String list = new String();
for(LongWritable value: values) {
list = new String(list + "\t" + value.toString());
}
textList.set(list);
context.write(key, textList);
}
}
I want to put a partition number on 'list' respectively. There will be '0' or '1'.
list = new String(list + "\t" + value.toString() + "\t" + ??);
It would be great if someone helps me.
+
Thanks to the answer, I got a solution. But, It didn't work and I think I did something wrong.
Below is the modified MyPartitioner.
public static class MyPartitioner extends Partitioner {
public int getPartition(Text key, LongWritable value, int numReduceTasks) {
int numOfChars = key.toString().length();
return numOfChars % numReduceTasks;
private int bring_num = 0;
public void configure(JobConf job) {
bring_num = jobConf.getInt(numOfChars & numReduceTasks);
}
}
}
Add the below code to the Reducer class to get the partition number in a class variable which can be later used in the reducer method.
String partition;
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
partition = conf.get("mapred.task.partition");
}

Writing a value to file without moving to reducer

I have an input of records like this,
a|1|Y,
b|0|N,
c|1|N,
d|2|Y,
e|1|Y
Now, in mapper, i has to check the value of third column. If it is 'Y' then that record has to write directly to output file without moving that record to reducer or else i.e, 'N' value records has to move to reducer for further processing..
So,
a|1|Y,
d|2|Y,
e|1|Y
should not go to reducer but
b|0|N,
c|1|N
should go to reducer and then to output file.
How can i do this??
What you can probably do is use MultipleOutputs - click here to separate out records of 'Y' and 'N' type to two different files from mappers.
Next, you run saparate jobs for the two newly generated 'Y' and 'N' type data sets.
For 'Y' types set number of reducers to 0, so that, Reducers aren't use. And, for 'N' types do it the way you want using reducers.
Hope this helps.
See if this works,
public class Xxxx {
public static class MyMapper extends
Mapper<LongWritable, Text, LongWritable, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
FileSystem fs = FileSystem.get(context.getConfiguration());
Random r = new Random();
FileSplit split = (FileSplit)context.getInputSplit();
String fileName = split.getPath().getName();
FSDataOutputStream out = fs.create(new Path(fileName + "-m-" + r.nextInt()));
String parts[];
String line = value.toString();
String[] splits = line.split(",");
for(String s : splits) {
parts = s.split("\\|");
if(parts[2].equals("Y")) {
out.writeBytes(line);
}else {
context.write(key, value);
}
}
out.close();
fs.close();
}
}
public static class MyReducer extends
Reducer<LongWritable, Text, LongWritable, Text> {
public void reduce(LongWritable key, Iterable<Text> values,
Context context) throws IOException, InterruptedException {
for(Text t : values) {
context.write(key, t);
}
}
}
/**
* #param args
* #throws IOException
* #throws InterruptedException
* #throws ClassNotFoundException
*/
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:9000");
conf.set("mapred.job.tracker", "localhost:9001");
Job job = new Job(conf, "Xxxx");
job.setJarByClass(Xxxx.class);
Path outPath = new Path("/output_path");
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
FileInputFormat.addInputPath(job, new Path("/input.txt"));
FileOutputFormat.setOutputPath(job, outPath);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
In your map function, you will get input line by line. Split it according by using | as the delimiter. (by using the String.split() method to be exact)
It will look like this
String[] line = value.toString().split('|');
Access the third element of this array by line[2]
Then, using a simple if else statement, emit the output with N value for further processing.

Type mismatch in key from map, using SequenceFileInputFormat correctly

I am trying to run a recommender example from chapter6 (listing 6.1 ~ 6.4) in the ebook Mahout in Action. There are two mapper/reducer pairs. Here is the code:
Mapper - 1
public class WikipediaToItemPrefsMapper extends
Mapper<LongWritable,Text,VarLongWritable,VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\d+)");
#Override
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
m.find();
VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
itemID.set(Long.parseLong(m.group()));
context.write(userID, itemID);
}
}
}
Reducer - 1
public class WikipediaToUserVectorReducer extends
Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
#Override
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs,
Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(
Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int)itemPref.get(), 1.0f);
}
//LongWritable userID_lw = new LongWritable(userID.get());
context.write(userID, new VectorWritable(userVector));
//context.write(userID_lw, new VectorWritable(userVector));
}
}
The reducer outputs a userID and a userVector and it looks like this: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} provided FileInputformat and TextInputFormat are used in the driver.
I want to use another pair of mapper-reducer to process this data further:
Mapper - 2
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable> {
#Override
public void map(VarLongWritable userID,
VectorWritable userVector,
Context context)
throws IOException, InterruptedException {
Iterator<Vector.Element> it = userVector.get().iterateNonZero();
while (it.hasNext()) {
int index1 = it.next().index();
Iterator<Vector.Element> it2 = userVector.get().iterateNonZero();
while (it2.hasNext()) {
int index2 = it2.next().index();
context.write(new IntWritable(index1),
new IntWritable(index2));
}
}
}
}
Reducer - 2
public class UserVectorToCooccurenceReducer extends
Reducer {
#Override
public void reduce(IntWritable itemIndex1,
Iterable<IntWritable> itemIndex2s,
Context context)
throws IOException, InterruptedException {
Vector cooccurrenceRow = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (IntWritable intWritable : itemIndex2s) {
int itemIndex2 = intWritable.get();
cooccurrenceRow.set(itemIndex2, cooccurrenceRow.get(itemIndex2) + 1.0);
}
context.write(itemIndex1, new VectorWritable(cooccurrenceRow));
}
}
This is the driver I am using:
public final class RecommenderJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job_preferenceValues = new Job (getConf());
job_preferenceValues.setJarByClass(RecommenderJob.class);
job_preferenceValues.setJobName("job_preferenceValues");
job_preferenceValues.setInputFormatClass(TextInputFormat.class);
job_preferenceValues.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.setInputPaths(job_preferenceValues, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(job_preferenceValues, new Path(args[1]));
job_preferenceValues.setMapOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setMapOutputValueClass(VarLongWritable.class);
job_preferenceValues.setOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setOutputValueClass(VectorWritable.class);
job_preferenceValues.setMapperClass(WikipediaToItemPrefsMapper.class);
job_preferenceValues.setReducerClass(WikipediaToUserVectorReducer.class);
job_preferenceValues.waitForCompletion(true);
Job job_cooccurence = new Job (getConf());
job_cooccurence.setJarByClass(RecommenderJob.class);
job_cooccurence.setJobName("job_cooccurence");
job_cooccurence.setInputFormatClass(SequenceFileInputFormat.class);
job_cooccurence.setOutputFormatClass(TextOutputFormat.class);
SequenceFileInputFormat.setInputPaths(job_cooccurence, new Path(args[1]));
FileOutputFormat.setOutputPath(job_cooccurence, new Path(args[2]));
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
job_cooccurence.setOutputKeyClass(IntWritable.class);
job_cooccurence.setOutputValueClass(VectorWritable.class);
job_cooccurence.setMapperClass(UserVectorToCooccurenceMapper.class);
job_cooccurence.setReducerClass(UserVectorToCooccurenceReducer.class);
job_cooccurence.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new RecommenderJob(), args);
}
}
The error that I get is:
java.io.IOException: Type mismatch in key from map: expected org.apache.mahout.math.VarLongWritable, received org.apache.hadoop.io.IntWritable
In course of Googling for a fix, I found out that my issue is similar to this question. But the difference is that I am already using SequenceFileInputFormat and SequenceFileOutputFormat, I believe correctly. I also see that org.apache.mahout.cf.taste.hadoop.item.RecommenderJob does more or less something similar. In my understanding & Yahoo Tutorial
SequenceFileOutputFormat rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer.
What am I doing wrong? Will really appreciate some pointers from someone.. I spent the day trying to fix this and got nowhere :(
Your second mapper has the following signature:
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable>
But you define the following in your driver code:
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
The reducer is expecting <IntWritable, IntWritable> as input, so you should just amend your driver code to:
job_cooccurence.setMapOutputKeyClass(IntWritable.class);
job_cooccurence.setMapOutputValueClass(IntWritable.class);

Resources