In Hadoop word count example IntWritable is made static so that it can be reused in the same JVM instead of creating new. My question is why not make text also static?
I made it and is working fine but never saw that in any example. Am I missing something?
private ***static*** Text word = new Text();
private final static IntWritable intWritable = new IntWritable(1);
The original word count example.
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
OutputCollector API, Collects the pairs output by Mappers and Reducers, in order to program works properly, decide the variable whether it has to be global or not,based on your logic and type of Application logic trying to solve, In the case of WordCount Program, the program works properly because mapper object is not sharing it's state across multiple threads
Related
Model
#Column(name="Desc", name="des", name="DS")
private String description;
How can I mention multiple name of column ? so, that if any one found it map value to description?
How can I mention multiple name of column ?
You can't. A database does not allow columns to have multiple names, hence you can't map multiple column names to a class field.
so, that if any one found it map value to description?
If you have multiple stored procedures that return in their respective result set "Desc", "des", "ds" and you need to map this to the same Java class - you need to define different row mappers and describe the mapping there.
For example let's say you have SP1 and SP2 and you want both result sets from those to be mapped to ResultDto.
ResultDto looks like:
public class ResultDto {
private String name; //always maps to DB column "name"
private String desc; //maps to different DB columns - "ds", "desc"
//ommitted..
}
You can define a base row mapper and define the mapping for all overlapping fields of all Stored Procs result sets.
Code Example:
protected static abstract class BaseRowMapper implements RowMapper<ResultDto> {
public abstract ResultDto mapRow(ResultSet rs, int rowNum) throws SQLException;
protected void mapBase(ResultSet rs, ResultDto resultDto) throws SQLException {
resultDto.setName(rs.getString("name")); //map all overlapping fields/columns here
}
}
private static class SP1RowMapper extends BaseRowMapper {
#Override
public ResultDto mapRow(ResultSet rs, int rowNum) throws SQLException {
ResultDto resultDto = new ResultDto();
mapBase(rs, resultDto);
resultDto.setDescription(rs.getString("ds"));
return resultDto;
}
}
private static class SP2RowMapper extends BaseRowMapper {
#Override
public ResultDto mapRow(ResultSet rs, int rowNum) throws SQLException {
ResultDto resultDto = new ResultDto();
mapBase(rs, resultDto);
resultDto.setDescription(rs.getString("desc"));
return resultDto;
}
}
I don't know how you call the Stored Procedures, but if you use Spring's SimpleJdbcCall, the code will look like:
new SimpleJdbcCall(datasource)
.withProcedureName("SP NAME")
.declareParameters(
//Stored Proc params
)
.returningResultSet("result set id", rowMapperInstance);
I am trying to model SQL query like select distinct (col1) from table where col2= value2 in map reduce. The logic I am using is that each mapper will check for where clause and if the match found, it will emit where clause value as a key and col1 as value. Based on default hash function, all output will go to the same reducer as key used value from where clause. There in the reducer, I can exclude duplicate and emit distinct values. Is this correct approach?
Is this a correct approach to implement this?
Note: Data for this query is in the CSV file.
//MAPPER pseudo code
public static class DistinctMapper extends Mapper<Object, Text, Text, NullWritable> {
private Text col1 = new Text();
private Text col2 = new Text();
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// Logic to extract columns
String C1 = extractColumn(value);
String C2 = extractColumn(value);
if (C2 != 'WhereCluaseValue') { // filter value
return;
}
// Mapper output key to the distinct column value
col1.set(C1);
// Mapper value as NULL
context.write(col1, NullWritable.get());
}
}
//REDUCER pseudo code
public static class DistinctReducer extends Reducer<Text, NullWritable, Text, NullWritable> {
public void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
// distinct column with a null value
//Here we are not concerned about the list of values
context.write(key, NullWritable.get());
}
}
I'm new to hadoop. I got this code from net
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class Gender {
private static String genderCheck = "female";
public static class Map extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text locText = new Text();
public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
String line = value.toString();
String location = line.split(",")[14] + "," + line.split(",")[15];
long male = 0L;
long female = 0L;
if (line.split(",")[17].matches("\d+") && line.split(",")[18].matches("\d+")) {
male = Long.parseLong(line.split(",")[17]);
female = Long.parseLong(line.split(",")[18]);
}
long diff = male - female;
locText.set(location);
if (Gender.genderCheck.toLowerCase().equals("female") && diff < 0) {
output.collect(locText, new LongWritable(diff * -1L));
}
else if (Gender.genderCheck.toLowerCase().equals("male") && diff
> 0) {
output.collect(locText, new LongWritable(diff));
}
} }
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Gender.class);
conf.setJobName("gender");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(LongWritable.class);
conf.setMapperClass(Map.class);
if (args.length != 3) {
System.out.println("Usage:");
System.out.println("[male/female] /path/to/2kh/files /path/to/output");
System.exit(1);
}
if (!args[0].equalsIgnoreCase("male") && !args[0].equalsIgnoreCase("female")) {
System.out.println("first argument must be male or female");
System.exit(1);
}
Gender.genderCheck = args[0];
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[1]));
FileOutputFormat.setOutputPath(conf, new Path(args[2]));
JobClient.runJob(conf); }
}
when I compile this code using "javac -cp /usr/local/hadoop/hadoop-core-1.0.3.jar Gender.java"
getting the following error:
"Gender.Map is not abstract and does not override abstract method
map(java.lang.Object,java.lang.Object,org.apache.hadoop.mapred.OutputCollector,org.apache.hadoop.mapred.Reporter)
in org.apache.hadoop.mapred.Mapper
public static class Map extends MapReduceBase implements Mapper "
How can I compile it correctly?
Change the class Maper class declaration as follows:
public static class Map extends MapReduceBase implements Mapper<LongWritable,Text,Text, LongWritable>
If you do not specify any specific class names, you would need to have the map function as follows:
#Override
public void map(Object arg0, Object arg1, OutputCollector arg2, Reporter arg3) throws IOException {
// TODO Auto-generated method stub
}
Now, the specific types denote here the expected input key-value pair types and the output key-value types from the mapper.
In your case the input key-value pair are LongWritable-Text.
And, guessing from your output.collect method calls, your mapper output key-value pair is Text-LongWritable.
Hence, your Map class shall implememnt Mapper<LongWritable,Text,Text, LongWritable>.
There was one more error in your code -
Using "\d+" will not compile as \d has no meaning, after backslash it expects a special escape sequence, so I guess for you the following should work:
line.split(",")[17].matches("\\d+")
Change the map class as follows:
public static class Map extends MapReduceBase implements Mapper <Input key, Input value, Output Key , Output Value>
In your case the input key is LongWritable, Input value is Text, Output Key is Text , Output value is LongWritable
public static class Map extends MapReduceBase implements Mapper <LongWritable, Text, Text,LongWritable>
I am writing my own custom Partitioner(Old Api) below is the code where I am extending Partitioner class:
public static class WordPairPartitioner extends Partitioner<WordPair,IntWritable> {
#Override
public int getPartition(WordPair wordPair, IntWritable intWritable, int numPartitions) {
return wordPair.getWord().hashCode() % numPartitions;
}
}
Setting the JobConf:
conf.setPartitionerClass(WordPairPartitioner.class);
WordPair Class contains:
private Text word;
private Text neighbor;
Questions:
1. I am getting error:"actual argument class (WordPairPartitioner) cannot convert to Class (?extends Partitioner).
2. Is this a right way to write the custom partitioner or do I need to override some other functionality as well?
I believe you are mixing up old API(classes from org.apache.hadoop.mapred.*) and new API(classes from org.apache.hadoop.mapreduce.*)
Using old API, you may do as follows:
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
public static class WordPairPartitioner implements Partitioner<WordPair,IntWritable> {
#Override
public int getPartition(WordPair wordPair, IntWritable intWritable, int numPartitions) {
return wordPair.getWord().hashCode() % numPartitions;
}
#Override
public void configure(JobConf arg0) {
}
}
In addition to Amar's answer, you should handle the eventuality of hashCode returning a negative number by bit masking:
#Override
public int getPartition(WordPair wordPair, IntWritable intWritable, int numPartitions) {
return (wordPair.getWord().hashCode() % numPartitions) & 0x7FFFFFFF;
}
Can someone walk me though the basic work-flow of reading and writing data with classes generated from DDL?
I have defined some struct-like records using DDL. For example:
class Customer {
ustring FirstName;
ustring LastName;
ustring CardNo;
long LastPurchase;
}
I've compiled this to get a Customer class and included it into my project. I can easily see how to use this as input and output for mappers and reducers (the generated class implements Writable), but not how to read and write it to file.
The JavaDoc for the org.apache.hadoop.record package talks about serializing these records in Binary, CSV or XML format. How do I actually do that? Say my reducer produces IntWritable keys and Customer values. What OutputFormat do I use to write the result in CSV format? What InputFormat would I use to read the resulting files in later, if I wanted to perform analysis over them?
Ok, so I think I have this figured out. I'm not sure if it is the most straight-forward way, so please correct me if you know a simpler work-flow.
Every class generated from DDL implements the Record interface, and consequently provides two methods:
serialize(RecordOutput out) for writing
deserialize(RecordInput in) for reading
RecordOutput and RecordInput are utility interfaces provided in the org.apache.hadoop.record package. There are a few implementations (e.g. XMLRecordOutput, BinaryRecordOutput, CSVRecordOutput)
As far as I know, you have to implement your own OutputFormat or InputFormat classes to use these. This is fairly easy to do.
For example, the OutputFormat I talked about in the original question (one that writes Integer keys and Customer values in CSV format) would be implemented like this:
private static class CustomerOutputFormat
extends TextOutputFormat<IntWritable, Customer>
{
public RecordWriter<IntWritable, Customer> getRecordWriter(FileSystem ignored,
JobConf job,
String name,
Progressable progress)
throws IOException {
Path file = FileOutputFormat.getTaskOutputPath(job, name);
FileSystem fs = file.getFileSystem(job);
FSDataOutputStream fileOut = fs.create(file, progress);
return new CustomerRecordWriter(fileOut);
}
protected static class CustomerRecordWriter
implements RecordWriter<IntWritable, Customer>
{
protected DataOutputStream outStream ;
public AnchorRecordWriter(DataOutputStream out) {
this.outStream = out ;
}
public synchronized void write(IntWritable key, Customer value) throws IOException {
CsvRecordOutput csvOutput = new CsvRecordOutput(outStream);
csvOutput.writeInteger(key.get(), "id") ;
value.serialize(csvOutput) ;
}
public synchronized void close(Reporter reporter) throws IOException {
outStream.close();
}
}
}
Creating the InputFormat is much the same. Because the csv format is one entry per line, we can use a LineRecordReader internally to do most of the work.
private static class CustomerInputFormat extends FileInputFormat<IntWritable, Customer> {
public RecordReader<IntWritable, Customer> getRecordReader(
InputSplit genericSplit,
JobConf job,
Reporter reporter)
throws IOException {
reporter.setStatus(genericSplit.toString());
return new CustomerRecordReader(job, (FileSplit) genericSplit);
}
private class CustomerRecordReader implements RecordReader<IntWritable, Customer> {
private LineRecordReader lrr ;
public CustomerRecordReader(Configuration job, FileSplit split)
throws IOException{
this.lrr = new LineRecordReader(job, split);
}
public IntWritable createKey() {
return new IntWritable();
}
public Customer createValue() {
return new Customer();
}
public synchronized boolean next(IntWritable key, Customer value)
throws IOException {
LongWritable offset = new LongWritable() ;
Text line = new Text() ;
if (!lrr.next(offset, line))
return false ;
CsvRecordInput cri = new CsvRecordInput(new
ByteArrayInputStream(line.toString().getBytes())) ;
key.set(cri.readInt("id")) ;
value.deserialize(cri) ;
return true ;
}
public float getProgress() {
return lrr.getProgress() ;
}
public synchronized long getPos() throws IOException {
return lrr.getPos() ;
}
public synchronized void close() throws IOException {
lrr.close();
}
}
}