Hadoop - Mysql new API connection - hadoop

I am trying to set MySQL as input, in a Hadoop Process. How to use DBInputFormat class for Hadoop - MySQL connection in version 1.0.3? The configuration of the job via JobConf from hadoop-1.0.3/docs/api/ doesnt work.
// Create a new JobConf
JobConf job = new JobConf(new Configuration(), MyJob.class);
// Specify various job-specific parameters
job.setJobName("myjob");
FileInputFormat.setInputPaths(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"));
job.setMapperClass(MyJob.MyMapper.class);
job.setCombinerClass(MyJob.MyReducer.class);
job.setReducerClass(MyJob.MyReducer.class);
job.setInputFormat(SequenceFileInputFormat.class);
job.setOutputFormat(SequenceFileOutputFormat.class);

You need to do something like the following (assuming the typical employee table for example):
JobConf conf = new JobConf(getConf(), MyDriver.class);
conf.setInputFormat(DBInputFormat.class);
DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”, “jdbc:mysql://localhost/mydatabase”); String [] fields = { “employee_id”, "name" };
DBInputFormat.setInput(conf, MyRecord.class, “employees”, null /* conditions */, “employee_id”, fields);
...
// other necessary configuration
JobClient.runJob(conf);
The configureDB() and setInput() calls configure the DBInputFormat. The first call specifies the JDBC driver implementation to use and what database to connect to. The second call specifies what data to load from the database. The MyRecord class is the class where data will be read into in Java, and "employees" is the name of the table to read. The "employee_id" parameter specifies the table’s primary key, used for ordering results. The section “Limitations of the InputFormat” below explains why this is necessary. Finally, the fields array lists what columns of the table to read. An overloaded definition of setInput() allows you to specify an arbitrary SQL query to read from, instead.
After calling configureDB() and setInput(), you should configure the rest of your job as usual, setting the Mapper and Reducer classes, specifying any other data sources to read from (e.g., datasets in HDFS) and other job-specific parameters.
You need to create your own implementation of Writable- something like the following (considering id and name as table fields):
class MyRecord implements Writable, DBWritable {
long id;
String name;
public void readFields(DataInput in) throws IOException {
this.id = in.readLong();
this.name = Text.readString(in);
}
public void readFields(ResultSet resultSet) throws SQLException {
this.id = resultSet.getLong(1);
this.name = resultSet.getString(2); }
public void write(DataOutput out) throws IOException {
out.writeLong(this.id);
Text.writeString(out, this.name); }
public void write(PreparedStatement stmt) throws SQLException {
stmt.setLong(1, this.id);
stmt.setString(2, this.name); }
}
The mapper then receives an instance of your DBWritable implementation as its input value. The input key is a row id provided by the database; you’ll most likely discard this value.
public class MyMapper extends MapReduceBase implements Mapper<LongWritable, MyRecord, LongWritable, Text> {
public void map(LongWritable key, MyRecord val, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException {
// Use val.id, val.name here
output.collect(new LongWritable(val.id), new Text(val.name));
}
}
For more : read the following link (actual source of my answer) : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

Have a look at this post. It shows how to sink data from Map Reduce to MySQL Database.

Related

Getting the partition id of input file in Hadoop

I need to know the row index of the partitions of the input file that I'm using. I could force this in the original file by concatenating the row index to the data but I'd rather have a way of doing this in Hadoop. I have this in my mapper...
String id = context.getConfiguration().get("mapreduce.task.partition");
But "id" is 0 in every case. In the "Hadoop: The Definitive Guide" it mentions accessing properties like the partition id "can be accessed from the context object passed to all methods of the Mapper or Reducer". It does not, from what I can tell, actually go into how to access this information.
I went through the documentation for the Context object and it seems like the above is the way to do it and the script does compile. But since I'm getting 0 for every value, I'm not sure if I'm actually using the right thing and I'm unable to find any detail online that could help in figuring this out.
Code used to test...
public class Test {
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String id = context.getConfiguration().get("mapreduce.task.partition");
context.write(new Text("Test"), new Text(id + "_" + value.toString()));
}
}
public static class TestReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for(Text value : values) {
context.write(key, value);
}
}
}
public static void main(String[] args) throws Exception {
if(args.length != 2) {
System.err.println("Usage: Test <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(Test.class);
job.setJobName("Test");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Two options are:
Use the offset instead of the row number
Track the line number in the mapper
For the first one, the key which is LongWritable tells you the offset of the line being processed. Unless your lines are exactly the same length, you won't be able to calculate the line number from an offset, but it does allow you to determine ordering if thats useful.
The second option is to just track it in the mapper. You could change your code to something like:
public static class TestMapper extends Mapper<LongWritable, Text, Text, Text> {
private long currentLineNum = 0;
private Text test = new Text("Test");
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(test, new Text(currentLineNum + "_" + value));
currentLineNum++;
}
}
You could also represent your matrix as lines of tuples and include the row and col on every tuple so when you're reading in the file, you have that information. If you use a file that is just space or comma seperated values that make up a 2D array, it'll be extremely hard to figure out what line (row) you are currently working on in the mapper

When to go for custom Input format for Map reduce jobs

When should we go for custom Input Format while using Map Reduce programming ?
Say I have a file which I need to read line by line and it has 15 columns delimited by pipe, should I go for custom Input Format ?
I can use a TextInput Format as well as Custom Input Format in this case.
CustomInputFormat can be written when you need to customize input
record reading. But in your case you need not have such an implementation.
see below example of CustomInputFormat out of many such...
Example : Reading Paragraphs as Input Records
If you are working on Hadoop MapReduce or Using AWS EMR then there might be an use case where input files consistent a paragraph as key-value record instead of a single line (think about scenarios like analyzing comments of news articles). So instead of processing a single line as input if you need to process a complete paragraph at once as a single record then you will need to customize the default behavior of **TextInputFormat** i.e. to read each line by default into reading a complete paragraph as one input key-value pair for further processing in MapReduce jobs.
This requires us to to create a custom record reader which can be done by implementing the class RecordReader. The next() method is where you would tell the record reader to fetch a paragraph instead of one line. See the following implementation, it’s self-explanatory:
public class ParagraphRecordReader implements RecordReader<LongWritable, Text> {
private LineRecordReader lineRecord;
private LongWritable lineKey;
private Text lineValue;
public ParagraphRecordReader(JobConf conf, FileSplit split) throws IOException {
lineRecord = new LineRecordReader(conf, split);
lineKey = lineRecord.createKey();
lineValue = lineRecord.createValue();
}
#Override
public void close() throws IOException {
lineRecord.close();
}
#Override
public LongWritable createKey() {
return new LongWritable();
}
#Override
public Text createValue() {
return new Text("");
}
#Override
public float getProgress() throws IOException {
return lineRecord.getPos();
}
#Override
public synchronized boolean next(LongWritable key, Text value) throws IOException {
boolean appended, isNextLineAvailable;
boolean retval;
byte space[] = {' '};
value.clear();
isNextLineAvailable = false;
do {
appended = false;
retval = lineRecord.next(lineKey, lineValue);
if (retval) {
if (lineValue.toString().length() > 0) {
byte[] rawline = lineValue.getBytes();
int rawlinelen = lineValue.getLength();
value.append(rawline, 0, rawlinelen);
value.append(space, 0, 1);
appended = true;
}
isNextLineAvailable = true;
}
} while (appended);
return isNextLineAvailable;
}
#Override
public long getPos() throws IOException {
return lineRecord.getPos();
}
}
With a ParagraphRecordReader implementation, we would need to extend TextInputFormat to create a custom InputFomat by just overriding the getRecordReader method and return an object of ParagraphRecordReader to override default behavior.
ParagrapghInputFormat will look like:
public class ParagrapghInputFormat extends TextInputFormat
{
#Override
public RecordReader<LongWritable, Text> getRecordReader(InputSplit split, JobConf conf, Reporter reporter)throws IOException {
reporter.setStatus(split.toString());
return new ParagraphRecordReader(conf, (FileSplit)split);
}
}
Ensure that the job configuration to use our custom input format implementation for reading data into MapReduce jobs. It will be as simple as setting up inputformat type to ParagraphInputFormat as show below:
conf.setInputFormat(ParagraphInputFormat.class);
With above changes, we can read paragraphs as input records into MapReduce programs.
let’s assume that input file is as follows with paragraphs:
And a simple mapper code would look like:
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
System.out.println(key+" : "+value);
}
Yes u can use TextInputformat for you case.

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

Read/Write Hbase without reducer, exceptions

I want to read and write hbase without using any reducer.
I followed the example in "The Apache HBase™ Reference Guide", but there are exceptions.
Here is my code:
public class CreateHbaseIndex {
static final String SRCTABLENAME="sourceTable";
static final String SRCCOLFAMILY="info";
static final String SRCCOL1="name";
static final String SRCCOL2="email";
static final String SRCCOL3="power";
static final String DSTTABLENAME="dstTable";
static final String DSTCOLNAME="index";
static final String DSTCOL1="key";
public static void main(String[] args) {
System.out.println("CreateHbaseIndex Program starts!...");
try {
Configuration config = HBaseConfiguration.create();
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.addColumn(Bytes.toBytes(SRCCOLFAMILY), Bytes.toBytes(SRCCOL1));//info:name
HBaseAdmin admin = new HBaseAdmin(config);
if (admin.tableExists(DSTTABLENAME)) {
System.out.println("table Exists.");
}
else{
HTableDescriptor tableDesc = new HTableDescriptor(DSTTABLENAME);
tableDesc.addFamily(new HColumnDescriptor(DSTCOLNAME));
admin.createTable(tableDesc);
System.out.println("create table ok.");
}
Job job = new Job(config, "CreateHbaseIndex");
job.setJarByClass(CreateHbaseIndex.class);
TableMapReduceUtil.initTableMapperJob(
SRCTABLENAME, // input HBase table name
scan, // Scan instance to control CF and attribute selection
HbaseMapper.class, // mapper
ImmutableBytesWritable.class, // mapper output key
Put.class, // mapper output value
job);
job.waitForCompletion(true);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
System.out.println("Program ends!...");
}
public static class HbaseMapper extends TableMapper<ImmutableBytesWritable, Put> {
private HTable dstHt;
private Configuration dstConfig;
#Override
public void setup(Context context) throws IOException{
dstConfig=HBaseConfiguration.create();
dstHt = new HTable(dstConfig,SRCTABLENAME);
}
#Override
public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
// this is just copying the data from the source table...
context.write(row, resultToPut(row,value));
}
private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
Put put = new Put(key.get());
for (KeyValue kv : result.raw()) {
put.add(kv);
}
return put;
}
#Override
protected void cleanup(Context context) throws IOException, InterruptedException {
dstHt.close();
super.cleanup(context);
}
}
}
By the way, "souceTable" is like this:
key name email
1 peter a#a.com
2 sam b#b.com
"dstTable" will be like this:
key value
peter 1
sam 2
I am a newbie in this field and need you help. Thx~
You are correct that you don't need a reducer to write to HBase, but there are some instances where a reducer might help. If you are creating an index, you might run into situations where two mappers are trying to write the same row. Unless you are careful to ensure that they are writing into different column qualifiers, you could overwrite one update with another due to race conditions. While HBase does do row level locking, it won't help if your application logic is faulty.
Without seeing your exceptions, I would guess that you are failing because you are trying to write key-value pairs from your source table into your index table, where the column family doesn't exist.
In this code you are not specifying the output format. You need to add the following code
job.setOutputFormatClass(TableOutputFormat.class);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE,
DSTTABLENAME);
Also, we are not supposed to create new configuration in the set up, we need to use the same configuration from context.

Any suggestions for reading two different dataset into Hadoop at the same time?

Dear hadooper:
I'm new for hadoop, and recently try to implement an algorithm.
This algorithm needs to calculate a matrix, which represent the different rating of every two pair of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)
Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.
But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).
Any suggestions?
You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:
public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String person_name, book_title,file_tag="person_book#";
private String emit_value = new String();
//emit_value = "";
public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] person_detail = line.split(",");
person_name = person_detail[0].trim();
book_title = person_detail[1].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
person_name = "student name missing";
}
emit_value = file_tag + person_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static class FullOuterJoinResultDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
{
private String author_name, book_title,file_tag="auth_book#";
private String emit_value = new String();
// emit_value = "";
public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter)
throws IOException
{
String line = values.toString();
try
{
String[] author_detail = line.split(",");
author_name = author_detail[1].trim();
book_title = author_detail[0].trim();
}
catch (ArrayIndexOutOfBoundsException e)
{
author_name = "Not Appeared in Exam";
}
emit_value = file_tag + author_name;
output.collect(new Text(book_title), new Text(emit_value));
}
}
public static void main(String args[])
throws Exception
{
if(args.length !=3)
{
System.out.println("Input outpur file missing");
System.exit(-1);
}
Configuration conf = new Configuration();
String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
conf.set("mapred.textoutputformat.separator", ",");
JobConf mrjob = new JobConf();
mrjob.setJobName("Inner_Join");
mrjob.setJarByClass(FullOuterJoin.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);
FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
mrjob.setReducerClass(FullOuterJoinReducer.class);
mrjob.setOutputKeyClass(Text.class);
mrjob.setOutputValueClass(Text.class);
JobClient.runJob(mrjob);
}
Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object).
The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Resources