How to temporarily connect to MySQL in MapReduce - hadoop

I want to connect to MySQL in MapReduce temporarily. In other words, I want to bring table in MySQL to Map function temporarily to change InputData(Text) but, the Result was empty.
Below is my code:
public class Map extends Mapper{
private Text outputKey=new Text();
private final static IntWritable outputValue=new IntWritable(1);
public void map(LongWritable key,Text value,Context context)
throws IOException,InterruptedException
{
int i=0;
try{
Connection con=DriverManager.getConnection("jdbc:sqlserver://locallhost:3306/test_db",
"user","user");
PreparedStatement ps=con.prepareStatement("select * from studentinfo");
ResultSet rs=ps.executeQuery();
while(rs.next())
i++;
AirlinePerformanceParser parser=new AirlinePerformanceParser(value);
outputKey.set(parser.getMonth_day()+","+i);
context.write(outputKey, outputValue);
}catch(SQLException e)
{
e.printStackTrace();
}
}
}
Map and reduce work well (there are no exceptions), but Result was empty.
I don't want to use DBInputFormat. MySQL table must not be used by DBInputFormat.
What is the problem?

If your query is static, i.e.it does not depend on map input (as is your current query), you should move your sql code into the setup function, to query your DB only once.
// set global the data buffer from your sql results (here a counter)
int nbResults;
#Override
protected void setup(Context context) throws IOException, InterruptedException {
// query your DB
Connection con [...]
}
public void map(LongWritable key,Text value,Context context)throws IOException,InterruptedException {
AirlinePerformanceParser parser=new AirlinePerformanceParser(value);
outputKey.set(parser.getMonth_day()+","+nbResults);
context.write(outputKey, outputValue);
}
As pointed by #Ramzy, check the connexion status.
In addition, do not perform a SELECT * if what you need is a COUNT(*).

Related

Hadoop Mapreduce: Custom Input Format

I have a file with data having text and "^" in between:
SOME TEXT^GOES HERE^
AND A FEW^MORE
GOES HERE
I am writing a custom input format to delimit the rows using "^" character. i.e The output of the mapper should be like:
SOME TEXT
GOES HERE
AND A FEW
MORE GOES HERE
I have written a written a custom input format which extends FileInputFormat and also written a custom record reader that extends RecordReader. Code for my custom record reader is given below. I dont know how to proceed with this code. Having trouble with the nextKeyValue() method in the WHILE loop part. How should I read the data from a split and generate my custom key-value? I am using all new mapreduce package instead of the old mapred package.
public class MyRecordReader extends RecordReader<LongWritable, Text>
{
long start, current, end;
Text value;
LongWritable key;
LineReader reader;
FileSplit split;
Path path;
FileSystem fs;
FSDataInputStream in;
Configuration conf;
#Override
public void initialize(InputSplit inputSplit, TaskAttemptContext cont) throws IOException, InterruptedException
{
conf = cont.getConfiguration();
split = (FileSplit)inputSplit;
path = split.getPath();
fs = path.getFileSystem(conf);
in = fs.open(path);
reader = new LineReader(in, conf);
start = split.getStart();
current = start;
end = split.getLength() + start;
}
#Override
public boolean nextKeyValue() throws IOException
{
if(key==null)
key = new LongWritable();
key.set(current);
if(value==null)
value = new Text();
long readSize = 0;
while(current<end)
{
Text tmpText = new Text();
readSize = read //here how should i read data from the split, and generate key-value?
if(readSize==0)
break;
current+=readSize;
}
if(readSize==0)
{
key = null;
value = null;
return false;
}
return true;
}
#Override
public float getProgress() throws IOException
{
}
#Override
public LongWritable getCurrentKey() throws IOException
{
}
#Override
public Text getCurrentValue() throws IOException
{
}
#Override
public void close() throws IOException
{
}
}
There is no need to implement that yourself. You can simply set the configuration value textinputformat.record.delimiter to be the circumflex character.
conf.set("textinputformat.record.delimiter", "^");
This should work fine with the normal TextInputFormat.

ReadField with complextype in hadoop

I have this Class:
public class Stripe implements WritableComparable<Stripe>{
private List<Term> occorrenze = new ArrayList<Term>();
public Stripe(){}
#Override
public void readFields(DataInput in) throws IOException {
}
}
public class Term implements WritableComparable<Term> {
private Text key;
private IntWritable frequency;
#Override
public void readFields(DataInput in) throws IOException {
this.key.readFields(in);
this.frequency.readFields(in);
}
Stripe is a List of Term (pair of Text and intWritable).
how can I set the method "readField" for read the complex type Stripe from DataInput?
To serialize an list you'll need to write out the length of the list, followed by the elements themselves. A simple readFields / write method pair for Stripe could be:
#Override
public void readFields(DataInput in) throws IOException {
occorrenze.clear();
int cnt = in.readInt();
for (int x = 0; x < cnt; x++) {
Term term = new Term();
term.readFields(in);
occorrence.add(term);
}
}
#Override
public void write(DataOutput out) throws IOException {
out.writeInt(occorrenze.size());
for (Term term : occorrenze) {
term.write(out);
}
}
You could make this more efficient by using a VInt rather than an int, and by using a pool of Terms which can be re-used to save on object creation / garbage collection in the readFields method
You could use ArrayWritable which is a list of writables of the same type.

How can i emmit key values in the end of the whole file processing?

Mapper reads lines from file... How can i emmit key values in the end, after the whole scanning of the file and not per line?
Using the new mapreduce API, you can override the Mapper.cleanup(Context) method and use Context.write(K, V) as you normally would in the map method.
#Override
protected void cleanup(Context context) {
context.write(new Text("key"), new Text("value"));
}
The old mapred API you can override the close() method - but you'll need to store a reference to the OutputCollector given to the map method:
private OutputCollector cachedCollector = null;
void map(Longwritable key, Text value, OutputCollector outputCollector, Reporter reporter) {
if (cachedCollector == null) {
cachedCollector = outputCollector;
}
// ...
}
public void close() {
cachedCollector.collect(outputKey, outputValue);
}
Do you have one Key value for whole file or multiple?
If it is case #1:
Use WholeFileInputFormat. You will receive full file content as a single record. You can split this into records, process all the records and emit final Key/Value at the end of your processing
Cae #2:
Use same fileInputFormat. Store all key values in a temp storage. At the end, access your temp storage and emit whatever the Key/values you want and suppress those you don't want
Another alternative to Chris's answer could be that you can achieve this by overriding the run() of the Mapper class (New API)
public static class Map extends Mapper<IntWritable, IntWritable, IntWritable, IntWritable> {
//map method here
// Override the run()
#override
public void run(Context context) throws IOException, InterruptedException {
setup(context);
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
// Have your last <key,value> emitted here
context.write(lastOutputKey, lastOutputValue);
cleanup(context);
}
}
And in order to ensure that each mapper gets one file to process, you have to create your own version of FileInputFormat and override isSplittable(), like this:
Class NonSplittableFileInputFormat extends FileInputFormat{
#Override
public boolean isSplitable(FileSystem fs, Path filename){
return false;
}
}

What is proper way to use PreparedStatementCreator of Spring JDBC?

As per my understanding the use of PreparedStatement in Java is we can use it multiple times.
But I have some confusion using PreparedStatementCreator of Spring JDBC.
For example consider following code,
public class SpringTest {
JdbcTemplate jdbcTemplate;
PreparedStatementCreator preparedStatementCreator;
ResultSetExtractor<String> resultSetExtractor;
public SpringTest() throws SQLException {
jdbcTemplate = new JdbcTemplate(OracleUtil.getDataSource());
preparedStatementCreator = new PreparedStatementCreator() {
String query = "select NAME from TABLE1 where ID=?";
public PreparedStatement createPreparedStatement(Connection connection) throws SQLException {
return connection.prepareStatement(query);
}
};
resultSetExtractor = new ResultSetExtractor<String>() {
public String extractData(ResultSet resultSet) throws SQLException,
DataAccessException {
if (resultSet.next()) {
return resultSet.getString(1);
}
return null;
}
};
}
public String getNameFromId(int id){
return jdbcTemplate.query(preparedStatementCreator, new Table1Setter(id), resultSetExtractor);
}
private static class Table1Setter implements PreparedStatementSetter{
private int id;
public Table1Setter(int id) {
this.id =id;
}
#Override
public void setValues(PreparedStatement preparedStatement) throws SQLException {
preparedStatement.setInt(1, id);
}
}
public static void main(String[] args) {
try {
SpringTest springTest = new SpringTest();
for(int i=0;i<10;i++){
System.out.println(springTest.getNameFromId(i));
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
As per this code when I called springTest.getNameFromId(int id) method, it returns name from given id, Here I've used PreparedStatementCreator for creating PreparedStatement and PreparedStatementSetter for setting input parameters and I got result from ResultSetExtractor.
But performance is very slow.
After debugging and looking into what happens inside PreparedStatementCreator and JdbcTemplate I got to know that PreparedStatementCreator creates each and every time new PreparedStatement...!!!
Each and every time when I am calls method jdbcTemplate.query(preparedStatementCreator, preparedStatementSetter, resultSetExtractor), it creates new PreparedStatement and this slow downs performance.
Is this right way to use PreparedStatementCreator? Because in this code I unable to reuse PreparedStatement. And if this is right way to use PreparedStatementCreator than how to get benefit of re-usability of PreparedStatement?
Prepared Statements are usually cached by underlying connection pool, so you don't need to worry about creating a new one every time or not.
So I think that your actually usage is correct.
JdbcTemplate closes the statement after executing it, so if you really want to reuse the same prepared statement you could proxy the statement and intercept the close method in the statement creator
For example (not tested, only as example):
public abstract class ReusablePreparedStatementCreator implements PreparedStatementCreator {
private PreparedStatement statement;
public PreparedStatement createPreparedStatement(Connection conn) throws SQLException {
if (statement != null)
return statement;
PreparedStatement ps = doPreparedStatement(conn);
ProxyFactory pf = new ProxyFactory(ps);
MethodInterceptor closeMethodInterceptor = new MethodInterceptor() {
#Override
public Object invoke(MethodInvocation invocation) throws Throwable {
return null; // don't close statement
}
};
NameMatchMethodPointcutAdvisor closeAdvisor = new NameMatchMethodPointcutAdvisor();
closeAdvisor.setMappedName("close");
closeAdvisor.setAdvice(closeMethodInterceptor);
pf.addAdvisor(closeAdvisor);
statement = (PreparedStatement) pf.getProxy();
return statement;
}
public abstract PreparedStatement doPreparedStatement(Connection conn) throws SQLException;
public void close() {
try {
PreparedStatement ps = (PreparedStatement) ((Advised) statement).getTargetSource().getTarget();
ps.close();
} catch (Exception e) {
// handle exception
}
}
}
You are on the right way to use PreparedStatementCreator.
In each new transaction, you should create brand new PreparedStatement instance, it's definitely correct. PreparedStatementCreator is mainly designed to wrap the code block to create PreparedStatement instance easily, not saying that you should resue the new instance each itme.
PreparedStatement is mainly designed to send the templated and pre-compiled SQL statement DBMS which will save some pre-compiled time for SQL execution.
To summarize, what you did is correct. use PreparedStatement will have better performance than Statement.
After debugging and looking into what happens inside PreparedStatementCreator and JdbcTemplate I got to know that PreparedStatementCreator creates each and every time new PreparedStatement...!!!
I'm not sure why that's so shocking since it's your own code that creates a new PreparedStatement each time by calling connection.prepareStatement(query);. If you want to reuse the same one, then you shouldn't create a new one.

Type mismatch in key from map, using SequenceFileInputFormat correctly

I am trying to run a recommender example from chapter6 (listing 6.1 ~ 6.4) in the ebook Mahout in Action. There are two mapper/reducer pairs. Here is the code:
Mapper - 1
public class WikipediaToItemPrefsMapper extends
Mapper<LongWritable,Text,VarLongWritable,VarLongWritable> {
private static final Pattern NUMBERS = Pattern.compile("(\d+)");
#Override
public void map(LongWritable key,
Text value,
Context context)
throws IOException, InterruptedException {
String line = value.toString();
Matcher m = NUMBERS.matcher(line);
m.find();
VarLongWritable userID = new VarLongWritable(Long.parseLong(m.group()));
VarLongWritable itemID = new VarLongWritable();
while (m.find()) {
itemID.set(Long.parseLong(m.group()));
context.write(userID, itemID);
}
}
}
Reducer - 1
public class WikipediaToUserVectorReducer extends
Reducer<VarLongWritable,VarLongWritable,VarLongWritable,VectorWritable> {
#Override
public void reduce(VarLongWritable userID,
Iterable<VarLongWritable> itemPrefs,
Context context)
throws IOException, InterruptedException {
Vector userVector = new RandomAccessSparseVector(
Integer.MAX_VALUE, 100);
for (VarLongWritable itemPref : itemPrefs) {
userVector.set((int)itemPref.get(), 1.0f);
}
//LongWritable userID_lw = new LongWritable(userID.get());
context.write(userID, new VectorWritable(userVector));
//context.write(userID_lw, new VectorWritable(userVector));
}
}
The reducer outputs a userID and a userVector and it looks like this: 98955 {590:1.0 22:1.0 9059:1.0 3:1.0 2:1.0 1:1.0} provided FileInputformat and TextInputFormat are used in the driver.
I want to use another pair of mapper-reducer to process this data further:
Mapper - 2
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable> {
#Override
public void map(VarLongWritable userID,
VectorWritable userVector,
Context context)
throws IOException, InterruptedException {
Iterator<Vector.Element> it = userVector.get().iterateNonZero();
while (it.hasNext()) {
int index1 = it.next().index();
Iterator<Vector.Element> it2 = userVector.get().iterateNonZero();
while (it2.hasNext()) {
int index2 = it2.next().index();
context.write(new IntWritable(index1),
new IntWritable(index2));
}
}
}
}
Reducer - 2
public class UserVectorToCooccurenceReducer extends
Reducer {
#Override
public void reduce(IntWritable itemIndex1,
Iterable<IntWritable> itemIndex2s,
Context context)
throws IOException, InterruptedException {
Vector cooccurrenceRow = new RandomAccessSparseVector(Integer.MAX_VALUE, 100);
for (IntWritable intWritable : itemIndex2s) {
int itemIndex2 = intWritable.get();
cooccurrenceRow.set(itemIndex2, cooccurrenceRow.get(itemIndex2) + 1.0);
}
context.write(itemIndex1, new VectorWritable(cooccurrenceRow));
}
}
This is the driver I am using:
public final class RecommenderJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job_preferenceValues = new Job (getConf());
job_preferenceValues.setJarByClass(RecommenderJob.class);
job_preferenceValues.setJobName("job_preferenceValues");
job_preferenceValues.setInputFormatClass(TextInputFormat.class);
job_preferenceValues.setOutputFormatClass(SequenceFileOutputFormat.class);
FileInputFormat.setInputPaths(job_preferenceValues, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(job_preferenceValues, new Path(args[1]));
job_preferenceValues.setMapOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setMapOutputValueClass(VarLongWritable.class);
job_preferenceValues.setOutputKeyClass(VarLongWritable.class);
job_preferenceValues.setOutputValueClass(VectorWritable.class);
job_preferenceValues.setMapperClass(WikipediaToItemPrefsMapper.class);
job_preferenceValues.setReducerClass(WikipediaToUserVectorReducer.class);
job_preferenceValues.waitForCompletion(true);
Job job_cooccurence = new Job (getConf());
job_cooccurence.setJarByClass(RecommenderJob.class);
job_cooccurence.setJobName("job_cooccurence");
job_cooccurence.setInputFormatClass(SequenceFileInputFormat.class);
job_cooccurence.setOutputFormatClass(TextOutputFormat.class);
SequenceFileInputFormat.setInputPaths(job_cooccurence, new Path(args[1]));
FileOutputFormat.setOutputPath(job_cooccurence, new Path(args[2]));
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
job_cooccurence.setOutputKeyClass(IntWritable.class);
job_cooccurence.setOutputValueClass(VectorWritable.class);
job_cooccurence.setMapperClass(UserVectorToCooccurenceMapper.class);
job_cooccurence.setReducerClass(UserVectorToCooccurenceReducer.class);
job_cooccurence.waitForCompletion(true);
return 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new RecommenderJob(), args);
}
}
The error that I get is:
java.io.IOException: Type mismatch in key from map: expected org.apache.mahout.math.VarLongWritable, received org.apache.hadoop.io.IntWritable
In course of Googling for a fix, I found out that my issue is similar to this question. But the difference is that I am already using SequenceFileInputFormat and SequenceFileOutputFormat, I believe correctly. I also see that org.apache.mahout.cf.taste.hadoop.item.RecommenderJob does more or less something similar. In my understanding & Yahoo Tutorial
SequenceFileOutputFormat rapidly serializes arbitrary data types to the file; the corresponding SequenceFileInputFormat will deserialize the file into the same types and presents the data to the next Mapper in the same manner as it was emitted by the previous Reducer.
What am I doing wrong? Will really appreciate some pointers from someone.. I spent the day trying to fix this and got nowhere :(
Your second mapper has the following signature:
public class UserVectorToCooccurenceMapper extends
Mapper<VarLongWritable,VectorWritable,IntWritable,IntWritable>
But you define the following in your driver code:
job_cooccurence.setMapOutputKeyClass(VarLongWritable.class);
job_cooccurence.setMapOutputValueClass(VectorWritable.class);
The reducer is expecting <IntWritable, IntWritable> as input, so you should just amend your driver code to:
job_cooccurence.setMapOutputKeyClass(IntWritable.class);
job_cooccurence.setMapOutputValueClass(IntWritable.class);

Resources