Passing a filename to Java UDF from Pig using distributed cache - user-defined-functions

I am using a small map file in my Java UDF function and I want to pass the filename of this file from Pig through the constructor.
Following is the relevant part from my UDF function
public GenerateXML(String mapFilename) throws IOException {
this(null);
}
public GenerateXML(String mapFilename) throws IOException {
if (mapFilename != null) {
// do preocessing
}
}
In the Pig script I have the following line
DEFINE GenerateXML com.domain.GenerateXML('typemap.tsv');
This works in local mode, but not in distributed mode. I am passing the following parameters to Pig in command line
pig -Dmapred.cache.files="/path/to/typemap.tsv#typemap.tsv" -Dmapred.create.symlink=yes -f generate-xml.pig
And I am getting the following exception
2013-01-11 10:39:42,002 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<file generate-xml.pig, line 16, column 42> Failed to generate logical plan. Nested exception: java.lang.RuntimeException: could not instantiate 'com.domain.GenerateXML' with arguments '[typemap.tsv]'
Any idea what I need to change to make it work?

The problem is solved now.
It seems that when I run the Pig script using following parameters
pig -Dmapred.cache.files="/path/to/typemap.tsv#typemap.tsv" -Dmapred.create.symlink=yes -f generate-xml.pig
The /path/to/typemap.tsv should be the local path and not a path in HDFS.

You can use getCacheFiles function in a Pig UDF and it will be enough - you don't have to use any additional properties like mapred.cache.files. Your case can be implemented like this:
public class UdfCacheExample extends EvalFunc<Tuple> {
private Dictionary dictionary;
private String pathToDictionary;
public UdfCacheExample(String pathToDictionary) {
this.pathToDictionary = pathToDictionary;
}
#Override
public Tuple exec(Tuple input) throws IOException {
Dictionary dictionary = getDictionary();
return createSomething(input);
}
#Override
public List<String> getCacheFiles() {
return Arrays.asList(pathToDictionary);
}
private Dictionary getDictionary() {
// lazy initialization here
}
}

Related

Issue while Writing Multiple O/P Files in MapReduce

I have a requirement to split my input file into 2 output file based on a filter condition. My output directory should looks like below:
/hdfs/base/dir/matched/YYYY/MM/DD
/hdfs/base/dir/notmatched/YYYY/MM/DD
I am using MultipleOutputs class to split my data in my map function.
In my driver class I am using like below:
FileOutputFormat.setOutputPath(job, new Path("/hdfs/base/dir"));
and in Mapper I am using below:
mos.write(key, value, fileName); // File Name is generating based on filter criteria
This program is working fine for a single day. But in second day my program is failing saying that:
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://nameservice1/hdfs/base/dir already exists
I cannot use different base directory for the second day.
How can I handle this situation?
Note: I don't want to read the input twise to create 2 separate file.
Create Custom o/p format class like below
package com.visa.util;
import java.io.IOException;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.OutputCommitter;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
public class CostomOutputFormat<K, V> extends SequenceFileOutputFormat<K, V>{
#Override
public void checkOutputSpecs(JobContext arg0) throws IOException {
}
#Override
public OutputCommitter getOutputCommitter(TaskAttemptContext arg0) throws IOException {
return super.getOutputCommitter(arg0);
}
#Override
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext arg0) throws IOException, InterruptedException {
return super.getRecordWriter(arg0);
}
}
and use it in driver class:
job.setOutputFormatClass(CostomOutputFormat.class);
Which will skip checking of the existence of o/p directory.
You can have a flag column in your output value. Later you can process the output and split it by the flag column.

set a conf value in mapper - get it in run method

In the run method of the Driver class, I want to fetch a String value (from the mapper function) and want to write it to a file. I used the following code, but null was returned. Please help
Mapper
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.getConfiguration().set("feedName", feedName);
}
Driver Class
#Override
public int run(String[] args) throws Exception {
String lineVal = conf.get("feedName")
}
Configuration is one way.
If you want to pass non-counter types of values back to the driver, you can utilize HDFS for that.
Either write to your main output context (key and values) that you emit from your job.
Or alternatively use MultipleOutputs, if you do not want to mess with your standard job output.
For example, you can write any kind of properties as Text keys and Text values from your mappers or reducers.
Once control is back to your driver, simply read from HDFS. For example you can store your name/values to the Configuration object to be used by the next job in your sequence:
public void load(Configuration targetConf, Path src, FileSystem fs) throws IOException {
InputStream is = fs.open(src);
try {
Properties props = new Properties();
props.load(new InputStreamReader(is, "UTF8"));
for (Map.Entry prop : props.entrySet()) {
String name = (String)prop.getKey();
String value = (String)prop.getValue();
targetConf.set(name, value);
}
} finally {
is.close();
}
}
Note that if you have multiple mappers or reducers where you write to MultipleOutputs, you will end up with multiple {name}-m-##### or {name}-r-##### files.
In that case, you will need to either read from every output file or run a single reducer job to combine your outputs into one and then just read from one file as shown above.
Using configuration you can only do the viceversa.
You can set values in Driver class
public int run(String[] args) throws Exception {
conf.set("feedName",value);
}
and set get those in Mapper class
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String lineVal = conf.get("feedName");
}
UPDATE
One option to your question is write data to a file and store it in HDFS, and then access them in Driver class. These files can be treated as "Intermediate Files".
Just try it and see.

how to solve "Error during parsing. could not instantiate" in pig?

Hello every one i am new in Pig i am trying following pig script :
then it shows following error :
ERROR 1000: Error during parsing. could not instantiate 'UPER' with arguments 'null' Details at logfile: /home/training/pig_1371303109105.log
my Pig script:
register udf.jar;
A = LOAD 'data1.txt' USING PigStorage(',') AS (name:chararray, class:chararray, age:int);
B = foreach A generate UPER(class);
I follow this tutorial .
My java class is :
enter code here
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.*;
public class UPER extends EvalFunc<String>{
#Override
public String exec(Tuple input) throws IOException {
// TODO Auto-generated method stub
if(input == null ||input.size() ==0)
return null;
try
{
String str=(String)input.get(0);
return str.toUpperCase();
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}}
}
I found the following information from your error log:
Caused by: java.lang.Error: Unresolved compilation problem:
The type org.apache.commons.logging.Log cannot be resolved. It is indirectly referenced from required .class files
at UPER.<init>(UPER.java:1)
I guess that org.apache.commons.logging.Log is not in your environment. How did you run your Pig script? This class should have been in the Pig envrionment. org.apache.commons.logging.Log is in commons-logging-*.*.*.jar

How to implement a Java MapReduce that produce output values large then the maximum Heap?

I need to process a 9 GB CSV File. During the MR it has to do some grouping and produce a special format for a legacy system.
The input file looks like this:
AppId;Username;Other Fields like timestamps...
app/10;Mr Foobar;...
app/10;d0x;...
app/10;Mr leet;...
app/110;kr1s;...
app/110;d0x;...
...
And the Outputfile is quite simple like this:
app/10;3;Mr Foobar;d0x;Mr leet
app/110;2;kr1s;d0x
^ ^ ^^^^^^^^
\ AppId \ \ A list with all users playing the game
\
\ Ammount of users
To solve that, I wrote a mapper that returns the AppId as Key and the Username as value. With this the mapping phase runs fine.
The problem happens on the reduce phase. There I'll get a Iterator<Text> userIds that potentially contains a List with a lot of userIds (>5.000.000).
The reducer to process this looks like this:
public class UserToAppReducer extends Reducer<Text, Text, Text, UserSetWritable> {
final UserSetWritable userSet = new UserSetWritable();
#Override
protected void reduce(final Text appId, final Iterable<Text> userIds, final Context context) throws IOException, InterruptedException {
this.userSet.clear();
for (final Text userId : userIds) {
this.userSet.add(userId.toString());
}
context.write(appId, this.userSet);
}
}
The UserSetWritable is a custom writable that stores a list of users. This is needed to generate the output (key = appId, value = a list of usernames).
This is how the current UserSetWritable looks like:
public class UserSetWritable implements Writable {
private final Set<String> userIds = new HashSet<String>();
public void add(final String userId) {
this.userIds.add(userId);
}
#Override
public void write(final DataOutput out) throws IOException {
out.writeInt(this.userIds.size());
for (final String userId : this.userIds) {
out.writeUTF(userId);
}
}
#Override
public void readFields(final DataInput in) throws IOException {
final int size = in.readInt();
for (int i = 0; i < size; i++) {
this.userIds.add(readUTF);
}
}
#Override
public String toString() {
String result = "";
for (final String userId : this.userIds) {
result += userId + "\t";
}
result += this.userIds.size();
return result;
}
public void clear() {
this.userIds.clear();
}
}
With this approche I get a Java HeapOutOfMemory Exception.
Error: Java heap space
attempt_201303072200_0016_r_000002_0: WARN : mapreduce.Counters - Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
attempt_201303072200_0016_r_000002_0: WARN : org.apache.hadoop.conf.Configuration - session.id is deprecated. Instead, use dfs.metrics.session-id
attempt_201303072200_0016_r_000002_0: WARN : org.apache.hadoop.conf.Configuration - slave.host.name is deprecated. Instead, use dfs.datanode.hostname
attempt_201303072200_0016_r_000002_0: FATAL: org.apache.hadoop.mapred.Child - Error running child : java.lang.OutOfMemoryError: Java heap space
attempt_201303072200_0016_r_000002_0: at java.util.Arrays.copyOfRange(Arrays.java:3209)
attempt_201303072200_0016_r_000002_0: at java.lang.String.<init>(String.java:215)
attempt_201303072200_0016_r_000002_0: at java.nio.HeapCharBuffer.toString(HeapCharBuffer.java:542)
attempt_201303072200_0016_r_000002_0: at java.nio.CharBuffer.toString(CharBuffer.java:1157)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.io.Text.decode(Text.java:394)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.io.Text.decode(Text.java:371)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.io.Text.toString(Text.java:273)
attempt_201303072200_0016_r_000002_0: at com.myCompany.UserToAppReducer.reduce(UserToAppReducer.java:21)
attempt_201303072200_0016_r_000002_0: at com.myCompany.UserToAppReducer.reduce(UserToAppReducer.java:1)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
attempt_201303072200_0016_r_000002_0: at java.security.AccessController.doPrivileged(Native Method)
attempt_201303072200_0016_r_000002_0: at javax.security.auth.Subject.doAs(Subject.java:396)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
attempt_201303072200_0016_r_000002_0: at org.apache.hadoop.mapred.Child.main(Child.java:262)
UserToAppReducer.java:21 is this line: this.userSet.add(userId.toString());
On the same cluster I am able to proccess the data with this pig Script:
set job.name convertForLegacy
set default_parallel 4
data = load '/data/...txt'
using PigStorage(',')
as (appid:chararray,uid:chararray,...);
grp = group data by appid;
counter = foreach grp generate group, data.uid, COUNT(data);
store counter into '/output/....' using PigStorage(',');
So how to solve this OutOfMemoryException with MapReduce?
Similar question for writing out 'large' values: Handling large output values from reduce step in Hadoop
In addition to using this concept for writing out large records (getting the CSV list you want of 100,000's of users), you'll need to use a composite key (the App ID and user ID) and custom partitioner to ensure all the keys for a single App ID make their way to the reducer.
Some like this gist (not tested).

Hadoop - Mysql new API connection

I am trying to set MySQL as input, in a Hadoop Process. How to use DBInputFormat class for Hadoop - MySQL connection in version 1.0.3? The configuration of the job via JobConf from hadoop-1.0.3/docs/api/ doesnt work.
// Create a new JobConf
JobConf job = new JobConf(new Configuration(), MyJob.class);
// Specify various job-specific parameters
job.setJobName("myjob");
FileInputFormat.setInputPaths(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"));
job.setMapperClass(MyJob.MyMapper.class);
job.setCombinerClass(MyJob.MyReducer.class);
job.setReducerClass(MyJob.MyReducer.class);
job.setInputFormat(SequenceFileInputFormat.class);
job.setOutputFormat(SequenceFileOutputFormat.class);
You need to do something like the following (assuming the typical employee table for example):
JobConf conf = new JobConf(getConf(), MyDriver.class);
conf.setInputFormat(DBInputFormat.class);
DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”, “jdbc:mysql://localhost/mydatabase”); String [] fields = { “employee_id”, "name" };
DBInputFormat.setInput(conf, MyRecord.class, “employees”, null /* conditions */, “employee_id”, fields);
...
// other necessary configuration
JobClient.runJob(conf);
The configureDB() and setInput() calls configure the DBInputFormat. The first call specifies the JDBC driver implementation to use and what database to connect to. The second call specifies what data to load from the database. The MyRecord class is the class where data will be read into in Java, and "employees" is the name of the table to read. The "employee_id" parameter specifies the table’s primary key, used for ordering results. The section “Limitations of the InputFormat” below explains why this is necessary. Finally, the fields array lists what columns of the table to read. An overloaded definition of setInput() allows you to specify an arbitrary SQL query to read from, instead.
After calling configureDB() and setInput(), you should configure the rest of your job as usual, setting the Mapper and Reducer classes, specifying any other data sources to read from (e.g., datasets in HDFS) and other job-specific parameters.
You need to create your own implementation of Writable- something like the following (considering id and name as table fields):
class MyRecord implements Writable, DBWritable {
long id;
String name;
public void readFields(DataInput in) throws IOException {
this.id = in.readLong();
this.name = Text.readString(in);
}
public void readFields(ResultSet resultSet) throws SQLException {
this.id = resultSet.getLong(1);
this.name = resultSet.getString(2); }
public void write(DataOutput out) throws IOException {
out.writeLong(this.id);
Text.writeString(out, this.name); }
public void write(PreparedStatement stmt) throws SQLException {
stmt.setLong(1, this.id);
stmt.setString(2, this.name); }
}
The mapper then receives an instance of your DBWritable implementation as its input value. The input key is a row id provided by the database; you’ll most likely discard this value.
public class MyMapper extends MapReduceBase implements Mapper<LongWritable, MyRecord, LongWritable, Text> {
public void map(LongWritable key, MyRecord val, OutputCollector<LongWritable, Text> output, Reporter reporter) throws IOException {
// Use val.id, val.name here
output.collect(new LongWritable(val.id), new Text(val.name));
}
}
For more : read the following link (actual source of my answer) : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/
Have a look at this post. It shows how to sink data from Map Reduce to MySQL Database.

Resources