Consider the following main class for map-reduce job:
public class App extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new App(), args);
}
#Override
public int run(String[] args) throws Exception {
System.out.println(Charset.defaultCharset().toString());
return 0;
}
}
When using in interactive shell, it outputs 'UTF-8'. When using in crontab, it's 'US-ASCII'.
But use 'java -Dfile.encoding=UTF-8 -jar xxx.jar', it works fine in crontab. However, 'hadoop jar' command doesn't take this parameter:
hadoop jar xxx.jar -Dfile.encoding=UTF-8
In crontab, it still outputs US-ASCII.
One solution is export a LC_ALL env:
0 * * * * (export LC_ALL=en_US.UTF-8; hadoop jar xxx.jar)
Is there another way?
Update
Another env I find useful is HADOOP_OPTS:
0 * * * * (export HADOOP_OPTS="-Dfile.encoding=UTF-8"; hadoop jar xxx.jar)
Try setting the environment variable HADOOP_OPTS to contain args like this. They are really arguments to java. See the bin/hadoop script; it will add these to the java command.
We found that the problem was that the mapper java processes didn't have -Dfile.encoding=UTF-8 . We had to add that to "mapreduce.map.java.opts". Same for "mapreduce.reduce.java.opts".
You can do it in the XML config files, as well as in Java like:
config.set("mapreduce.map.java.opts","-Xmx1843M -Dfile.encoding=UTF-8");
See http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html for config details.
Related
I have a hadoop (2.2.0) map-reduce job which reads text from a specified path (say INPUT_PATH), and does some processing. I don't want to hardcode the input path (since it comes from some other source which changes each week).
I believe there should be a way in hadoop to specify an xml properties file while running though the command line. How should I do it?
One way I thought was to set an environment variable which points to the location of the properties file and then read this env variable in code and subsequently read the property file. This could work because the value of the env variable can be changed each week without changing the code. But I feel this is an ugly way of loading properties and overrides.
Please let me know the least hacky way of doing this.
There is no inbuilt way to read any configuration file for input/output.
One way I can suggest is to implement a Java M/R Driver program that does the following,
Read the configuration (XML/properties/anything) (Probably generated / updated by the other process)
Set the Job Properties
Submit the Job using your hadoop command (pass the configuration file as argument)
Something like this,
public class SampleMRDriver
extends Configured implements Tool {
#Override
public int run(
String[] args)
throws Exception {
// Read from args the configuration file
Properties prop = new Properties();
prop.loadFromXML(new FileInputStream(args[0]));
Job job = Job.getInstance(getConf(), "Test Job");
job.setJarByClass(SampleMRDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
FileInputFormat.setInputPaths(job, new Path(prop.get("input_path")));
FileOutputFormat.setOutputPath(job, new Path(prop.get("output_path")));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(
String[] args)
throws Exception {
ToolRunner.run(new BatteryAnomalyDetection(), args);
}
}
Getting the following error when I try to run mapreduce on avro:
14/02/26 20:07:50 INFO mapreduce.Job: Task Id : attempt_1393424169778_0002_m_000001_0, Status : FAILED
Error: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
How can I fix this?
I have Hadoop 2.2 up and running.
I'm using Avro 1.7.6.
Below is the code:
package avroColorCount;
import java.io.IOException;
import org.apache.avro.*;
import org.apache.avro.Schema.Type;
import org.apache.avro.mapred.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class MapredColorCount extends Configured implements Tool {
public static class ColorCountMapper extends AvroMapper<User, Pair<CharSequence, Integer>> {
#Override
public void map(User user, AvroCollector<Pair<CharSequence, Integer>> collector, Reporter reporter)
throws IOException {
CharSequence color = user.getFavoriteColor();
// We need this check because the User.favorite_color field has type ["string", "null"]
if (color == null) {
color = "none";
}
collector.collect(new Pair<CharSequence, Integer>(color, 1));
}
}
public static class ColorCountReducer extends AvroReducer<CharSequence, Integer,
Pair<CharSequence, Integer>> {
#Override
public void reduce(CharSequence key, Iterable<Integer> values,
AvroCollector<Pair<CharSequence, Integer>> collector,
Reporter reporter)
throws IOException {
int sum = 0;
for (Integer value : values) {
sum += value;
}
collector.collect(new Pair<CharSequence, Integer>(key, sum));
}
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MapredColorCount <input path> <output path>");
return -1;
}
JobConf conf = new JobConf(getConf(), MapredColorCount.class);
conf.setJobName("colorcount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
AvroJob.setMapperClass(conf, ColorCountMapper.class);
AvroJob.setReducerClass(conf, ColorCountReducer.class);
// Note that AvroJob.setInputSchema and AvroJob.setOutputSchema set
// relevant config options such as input/output format, map output
// classes, and output key class.
AvroJob.setInputSchema(conf, User.getClassSchema());
AvroJob.setOutputSchema(conf, Pair.getPairSchema(Schema.create(Type.STRING),
Schema.create(Type.INT)));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MapredColorCount(), args);
System.exit(res);
}
}
You're using wrong version of avro library.
createDatumWriter method first appeared in GenericData class in version 1.7.5 of avro library. If Hadoop does not seem to find it, then it means that there is an earlier version of avro library (possibly 1.7.4) in your classpath.
First try to provide a correct version of library with HADOOP_CLASSPATH or -libjars option.
Unfortunately, it may be more tricky. In my case it was some other jar file that I loaded with my project but actually never used. I spent several weeks do find it. Hope now you will find it quicker.
Here is some handy code to help you analyze your classpath during your job run (use it inside working job, like WordCount example):
public static void printClassPath() {
ClassLoader cl = ClassLoader.getSystemClassLoader();
URL[] urls = ((URLClassLoader) cl).getURLs();
System.out.println("classpath BEGIN");
for (URL url : urls) {
System.out.println(url.getFile());
}
System.out.println("classpath END");
}
Hope it helps.
Viacheslav Rodionov's answer definitely points to the root cause. Thank you for posting! The following configuration setting then seemed to pick up the 1.7.6 library first and allowed my reducer code (where the createDatumWriter method was called) to complete successfully:
Configuration conf = getConf();
conf.setBoolean(MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);
Job job = Job.getInstance(conf);
I ran exactly into the same problem and as Viacheslav suggested -- it's a version conflict between Avro installed with Hadoop distribution, and Avro version in your project.
And it seems the most reliable way to solve the problem -- simply just use Avro version installed with your Hadoop distro. Unless there is compelling reason to use different version.
Why is using default Avro version which comes with Hadoop distribution is good idea? Because in production hadoop environment you most likely will deal numerous other jobs and services running on the same shared hadoop infrastructure. And the all share the same jar dependencies which come with Hadoop distribution installed in your production environment.
Replacing jar version for specific mapreduce job maybe tricky but solvable task. However it creates a risk of introducing compatibility problem which may be very hard to detect and can backfire later somewhere else in your hadoop ecosystem.
As we know, new need to pack all needed class into the job-jar and upload it to server. it's so slow, i will to know whether there is a way which to specify the thirdpart jar include executing map-red job, so that i could only pack my classes with out dependencies.
PS(i found there is a "-libjar" command, but i doesn't figure out how to use it. Here is the link http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/)
Those are called generic options.
So, to support those, your job should implement Tool.
Run your job like --
hadoop jar yourfile.jar [mainClass] args -libjars <comma seperated list of jars>
Edit:
To implement Tool and extend Configured, you do something like this in your MapReduce application --
public class YourClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new YourClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
//parse you normal arguments here.
Configuration conf = getConf();
Job job = new Job(conf, "Name of job");
//set the class names etc
//set the output data type classes etc
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}
For me I had to specify -libjar option before the arguments. Otherwise it was considered as an argument.
I am trying to run a map reduce job using hadoop jar command.
I am trying to include external libraries using the -libjars option.
The command that I am running currently is
hadoop jar mapR.jar com.ms.hadoop.poc.CsvParser -libjars google-gson.jar Test1.txt output
But I am recieveing this as the output
usage: [input] [output]
Can anyone please help me out.
I have included the the exteranal libraries in my classpath as well.
Can you list the contents of your main(String args[]) method? Are you using the ToolRunner interface to launch your job? The parsing of the -libjars argument is a function of the GenericOptionsParser, which is invoked for you via the ToolRunner utility class:
public class Driver extends Configured implements Tool {
public static void main(String args[]) {
System.exit(ToolRunner.run(new Driver(), args)));
}
public int run(String args[]) {
Job job = new Job(getConf());
Configuration conf = job.getConfiguration();
// other job configuration
return job.waitForCompletion(true) ? 0 : 1;
}
}
How do i copy a file that is required for a hadoop program, to all compute nodes? I am aware that -file option for hadoop streaming does that. How do i do this for java+hadoop?
Exactly the same way.
Assuming you use the ToolRunner / Configured / Tool pattern, the files you specify after the -files option will be in the local dir when your mapper / reducer / combiner tasks run:
public class Driver extends Configured implements Tool {
public static void main(String args[]) {
ToolRunner.run(new Driver(), args);
}
public int run(String args[]) {
Job job = new Job(getConf());
// ...
job.waitForCompletion(true);
}
}
public class MyMapper extends Mapper<K1, V1, K2, V2> {
public void setup(Context context) {
File myFile = new File("file.csv");
// do something with file
}
// ...
}
You can then execute with:
#> hadoop jar myJar.jar Driver -files file.csv ......
See the Javadoc for GenericOptionsParser for more info