How do i copy a file that is required for a hadoop program, to all compute nodes? I am aware that -file option for hadoop streaming does that. How do i do this for java+hadoop?
Exactly the same way.
Assuming you use the ToolRunner / Configured / Tool pattern, the files you specify after the -files option will be in the local dir when your mapper / reducer / combiner tasks run:
public class Driver extends Configured implements Tool {
public static void main(String args[]) {
ToolRunner.run(new Driver(), args);
}
public int run(String args[]) {
Job job = new Job(getConf());
// ...
job.waitForCompletion(true);
}
}
public class MyMapper extends Mapper<K1, V1, K2, V2> {
public void setup(Context context) {
File myFile = new File("file.csv");
// do something with file
}
// ...
}
You can then execute with:
#> hadoop jar myJar.jar Driver -files file.csv ......
See the Javadoc for GenericOptionsParser for more info
Related
When I submit hadoop job It always said
WARN [JobClient] Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same
How can I fix this?
I am using CDH 4.6.0.
You should use something like below driver code to start your MapReduce job to get rid of warning (although it doesn't do any harm):
public class MyClass extends Configured implements Tool {
public int run(String [] args) throws IOException {
JobConf conf = new JobConf(getConf(), MyClass.class);
// run the job here.
return 0;
}
public static void main(String [] args) throws Exception {
int status = ToolRunner.run(new MyClass(), args); // calls your run() method.
System.exit(status);
}
}
I am trying to set the number of reducers to use via command line. It seems like I am using wrong syntax. I am using hadoop 2.5 (yarn) MR2.
hadoop jar mrjobs-0.1.jar com.example.Weather -D mapreduce.job.reduces=2 datasets/inputs output
This commands is not working when I added -D option else its working fine.
Any help appreciated !
thanks!
Syntax looks proper, I have tested against 2.5 YARN MR2 with the following it's working:
hadoop jar hadoop-mapreduce-examples.jar wordcount -Dmapreduce.job.reduces=5 input output
Most probably the problem could be your Driver class hasn't implemented ToolRunner which works in coordination with GenericOptionsParser to parse generic command line arguments.
Here is an example of how to implement ToolRunner in your MapReduce Driver class:
// imports ignored
public class ExampleDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: ExampleDriver <in> <out>");
System.exit(2);
}
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJobName("example driver");
job.setJarByClass(ExampleDriver.class);
job.setMapperClass(YourMapper.class);
job.setReducerClass(YourReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int ret = job.waitForCompletion(true) ? 0 : 1;
return ret;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new ExampleDriver(), args);
System.exit(res);
}
}
Hadoop's DistributedCache documentation doesn't seem to sufficently describe how to use the distributed cache. Here is the example given:
// Setting up the cache for the application
1. Copy the requisite files to the FileSystem:
$ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat
$ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip
$ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
$ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
$ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
$ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
2. Setup the application's JobConf:
JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
3. Use the cached files in the Mapper
or Reducer:
public static class MapClass extends MapReduceBase
implements Mapper<K, V, K, V> {
private Path[] localArchives;
private Path[] localFiles;
public void configure(JobConf job) {
// Get the cached archives/files
File f = new File("./map.zip/some/file/in/zip.txt");
}
public void map(K key, V value,
OutputCollector<K, V> output, Reporter reporter)
throws IOException {
// Use data from the cached archives/files here
// ...
// ...
output.collect(k, v);
}
}
I've been searching around for over an hour trying to figure out how to use this. After piecing together a few other SO questions, here's what I came up with:
public static void main(String[] args) throws Exception {
Job job = new Job(new JobConf(), "Job Name");
JobConf conf = job.getConfiguration();
DistributedCache.createSymlink(conf);
DistributedCache.addCacheArchive(new URI("/ProjectDir/LookupTable.zip", job);
// *Rest of configuration code*
}
public static class MyMapper extends Mapper<Object, Text, Text, IntWritable>
{
private Path[] localArchives;
public void configure(JobConf job)
{
// Get the cached archive
File file1 = new File("./LookupTable.zip/file1.dat");
BufferedReader br1index = new BufferedReader(new FileInputStream(file1));
}
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{ // *Map code* }
}
Where am I supposed to call the void configure(JobConf job) function?
Where do I use the private Path[] localArchives object?
Is my code in the configure() function the correct way to access files within an archive and to link a file with a BufferedReader?
I will answer your questions w.r.t new API and common practices in use for distributed cache
Where am I supposed to call the void configure(JobConf job) function?
Framework will call protected void setup(Context context) method once at beginning of every map task, the logic associated with using cache files is usually handled here. For example, reading file and storing data in variable to be used in map() function which is called after setup()
Where do I use the private Path[] localArchives object?
It will be typically used in setup() method to retrieve path of cache files . Something like this.
Path[] localArchive =DistributedCache.getLocalCacheFiles(context.getConfiguration());
Is my code in the configure() function the correct way to access
files within an archive and to link a file with a BufferedReader?
Its missing a call to method to retrive path where cache files are stored (shown above). Once the path is retrieved the file(s) can be read as below.
FSDataInputStream in = fs.open(localArchive);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
I have been testing a map reduce job on a single node and it seems to work but now that I am trying to run it on a remote cluster I am getting a ClassNotFoundExcepton. My code is structured as follows:
public class Pivot {
public static class Mapper extends TableMapper<ImmutableBytesWritable, ImmutableBytesWritable> {
#Override
public void map(ImmutableBytesWritable rowkey, Result values, Context context) throws IOException {
(map code)
}
}
public static class Reducer extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
(reduce code)
}
}
public static void main(String[] args) {
Configuration conf = HBaseConfiguration.create();
conf.set("fs.default.name", "hdfs://hadoop-master:9000");
conf.set("mapred.job.tracker", "hdfs://hadoop-master:9001");
conf.set("hbase.master", "hadoop-master:60000");
conf.set("hbase.zookeeper.quorum", "hadoop-master");
conf.set("hbase.zookeeper.property.clientPort", "2222");
Job job = new Job(conf);
job.setJobName("Pivot");
job.setJarByClass(Pivot.class);
Scan scan = new Scan();
TableMapReduceUtil.initTableMapperJob("InputTable", scan, Mapper.class, ImmutableBytesWritable.class, ImmutableBytesWritable.class, job);
TableMapReduceUtil.initTableReducerJob("OutputTable", Reducer.class, job);
job.waitForCompletion(true);
}
}
The error I am receiving when I try to run this job is the following:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Pivot$Mapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
...
Is there something I'm missing? Why is the job having difficulty finding the mapper?
When running a job from Eclipse it's important to note that Hadoop requires you to launch your job from a jar. Hadoop requires this so it can send your code up to HDFS / JobTracker.
In your case i imagine you haven't bundled up your job classes into a jar, and then run the program 'from the jar' - resulting in a CNFE.
Try building a jar and running from the command line using hadoop jar myjar.jar ..., once this works then you can test running from within Eclipse
I am trying to run a map reduce job using hadoop jar command.
I am trying to include external libraries using the -libjars option.
The command that I am running currently is
hadoop jar mapR.jar com.ms.hadoop.poc.CsvParser -libjars google-gson.jar Test1.txt output
But I am recieveing this as the output
usage: [input] [output]
Can anyone please help me out.
I have included the the exteranal libraries in my classpath as well.
Can you list the contents of your main(String args[]) method? Are you using the ToolRunner interface to launch your job? The parsing of the -libjars argument is a function of the GenericOptionsParser, which is invoked for you via the ToolRunner utility class:
public class Driver extends Configured implements Tool {
public static void main(String args[]) {
System.exit(ToolRunner.run(new Driver(), args)));
}
public int run(String args[]) {
Job job = new Job(getConf());
Configuration conf = job.getConfiguration();
// other job configuration
return job.waitForCompletion(true) ? 0 : 1;
}
}