How to specify the partitioner for hadoop streaming

How to specify the partitioner for hadoop streaming - hadoop

I have a custom partitioner like below:
import java.util.*;
import org.apache.hadoop.mapreduce.*;
public static class SignaturePartitioner extends Partitioner<Text,Text>
{
#Override
public int getPartition(Text key,Text value,int numReduceTasks)
{
return (key.toString().Split(' ')[0].hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
I set the hadoop streaming parameter like below
-file SignaturePartitioner.java \
-partitioner SignaturePartitioner \
Then I get an error: Class Not Found.
Do you know what's the problem?
Best Regards,

I faced the same issue, but managed to solve after lot of research.
Root cause is streaming-2.6.0.jar uses mapred api and not mapreduce api. Also, implement Partitioner interface, and not extend Partitioner class. The following worked for me:
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.mapred.JobConf;`
public class Mypartitioner implements Partitioner<Text, Text> {`
public void configure(JobConf job) {}
public int getPartition(Text pkey, Text pvalue, int pnumparts)
{
if (pkey.toString().startsWith("a"))
return 0;
else return 1 ;
}
}
compile Mypartitioner, create jar, and then,
bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar
-libjars /home/sanjiv/hadoop-2.6.0/Mypartitioner.jar
-D mapreduce.job.reduces=2
-files /home/sanjiv/mymapper.sh,/home/sanjiv/myreducer.sh
-input indir -output outdir -mapper mymapper.sh
-reducer myreducer.sh -partitioner Mypartitioner

-file SignaturePartitioner.java -partitioner SignaturePartitioner
The -file option will make the file available on all the required nodes by the Hadoop framework. It needs to point to the class name and not the Java file name.

Related

Why does Spark Streaming not read from Kafka topic?

Spark Streaming 1.6.0
Apache Kafka 10.0.1
I use Spark Streaming to read from sample topic. The code runs with no errors or exceptions but I don't get any data on the console via print() method.
I checked to see if there are messages in the topic:
./bin/kafka-console-consumer.sh \
--zookeeper ip-172-xx-xx-xxx:2181 \
--topic sample \
--from-beginning
And I am getting the messages:
message no. 1
message no. 2
message no. 3
message no. 4
message no. 5
The command to run the streaming job:
./bin/spark-submit \
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=512m" \
--jars /home/ubuntu/zifferlabs/target/ZifferLabs-1-jar-with-dependencies.jar \
--class "com.zifferlabs.stream.SampleStream" \
/home/ubuntu/zifferlabs/src/main/java/com/zifferlabs/stream/SampleStream.java
Here is the entire code:
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import kafka.serializer.DefaultDecoder;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class SampleStream {
private static void processStream() {
SparkConf conf = new SparkConf().setAppName("sampleStream")
.setMaster("local[3]")
.set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")
.set("spark.driver.memory", "2g").set("spark.streaming.blockInterval", "1000")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.scheduler.mode", "FAIR");
JavaStreamingContext jsc = new JavaStreamingContext(conf, new Duration(Long.parseLong("2000")));
String[] topics = "sample".split(",");
Set<String> topicSet = new HashSet<String>(Arrays.asList(topics));
Map<String, String> props = new HashMap<String, String>();
props.put("metadata.broker.list", "ip-172-xx-xx-xxx:9092");
props.put("kafka.consumer.id", "sample_con");
props.put("group.id", "sample_group");
props.put("zookeeper.connect", "ip-172-xx-xx-xxx:2181");
props.put("zookeeper.connection.timeout.ms", "16000");
JavaPairInputDStream<String, byte[]> kafkaStream =
KafkaUtils.createDirectStream(jsc, String.class, byte[].class, StringDecoder.class,
DefaultDecoder.class, props, topicSet);
JavaDStream<String> data = kafkaStream.map(new Function<Tuple2<String,byte[]>, String>() {
public String call(Tuple2<String, byte[]> arg0) throws Exception {
System.out.println("$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ value is: " + arg0._2().toString());
return arg0._2().toString();
}
});
data.print();
System.out.println("Spark Streaming started....");
jsc.checkpoint("/home/spark/sparkChkPoint");
jsc.start();
jsc.awaitTermination();
System.out.println("Stopped Spark Streaming");
}
public static void main(String[] args) {
processStream();
}
}

I think you've got the code right, but the command line to execute it is incorrect.
You spark-submit the application as follows (formatting's mine + spark.executor.extraJavaOptions removed for simplicity):
./bin/spark-submit \
--jars /home/ubuntu/zifferlabs/target/ZifferLabs-1-jar-with-dependencies.jar \
--class "com.zifferlabs.stream.SampleStream" \
/home/ubuntu/zifferlabs/src/main/java/com/zifferlabs/stream/SampleStream.java
I think it won't work since spark-submit submits your Java source code not the executable code.
Please spark-submit your application as follows:
./bin/spark-submit \
--class "com.zifferlabs.stream.SampleStream" \
/home/ubuntu/zifferlabs/target/ZifferLabs-1-jar-with-dependencies.jar
which is --class to define the "entry point" to your Spark application and the code with dependencies (as the only input argument for spark-submit).
Give it a shot and report back!

Hadoop WordCount Tutorial java.lang.ClassNotFoundException

I'm relatively new to hadoop and I'm struggling a little bit to understand the ClassNotFoundException I get when trying to run the job. I'm using the standard tutorial found here and here is my WordCount class (running on ubuntu 16.04 hadoop 2.7.3 distributed cluster mode):
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
To try and remain organized, I added a couple paths to my ~/.bashrc file:
hduser#mynode:~$ cd $HADOOP_CODE
hduser#mynode:/usr/local/hadoop/code$
This is one directory down from the $HADOOP_HOME directory. To compile the WordCount.JAVA file, I ran:
hduser#mynode:/usr/local/hadoop$ hadoop com.sun.tools.javac.Main $HADOOP_CODE/WordCount.java
hduser#mynode:/usr/local/hadoop$ jar cf wc.jar $HADOOP_CODE/WordCount*.class
I then tried:
hduser#mynode:/usr/local/hadoop$ hadoop jar $HADOOP_CODE/wc.jar $HADOOP_CODE/WordCount /home/hduser/input /home/hduser/output/wordcount
which bombed with the following error:
Exception in thread "main" java.lang.ClassNotFoundException: /usr/local/hadoop/code/WordCount
EDIT
This gave me the same error:
hduser#mynode:/usr/local/hadoop/code$ hadoop jar $HADOOP_CODE/wc.jar WordCount /home/hduser/input /home/hduser/output/wordcount
To get it to run without error, I moved the WordCount.Java file up one directory to the default hadoop ($HADOOP_HOME) folder. I also know from here and here that the solution is to add a package to the file.
What I'm trying to understand is why that is the solution. With no package name, where is hadoop looking for the specified package, and why can't I pass it a full path to get it to run correctly? This may be a basic java question (apologies - I'm from the python world), but what is the package name doing during the compile process that makes it so I could run without a path name, but leaving off the package name means it has to be in that default directory? I'd prefer not to have to add a package name to every job I run. An explanation would be greatly appreciated!

Setting number of Reduce tasks using command line

I am a beginner in Hadoop. When trying to set the number of reducers using command line using Generic Options Parser, the number of reducers is not changing. There is no property set in the configuration file "mapred-site.xml" for the number of reducers and I think, that would make the number of reducers=1 by default. I am using cloudera QuickVM and hadoop version : "Hadoop 2.5.0-cdh5.2.0".
Pointers Appreciated. Also my issue was I wanted to know the preference order of the ways to set the number of reducers.
Using configuration File "mapred-site.xml"
mapred.reduce.tasks
By specifying in the driver class
job.setNumReduceTasks(4)
By specifying at the command line using Tool interface:
-Dmapreduce.job.reduces=2
Mapper :
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException
{
String line = value.toString();
//Split the line into words
for(String word: line.split("\\W+"))
{
//Make sure that the word is legitimate
if(word.length() > 0)
{
//Emit the word as you see it
context.write(new Text(word), new IntWritable(1));
}
}
}
}
Reducer :
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
//Initializing the word count to 0 for every key
int count=0;
for(IntWritable value: values)
{
//Adding the word count counter to count
count += value.get();
}
//Finally write the word and its count
context.write(key, new IntWritable(count));
}
}
Driver :
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class WordCount extends Configured implements Tool
{
public int run(String[] args) throws Exception
{
//Instantiate the job object for configuring your job
Job job = new Job();
//Specify the class that hadoop needs to look in the JAR file
//This Jar file is then sent to all the machines in the cluster
job.setJarByClass(WordCount.class);
//Set a meaningful name to the job
job.setJobName("Word Count");
//Add the apth from where the file input is to be taken
FileInputFormat.addInputPath(job, new Path(args[0]));
//Set the path where the output must be stored
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//Set the Mapper and the Reducer class
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
//Set the type of the key and value of Mapper and reducer
/*
* If the Mapper output type and Reducer output type are not the same then
* also include setMapOutputKeyClass() and setMapOutputKeyValue()
*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//job.setNumReduceTasks(4);
//Start the job and wait for it to finish. And exit the program based on
//the success of the program
System.exit(job.waitForCompletion(true)?0:1);
return 0;
}
public static void main(String[] args) throws Exception
{
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new WordCount(), args);
System.exit(res);
}
}
And I have tried the following commands to run the job :
hadoop jar /home/cloudera/Misc/wordCount.jar WordCount -Dmapreduce.job.reduces=2 hdfs:/Input/inputdata hdfs:/Output/wordcount_tool_D=2_take13
and
hadoop jar /home/cloudera/Misc/wordCount.jar WordCount -D mapreduce.job.reduces=2 hdfs:/Input/inputdata hdfs:/Output/wordcount_tool_D=2_take14

Answering your query on order. It would always be 2>3>1
The option specified in your driver class takes precedence over the ones you specify as an argument to your GenOptionsParser or the ones you specify in your site specific config.
I would recommend debugging the configurations inside your driver class by printing it out before you submit the job. This way , you can be sure what the configurations are , right before you submit the job to the cluster.
Configuration conf = getConf(); // This is available to you since you extended Configured
for(Entry entry: conf)
//Sysout the entries here

Error from AVRO Mapreduce

Getting the following error when I try to run mapreduce on avro:
14/02/26 20:07:50 INFO mapreduce.Job: Task Id : attempt_1393424169778_0002_m_000001_0, Status : FAILED
Error: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
How can I fix this?
I have Hadoop 2.2 up and running.
I'm using Avro 1.7.6.
Below is the code:
package avroColorCount;
import java.io.IOException;
import org.apache.avro.*;
import org.apache.avro.Schema.Type;
import org.apache.avro.mapred.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class MapredColorCount extends Configured implements Tool {
public static class ColorCountMapper extends AvroMapper<User, Pair<CharSequence, Integer>> {
#Override
public void map(User user, AvroCollector<Pair<CharSequence, Integer>> collector, Reporter reporter)
throws IOException {
CharSequence color = user.getFavoriteColor();
// We need this check because the User.favorite_color field has type ["string", "null"]
if (color == null) {
color = "none";
}
collector.collect(new Pair<CharSequence, Integer>(color, 1));
}
}
public static class ColorCountReducer extends AvroReducer<CharSequence, Integer,
Pair<CharSequence, Integer>> {
#Override
public void reduce(CharSequence key, Iterable<Integer> values,
AvroCollector<Pair<CharSequence, Integer>> collector,
Reporter reporter)
throws IOException {
int sum = 0;
for (Integer value : values) {
sum += value;
}
collector.collect(new Pair<CharSequence, Integer>(key, sum));
}
}
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MapredColorCount <input path> <output path>");
return -1;
}
JobConf conf = new JobConf(getConf(), MapredColorCount.class);
conf.setJobName("colorcount");
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
AvroJob.setMapperClass(conf, ColorCountMapper.class);
AvroJob.setReducerClass(conf, ColorCountReducer.class);
// Note that AvroJob.setInputSchema and AvroJob.setOutputSchema set
// relevant config options such as input/output format, map output
// classes, and output key class.
AvroJob.setInputSchema(conf, User.getClassSchema());
AvroJob.setOutputSchema(conf, Pair.getPairSchema(Schema.create(Type.STRING),
Schema.create(Type.INT)));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MapredColorCount(), args);
System.exit(res);
}
}

You're using wrong version of avro library.
createDatumWriter method first appeared in GenericData class in version 1.7.5 of avro library. If Hadoop does not seem to find it, then it means that there is an earlier version of avro library (possibly 1.7.4) in your classpath.
First try to provide a correct version of library with HADOOP_CLASSPATH or -libjars option.
Unfortunately, it may be more tricky. In my case it was some other jar file that I loaded with my project but actually never used. I spent several weeks do find it. Hope now you will find it quicker.
Here is some handy code to help you analyze your classpath during your job run (use it inside working job, like WordCount example):
public static void printClassPath() {
ClassLoader cl = ClassLoader.getSystemClassLoader();
URL[] urls = ((URLClassLoader) cl).getURLs();
System.out.println("classpath BEGIN");
for (URL url : urls) {
System.out.println(url.getFile());
}
System.out.println("classpath END");
}
Hope it helps.

Viacheslav Rodionov's answer definitely points to the root cause. Thank you for posting! The following configuration setting then seemed to pick up the 1.7.6 library first and allowed my reducer code (where the createDatumWriter method was called) to complete successfully:
Configuration conf = getConf();
conf.setBoolean(MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true);
Job job = Job.getInstance(conf);

I ran exactly into the same problem and as Viacheslav suggested -- it's a version conflict between Avro installed with Hadoop distribution, and Avro version in your project.
And it seems the most reliable way to solve the problem -- simply just use Avro version installed with your Hadoop distro. Unless there is compelling reason to use different version.
Why is using default Avro version which comes with Hadoop distribution is good idea? Because in production hadoop environment you most likely will deal numerous other jobs and services running on the same shared hadoop infrastructure. And the all share the same jar dependencies which come with Hadoop distribution installed in your production environment.
Replacing jar version for specific mapreduce job maybe tricky but solvable task. However it creates a risk of introducing compatibility problem which may be very hard to detect and can backfire later somewhere else in your hadoop ecosystem.

How to set -file option for java hadoop?

How do i copy a file that is required for a hadoop program, to all compute nodes? I am aware that -file option for hadoop streaming does that. How do i do this for java+hadoop?

Exactly the same way.
Assuming you use the ToolRunner / Configured / Tool pattern, the files you specify after the -files option will be in the local dir when your mapper / reducer / combiner tasks run:
public class Driver extends Configured implements Tool {
public static void main(String args[]) {
ToolRunner.run(new Driver(), args);
}
public int run(String args[]) {
Job job = new Job(getConf());
// ...
job.waitForCompletion(true);
}
}
public class MyMapper extends Mapper<K1, V1, K2, V2> {
public void setup(Context context) {
File myFile = new File("file.csv");
// do something with file
}
// ...
}
You can then execute with:
#> hadoop jar myJar.jar Driver -files file.csv ......
See the Javadoc for GenericOptionsParser for more info

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to specify the partitioner for hadoop streaming - hadoop

-file SignaturePartitioner.java -partitioner SignaturePartitioner The -file option will make the file available on all the required nodes by the Hadoop framework. It needs to point to the class name and not the Java file name.

Related

Why does Spark Streaming not read from Kafka topic?

Hadoop WordCount Tutorial java.lang.ClassNotFoundException

Setting number of Reduce tasks using command line

Error from AVRO Mapreduce

How to set -file option for java hadoop?

Categories

Resources