Why does Spark Streaming not read from Kafka topic? - amazon-ec2

Spark Streaming 1.6.0
Apache Kafka 10.0.1
I use Spark Streaming to read from sample topic. The code runs with no errors or exceptions but I don't get any data on the console via print() method.
I checked to see if there are messages in the topic:
./bin/kafka-console-consumer.sh \
--zookeeper ip-172-xx-xx-xxx:2181 \
--topic sample \
--from-beginning
And I am getting the messages:
message no. 1
message no. 2
message no. 3
message no. 4
message no. 5
The command to run the streaming job:
./bin/spark-submit \
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:MaxDirectMemorySize=512m" \
--jars /home/ubuntu/zifferlabs/target/ZifferLabs-1-jar-with-dependencies.jar \
--class "com.zifferlabs.stream.SampleStream" \
/home/ubuntu/zifferlabs/src/main/java/com/zifferlabs/stream/SampleStream.java
Here is the entire code:
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import kafka.serializer.DefaultDecoder;
import kafka.serializer.StringDecoder;
import scala.Tuple2;
public class SampleStream {
private static void processStream() {
SparkConf conf = new SparkConf().setAppName("sampleStream")
.setMaster("local[3]")
.set("spark.serializer", "org.apache.spark.serializer.JavaSerializer")
.set("spark.driver.memory", "2g").set("spark.streaming.blockInterval", "1000")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.scheduler.mode", "FAIR");
JavaStreamingContext jsc = new JavaStreamingContext(conf, new Duration(Long.parseLong("2000")));
String[] topics = "sample".split(",");
Set<String> topicSet = new HashSet<String>(Arrays.asList(topics));
Map<String, String> props = new HashMap<String, String>();
props.put("metadata.broker.list", "ip-172-xx-xx-xxx:9092");
props.put("kafka.consumer.id", "sample_con");
props.put("group.id", "sample_group");
props.put("zookeeper.connect", "ip-172-xx-xx-xxx:2181");
props.put("zookeeper.connection.timeout.ms", "16000");
JavaPairInputDStream<String, byte[]> kafkaStream =
KafkaUtils.createDirectStream(jsc, String.class, byte[].class, StringDecoder.class,
DefaultDecoder.class, props, topicSet);
JavaDStream<String> data = kafkaStream.map(new Function<Tuple2<String,byte[]>, String>() {
public String call(Tuple2<String, byte[]> arg0) throws Exception {
System.out.println("$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$ value is: " + arg0._2().toString());
return arg0._2().toString();
}
});
data.print();
System.out.println("Spark Streaming started....");
jsc.checkpoint("/home/spark/sparkChkPoint");
jsc.start();
jsc.awaitTermination();
System.out.println("Stopped Spark Streaming");
}
public static void main(String[] args) {
processStream();
}
}

I think you've got the code right, but the command line to execute it is incorrect.
You spark-submit the application as follows (formatting's mine + spark.executor.extraJavaOptions removed for simplicity):
./bin/spark-submit \
--jars /home/ubuntu/zifferlabs/target/ZifferLabs-1-jar-with-dependencies.jar \
--class "com.zifferlabs.stream.SampleStream" \
/home/ubuntu/zifferlabs/src/main/java/com/zifferlabs/stream/SampleStream.java
I think it won't work since spark-submit submits your Java source code not the executable code.
Please spark-submit your application as follows:
./bin/spark-submit \
--class "com.zifferlabs.stream.SampleStream" \
/home/ubuntu/zifferlabs/target/ZifferLabs-1-jar-with-dependencies.jar
which is --class to define the "entry point" to your Spark application and the code with dependencies (as the only input argument for spark-submit).
Give it a shot and report back!

Related

Writing data from JavaDStream<String> in Apache spark to elasticsearch

I am working on programming to process data from Apache kafka to elasticsearch. For that purpose I am using Apache Spark. I have gone through many link but unable to find example to write data from JavaDStream in Apache spark to elasticsearch.
Below is sample code of spark which gets data from kafka and prints it.
import org.apache.log4j.Logger;
import org.apache.log4j.Level;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Arrays;
import java.util.Iterator;
import java.util.Map;
import java.util.Set;
import java.util.regex.Pattern;
import scala.Tuple2;
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.Durations;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
import com.google.common.collect.ImmutableMap;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.util.List;
public class SparkStream {
public static JavaSparkContext sc;
public static List<Map<String, ?>> alldocs;
public static void main(String args[])
{
if(args.length != 2)
{
System.out.println("SparkStream <broker1-host:port,broker2-host:port><topic1,topic2,...>");
System.exit(1);
}
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
SparkConf sparkConf=new SparkConf().setAppName("Data Streaming");
sparkConf.setMaster("local[2]");
sparkConf.set("es.index.auto.create", "true");
sparkConf.set("es.nodes","localhost");
sparkConf.set("es.port","9200");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Set<String> topicsSet=new HashSet<>(Arrays.asList(args[1].split(",")));
Map<String,String> kafkaParams=new HashMap<>();
String brokers=args[0];
kafkaParams.put("metadata.broker.list",brokers);
kafkaParams.put("auto.offset.reset", "largest");
kafkaParams.put("offsets.storage", "zookeeper");
JavaPairDStream<String, String> messages=KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicsSet
);
JavaDStream<String> lines = messages.map(new Function<Tuple2<String, String>, String>() {
#Override
public String call(Tuple2<String, String> tuple2) {
return tuple2._2();
}
});
lines.print();
jssc.start();
jssc.awaitTermination();
}
}
`
One method to save to elastic search is using the saveToEs method inside a foreachRDD function. Any other method you wish to use would still require the foreachRDD call to your dstream.
For example:
lines.foreachRDD(lambda rdd: rdd.saveToEs("ESresource"))
See here for more
dstream.foreachRDD{rdd=>
val es = sqlContext.createDataFrame(rdd).toDF("use headings suitable for your dataset")
import org.elasticsearch.spark.sql._
es.saveToEs("wordcount/testing")
es.show()
}
In this code block "dstream" is the data stream which observe data from server like kafka. Inside brackets of "toDF()" you have to use headings. In "saveToES()" you have use elasticsearch index. Before this you have create SQLContext.
val sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
If you are using kafka to send data you have to add dependency mentioned below
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.2.1"
Get the dependency
To see full example see
In this example first you have to create kafka producer "test" then start elasticsearch
After run the program. You can see full sbt and code using above url.

Running a mapreduce job no output at all. It doesn't even run . very weird. no error thrown on the terminal

I compiled the mapreduce code (driver, mapper and reducer classes) and created Jar files. When I run it on the dataset, it doesn't seem to run. It just comes back to the prompt as shown in the image. Any suggestions folks?
thanks much
basam
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
//This driver program will bring all the information needed to submit this Map reduce job.
public class MultiLangDictionary {
public static void main(String[] args) throws Exception{
if (args.length !=2){
System.err.println("Usage: MultiLangDictionary <input path> <output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job ajob = new Job(conf, "MultiLangDictionary");
//Assigning the driver class name
ajob.setJarByClass(MultiLangDictionary.class);
FileInputFormat.addInputPath(ajob, new Path(args[0]));
//first argument is the job itself
//second argument is the location of the output dataset
FileOutputFormat.setOutputPath(ajob, new Path(args[1]));
ajob.setInputFormatClass(TextInputFormat.class);
ajob.setOutputFormatClass(TextOutputFormat.class);
//Defining the mapper class name
ajob.setMapperClass(MultiLangDictionaryMapper.class);
//Defining the Reducer class name
ajob.setReducerClass(MultiLangDictionaryReducer.class);
//setting the second argument as a path in a path variable
Path outputPath = new Path(args[1]);
//deleting the output path automatically from hdfs so that we don't have delete it explicitly
outputPath.getFileSystem(conf).delete(outputPath);
}
}
try with java packagename.classname in the command
hadoop jar MultiLangDictionary.jar [yourpackagename].MultiLangDictionary input output
You could try adding the Map and Reduce output key types to your driver. Something like (this is an example):
job2.setMapOutputKeyClass(Text.class);
job2.setMapOutputValueClass(Text.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
In the above both the Mapper and Reducer would be writing (Text,Text) in their context.write() methods.

Why does job submission from Java fail?

I submit a Spark job from Java as a RESTful service. I keep getting the following error:
Application application_1446816503326_0098 failed 2 times due to AM
Container for appattempt_1446816503326_0098_000002 exited with
exitCode: -1000 For more detailed output, check application tracking
page:http://ip-172-31-34-
108.us-west-2.compute.internal:8088/proxy/application_1446816503326_0098/Then,
click on links to logs of each attempt. Diagnostics:
java.io.FileNotFoundException: File file:/opt/apache-tomcat-
8.0.28/webapps/RESTfulExample/WEB-INF/lib/spark-yarn_2.10-1.3.0.jar does not exist Failing this attempt. Failing the application.
spark-yarn_2.10-1.3.0.jar file is there in the lib folder.
Here is my program.
package SparkSubmitJava;
import org.apache.spark.deploy.yarn.Client;
import org.apache.spark.deploy.yarn.ClientArguments;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import java.io.IOException;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.PathParam;
import javax.ws.rs.core.Response;
#Path("/spark")
public class JavaRestService {
#GET
#Path("/{param}/{param2}/{param3}")
public Response getMsg(#PathParam("param") String bedroom,#PathParam("param2") String bathroom,#PathParam("param3")String area) throws IOException {
String[] args = new String[] {
"--name",
"JavaRestService",
"--driver-memory",
"1000M",
"--jar",
"/opt/apache-tomcat-8.0.28/webapps/scalatest-0.0.1-SNAPSHOT.jar",
"--class",
"ScalaTest.ScalaTest.ScalaTest",
"--arg",
bedroom,
"--arg",
bathroom,
"--arg",
area,
"--arg",
"yarn-cluster",
};
Configuration config = new Configuration();
System.setProperty("SPARK_YARN_MODE", "true");
SparkConf sparkConf = new SparkConf();
ClientArguments cArgs = new ClientArguments(args, sparkConf);
Client client = new Client(cArgs, config, sparkConf);
client.run();
return Response.status(200).entity(client).build();
}
}
Any help will be appreciated.

How do I convert my Java Hadoop code to run on EC2?

I wrote a Driver, Mapper, and Reducer class in Java that runs the k-nearest neighbor algorithm on test data, and pulls in the training set using Distributed Cache. I used a Cloudera virtual machine to test the code, and it works in pseudo-distributed mode.
I'm trying to get through Amazon's EC2/EMR documentation ... it seems like there should be a way to easily convert working Java Hadoop code into something that will work in EC2, but I'm seeing a whole bunch of custom amazon import statements and methods that I've never seen before.
Here's my driver code for an example:
import java.net.URI;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class KNNDriverEC2 extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.setInt("rows",1000);
conf.setInt("columns",613);
DistributedCache.createSymlink(conf);
// might have to start next line with ./!!!
DistributedCache.addCacheFile(new URI("knn-jg/cache_data/train_sample.csv#train_sample.csv"),conf);
DistributedCache.addCacheFile(new URI("knn-jg/cache_data/train_labels.csv#train_labels.csv"),conf);
//DistributedCache.addCacheFile(new URI("cacheData/train_sample.csv"),conf);
//DistributedCache.addCacheFile(new URI("cacheData/train_labels.csv"),conf);
Job job = new Job(conf);
job.setJarByClass(KNNDriverEC2.class);
job.setJobName("KNN");
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(KNNMapperEC2.class);
job.setReducerClass(KNNReducerEC2.class);
// job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new KNNDriverEC2(), args);
System.exit(exitCode);
}
}
I've gotten my instance running, but an exception is thrown at the line "FileInputFormat.setInputPaths(job, new Path(args[0]));". I'm going to try to work through the documentation on handling arguments, but I've run into so many errors so far I'm wondering if I'm far from making this work. Any help appreciated.

How to specify the partitioner for hadoop streaming

I have a custom partitioner like below:
import java.util.*;
import org.apache.hadoop.mapreduce.*;
public static class SignaturePartitioner extends Partitioner<Text,Text>
{
#Override
public int getPartition(Text key,Text value,int numReduceTasks)
{
return (key.toString().Split(' ')[0].hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
}
I set the hadoop streaming parameter like below
-file SignaturePartitioner.java \
-partitioner SignaturePartitioner \
Then I get an error: Class Not Found.
Do you know what's the problem?
Best Regards,
I faced the same issue, but managed to solve after lot of research.
Root cause is streaming-2.6.0.jar uses mapred api and not mapreduce api. Also, implement Partitioner interface, and not extend Partitioner class. The following worked for me:
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.Partitioner;
import org.apache.hadoop.mapred.JobConf;`
public class Mypartitioner implements Partitioner<Text, Text> {`
public void configure(JobConf job) {}
public int getPartition(Text pkey, Text pvalue, int pnumparts)
{
if (pkey.toString().startsWith("a"))
return 0;
else return 1 ;
}
}
compile Mypartitioner, create jar, and then,
bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar
-libjars /home/sanjiv/hadoop-2.6.0/Mypartitioner.jar
-D mapreduce.job.reduces=2
-files /home/sanjiv/mymapper.sh,/home/sanjiv/myreducer.sh
-input indir -output outdir -mapper mymapper.sh
-reducer myreducer.sh -partitioner Mypartitioner
-file SignaturePartitioner.java -partitioner SignaturePartitioner
The -file option will make the file available on all the required nodes by the Hadoop framework. It needs to point to the class name and not the Java file name.

Resources