Storm 1.2.2 and Kafka Version 2.x - apache-storm

I'm testing a case using Storm 1.2.2 and Kafka 2.x as my Spout. So i created a LocalCluster just for test purposes.
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka_spout", new KafkaSpout<>(KafkaSpoutConfig.builder("MYKAFKAIP:9092", "storm-test-dpi").build()), 1);
builder.setBolt("bolt", new LoggerBolt()).shuffleGrouping("kafka_spout");
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("kafkaBoltTest", new Config(), builder.createTopology());
After initialize this app i got the following:
9293 [Thread-20-kafka_spout-executor[3 3]] INFO o.a.k.c.u.AppInfoParser - Kafka version :
9293 [Thread-20-kafka_spout-executor[3 3]] INFO o.a.k.c.u.AppInfoParser - Kafka commitId : 3402a74efb23d1d4
And after a lot of error:
9664 [Thread-20-kafka_spout-executor[3 3]] INFO o.a.s.k.s.KafkaSpout - Initialization complete
9703 [Thread-20-kafka_spout-executor[3 3]] WARN o.a.k.c.c.i.Fetcher - Unknown error fetching data for topic-partition storm-test-dpi-0
9714 [Thread-20-kafka_spout-executor[3 3]] WARN o.a.k.c.c.i.Fetcher - Unknown error fetching data for topic-partition storm-test-dpi-0
9742 [Thread-20-kafka_spout-executor[3 3]] WARN o.a.k.c.c.i.Fetcher - Unknown error fetching data for topic-partition storm-test-dpi-0
9756 [Thread-20-kafka_spout-executor[3 3]] WARN o.a.k.c.c.i.Fetcher - Unknown error fetching data for topic-partition storm-test-dpi-0
9767 [Thread-20-kafka_spout-executor[3 3]] WARN o.a.k.c.c.i.Fetcher - Unknown error fetching data for topic-partition storm-test-dpi-0
9781 [Thread-20-kafka_spout-executor[3 3]] WARN o.a.k.c.c.i.Fetcher - Unknown error fetching data for topic-partition storm-test-dpi-0
9806 [Thread-20-kafka_spout-executor[3 3]] WARN o.a.k.c.c.i.Fetcher - Unknown error fetching data for topic-partition storm-test-dpi-0
I think this problem is because of Kafka Version, as you can see the log is showing version "" but my Kafka version is "2.x".
This is my pom.xml:
Where ${version.storm} is 1.2.2

You are supposed to also declare the version of kafka-clients you are using. The storm-kafka-client POM sets the kafka-clients scope to provided. This means kafka-clients won't be included when you build. We do this so you can easily upgrade.
The reason it's even running for you is because you are using LocalCluster in some test code, where provided dependencies are present.
Add this to your POM, and it should work:


Flume Twitter Streaming Issue

I'm using Flume 1.6.0-cdh5.9.1 to stream Tweets using Twitter source.
The configuration file is below:
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxx
TwitterAgent.sources.Twitter.keywords = hadoop, cloudera = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/cloudera/tweets/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
For the Cloudera .jar dependency, I've built flume-sources-1.0-SNAPSHOT.jar with Maven using below dependencies:
<!-- For the Twitter API -->
<!-- Hadoop Dependencies -->
Now, when I run the Flume Agent, it starts successfully, connects to Twitter, but halts after the last line (Receiving status stream):
2017-02-08 21:55:12,556 (Twitter Stream consumer-1[initializing]) [INFO -] Establishing connection.
2017-02-08 21:55:46,474 (Twitter Stream consumer-1[Establishing connection]) [INFO -] Connection established.
2017-02-08 21:55:46,474 (Twitter Stream consumer-1[Establishing connection]) [INFO -] Receiving status stream.
After the last line nothing happens. It doesn't terminate, doesn't stream anything. I had a look at the HDFS location and nothing is created there.
Can someone help me here?
The problem lies in the configuration TwitterAgent.sources.Twitter.keywords
Twitter Source will work fine and will continuously pull Tweets as long as it finds data in the Firehose. I tried with some other popular recent keyword and it worked just perfectly fine.

elasticsearch query using pyspark and elasticsearch-hadoop connector throws exception in RecordReader.close

Reading from elasticsearch into rdd throws exception: ActionRequestValidationException[Validation Failed: 1: no scroll ids specified;]
mr.EsInputFormat: Cannot determine task id...
Software versions: pyspark 1.6, elasticsearch-hadoop-2.2.1 connector used as connector to elasticsearch, Elasticsearch version is 1.0.1 , hadoop 2.7.2 and python 2.7
elasticsearch-hadoop-2.2.1 library taken from here:
es_rdd = sc.newAPIHadoopRDD(
conf={ "es.resource" : "INDEX/TYPE", "es.nodes" : "NODE_NAME"})
print (es_rdd.first())
Please help to resolve the exception;
The following warning is printed before the exception, Could be linked to this warning and potentially the actual exception:
mr.EsInputFormat: Cannot determine task id...
INFO Configuration.deprecation: is deprecated. Instead, use
Full Exception:
16/04/26 21:00:02 INFO rdd.NewHadoopRDD: Input split: ShardInputSplit [node=[KHHV8pgMQySzw9Fz1Xt7VQ/Iguana|],shard=0]
16/04/26 21:00:02 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use
16/04/26 21:00:02 INFO Configuration.deprecation: is deprecated. Instead, use
16/04/26 21:00:02 INFO Configuration.deprecation: is deprecated. Instead, use
16/04/26 19:31:12 WARN mr.EsInputFormat: Cannot determine task id...
16/04/26 19:31:14 WARN rdd.NewHadoopRDD: Exception in RecordReader.close() ActionRequestValidationException[Validation Failed: 1: no scroll ids specified;]
at org.apache.spark.rdd.NewHadoopRDD$$anon$$apache$spark$rdd$NewHadoopRDD$$anon$$close(NewHadoopRDD.scala:191)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:166)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.api.python.SerDeUtil$
at org.apache.spark.api.python.SerDeUtil$
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
at org.apache.spark.api.python.PythonRunner$
Thank you !

Apache Storm - Kinesis Spout throwing AmazonClientException backing off

2016-02-02 16:15:18 c.a.s.k.s.u.InfiniteConstantBackoffRetry [DEBUG] Caught exception of type com.amazonaws.AmazonClientException, backing off for 1000 ms.
I tested GET and PUT using Streams and Get requests - both worked flawless. I have all 3 variants Batch, Storm and Spark. Spark - used KinesisStreams - working Batch: Can you Get and Put - working Storm: planning to use KinesisSpout library from Kinesis. It is failing with no clue.
final KinesisSpoutConfig config = new KinesisSpoutConfig(streamname, zookeeperurl);
config.withZookeeperPrefix("kinesis-zooprefix-" + name);
System.setProperty("aws.accessKeyId", key);
System.setProperty("aws.secretKey", keysecret);
SystemPropertiesCredentialsProvider scp = new SystemPropertiesCredentialsProvider();
final KinesisSpout spout = new KinesisSpoutConflux(config, scp, new ClientConfiguration());
What am I doing wrong?
Storm Logs:
2016-02-02 16:15:17 c.a.s.k.s.KinesisSpout [INFO] KinesisSpoutConflux[taskIndex=0] open() called with topoConfig task index 0 for processing stream Kinesis-Conflux
2016-02-02 16:15:17 c.a.s.k.s.KinesisSpout [DEBUG] KinesisSpoutConflux[taskIndex=0] activating. Starting to process stream Kinesis-Test
2016-02-02 16:15:17 c.a.s.k.s.KinesisHelper [INFO] Using us-east-1 region
I don't see "nextTuple" getting called.
My Versions:
storm = 0.9.3
kinesis-storm-spout = 1.1.1

Storm HDFS Bolt not working

So I've just started working with storm and trying to understand it. I am trying to connect to the kafka topic, read the data and write it to the HDFS bolt.
At first I created it without the shuffleGrouping("stormspout") and my Storm UI was showing that the spout was consuming the data from the topic but nothing was being written to the bolt (except for the empty files it was creating on the HDFS) . I then added shuffleGrouping("stormspout"); and now the bolt appears to be giving an error. If anyone can help with this, I will really appreciate it.
2015-04-13 00:02:58 s.k.PartitionManager [INFO] Read partition information from: /storm/partition_0 --> null
2015-04-13 00:02:58 s.k.PartitionManager [INFO] No partition information found, using configuration to determine offset
2015-04-13 00:02:58 s.k.PartitionManager [INFO] Last commit offset from zookeeper: 0
2015-04-13 00:02:58 s.k.PartitionManager [INFO] Commit offset 0 is more than 9223372036854775807 behind, resetting to startOffsetTime=-2
2015-04-13 00:02:58 s.k.PartitionManager [INFO] Starting Kafka from offset 0
2015-04-13 00:02:58 s.k.ZkCoordinator [INFO] Task [1/1] Finished refreshing
2015-04-13 00:02:58 b.s.d.task [INFO] Emitting: stormspout default [colmanblah]
2015-04-13 00:02:58 b.s.d.executor [INFO] TRANSFERING tuple TASK: 2 TUPLE: source: stormspout:3, stream: default, id: {462820364856350458=5573117062061876630}, [colmanblah]
2015-04-13 00:02:58 b.s.d.task [INFO] Emitting: stormspout __ack_init [462820364856350458 5573117062061876630 3]
2015-04-13 00:02:58 b.s.d.executor [INFO] TRANSFERING tuple TASK: 1 TUPLE: source: stormspout:3, stream: __ack_init, id: {}, [462820364856350458 5573117062061876630 3]
2015-04-13 00:02:58 b.s.d.executor [INFO] Processing received message FOR 1 TUPLE: source: stormspout:3, stream: __ack_init, id: {}, [462820364856350458 5573117062061876630 3]
2015-04-13 00:02:58 b.s.d.executor [INFO] BOLT ack TASK: 1 TIME: TUPLE: source: stormspout:3, stream: __ack_init, id: {}, [462820364856350458 5573117062061876630 3]
2015-04-13 00:02:58 b.s.d.executor [INFO] Execute done TUPLE source: stormspout:3, stream: __ack_init, id: {}, [462820364856350458 5573117062061876630 3] TASK: 1 DELTA:
2015-04-13 00:02:59 b.s.d.executor [INFO] Prepared bolt stormbolt:(2)
2015-04-13 00:02:59 b.s.d.executor [INFO] Processing received message FOR 2 TUPLE: source: stormspout:3, stream: default, id: {462820364856350458=5573117062061876630}, [colmanblah]
2015-04-13 00:02:59 b.s.util [ERROR] Async loop died!
java.lang.RuntimeException: java.lang.NullPointerException
at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor( ~[storm-core-]
at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable( ~[storm-core-]
at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-]
at backtype.storm.daemon.executor$fn__5697$fn__5710$fn__5761.invoke(executor.clj:794) ~[storm-core-]
at backtype.storm.util$async_loop$fn__452.invoke(util.clj:465) ~[storm-core-]
at [clojure-1.5.1.jar:na]
at [na:1.7.0_71]
Caused by: java.lang.NullPointerException: null
at org.apache.storm.hdfs.bolt.HdfsBolt.execute( ~[storm-hdfs-]
at backtype.storm.daemon.executor$fn__5697$tuple_action_fn__5699.invoke(executor.clj:659) ~[storm-core-]
at backtype.storm.daemon.executor$mk_task_receiver$fn__5620.invoke(executor.clj:415) ~[storm-core-]
at backtype.storm.disruptor$clojure_handler$reify__1741.onEvent(disruptor.clj:58) ~[storm-core-]
at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor( ~[storm-core-]
... 6 common frames omitted
2015-04-08 04:26:39 b.s.d.executor [ERROR]
java.lang.RuntimeException: java.lang.NullPointerException
at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor( ~[storm-core-]
at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable( ~[storm-core-]
at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-]
at backtype.storm.daemon.executor$fn__5697$fn__5710$fn__5761.invoke(executor.clj:794) ~[storm-core-]
at backtype.storm.util$async_loop$fn__452.invoke(util.clj:465) ~[storm-core-]
at [clojure-1.5.1.jar:na]
at [na:1.7.0_71]
Caused by: java.lang.NullPointerException: null
at org.apache.storm.hdfs.bolt.HdfsBolt.execute( ~[storm-hdfs-]
at backtype.storm.daemon.executor$fn__5697$tuple_action_fn__5699.invoke(executor.clj:659) ~[storm-core-]
at backtype.storm.daemon.executor$mk_task_receiver$fn__5620.invoke(executor.clj:415) ~[storm-core-]
at backtype.storm.disruptor$clojure_handler$reify__1741.onEvent(disruptor.clj:58) ~[storm-core-]
at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor( ~[storm-core-]
TopologyBuilder builder = new TopologyBuilder();
Config config = new Config();
//LocalCluster cluster = new LocalCluster();
BrokerHosts brokerHosts = new ZkHosts("", "/brokers");
SpoutConfig spoutConfig = new SpoutConfig(brokerHosts, "myTopic", "/kafkastorm", "KafkaSpout");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.forceFromStart = true;
builder.setSpout("stormspout", new KafkaSpout(spoutConfig),4);
SyncPolicy syncPolicy = new CountSyncPolicy(10); //Synchronize data buffer with the filesystem every 10 tuples
FileRotationPolicy rotationPolicy = new FileSizeRotationPolicy(5.0f, Units.MB); // Rotate data files when they reach five MB
FileNameFormat fileNameFormat = new DefaultFileNameFormat().withPath("/stormstuff"); // Use default, Storm-generated file names
builder.setBolt("stormbolt", new HdfsBolt()
//cluster.submitTopology("ColmansStormTopology", config, builder.createTopology());
try {
StormSubmitter.submitTopologyWithProgressBar("ColmansStormTopology", config, builder.createTopology());
} catch (AlreadyAliveException e) {
} catch (InvalidTopologyException e) {
POM.XML dependencies
First of all try to emit the values from the execute method, if you are emitting from different worker thread, then let all the worker threads to feed the data in LinkedBlockingQueue and only a single worker thread will allow to emit the values from LinkedBlockingQueue.
Secondly, try to Set Config.setMaxSpoutPending to some value and again try to run the code, and check if the scenario persist try to reduce that value.
Reference - Config.TOPOLOGY_MAX_SPOUT_PENDING: This sets the maximum number of spout tuples that can be pending on a single spout task at once (pending means the tuple has not been acked or failed yet). It is highly recommended you set this config to prevent queue explosion.
I eventually figured this out by going through the storm source code.
I wasn't setting
RecordFormat format = new DelimitedRecordFormat().withFieldDelimiter("|");
and including it like
builder.setBolt("stormbolt", new HdfsBolt()
In the HDFSBolt.Java class, it tries to use this and basically falls over if its not set. That was where the NPE was coming from.
Hope this helps someone else out, ensure you have set all the bits that are required in this class. A more useful error message such as "RecordFormat not set" would be nice....

Spark HbaseRDD giving Exception

I am trying to read form Hbase using following code
JavaPairRDD<ImmutableBytesWritable, Result> pairRdd = ctx
.newAPIHadoopRDD(conf, TableInputFormat.class,
But getting exception
java.lang.IllegalStateException: unread block data
Find below code
SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL");
/* String [] stjars={"/home/BreakDown/SparkDemo2/target/SparkDemo2-0.0.1-SNAPSHOT.jar"};
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
Configuration conf= HBaseConfiguration.create();
Any pointer will be of great help
Spark version 1.1.1 hadoop 2
hadoop 2.2.0
Hbase 0.98.8-hadoop2
PFB Stack Trace
14/12/17 21:18:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/12/17 21:18:46 INFO AppClient$ClientActor: Connecting to master spark://
14/12/17 21:18:46 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
14/12/17 21:18:46 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20141217211846-0035
14/12/17 21:18:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0,, ANY, 1256 bytes)
14/12/17 21:18:47 INFO BlockManagerMasterActor: Registering block manager with 265.4 MB RAM, BlockManagerId(0,, 41717, 0)
14/12/17 21:18:48 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, java.lang.IllegalStateException: unread block data$BlockDataInputStream.setBlockDataMode(
