Assuming we have the following topology
spout A -> bolt B -> bolt C -> bolt E
and bolt E is the final one, that persists info in the database, therefore no needs to emit any tuple. How to implement such solution,
if I define no output_fields - then I get exception
Exception in thread "main" java.io.IOException: org.apache.storm.thrift.protocol.TProtocolException: Required field 'output_fields' is unset! Struct:StreamInfo(output_fields:null, direct:false)
at storm.petrel.ThriftReader.read(ThriftReader.java:77)
at storm.petrel.GenericTopology.readTopology(GenericTopology.java:36)
at storm.petrel.GenericTopology.main(GenericTopology.java:53)
Caused by: org.apache.storm.thrift.protocol.TProtocolException: Required field 'output_fields' is unset! Struct:StreamInfo(output_fields:null, direct:false)
at org.apache.storm.generated.StreamInfo.validate(StreamInfo.java:407)
at org.apache.storm.generated.StreamInfo$StreamInfoStandardScheme.read(StreamInfo.java:485)
at org.apache.storm.generated.StreamInfo$StreamInfoStandardScheme.read(StreamInfo.java:441)
at org.apache.storm.generated.StreamInfo.read(StreamInfo.java:377)
at org.apache.storm.generated.ComponentCommon$ComponentCommonStandardScheme.read(ComponentCommon.java:681)
at org.apache.storm.generated.ComponentCommon$ComponentCommonStandardScheme.read(ComponentCommon.java:636)
at org.apache.storm.generated.ComponentCommon.read(ComponentCommon.java:552)
at org.apache.storm.generated.Bolt$BoltStandardScheme.read(Bolt.java:451)
at org.apache.storm.generated.Bolt$BoltStandardScheme.read(Bolt.java:427)
at org.apache.storm.generated.Bolt.read(Bolt.java:358)
at org.apache.storm.generated.StormTopology$StormTopologyStandardScheme.read(StormTopology.java:727)
at org.apache.storm.generated.StormTopology$StormTopologyStandardScheme.read(StormTopology.java:683)
at org.apache.storm.generated.StormTopology.read(StormTopology.java:595)
at storm.petrel.ThriftReader.read(ThriftReader.java:75)
... 2 more
Please re-check bolt E, make sure it was not set by any others bolt (it's meant bolt E was not used by any methods TopologyBuilder.setBolt, e.g. : TopologyBuilder.setBolt("mybolt",new MyBolt()).fieldsGrouping("bolt E",
new Fields(new String[] { "user_id" }));
Related
I have a spout which reads from a source with 40K qps.
I have two bolt, first one which reads from the source and does a database connection to build a cache which refreshes in every hour. The database has 2 connection open for a user so executor count that I have for this bolt is 2.
Other bolt is assigned 200 executors and 200 task to process the request.
I can't increase the connection to db. And I see that all the request is going to single workers. Other workers keep waiting and prints "0 send message".
kafkaSpoutConfigList:
- executorsCount: 30
taskCount: 30
spoutName: 'kafka_consumer_spout'
topicName: 'request'
processingBoltConfigList:
- executorsCount: 2
taskCount: 2
boltName: 'db_bolt'
boltClassName: 'com.Bolt1Class'
boltSourceList:
- 'kafka_consumer_spout'
- executorsCount: 200
taskCount: 200
boltName: 'bolt2'
boltClassName: 'com.Bolt2Class'
boltSourceList:
- 'db_bolt::streamx'
kafkaBoltConfigList:
- executorsCount: 15
taskCount: 15
boltName: 'kafka_producer_bolt'
topicName: 'consumer_topic'
boltSourceList:
- 'bolt2::Stream1'
- executorsCount: 15
taskCount: 15
boltName: 'kafka_producer_bolt'
topicName: 'data_test'
boltSourceList:
- 'bolt2::Stream2'
I am using localandgroupshuffling.
When you use LocalOrShuffleGrouping, the following happens:
If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping
So let's say your workers look like this:
worker1: {"bolt1 task 1", "bolt2 task 0-50"}
worker2: { "bolt1 task 2", "bolt2 task 50-100"}
worker3: { "bolt2 task 100-150"}
worker4: { "bolt2 task 150-200"}
In this case because you're telling Storm to use a local grouping when sending from bolt1 to bolt2, all the tuples will be going to worker 1 and 2. Worker 3 and 4 will be idle.
If you want to send tuples also to worker 3 and 4, you need to switch to shuffle grouping.
I m trying to convert List of available Currency to a Map, To look up based on Currency Numeric code i want to get String code. Here is the code.
But this code above throwing below error, I m very new to Java 8 hence banging my head :
Exception in thread "main" java.lang.ExceptionInInitializerError
Caused by: java.lang.IllegalStateException: Duplicate key YUM
at java.util.stream.Collectors.lambda$throwingMerger$0(Collectors.java:133)
at java.util.HashMap.merge(HashMap.java:1254)
at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1556)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
But this code above throwing below error, I m very new to Java 8 hence banging my head
public class IsoCurrencyCode {
private static final Set<Currency> ISO_CURRENCY = Currency.getAvailableCurrencies();
private static final Map<Integer, Currency> NUMERIC_MAP = ISO_CURRENCY.stream().collect(Collectors.toMap(Currency::getNumericCode, Function.identity()));
public static void main(String[] args) {
//
Currency currency = NUMERIC_MAP.get(971);
System.out.println(currency.getCurrencyCode());
}
}
It should load all currencies into the map with code as keys.
Collectors.toMap() doesn't accept duplicates for keys.
Since you have that for Currency::getNumericCode, toMap() throws this exception when a duplicate key is encountered.
Caused by: java.lang.IllegalStateException: Duplicate key YUM
Note that here the error message is misleading. Keys are Integer while YUM is not. YUM looks like a Currency instance and that is.
Indeed, YUM refers to one of the values (Currency) processed by toMap() that have a duplicate key, not the key value itself. It is a Java bug fixed in Java 9.
To solve your issue, either use Collectors.groupingBy() to collect to a Map<Integer, List<Currency>> and in this case you could have multiple values by key or as alternative merge the duplicate keys with the toMap() overload, for example to keep the last entry :
private static final Map<Integer, Currency> NUMERIC_MAP =
ISO_CURRENCY.stream()
.collect(Collectors.toMap(Currency::getNumericCode, Function.identity(), (v1, v2)-> v2);
To answer to your comment, you could find the culprit code (duplicates) in this way :
Map<Integer, List<Currency>> dupMap =
ISO_CURRENCY.stream()
.collect(Collectors.groupingBy(Currency::getNumericCode)
.entrySet()
.filter(e -> e.getValue().size() > 1)
.collect(Collectors.toMap(Entry::getKey,Entry::getValue));
System.out.println(dupMap);
I want to index documents into Elasticsearch from Storm, but I couldn't get any document to be indexed into Elasticsearch.
In my topology I have a KafkaSpout that emits a json like this { “tweetId”: 1, “text”: “hello” } to a EsBolt that is a native bolt from elasticsearch-hadoop library that writes the Storm Tuples to Elasticsearch (doc is here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/storm.html).
These are the configs for my EsBolt:
Map conf = new HashMap();
conf.put("es.nodes","127.0.0.1");
conf.put("es.port","9200");
conf.put("es.resource","twitter/tweet");
conf.put("es.index.auto.create","no");
conf.put("es.input.json", "true");
conf.put("es.mapping.id", "tweetId");
EsBolt elasticsearchBolt = new EsBolt("twitter/tweet", conf);
The first two configurations have these values by default, but I chose to set them explicitly. I have also tried without them, getting the same result.
And this is how I build my topology:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(TWEETS_DATA_KAFKA_SPOUT_ID, kafkaSpout, kafkaSpoutParallelism)
.setNumTasks(kafkaSpoutNumberOfTasks);
builder.setBolt(ELASTICSEARCH_BOLT_ID, elasticsearchBolt, elasticsearchBoltParallelism)
.setNumTasks(elasticsearchBoltNumberOfTasks)
.shuffleGrouping(TWEETS_DATA_KAFKA_SPOUT_ID);
return builder.createTopology();
Before I run the topology locally I create the "twitter" index in Elasticsearch and a mapping "tweet" for this index.
This is what I get if I retrieve the mapping for my newly created type (curl -XGET 'http://localhost:9200/twitter/_mapping/tweet'):
{
"twitter": {
"mappings": {
"tweet": {
"properties": {
"text": {
"type": "string"
},
"tweetId": {
"type": "string"
}
}
}
}
}
}
I run the topology locally and this is what I get in my console when processing a tuple:
Processing received message FOR 6 TUPLE: source: tweets-data-kafka-spout:9, stream: default, id: {-8010897758788654352=-6240339405307942979}, [{"tweetId":"1","text":"hello"}]
Emitting: elasticsearch-bolt __ack_ack [-8010897758788654352 -6240339405307942979]
TRANSFERING tuple TASK: 2 TUPLE: source: elasticsearch-bolt:6, stream: __ack_ack, id: {}, [-8010897758788654352 -6240339405307942979]
BOLT ack TASK: 6 TIME: TUPLE: source: tweets-data-kafka-spout:9, stream: default, id: {-8010897758788654352=-6240339405307942979}, [{"tweetId":"1","text":"hello"}]
Execute done TUPLE source: tweets-data-kafka-spout:9, stream: default, id: {-8010897758788654352=-6240339405307942979}, [{"tweetId":"1","text":"hello"}] TASK: 6 DELTA:
So the tuples seems to be processed. However I don't have any document indexed in Elasticsearch.
I suppose I am doing something wrong when I set the configurations for EsBolt, maybe missing a configuration or something.
Documents will only be indexed once you reach the flush size, specified by es.storm.bolt.flush.entries.size
Alternately, you may set a TICK frequency that triggers a queue flush.
config.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 5);
By default, es-hadoop flushes on tick, as per the es.storm.bolt.tick.tuple.flush parameter.
I have also got the same issue, but when I looking for the es-Hadoop documents, I find because I was miss set the frequency that triggers a queue flush.Then I add a configurations to my store topology (es.storm.bolt.flush.entries.size ), it's fine.but when we setting the value for Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS .it's throw an exception :java.lang.RuntimeException:java.lang.NullPointerException in bolt execute function. then we use debug mode to test my topology, I find the input tuple in bolt execute don't contain any entries, but this empty tuple is been triggered.
That's what I feel confusion. Don't the tuple will be emitted according to the setting time, Even though this tuple is empty after we set Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS.i think which is a bug.
enter image description here
enter image description here
more information you can see:https://www.elastic.co/guide/en/elasticsearch/hadoop/current/storm.html
I want to make an mapreduce design like this inside one job.
Example :
I want on a job: *************************************************************
[Mapper A] ---> [Mapper C]
[Mapper B] ---> [Reducer B]
After that [Reducer B] ---> [Mapper C]
[Mapper C] ---> [Reducer C] ******************************************************************************
So [Mapper A] & [Reducer B] ---> [Mapper C]. And next [Mapper C] continue to [Reducer C]. I want all scenario above run on one job.
It's like a routing inside one mapreduce job. I can route many mappers to particular reducer and continue it to other mapper than reducer again inside one job. I need your suggest bro
Thanks.....
--Edit starts
To simplify the problem lets say you have three jobs JobA, JobB, JobC each comprising of a map and a reduce phase.
Now you want to use mapper output of JobA in mapper task of JobC, so JobC just needs to wait for JobA to finish its map task, you can use MultipleOutputs class in your JobA to preserve/write map phase output at a location which JobC can poll for.
--Edit ends
Programatically you can do something like below code, where getJob() should be defined in respective Map-reduce class, where you specify configuration, DistributedCache, input formats etc.
main () {
processMapperA();
processMapReduceB();
processMapReduceC();
}
processMapperA()
{
// configure the paths/inputs needed, for example sake I am taking two paths
String path1 = "path1";
String path2 = "path2";
String[] mapperApaths = new String[]{path1, path2};
Job mapperAjob = MapperA.getJob(mapperApaths, <some other params you want to pass>);
mapperAjob.submit();
mapperAjob.waitForCompletion(true);
}
processMapReduceB()
{
// init input params to job
.
.
Job mapReduceBjob = MapReduceB.getJob(<input params you want to pass>);
mapReduceBjob.submit();
mapReduceBjob.waitForCompletion(true);
}
processMapReduceC()
{
// init input params to job
.
.
Job mapReduceCjob = MapReduceC.getJob(<input params you want to pass like outputMapperA, outputReducerB>);
mapReduceCjob.submit();
mapReduceCjob.waitForCompletion(true);
}
To gain more control on the workflow, you can consider using Oozie or SpringBatch.
With Oozie you can define workflow.xml's and schedule execution of each job as required.
SpringBatch can also be used for same but would require some coding and understanding, if you have a background to it, it can be used straightaway.
--Edit starts
Oozie is a workflow management tool, it allows you to configure and schedule jobs.
--Edit Ends
Hope this helps.
Are the below properties in hive-site.xml correct for Hive access to cassandra??
(I HAVE COPIED ENTIRE HIVE-DEFAULT.XML CONTENT BUT HAVE CHANGED ONLY THE BELOW PROPERTIES)
javax.jdo.option.ConnectionURL : cassandra://localhost:9160
javax.jdo.option.ConnectionDriverName:org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbclass: jdbc:cassandra
hive.stats.jdbcdriver: org.apache.cassandra.cql.jdbc.CassandraDriver
hive.stats.dbconnectionstring: jdbc:cassandra:;databaseName=TempStatsStore;create=true
I am running 1-node Cassandra. But, later would make it a minimum 2 node cluster.
When I run the below table creation command I get an error:
CREATE EXTERNAL TABLE MyHiveTable
(m string, n string, o string, p string)
STORED BY 'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
TBLPROPERTIES ( "cassandra.ks.name" = "cql3ks",
"cassandra.cf.name" = "test",
"cassandra.cql3.type" = "text, text, text, text");
Error:
FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Error creating transactional connection factory
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
don't know about jdo settings but you could try this link which is far better option for integrating hive with cassandra -
https://github.com/milliondreams/hive/tree/cas-support-cql/cassandra-handler