storm processing data extremely slow

storm processing data extremely slow - performance

We have 1 spout and 1 bolt on single node. Spout reads the data from RabbitMQ and emits it to the only bolt which writes data to Cassandra.
Our data source generates 10000 messages per second and storm takes around 10 sec to process this, which is too slow for us.
We tried increasing the parallelism of topology but that doesn't make any difference.
What is ideal no of messages that can be processed on a single node machine with 1 spout and 1 bolt? and what are the possible ways to increase the processing speed of storm topology?.
Update :
This is the sample code, it doesent have code for RabbitMQ and cassandra, but gives same performance issue.
// Topology Class
public class SimpleTopology {
public static void main(String[] args) throws InterruptedException {
System.out.println("hiiiiiiiiiii");
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("SimpleSpout", new SimpleSpout());
topologyBuilder.setBolt("SimpleBolt", new SimpleBolt(), 2).setNumTasks(4).shuffleGrouping("SimpleSpout");
Config config = new Config();
config.setDebug(true);
config.setNumWorkers(2);
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("SimpleTopology", config, topologyBuilder.createTopology());
Thread.sleep(2000);
}
}
// Simple Bolt
public class SimpleBolt implements IRichBolt{
private OutputCollector outputCollector;
public void prepare(Map map, TopologyContext tc, OutputCollector oc) {
this.outputCollector = oc;
}
public void execute(Tuple tuple) {
this.outputCollector.ack(tuple);
}
public void cleanup() {
// TODO
}
public void declareOutputFields(OutputFieldsDeclarer ofd) {
// TODO
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
// Simple Spout
public class SimpleSpout implements IRichSpout{
private SpoutOutputCollector spoutOutputCollector;
private boolean completed = false;
private static int i = 0;
public void open(Map map, TopologyContext tc, SpoutOutputCollector soc) {
this.spoutOutputCollector = soc;
}
public void close() {
// Todo
}
public void activate() {
// Todo
}
public void deactivate() {
// Todo
}
public void nextTuple() {
if(!completed)
{
if(i < 100000)
{
String item = "Tag" + Integer.toString(i++);
System.out.println(item);
this.spoutOutputCollector.emit(new Values(item), item);
}
else
{
completed = true;
}
}
else
{
try {
Thread.sleep(2000);
} catch (InterruptedException ex) {
Logger.getLogger(SimpleSpout.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
public void ack(Object o) {
System.out.println("\n\n OK : " + o);
}
public void fail(Object o) {
System.out.println("\n\n Fail : " + o);
}
public void declareOutputFields(OutputFieldsDeclarer ofd) {
ofd.declare(new Fields("word"));
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
Update:
Is it possible that with shuffle grouping same tuple will be processed more than once? configuration used (spouts = 4. bolts = 4), the problem now is, with increase in no of bolts the performance is decreasing.

You should find out what is the bottleneck here -- RabbitMQ or Cassandra. Open the Storm UI and take a look at the latency times for each component.
If increasing parallelism didn't help (it normally should), there's definitely a problem with RabbitMQ or Cassandra, so you should focus on them.

In your code you only emit one tuple per call to nextTuple(). Try emitting more tuples per call.
something like:
public void nextTuple() {
int max = 1000;
int count = 0;
GetResponse response = channel.basicGet(queueName, autoAck);
while ((response != null) && (count < max)) {
// process message
spoutOutputCollector.emit(new Values(item), item);
count++;
response = channel.basicGet(queueName, autoAck);
}
try { Thread.sleep(2000); } catch (InterruptedException ex) {
}

We are successfully using RabbitMQ and Storm. The result gets stored in a different DB, but anyway. We first used basic_get in Spout, and had a terrible performance, but then we swiched to basic_consume, and performance is actually very good. So take a look at how you consuming messages from Rabbit.
Some important factors:
basic_consume instead of basic_get
prefetch_count (make it high enough)
If you want to increase performance, and you don't care about loosing messages - do not ack messages and set delivery_mode to 1.

Related

Subscribers onnext does not contain complete item

We are working with project reactor and having a huge problem right now. This is how we produce (publish our data):
public Flux<String> getAllFlux() {
return Flux.<String>create(sink -> {
new Thread(){
public void run(){
Iterator<Cache.Entry<String, MyObject>> iterator = getAllIterator();
ObjectMapper mapper = new ObjectMapper();
while(iterator.hasNext()) {
try {
sink.next(mapper.writeValueAsString(iterator.next().getValue()));
} catch (IOException e) {
e.printStackTrace();
}
}
sink.complete();
}
} .start();
});
}
As you can see we are taking data from an iterator and are publishing each item in that iterator as a json string. Our subscriber does the following:
flux.subscribe(new Subscriber<String>() {
private Subscription s;
int amount = 1; // the amount of received flux payload at a time
int onNextAmount;
String completeItem="";
ObjectMapper mapper = new ObjectMapper();
#Override
public void onSubscribe(Subscription s) {
System.out.println("subscribe");
this.s = s;
this.s.request(amount);
}
#Override
public void onNext(String item) {
MyObject myObject = null;
try {
System.out.println(item);
myObject = mapper.readValue(completeItem, MyObject.class);
System.out.println(myObject.toString());
} catch (IOException e) {
System.out.println(item);
System.out.println("failed: " + e.getLocalizedMessage());
}
onNextAmount++;
if (onNextAmount % amount == 0) {
this.s.request(amount);
}
}
#Override
public void onError(Throwable t) {
System.out.println(t.getLocalizedMessage())
}
#Override
public void onComplete() {
System.out.println("completed");
});
}
As you can see we are simply printing the String item which we receive and parsing it into an object using jackson wrapper. The problem we got now is that for most of our items everything works fine:
{"itemId": "someId", "itemDesc", "some description"}
But for some items the String is cut off like this for example:
{"itemId": "some"
And the next item after that would be
"Id", "itemDesc", "some description"}
There is no pattern for those cuts. It is completely random and it is different everytime we run that code. Ofcourse our jackson is gettin an error Unexpected end of Input with that behaviour.
So what is causing such a behaviour and how can we solve it?

Solution:
Send the Object inside the flux instead of the String:
public Flux<ItemIgnite> getAllFlux() {
return Flux.create(sink -> {
new Thread(){
public void run(){
Iterator<Cache.Entry<String, ItemIgnite>> iterator = getAllIterator();
while(iterator.hasNext()) {
sink.next(iterator.next().getValue());
}
}
} .start();
});
}
and use the following produces type:
#RequestMapping(value="/allFlux", method=RequestMethod.GET, produces="application/stream+json")
The key here is to use stream+json and not only json.

application with Chronicle-Queue memory keeps on growing

I have implemented a simple Spring Boot application which receive a network message, queue it into SingleChronicleQueue using appender.writeText(str), another thread polls for a message using tailer.readText(). After some processing a processed message is place in another SingleChronicleQueue to be sent away.
I have three queues in the application.
The application rotates the files every night and the first weird thing is that the file sizes (for each Q) are the same (different for every Q).
The largest cq4 file is about 220MB per day.
The problem that I face is that in three days from start until now the memory grew from 480MB to 1.6GB and it just unreasonable.
I have a notion that I am missing something in configuration, or a naive/bad implementation on my part. (I don't close the appender and tailer after every use, should I).
Here is a stripped down example, maybe someone can shed some light.
#Service
public class QueuesService {
private static Logger LOG = LoggerFactory.getLogger(QueuesService.class);
#Autowired
AppConfiguration conf;
private SingleChronicleQueue Q = null;
private ExcerptAppender QAppender = null;
private ExcerptTailer QTailer = null;
public QueuesService() {
}
#PostConstruct
private void init() {
Q = SingleChronicleQueueBuilder.binary(conf.getQueuePath()).indexSpacing(1).build();
QAppender = Q.acquireAppender();
QTailer = Q.createTailer();
}
public ExcerptAppender getQAppender() {
return QAppender;
}
public ExcerptTailer getQTailer() {
return QTailer;
}
}
#Service
public class ProcessingService {
private static Logger LOG = LoggerFactory.getLogger(ProcessingService.class);
#Autowired
AppConfiguration conf;
#Autowired
private TaskExecutor taskExecutor;
#Autowired
private QueuesService queueService;
private QueueProcessor processor = null;
public ProcessingService() {
}
#PostConstruct
private void init() {
processor = new QueueProcessor();
processor.start();
}
#Override
public Message processMessage(Message msg, Map<String, Object> metadata) throws SomeException {
String strMsg = msg.getMessage().toString();
if (LOG.isInfoEnabled()) {
LOG.info("\n" + strMsg);
}
try {
queueService.getQAppender().writeText(strMsg);
if (LOG.isInfoEnabled()) {
LOG.info("Added new message to queue. index: " + queueService.getQAppender().lastIndexAppended());
}
}
catch(Exception e) {
LOG.error("Unkbown error. reason: " + e.getMessage(), e);
}
}
class QueueProcessor extends Thread {
public void run() {
while (!interrupted()) {
try {
String msg = queueService.getEpicQTailer().readText();
if (msg != null) {
long index = queueService.getEpicQTailer().index();
// process
}
else {
Thread.sleep(10);
}
}
catch (InterruptedException e) {
LOG.warn(e);
this.interrupt();
break;
}
}
ThreadPoolTaskExecutor tp = (ThreadPoolTaskExecutor) taskExecutor;
tp.shutdown();
}
}
}

Chronicle Queue is designed to use virtual memory which can be much larger than main memory (or the heap) without a significant impact on your system. This allows you to access the data at random quickly.
Here is an example of a process writing 1 TB in 3 hours.
https://vanilla-java.github.io/2017/01/27/Chronicle-Queue-storing-1-TB-in-virtual-memory-on-a-128-GB-machine.html
This shows how much slower it gets as the queue grows
Even after it is 1 TB in size on a machine with 128 GB, it write 1 GB under 2 seconds pretty consistently.
While this doesn't cause a technical problem, we are aware this does concern people who also find this "weird", and we plan to have a mode which reduces virtual memory use (even if a little slower for some use cases)

Storm SQS messages not getting acked

I have a topology with 1 spout reading from 2 SQS queues and 5 bolts. After processing when i try to ack from second bolt it is not getting acked.
I'm running it in reliable mode and trying to ack in the last bolt. I get this message as if the messages are getting acked. But it is not getting deleted from the queue and the overwritten ack() methods are not getting called. It looks like it calls the default ack method in backtype.storm.task.OutputCollector instead of the overridden method in my spout.
8240 [Thread-24-conversionBolt] INFO backtype.storm.daemon.task - Emitting: conversionBolt__ack_ack [-7578372739434961741 -8189877254603774958]
I have anchored message ID to the tuple in my SQS queue spout and emitting to first bolt.
collector.emit(getStreamId(message), new Values(jsonObj.toString()), message.getReceiptHandle());
I have ack() and fail() methods overridden in my queue spout.Default Visibility Timeout has been set to 30 seconds
Code snippet from my topology:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("firstQueueSpout",
new SqsQueueSpout(StormConfigurations.getQueueURL()
+ StormConfigurations.getFirstQueueName(), true),
StormConfigurations.getAwsQueueSpoutThreads());
builder.setSpout("secondQueueSpout",
new SqsQueueSpout(StormConfigurations.getQueueURL()
+ StormConfigurations.getSecondQueueName(),
true), StormConfigurations.getAwsQueueSpoutThreads());
builder.setBolt("transformerBolt", new TransformerBolt(),
StormConfigurations.getTranformerBoltThreads())
.shuffleGrouping("firstQueueSpout")
.shuffleGrouping("secondQueueSpout");
builder.setBolt("conversionBolt", new ConversionBolt(),
StormConfigurations.getTranformerBoltThreads())
.shuffleGrouping("transformerBolt");
// To dispatch it to the corresponding bolts based on packet type
builder.setBolt("dispatchBolt", new DispatcherBolt(),
StormConfigurations.getDispatcherBoltThreads())
.shuffleGrouping("conversionBolt");
Code snippet from SQSQueueSpout(extends BaseRichSpout):
#Override
public void nextTuple()
{
if (queue.isEmpty()) {
ReceiveMessageResult receiveMessageResult = sqs.receiveMessage(
new ReceiveMessageRequest(queueUrl).withMaxNumberOfMessages(10));
queue.addAll(receiveMessageResult.getMessages());
}
Message message = queue.poll();
if (message != null)
{
try
{
JSONParser parser = new JSONParser();
JSONObject jsonObj = (JSONObject) parser.parse(message.getBody());
// ack(message.getReceiptHandle());
if (reliable) {
collector.emit(getStreamId(message), new Values(jsonObj.toString()), message.getReceiptHandle());
} else {
// Delete it right away
sqs.deleteMessageAsync(new DeleteMessageRequest(queueUrl, message.getReceiptHandle()));
collector.emit(getStreamId(message), new Values(jsonObj.toString()));
}
}
catch (ParseException e)
{
LOG.error("SqsQueueSpout SQLException in SqsQueueSpout.nextTuple(): ", e);
}
} else {
// Still empty, go to sleep.
Utils.sleep(sleepTime);
}
}
public String getStreamId(Message message) {
return Utils.DEFAULT_STREAM_ID;
}
public int getSleepTime() {
return sleepTime;
}
public void setSleepTime(int sleepTime)
{
this.sleepTime = sleepTime;
}
#Override
public void ack(Object msgId) {
System.out.println("......Inside ack in sqsQueueSpout..............."+msgId);
// Only called in reliable mode.
try {
sqs.deleteMessageAsync(new DeleteMessageRequest(queueUrl, (String) msgId));
} catch (AmazonClientException ace) { }
}
#Override
public void fail(Object msgId) {
// Only called in reliable mode.
try {
sqs.changeMessageVisibilityAsync(
new ChangeMessageVisibilityRequest(queueUrl, (String) msgId, 0));
} catch (AmazonClientException ace) { }
}
#Override
public void close() {
sqs.shutdown();
((AmazonSQSAsyncClient) sqs).getExecutorService().shutdownNow();
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("message"));
}
Code snipped from my first Bolt(extends BaseRichBolt):
public class TransformerBolt extends BaseRichBolt
{
private static final long serialVersionUID = 1L;
public static final Logger LOG = LoggerFactory.getLogger(TransformerBolt.class);
private OutputCollector collector;
#Override
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
}
#Override
public void execute(Tuple input) {
String eventStr = input.getString(0);
//some code here to convert the json string to map
//Map datamap, long packetId being sent to next bolt
this.collector.emit(input, new Values(dataMap,packetId));
}
catch (Exception e) {
LOG.warn("Exception while converting AWS SQS to HashMap :{}", e);
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("dataMap", "packetId"));
}
}
Code snippet from second Bolt:
public class ConversionBolt extends BaseRichBolt
{
private static final long serialVersionUID = 1L;
private OutputCollector collector;
#Override
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
}
#Override
public void execute(Tuple input)
{
try{
Map dataMap = (Map)input.getValue(0);
Long packetId = (Long)input.getValue(1);
//this ack is not working
this.collector.ack(input);
}catch(Exception e){
this.collector.fail(input);
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
Kindly let me know if you need more information. Somebody shed some light on why the overridden ack in my spout is not getting called(from my second bolt)...

You must ack all incoming tuples in all bolts, ie, add collector.ack(input) to TransformerBolt.execute(Tuple input).
The log message you see is correct: your code calls collector.ack(...) and this call gets logged. A call to ack in your topology is not a call to Spout.ack(...): Each time a Spout emits a tuple with a message ID, this ID gets registered by the running ackers of your topology. Those ackers will get a message on each ack of a Bolt, collect those and notify the Spout if all acks of a tuple got received. If a Spout receives this message from an acker, it calls it's own ack(Object messageID) method.
See here for more details: https://storm.apache.org/documentation/Guaranteeing-message-processing.html

RxJava cache last item for future subscribers

I have implemented simple RxEventBus which starts emitting events, even if there is no subscribers. I want to cache last emitted event, so that if first/next subscriber subscribes, it receive only one (last) item.
I created test class which describes my problem:
public class RxBus {
ApplicationsRxEventBus applicationsRxEventBus;
public RxBus() {
applicationsRxEventBus = new ApplicationsRxEventBus();
}
public static void main(String[] args) {
RxBus rxBus = new RxBus();
rxBus.start();
}
private void start() {
ExecutorService executorService = Executors.newScheduledThreadPool(2);
Runnable runnable0 = () -> {
while (true) {
long currentTime = System.currentTimeMillis();
System.out.println("emiting: " + currentTime);
applicationsRxEventBus.emit(new ApplicationsEvent(currentTime));
try {
Thread.sleep(500);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
};
Runnable runnable1 = () -> applicationsRxEventBus
.getBus()
.subscribe(new Subscriber<ApplicationsEvent>() {
#Override
public void onCompleted() {
}
#Override
public void onError(Throwable throwable) {
}
#Override
public void onNext(ApplicationsEvent applicationsEvent) {
System.out.println("runnable 1: " + applicationsEvent.number);
}
});
Runnable runnable2 = () -> applicationsRxEventBus
.getBus()
.subscribe(new Subscriber<ApplicationsEvent>() {
#Override
public void onCompleted() {
}
#Override
public void onError(Throwable throwable) {
}
#Override
public void onNext(ApplicationsEvent applicationsEvent) {
System.out.println("runnable 2: " + applicationsEvent.number);
}
});
executorService.execute(runnable0);
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
executorService.execute(runnable1);
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
executorService.execute(runnable2);
}
private class ApplicationsRxEventBus {
private final Subject<ApplicationsEvent, ApplicationsEvent> mRxBus;
private final Observable<ApplicationsEvent> mBusObservable;
public ApplicationsRxEventBus() {
mRxBus = new SerializedSubject<>(BehaviorSubject.<ApplicationsEvent>create());
mBusObservable = mRxBus.cache();
}
public void emit(ApplicationsEvent event) {
mRxBus.onNext(event);
}
public Observable<ApplicationsEvent> getBus() {
return mBusObservable;
}
}
private class ApplicationsEvent {
long number;
public ApplicationsEvent(long number) {
this.number = number;
}
}
}
runnable0 is emitting events even if there is no subscribers. runnable1 subscribes after 3 sec, and receives last item (and this is ok). But runnable2 subscribes after 3 sec after runnable1, and receives all items, which runnable1 received. I only need last item to be received for runnable2. I have tried cache events in RxBus:
private class ApplicationsRxEventBus {
private final Subject<ApplicationsEvent, ApplicationsEvent> mRxBus;
private final Observable<ApplicationsEvent> mBusObservable;
private ApplicationsEvent event;
public ApplicationsRxEventBus() {
mRxBus = new SerializedSubject<>(BehaviorSubject.<ApplicationsEvent>create());
mBusObservable = mRxBus;
}
public void emit(ApplicationsEvent event) {
this.event = event;
mRxBus.onNext(event);
}
public Observable<ApplicationsEvent> getBus() {
return mBusObservable.doOnSubscribe(() -> emit(event));
}
}
But problem is, that when runnable2 subscribes, runnable1 receives event twice:
emiting: 1447183225122
runnable 1: 1447183225122
runnable 1: 1447183225122
runnable 2: 1447183225122
emiting: 1447183225627
runnable 1: 1447183225627
runnable 2: 1447183225627
I am sure, that there is RxJava operator for this. How to achieve this?

Your ApplicationsRxEventBus does extra work by reemitting a stored event whenever one Subscribes in addition to all the cached events.
You only need a single BehaviorSubject + toSerialized as it will hold onto the very last event and re-emit it to Subscribers by itself.

You are using the wrong interface. When you susbscribe to a cold Observable you get all of its events. You need to turn it into hot Observable first. This is done by creating a ConnectableObservable from your Observable using its publish method. Your Observers then call connect to start receiving events.
You can also read more about in the Hot and Cold observables section of the tutorial.

Merging trident streams blocks the trident spout whereas the storm spout keeps working

I need some help understanding why merging two streams blocks one of the spouts of class FixedBatchSpout.
Short Description: I’m trying to merge two streams s1 and s2, but calling topology.merge(s1, s2) blocks the FixedBatchSpout (a trident spout) from which s1 originates, whereas the BaseRichSpout (a storm spout) from s2 seems to work properly.
Details: In the below main method, just adding the line topology.merge(s1, s2); prevents the FixedBatchSpout to emit past its first batch. This happens with multireduce as well.
FixedBatchSpout spout1 = new FixedBatchSpout(new Fields("sentence"), 2,
new Values("the cow jumped over the moon"),
new Values("the man went to the store and bought some candy”));
FixedLoopSpout spout2 = new FixedLoopSpout(new Fields("sentence"),
new Values("THE COW JUMPED OVER THE MOON"),
new Values("THE MAN WENT TO THE STORE AND BOUGHT SOME CANDY"));
Stream s1 = topology.newStream("hello", spout1);
Stream s2 = topology.newStream("world", spout2);
topology.merge(s1, s2);
public class FixedLoopSpout extends BaseRichSpout {
Values[] values;
List<Values> loop = new LinkedList<Values>();
Iterator<Values> head;
private SpoutOutputCollector collector;
private final Fields outputFields;
private long emitted = 0;
public FixedLoopSpout(Fields outputFields, Values... values) {
this.outputFields = outputFields;
this.values = values;
}
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
this.collector = collector;
for (Values value: this.values) {
this.loop.add(value);
}
this.head = this.loop.iterator();
}
public void nextTuple() {
if (!this.head.hasNext()) {
// wrap
this.head = this.loop.iterator();
}
this.collector.emit(this.head.next(), this.emitted++);
try {
Thread.sleep(100);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(this.outputFields);
}
}
Help is appreciated, Thanks!
Jacques

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

storm processing data extremely slow - performance

You should find out what is the bottleneck here -- RabbitMQ or Cassandra. Open the Storm UI and take a look at the latency times for each component. If increasing parallelism didn't help (it normally should), there's definitely a problem with RabbitMQ or Cassandra, so you should focus on them.

Related

Subscribers onnext does not contain complete item

application with Chronicle-Queue memory keeps on growing

Storm SQS messages not getting acked

RxJava cache last item for future subscribers

Merging trident streams blocks the trident spout whereas the storm spout keeps working

Categories

Resources