Apache Storm - Stateful bolt - scheduling method run - methods

Having a state-full apache storm bolt plugged into the topology, let's say it looks as follows:
public class WordCountBolt extends BaseStatefulBolt<KeyValueState<String, Long>> {
private KeyValueState<String, Long> wordCounts;
private OutputCollector collector;
...
#Override
public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
#Override
public void initState(KeyValueState<String, Long> state) {
wordCounts = state;
}
#Override
public void execute(Tuple tuple) {
String word = tuple.getString(0);
Integer count = wordCounts.get(word, 0);
count++;
wordCounts.put(word, count);
collector.emit(tuple, new Values(word, count));
collector.ack(tuple);
}
...
}
I would need to trigger a method to update the state (wordCounts), let's say each X seconds, independently of receiving an event or not. Is this possible in Apache Storm statefull bolts? Is it possible to simply schedule a method like this to be run in a defined interval repetitively?
public void updateState() {
wordCounts.put("NewKey", 1);
}

I'm not sure I follow why you need to update the state periodically, but if you need to run some code every so often, tick tuples might work for you. https://kitmenke.com/blog/2014/08/04/tick-tuples-within-storm/

Related

Kafka Streams: batch keys for a time window and do some processing on the batch of keys together

I have a stream of incoming primary keys (PK) that I am reading in my Kafkastreams app, I would like to batch them over say last 1 minute and query my transactional DB to get more data for the batch of PKs (deduplicated) in the last minute. And for each PK I would like to post a message on output topic.
I was able to code this using Processor API like below:
Topology topology = new Topology();
topology.addSource("test-source", inputKeySerde.deserializer(), inputValueSerde.deserializer(), "input.kafka.topic")
.addProcessor("test-processor", processorSupplier, "test-source")
.addSink("test-sink", "output.kafka.topic", outputKeySerde.serializer(), outputValueSerde.serializer, "test-processor");
Here processor supplier has a process method that adds the PK to a queue and a punctuator that is scheduled to run every minute and drains the queue and queries transactional DB and forwards a message for every PK.
ProcessorSupplier<Integer, ValueType> processorSupplier = new ProcessorSupplier<Integer, ValueType>() {
public Processor<Integer, ValueType> get() {
return new Processor<Integer, ValueType>() {
private ProcessorContext context;
private BlockingQueue<Integer> ids;
public void init(ProcessorContext context) {
this.context = context;
this.context.schedule(Duration.ofMillis(1000), PunctuationType.WALL_CLOCK_TIME, this::punctuate);
ids = new LinkedBlockingQueue<>();
}
#Override
public void process(Integer key, ValueType value) {
ids.add(key);
}
public void punctuate(long timestamp) {
Set<Long> idSet = new HashSet<>();
ids.drainTo(idSet, 1000);
List<Document> documentList = createDocuments(ids);
documentList.stream().forEach(document -> context.forward(document.getId(), document));
context.commit();
}
#Override
public void close() {
}
};
}
};
Wondering if there is a simpler way to accomplish this using DSL windowedBy and reduce/aggregate route?
***** Updated code to use state store ******
ProcessorSupplier<Integer, ValueType> processorSupplier = new ProcessorSupplier<Integer, ValueType>() {
public Processor<Integer, ValueType> get() {
return new Processor<Integer, ValueType>() {
private ProcessorContext context;
private KeyValueStore<Integer, Integer> stateStore;
public void init(ProcessorContext context) {
this.context = context;
stateStore = (KeyValueStore) context.getStateStore("MyStore");
this.context.schedule(Duration.ofMillis(5000), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
Set<Integer> ids = new HashSet<>();
try (KeyValueIterator<Integer, Integer> iter = this.stateStore.all()) {
while (iter.hasNext()) {
KeyValue<Integer, Integer> entry = iter.next();
ids.add(entry.key);
}
}
List<Document> documentList = createDocuments(dataRetriever, ids);
documentList.stream().forEach(document -> context.forward(document.getId(), document));
ids.stream().forEach(id -> stateStore.delete(id));
this.context.commit();
});
}
#Override
public void process(Integer key, ValueType value) {
Long id = key.getId();
stateStore.put(id, id);
}
#Override
public void close() {
}
};
}
};

Sorting DataStream using Apache Flink

I am learning Flink and I started with a simple word count using DataStream. To enhance the processing I filtered the output to show only the results with 3 or more words found.
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9000)
.flatMap(new Splitter())
.keyBy(0)
.timeWindow(Time.seconds(5))
.apply(new MyWindowFunction())
.sum(1)
.filter(word -> word.f1 >= 3);
I would like to create a WindowFunction to sort the output by the value of words found. The WindowFunction that I am trying to implement does not compile at all. I am struggling to define the apply method and the parameters of the WindowFunction interface.
public static class MyWindowFunction implements WindowFunction<
Tuple2<String, Integer>, // input type
Tuple2<String, Integer>, // output type
Tuple2<String, Integer>, // key type
TimeWindow> {
void apply(Tuple2<String, Integer> key, TimeWindow window, Iterable<Tuple2<String, Integer>> input, Collector<Tuple2<String, Integer>> out) {
String word = ((Tuple2<String, Integer>)key).f0;
Integer count = ((Tuple2<String, Integer>)key).f1;
.........
out.collect(new Tuple2<>(word, count));
}
}
I am updating this answer to use Flink 1.12.0. In order to sort the elements of a stream in I had to use a KeyedProcessFunction after counting the stream with a ReduceFunction. Then I had to set the parallelism of the very last transformation to 1 in order to not change the order of the elements that I sorted using KeyedProcessFunction. The sequence that I am using is socketTextStream -> flatMap -> keyBy -> reduce -> keyBy -> process -> print().setParallelism(1). Bellow it the example:
public class SocketWindowWordCountJava {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.socketTextStream("localhost", 9000)
.flatMap(new SplitterFlatMap())
.keyBy(new WordKeySelector())
.reduce(new SumReducer())
.keyBy(new WordKeySelector())
.process(new SortKeyedProcessFunction(3 * 1000))
.print().setParallelism(1);
String executionPlan = env.getExecutionPlan();
System.out.println("ExecutionPlan ........................ ");
System.out.println(executionPlan);
System.out.println("........................ ");
env.execute("Window WordCount sorted");
}
}
The UDF that I used to sort the stream is the SortKeyedProcessFunction which extends KeyedProcessFunction. I use a ValueState<List<Event>> listState of Event implements Comparable<Event> to have a sorted list as state. On the processElement method I register the time stamp that I added the event to the state context.timerService().registerProcessingTimeTimer(timeoutTime); and I collect the event at the onTimer method. I am also using a time window of 3 seconds here.
public class SortKeyedProcessFunction extends KeyedProcessFunction<String, Tuple2<String, Integer>, Event> {
private static final long serialVersionUID = 7289761960983988878L;
// delay after which an alert flag is thrown
private final long timeOut;
// state to remember the last timer set
private ValueState<List<Event>> listState = null;
private ValueState<Long> lastTime = null;
public SortKeyedProcessFunction(long timeOut) {
this.timeOut = timeOut;
}
#Override
public void open(Configuration conf) {
// setup timer and HLL state
ValueStateDescriptor<List<Event>> descriptor = new ValueStateDescriptor<>(
// state name
"sorted-events",
// type information of state
TypeInformation.of(new TypeHint<List<Event>>() {
}));
listState = getRuntimeContext().getState(descriptor);
ValueStateDescriptor<Long> descriptorLastTime = new ValueStateDescriptor<Long>(
"lastTime",
TypeInformation.of(new TypeHint<Long>() {
}));
lastTime = getRuntimeContext().getState(descriptorLastTime);
}
#Override
public void processElement(Tuple2<String, Integer> value, Context context, Collector<Event> collector) throws Exception {
// get current time and compute timeout time
long currentTime = context.timerService().currentProcessingTime();
long timeoutTime = currentTime + timeOut;
// register timer for timeout time
context.timerService().registerProcessingTimeTimer(timeoutTime);
List<Event> queue = listState.value();
if (queue == null) {
queue = new ArrayList<Event>();
}
Long current = lastTime.value();
queue.add(new Event(value.f0, value.f1));
lastTime.update(timeoutTime);
listState.update(queue);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Event> out) throws Exception {
// System.out.println("onTimer: " + timestamp);
// check if this was the last timer we registered
System.out.println("timestamp: " + timestamp);
List<Event> queue = listState.value();
Long current = lastTime.value();
if (timestamp == current.longValue()) {
Collections.sort(queue);
queue.forEach( e -> {
out.collect(e);
});
queue.clear();
listState.clear();
}
}
}
class Event implements Comparable<Event> {
String value;
Integer qtd;
public Event(String value, Integer qtd) {
this.value = value;
this.qtd = qtd;
}
public String getValue() { return value; }
public Integer getQtd() { return qtd; }
#Override
public String toString() {
return "Event{" +"value='" + value + '\'' +", qtd=" + qtd +'}';
}
#Override
public int compareTo(#NotNull Event event) {
return this.getValue().compareTo(event.getValue());
}
}
So when I use $ nc -lk 9000 and type the words on the console I see them in order on the output
...
Event{value='soccer', qtd=7}
Event{value='swim', qtd=5}
...
Event{value='basketball', qtd=9}
Event{value='soccer', qtd=8}
Event{value='swim', qtd=6}
The other UDFs are for the other transformations of the stream program and they are here for completeness.
public class SplitterFlatMap implements FlatMapFunction<String, Tuple2<String, Integer>> {
private static final long serialVersionUID = 3121588720675797629L;
#Override
public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String word : sentence.split(" ")) {
out.collect(Tuple2.of(word, 1));
}
}
}
public class WordKeySelector implements KeySelector<Tuple2<String, Integer>, String> {
#Override
public String getKey(Tuple2<String, Integer> value) throws Exception {
return value.f0;
}
}
public class SumReducer implements ReduceFunction<Tuple2<String, Integer>> {
#Override
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> event1, Tuple2<String, Integer> event2) throws Exception {
return Tuple2.of(event1.f0, event1.f1 + event2.f1);
}
}
The .sum(1) method will do everything you need (no need for using apply()), as long as the Splitter class (which should be a FlatMapFunction) is emitting Tuple2<String, Integer> records, where String is the word, and Integer is always 1.
So then .sum(1) will do the aggregation for you. If you needed something different than what sum() does, you would typically use .reduce(new MyCustomReduceFunction()), as that's going to be the most efficient and scalable approach, in terms of not needing to buffer lots in memory.

Spring cloud data flow with spring batch job - scaling considerations

We are currently in evaluation process shifting from Spring batch + Batch Admin
into Spring Cloud based infrastructure.
our main challenges / questions:
1. As part of the monolithic design of the spring batch jobs we are fetching some general MD and aggregated it into common data structure that many jobs using to run in a more optimized way. is the nature of the SCDF Tasks going to be a problem in our case ? should we reconsider shifting into Streams ? and how its can be done ?
2. One of the major reasons to use SCDF is the support for scaling for better performance.
as first POC its going to be hard for us to create a real cloud infrastructure and i was looking for standalone SCDF that use the remote partitioning design for a scaling solution.we looking for a demo/intro GitHub project/guide - i didn't mange to find anything relevant. is it also requiring as past years solution communication between nodes via JMS infrastructure (Spring Integration) ?
3. The main challenge for us is to refactor on of our batch jobs and be able to support both remote partitioning and multiple threads on each node. is it possible to create a spring batch job with both of the aspects.
4. breaking up our monolithic jar with 20 Jobs into separate spring boot über jars isn't simple task to achieve - any thoughts / ideas / best practices.
Best,
Elad
I had the same problem as Elad's point 3 and eventually solved it by using the basic framework as demonstrated here but with modified versions of DeployerPartitionHandler and DeployerStepExecutionHandler.
I first tried the naive approach of creating a two-level partitioning where the step that each worker executes is itself partitioned into sub-partitions. But the framework doesn't seem to support that; it got confused about the step's state.
So I went back to a flat set of partitions but passing multiple step execution ids to each worker. For this to work, I created DeployerMultiPartitionHandler which launches the configured number of workers and passes each one a list of step execution ids. Note that there are now two degrees of freedom: the number of workers and the gridSize, which is the total number of partitions that get distributed as evenly as possible to the workers. Unfortunately, I had to duplicate a lot of DeployerPartitionHandler's code here.
#Slf4j
#Getter
#Setter
public class DeployerMultiPartitionHandler implements PartitionHandler, EnvironmentAware, InitializingBean {
public static final String SPRING_CLOUD_TASK_STEP_EXECUTION_IDS =
"spring.cloud.task.step-execution-ids";
public static final String SPRING_CLOUD_TASK_JOB_EXECUTION_ID =
"spring.cloud.task.job-execution-id";
public static final String SPRING_CLOUD_TASK_STEP_EXECUTION_ID =
"spring.cloud.task.step-execution-id";
public static final String SPRING_CLOUD_TASK_STEP_NAME =
"spring.cloud.task.step-name";
public static final String SPRING_CLOUD_TASK_PARENT_EXECUTION_ID =
"spring.cloud.task.parentExecutionId";
public static final String SPRING_CLOUD_TASK_NAME = "spring.cloud.task.name";
private int maxWorkers = -1;
private int gridSize = 1;
private int currentWorkers = 0;
private TaskLauncher taskLauncher;
private JobExplorer jobExplorer;
private TaskExecution taskExecution;
private Resource resource;
private String stepName;
private long pollInterval = 10000;
private long timeout = -1;
private Environment environment;
private Map<String, String> deploymentProperties;
private EnvironmentVariablesProvider environmentVariablesProvider;
private String applicationName;
private CommandLineArgsProvider commandLineArgsProvider;
private boolean defaultArgsAsEnvironmentVars = false;
public DeployerMultiPartitionHandler(TaskLauncher taskLauncher,
JobExplorer jobExplorer,
Resource resource,
String stepName) {
Assert.notNull(taskLauncher, "A taskLauncher is required");
Assert.notNull(jobExplorer, "A jobExplorer is required");
Assert.notNull(resource, "A resource is required");
Assert.hasText(stepName, "A step name is required");
this.taskLauncher = taskLauncher;
this.jobExplorer = jobExplorer;
this.resource = resource;
this.stepName = stepName;
}
#Override
public Collection<StepExecution> handle(StepExecutionSplitter stepSplitter,
StepExecution stepExecution) throws Exception {
final Set<StepExecution> tempCandidates =
stepSplitter.split(stepExecution, this.gridSize);
// Following two lines due to https://jira.spring.io/browse/BATCH-2490
final List<StepExecution> candidates = new ArrayList<>(tempCandidates.size());
candidates.addAll(tempCandidates);
int partitions = candidates.size();
log.debug(String.format("%s partitions were returned", partitions));
final Set<StepExecution> executed = new HashSet<>(candidates.size());
if (CollectionUtils.isEmpty(candidates)) {
return null;
}
launchWorkers(candidates, executed);
candidates.removeAll(executed);
return pollReplies(stepExecution, executed, partitions);
}
private void launchWorkers(List<StepExecution> candidates, Set<StepExecution> executed) {
int partitions = candidates.size();
int numWorkers = this.maxWorkers != -1 ? Math.min(this.maxWorkers, partitions) : partitions;
IntStream.range(0, numWorkers).boxed()
.map(i -> candidates.subList(partitionOffset(partitions, numWorkers, i), partitionOffset(partitions, numWorkers, i + 1)))
.filter(not(List::isEmpty))
.forEach(stepExecutions -> processStepExecutions(stepExecutions, executed));
}
private void processStepExecutions(List<StepExecution> stepExecutions, Set<StepExecution> executed) {
launchWorker(stepExecutions);
this.currentWorkers++;
executed.addAll(stepExecutions);
}
private void launchWorker(List<StepExecution> workerStepExecutions) {
List<String> arguments = new ArrayList<>();
StepExecution firstWorkerStepExecution = workerStepExecutions.get(0);
ExecutionContext copyContext = new ExecutionContext(firstWorkerStepExecution.getExecutionContext());
arguments.addAll(
this.commandLineArgsProvider
.getCommandLineArgs(copyContext));
String jobExecutionId = String.valueOf(firstWorkerStepExecution.getJobExecution().getId());
String stepExecutionIds = workerStepExecutions.stream().map(workerStepExecution -> String.valueOf(workerStepExecution.getId())).collect(joining(","));
String taskName = String.format("%s_%s_%s",
taskExecution.getTaskName(),
firstWorkerStepExecution.getJobExecution().getJobInstance().getJobName(),
firstWorkerStepExecution.getStepName());
String parentExecutionId = String.valueOf(taskExecution.getExecutionId());
if(!this.defaultArgsAsEnvironmentVars) {
arguments.add(formatArgument(SPRING_CLOUD_TASK_JOB_EXECUTION_ID,
jobExecutionId));
arguments.add(formatArgument(SPRING_CLOUD_TASK_STEP_EXECUTION_IDS,
stepExecutionIds));
arguments.add(formatArgument(SPRING_CLOUD_TASK_STEP_NAME, this.stepName));
arguments.add(formatArgument(SPRING_CLOUD_TASK_NAME, taskName));
arguments.add(formatArgument(SPRING_CLOUD_TASK_PARENT_EXECUTION_ID,
parentExecutionId));
}
copyContext = new ExecutionContext(firstWorkerStepExecution.getExecutionContext());
log.info("launchWorker context={}", copyContext);
Map<String, String> environmentVariables = this.environmentVariablesProvider.getEnvironmentVariables(copyContext);
if(this.defaultArgsAsEnvironmentVars) {
environmentVariables.put(SPRING_CLOUD_TASK_JOB_EXECUTION_ID,
jobExecutionId);
environmentVariables.put(SPRING_CLOUD_TASK_STEP_EXECUTION_ID,
String.valueOf(firstWorkerStepExecution.getId()));
environmentVariables.put(SPRING_CLOUD_TASK_STEP_NAME, this.stepName);
environmentVariables.put(SPRING_CLOUD_TASK_NAME, taskName);
environmentVariables.put(SPRING_CLOUD_TASK_PARENT_EXECUTION_ID,
parentExecutionId);
}
AppDefinition definition =
new AppDefinition(resolveApplicationName(),
environmentVariables);
AppDeploymentRequest request =
new AppDeploymentRequest(definition,
this.resource,
this.deploymentProperties,
arguments);
taskLauncher.launch(request);
}
private String resolveApplicationName() {
if(StringUtils.hasText(this.applicationName)) {
return this.applicationName;
}
else {
return this.taskExecution.getTaskName();
}
}
private String formatArgument(String key, String value) {
return String.format("--%s=%s", key, value);
}
private Collection<StepExecution> pollReplies(final StepExecution masterStepExecution,
final Set<StepExecution> executed,
final int size) throws Exception {
final Collection<StepExecution> result = new ArrayList<>(executed.size());
Callable<Collection<StepExecution>> callback = new Callable<Collection<StepExecution>>() {
#Override
public Collection<StepExecution> call() {
Set<StepExecution> newExecuted = new HashSet<>();
for (StepExecution curStepExecution : executed) {
if (!result.contains(curStepExecution)) {
StepExecution partitionStepExecution =
jobExplorer.getStepExecution(masterStepExecution.getJobExecutionId(), curStepExecution.getId());
if (isComplete(partitionStepExecution.getStatus())) {
result.add(partitionStepExecution);
currentWorkers--;
}
}
}
executed.addAll(newExecuted);
if (result.size() == size) {
return result;
}
else {
return null;
}
}
};
Poller<Collection<StepExecution>> poller = new DirectPoller<>(this.pollInterval);
Future<Collection<StepExecution>> resultsFuture = poller.poll(callback);
if (timeout >= 0) {
return resultsFuture.get(timeout, TimeUnit.MILLISECONDS);
}
else {
return resultsFuture.get();
}
}
private boolean isComplete(BatchStatus status) {
return status.equals(BatchStatus.COMPLETED) || status.isGreaterThan(BatchStatus.STARTED);
}
#Override
public void setEnvironment(Environment environment) {
this.environment = environment;
}
#Override
public void afterPropertiesSet() {
Assert.notNull(taskExecution, "A taskExecution is required");
if(this.environmentVariablesProvider == null) {
this.environmentVariablesProvider =
new CloudEnvironmentVariablesProvider(this.environment);
}
if(this.commandLineArgsProvider == null) {
SimpleCommandLineArgsProvider simpleCommandLineArgsProvider = new SimpleCommandLineArgsProvider();
simpleCommandLineArgsProvider.onTaskStartup(taskExecution);
this.commandLineArgsProvider = simpleCommandLineArgsProvider;
}
}
}
The partitions are distributed to workers with the help of static function partitionOffset, which ensures that the number of partitions each worker receives differ by at most one:
static int partitionOffset(int length, int numberOfPartitions, int partitionIndex) {
return partitionIndex * (length / numberOfPartitions) + Math.min(partitionIndex, length % numberOfPartitions);
}
On the receiving end I created DeployerMultiStepExecutionHandler which inherits the parallel execution of partitions from TaskExecutorPartitionHandler and in addition implements the command line interface matching DeployerMultiPartitionHandler:
#Slf4j
public class DeployerMultiStepExecutionHandler extends TaskExecutorPartitionHandler implements CommandLineRunner {
private JobExplorer jobExplorer;
private JobRepository jobRepository;
private Log logger = LogFactory.getLog(org.springframework.cloud.task.batch.partition.DeployerStepExecutionHandler.class);
#Autowired
private Environment environment;
private StepLocator stepLocator;
public DeployerMultiStepExecutionHandler(BeanFactory beanFactory, JobExplorer jobExplorer, JobRepository jobRepository) {
Assert.notNull(beanFactory, "A beanFactory is required");
Assert.notNull(jobExplorer, "A jobExplorer is required");
Assert.notNull(jobRepository, "A jobRepository is required");
this.stepLocator = new BeanFactoryStepLocator();
((BeanFactoryStepLocator) this.stepLocator).setBeanFactory(beanFactory);
this.jobExplorer = jobExplorer;
this.jobRepository = jobRepository;
}
#Override
public void run(String... args) throws Exception {
validateRequest();
Long jobExecutionId = Long.parseLong(environment.getProperty(SPRING_CLOUD_TASK_JOB_EXECUTION_ID));
Stream<Long> stepExecutionIds = Stream.of(environment.getProperty(SPRING_CLOUD_TASK_STEP_EXECUTION_IDS).split(",")).map(Long::parseLong);
Set<StepExecution> stepExecutions = stepExecutionIds.map(stepExecutionId -> jobExplorer.getStepExecution(jobExecutionId, stepExecutionId)).collect(Collectors.toSet());
log.info("found stepExecutions:\n{}", stepExecutions.stream().map(stepExecution -> stepExecution.getId() + ":" + stepExecution.getExecutionContext()).collect(joining("\n")));
if (stepExecutions.isEmpty()) {
throw new NoSuchStepException(String.format("No StepExecution could be located for step execution id %s within job execution %s", stepExecutionIds, jobExecutionId));
}
String stepName = environment.getProperty(SPRING_CLOUD_TASK_STEP_NAME);
setStep(stepLocator.getStep(stepName));
doHandle(null, stepExecutions);
}
private void validateRequest() {
Assert.isTrue(environment.containsProperty(SPRING_CLOUD_TASK_JOB_EXECUTION_ID), "A job execution id is required");
Assert.isTrue(environment.containsProperty(SPRING_CLOUD_TASK_STEP_EXECUTION_IDS), "A step execution id is required");
Assert.isTrue(environment.containsProperty(SPRING_CLOUD_TASK_STEP_NAME), "A step name is required");
Assert.isTrue(this.stepLocator.getStepNames().contains(environment.getProperty(SPRING_CLOUD_TASK_STEP_NAME)), "The step requested cannot be found in the provided BeanFactory");
}
}

Storm SQS messages not getting acked

I have a topology with 1 spout reading from 2 SQS queues and 5 bolts. After processing when i try to ack from second bolt it is not getting acked.
I'm running it in reliable mode and trying to ack in the last bolt. I get this message as if the messages are getting acked. But it is not getting deleted from the queue and the overwritten ack() methods are not getting called. It looks like it calls the default ack method in backtype.storm.task.OutputCollector instead of the overridden method in my spout.
8240 [Thread-24-conversionBolt] INFO backtype.storm.daemon.task - Emitting: conversionBolt__ack_ack [-7578372739434961741 -8189877254603774958]
I have anchored message ID to the tuple in my SQS queue spout and emitting to first bolt.
collector.emit(getStreamId(message), new Values(jsonObj.toString()), message.getReceiptHandle());
I have ack() and fail() methods overridden in my queue spout.Default Visibility Timeout has been set to 30 seconds
Code snippet from my topology:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("firstQueueSpout",
new SqsQueueSpout(StormConfigurations.getQueueURL()
+ StormConfigurations.getFirstQueueName(), true),
StormConfigurations.getAwsQueueSpoutThreads());
builder.setSpout("secondQueueSpout",
new SqsQueueSpout(StormConfigurations.getQueueURL()
+ StormConfigurations.getSecondQueueName(),
true), StormConfigurations.getAwsQueueSpoutThreads());
builder.setBolt("transformerBolt", new TransformerBolt(),
StormConfigurations.getTranformerBoltThreads())
.shuffleGrouping("firstQueueSpout")
.shuffleGrouping("secondQueueSpout");
builder.setBolt("conversionBolt", new ConversionBolt(),
StormConfigurations.getTranformerBoltThreads())
.shuffleGrouping("transformerBolt");
// To dispatch it to the corresponding bolts based on packet type
builder.setBolt("dispatchBolt", new DispatcherBolt(),
StormConfigurations.getDispatcherBoltThreads())
.shuffleGrouping("conversionBolt");
Code snippet from SQSQueueSpout(extends BaseRichSpout):
#Override
public void nextTuple()
{
if (queue.isEmpty()) {
ReceiveMessageResult receiveMessageResult = sqs.receiveMessage(
new ReceiveMessageRequest(queueUrl).withMaxNumberOfMessages(10));
queue.addAll(receiveMessageResult.getMessages());
}
Message message = queue.poll();
if (message != null)
{
try
{
JSONParser parser = new JSONParser();
JSONObject jsonObj = (JSONObject) parser.parse(message.getBody());
// ack(message.getReceiptHandle());
if (reliable) {
collector.emit(getStreamId(message), new Values(jsonObj.toString()), message.getReceiptHandle());
} else {
// Delete it right away
sqs.deleteMessageAsync(new DeleteMessageRequest(queueUrl, message.getReceiptHandle()));
collector.emit(getStreamId(message), new Values(jsonObj.toString()));
}
}
catch (ParseException e)
{
LOG.error("SqsQueueSpout SQLException in SqsQueueSpout.nextTuple(): ", e);
}
} else {
// Still empty, go to sleep.
Utils.sleep(sleepTime);
}
}
public String getStreamId(Message message) {
return Utils.DEFAULT_STREAM_ID;
}
public int getSleepTime() {
return sleepTime;
}
public void setSleepTime(int sleepTime)
{
this.sleepTime = sleepTime;
}
#Override
public void ack(Object msgId) {
System.out.println("......Inside ack in sqsQueueSpout..............."+msgId);
// Only called in reliable mode.
try {
sqs.deleteMessageAsync(new DeleteMessageRequest(queueUrl, (String) msgId));
} catch (AmazonClientException ace) { }
}
#Override
public void fail(Object msgId) {
// Only called in reliable mode.
try {
sqs.changeMessageVisibilityAsync(
new ChangeMessageVisibilityRequest(queueUrl, (String) msgId, 0));
} catch (AmazonClientException ace) { }
}
#Override
public void close() {
sqs.shutdown();
((AmazonSQSAsyncClient) sqs).getExecutorService().shutdownNow();
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("message"));
}
Code snipped from my first Bolt(extends BaseRichBolt):
public class TransformerBolt extends BaseRichBolt
{
private static final long serialVersionUID = 1L;
public static final Logger LOG = LoggerFactory.getLogger(TransformerBolt.class);
private OutputCollector collector;
#Override
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
}
#Override
public void execute(Tuple input) {
String eventStr = input.getString(0);
//some code here to convert the json string to map
//Map datamap, long packetId being sent to next bolt
this.collector.emit(input, new Values(dataMap,packetId));
}
catch (Exception e) {
LOG.warn("Exception while converting AWS SQS to HashMap :{}", e);
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("dataMap", "packetId"));
}
}
Code snippet from second Bolt:
public class ConversionBolt extends BaseRichBolt
{
private static final long serialVersionUID = 1L;
private OutputCollector collector;
#Override
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
}
#Override
public void execute(Tuple input)
{
try{
Map dataMap = (Map)input.getValue(0);
Long packetId = (Long)input.getValue(1);
//this ack is not working
this.collector.ack(input);
}catch(Exception e){
this.collector.fail(input);
}
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
}
Kindly let me know if you need more information. Somebody shed some light on why the overridden ack in my spout is not getting called(from my second bolt)...
You must ack all incoming tuples in all bolts, ie, add collector.ack(input) to TransformerBolt.execute(Tuple input).
The log message you see is correct: your code calls collector.ack(...) and this call gets logged. A call to ack in your topology is not a call to Spout.ack(...): Each time a Spout emits a tuple with a message ID, this ID gets registered by the running ackers of your topology. Those ackers will get a message on each ack of a Bolt, collect those and notify the Spout if all acks of a tuple got received. If a Spout receives this message from an acker, it calls it's own ack(Object messageID) method.
See here for more details: https://storm.apache.org/documentation/Guaranteeing-message-processing.html

storm processing data extremely slow

We have 1 spout and 1 bolt on single node. Spout reads the data from RabbitMQ and emits it to the only bolt which writes data to Cassandra.
Our data source generates 10000 messages per second and storm takes around 10 sec to process this, which is too slow for us.
We tried increasing the parallelism of topology but that doesn't make any difference.
What is ideal no of messages that can be processed on a single node machine with 1 spout and 1 bolt? and what are the possible ways to increase the processing speed of storm topology?.
Update :
This is the sample code, it doesent have code for RabbitMQ and cassandra, but gives same performance issue.
// Topology Class
public class SimpleTopology {
public static void main(String[] args) throws InterruptedException {
System.out.println("hiiiiiiiiiii");
TopologyBuilder topologyBuilder = new TopologyBuilder();
topologyBuilder.setSpout("SimpleSpout", new SimpleSpout());
topologyBuilder.setBolt("SimpleBolt", new SimpleBolt(), 2).setNumTasks(4).shuffleGrouping("SimpleSpout");
Config config = new Config();
config.setDebug(true);
config.setNumWorkers(2);
LocalCluster localCluster = new LocalCluster();
localCluster.submitTopology("SimpleTopology", config, topologyBuilder.createTopology());
Thread.sleep(2000);
}
}
// Simple Bolt
public class SimpleBolt implements IRichBolt{
private OutputCollector outputCollector;
public void prepare(Map map, TopologyContext tc, OutputCollector oc) {
this.outputCollector = oc;
}
public void execute(Tuple tuple) {
this.outputCollector.ack(tuple);
}
public void cleanup() {
// TODO
}
public void declareOutputFields(OutputFieldsDeclarer ofd) {
// TODO
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
// Simple Spout
public class SimpleSpout implements IRichSpout{
private SpoutOutputCollector spoutOutputCollector;
private boolean completed = false;
private static int i = 0;
public void open(Map map, TopologyContext tc, SpoutOutputCollector soc) {
this.spoutOutputCollector = soc;
}
public void close() {
// Todo
}
public void activate() {
// Todo
}
public void deactivate() {
// Todo
}
public void nextTuple() {
if(!completed)
{
if(i < 100000)
{
String item = "Tag" + Integer.toString(i++);
System.out.println(item);
this.spoutOutputCollector.emit(new Values(item), item);
}
else
{
completed = true;
}
}
else
{
try {
Thread.sleep(2000);
} catch (InterruptedException ex) {
Logger.getLogger(SimpleSpout.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
public void ack(Object o) {
System.out.println("\n\n OK : " + o);
}
public void fail(Object o) {
System.out.println("\n\n Fail : " + o);
}
public void declareOutputFields(OutputFieldsDeclarer ofd) {
ofd.declare(new Fields("word"));
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
Update:
Is it possible that with shuffle grouping same tuple will be processed more than once? configuration used (spouts = 4. bolts = 4), the problem now is, with increase in no of bolts the performance is decreasing.
You should find out what is the bottleneck here -- RabbitMQ or Cassandra. Open the Storm UI and take a look at the latency times for each component.
If increasing parallelism didn't help (it normally should), there's definitely a problem with RabbitMQ or Cassandra, so you should focus on them.
In your code you only emit one tuple per call to nextTuple(). Try emitting more tuples per call.
something like:
public void nextTuple() {
int max = 1000;
int count = 0;
GetResponse response = channel.basicGet(queueName, autoAck);
while ((response != null) && (count < max)) {
// process message
spoutOutputCollector.emit(new Values(item), item);
count++;
response = channel.basicGet(queueName, autoAck);
}
try { Thread.sleep(2000); } catch (InterruptedException ex) {
}
We are successfully using RabbitMQ and Storm. The result gets stored in a different DB, but anyway. We first used basic_get in Spout, and had a terrible performance, but then we swiched to basic_consume, and performance is actually very good. So take a look at how you consuming messages from Rabbit.
Some important factors:
basic_consume instead of basic_get
prefetch_count (make it high enough)
If you want to increase performance, and you don't care about loosing messages - do not ack messages and set delivery_mode to 1.

Resources