How to E2E test functionality of Storm Topology by programmatically inserting messages - apache-storm

Our Apache Storm topology listens messages from Kafka using KafkaSpout and after doing lot of mapping/reducing/enrichment/aggregation etc. etc finally inserts data into Cassandra. There is another kafka input where we receive user queries for data if topology finds a response then it sends that onto a third kafka topic. Now we want to write E2E test using Junit in which we can directly programmatically insert data into topology and then by inserting user query message, we can assert on third point that response received on our query is correct.
To achieve this, we thought of starting EmbeddedKafka and CassandraUnit and then replacing actual Kafka and Cassandra with them and then we can start topology in the context of this single Junit test.
But our approach doesn't fit well with JUnit because it makes these tests too bulky. Starting kafka, cassandra and topology all are time taking and consume lot of resource. Is there anything in Apache Storm which can support kind of testing we are planning to write?

There are a number of options you have here, depending on what kind of slowdown you can live with:
As you mentioned, you can start Kafka, Cassandra and the topology. This is the slowest option, and the "most realistic".
Start Kafka and Cassandra once, and reuse them for all the tests. You can do the same with the Storm LocalCluster. It is likely faster to clear Kafka/Cassandra between each test (e.g. deleting all topics) instead of restarting them.
Replace the Kafka spouts/bolts and Cassandra bolt with stubs in test. Storm has a number of tools built in for stubbing bolts and spouts, e.g. the FixedTupleSpout, FeederSpout, the tracked topology and completable topology functionality in LocalCluster. This way you can insert some fixed tuples into the topology, and do asserts about which tuples where sent to the Cassandra bolt stub. There's examples of some of this functionality here and here
Finally you can of course unit test individual bolts. This is the fastest kind of test. You can use Testing.testTuple to create test tuples to pass to the bolt.

Related

Configuring connectors for multiple topics on Kafka Connect Distributed Mode

We have producers that are sending the following to Kafka:
topic=syslog, ~25,000 events per day
topic=nginx, ~5,000 events per day
topic=zeek.xxx.log, ~100,000 events per day (total). In this last case there are 20 distinct zeek topics, such as zeek.conn.log and zeek.http.log
kafka-connect-elasticsearch instances function as consumers to ship data from Kafka to Elasticsearch. The hello-world Sink configuration for kafka-connect-elasticsearch might look like this:
# elasticsearch.properties
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=24
topics=syslog,nginx,zeek.broker.log,zeek.capture_loss.log,zeek.conn.log,zeek.dhcp.log,zeek.dns.log,zeek.files.log,zeek.http.log,zeek.known_services.log,zeek.loaded_scripts.log,zeek.notice.log,zeek.ntp.log,zeek.packet_filtering.log,zeek.software.log,zeek.ssh.log,zeek.ssl.log,zeek.status.log,zeek.stderr.log,zeek.stdout.log,zeek.weird.log,zeek.x509.log
topic.creation.enable=true
key.ignore=true
schema.ignore=true
...
And can be invoked with bin/connect-standalone.sh. I realized that running or attempting to run tasks.max=24 when work is performed in a single process is not ideal. I know that using distributed mode would be a better alternative, but am unclear on the performance-optimal way to submit connectors to distributed mode. Namely,
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call? Or would it be best to break up multiple .properties configs + connectors (e.g. one for syslog, one for nginx, one for zeek.**) and submit them separately?
I understand that tasks be equal to the number of topics x number of partitions, but what dictates the number of workers?
Is there anywhere in the documentation that walks through best practices for a situation such as this where there is a noticeable imbalance of throughput for different topics?
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?
It'd be a JSON file, but yes.
what dictates the number of workers?
Up to you. JVM usage is one factor that you can monitor and scale on
Not really any documentation that I am aware of

Possible Test Scenarios for a Kafka Consumer that Pushes Records to the S3 After Processing

I have a Kafka consumer that:
Consumes records from Kafka.
Processes each one of them in parallel by calling three down-stream services.
Pushes a final processed documents (corresponding to each record) into the S3.
Some additional info:
I am using commitAsync(..);
I am using Spring Reactor.
Apart from the happy case, what should be the possible scenarios that I should cover? Considering that I am processing an X amount of messages per poll(..) and processing and committing them all in parallel? I want my entire program to be tested as harshly as possible.

Storm bolt following a kafka bolt

I have a Storm topology where I have to send output to kafka as well as update a value in redis. For this I have a Kafkabolt as well as a RedisBolt.
Below is what my topology looks like -
tp.setSpout("kafkaSpout", kafkaSpout, 3);
tp.setBolt("EvaluatorBolt", evaluatorBolt, 6).shuffleGrouping("kafkaStream");
tp.setBolt("ResultToRedisBolt",ResultsToRedisBolt,3).shuffleGrouping("EvaluatorBolt","ResultStream");
tp.setBolt("ResultToKafkaBolt", ResultsToKafkaBolt, 3).shuffleGrouping("EvaluatorBolt","ResultStream");
The problem is that both of the end bolts (Redis and Kafka) are listening to the same stream from the preceding bolt (ResultStream), hence both can fail independently. What I really need is that if the result is successfully published in Kafka, then only I update the value in Redis. Is there a way to have an output stream from a kafkaBolt where I can get the messages published successfully to Kafka? I can then probably listen to that stream in my RedisBolt and act accordingly.
It is not currently possible, unless you modify the bolt code. You would likely be better off changing your design slightly, since doing extra processing after the tuple is written to Kafka has some drawbacks. If you write the tuple to Kafka and you fail to write to Redis, you will get duplicates in Kafka, since the processing will start over at the spout.
It might be better, depending on your use case, to write the result to Kafka, and then have another topology read the result from Kafka and write to Redis.
If you still need to be able to emit new tuples from the bolt, it should be pretty easy to implement. The bolt recently got the ability to add a custom Producer callback, so we could extend that mechanism.
See the discussion at https://github.com/apache/storm/pull/2790#issuecomment-411709331 for context.

Apache Storm scale/rebalance without downtime

I'm currently analysing Apache Storm if it is usable as Stream Processing Framework for me. It looks really nice, but what worries me, is the scaling.
As far as I understood it, scaling is done by rebalancing.
e.g. If I wan't to add a new server to the cluster, I have to increase the workers. But when I do so with
storm rebalance storm_example -n 4
all the bolts and spouts stop working while it is rebalancing. But what I want is more like:
Add the Server, add a new worker on it, and when new Data arrive, also consider this new one to work off the data
Do I just don't get the idea of Storm or is that not possible with it.
I had the similar requirement and as per my research it is not possible. In my case we ended up creating a new storm cluster without disturbing the existing one. We were (are) trying to assign servers/workers based on the load to storm to avoid AWS cost.
It would be interesting to know if we can do so.

Which supervisor will be listening through its spout?

In my topology I have a spout with a socket opened on port 5555 to receive messages.
If I have 10 supervisors in my Storm cluster, will each one of them be listening to their 5555 ports?
In the end, to which supervisor should I send messages?
Multiple comments here:
Storm uses a pull based model for data ingestion via Spouts. If you open a socket you will block the Spout until data is available (and this is bad; see this SO question for more details: Why should I not loop or block in Spout.nextTuple())
About Spout deployment (Supervisors):
first, it depends on the parallelism of your spout (ie,parallelims_hint, default value is one)
second, supervisors do no execute Spout code: Supervisors start up worker JVM that execute Spouts/Bolts (see config parameter number_of_workers for a topology)
third, Storm uses a load-balanced round-robin scheduler; thus, it might happen that two Spout executor are scheduled to the same worker JVM (or different workers on the same host); for this case, you will get a port conflict (only one execute will be able to open the port)
Dated distribution should not matter in this case: if you really go with push, you can choose any host to send the data; Storm does not care. Of course, if you need some kind of key-based partitioning, you might want to send data from a single partition the a single Spout instance; as an alternative, just forward the data within the Spout and use fieldsGrouping to get your partitions for the consuming Bolt. However, if you use pull based data ingestion by the Spout, you can ensure that each Spout pulls data from certain partitions and the problem resolves naturally.
To sum up: using push based data ingestion might be a bad idea.

Resources