Sharing local store between different KafkaStreams - apache-kafka-streams

Given: Have two KafkaStreams with DSL topology each one. Local state store is added to one of the typologies. What is the optimal way for the second KafkaStream to update the local store in the first KafkaStreams?
I could think about adding some processor to the KafkaStreams with local store. This processor has (1) some static task list populated by second KafkaStream, (2) Punctuator which will process tasks from the task list.
Unfortunately, this design doesn't provide any guarantee for failure tolerate.
Any better approach?

Local state of an application should only be update by the application itself.
Not sure what you exactly want to achieve. One way to "update" state from a Kafka Streams instance might be via topic. Instance A creates a table from a topic. Instance B write into this topic when it want to update A's table-state.
Hope this helps. If not, maybe updated your question to give more details what you want to achieve.

Related

How to transfer NiFi flowfiles from one queue to another?

I have "unmatched" flowfiles in a queue. Is there any way to transfer these flowfiles into another queue?
EDIT:
WITH #Andy's SUGGESTED SOLUTION - #RESOLVED
There isn't a way to directly transfer between queues because it would take away the meaning of how those flow files got in the queue. They have to pass through the previous processor which is making the decision about which queue to place them in. You can create loops using a processor that does nothing like UpdateAttribute, and then connect that back to the original processor.
Bryan's answer is comprehensive and explains the ideal process for on-going success. If this is a one-time task (I have this queue that contains data I was using during testing; now I want it to go to this other processor), you can simply select the queue containing the data and drag the blue endpoint to the other component.

Turn recovery on after first message

I have a persistent actor which receives many messages. Fist message is CREATE (case class) and next messages are UPDATEs (case classes). So if it receives CREATE then it should not go into persistence to run recovery because the storage is empty for this actor. It's performance wasting from my perspective.
Is there any possibility to do not call recovery for particular input message (the first one which is CREATE), please?
A persistent actor will always have to hit the database, because there is no other way to know whether it having existed before - it could have been created in a previous instance of the application that was stopped or it could have been created on a different node in a cluster.
In general a good pattern for performance is to keep the actor in memory after it has been hit the first time, as that will allow as fast responses as possible. The most common way to do this is using Cluster Sharding (which you can read more about in the docs here: https://doc.akka.io/docs/akka/current/cluster-sharding.html?language=scala#cluster-sharding
I have never heard of anyone seeing the hit for an empty persistent actor as a performance problem and I'm not sure it is possible to solve that in a general way, so if you have such a problem and somehow can know the actor was never created before you can not do that with Akka Persistence but would have to build a special solution for that yourself.

How to restore bolt state during failover

I'm trying to figure out how to restore the state of a storm bolt intance during failover. I can persist the state externally (DB or file system), however once the bolt instance is restarted I need to point to the specific state of that bolt instance to recover it.
The prepare method of a bolt receives a context, documented here http://nathanmarz.github.io/storm/doc/backtype/storm/task/TopologyContext.html
What is not clear to me is - is there any piece of this context that uniquely identifies the specific bolt instance so I can understand which persistent state to point to? Is that ID preserved during failover? Alternatively, is there any variable/object I can set for the specific bolt/instance that is preserved during failover? Any help appreciated!
br
Sib
P.S.
New to stackoverflow so pls bear with me...
You can probably look for Trident Its basically an abstraction built on top of storm . The documentation says
Trident has first-class abstractions for reading from and writing to stateful sources. The state can either be internal to the topology – e.g., kept in-memory and backed by HDFS – or externally stored in a database like Memcached or Cassandra
In case of any fail over it says
Trident manages state in a fault-tolerant way so that state updates are idempotent in the face of retries and failures.
You can go through the documentation for any further clarification.
Tx (and credit) to Storm user group!
http://mail-archives.apache.org/mod_mbox/storm-user/201312.mbox/%3C74083558E4509844944FF5CF2BA7B20F1060FD0E#ESESSMB305.ericsson.se%3E
In original Storm, both spout and bolt are stateless. Storm can managed to restart nodes but it will require some effort to restore the states of nodes. There are two solutions that I can think of:
If a message fails to process, Storm will replay it from ROOT of the topology and the logic of replay has to be implemented by user. So in this case I would like to put more state information (e.g. the ID of some external state storage and id of this task) in messages.
Or you can use Trident. It can provides txid to each transaction and simplify storage process.
I'm OK with first solution because my app doesn't require transactional operations and I have a better understanding of the original Storm (Storm generates simpler log than Trident does).
You can use the task ID.
Task ids are assigned at topology creation and are static. If a task dies/restarts or gets reassigned somewhere else, it will still have the same id.

#Storm: how to setup various metrics for the same data source

I'm trying to setup Storm to aggregate a stream, but with various (DRPC available) metrics on the same stream.
E.g. the stream is consisted of messages that have a sender, a recipient, the channel through which the message arrived and a gateway through which it was delivered. I'm having trouble deciding how to organize one or more topologies that could give me e.g. total count of messages by gateway and/or by channel. And besides the total, counts per minute would be nice too.
The basic idea is to have a spout that will accept messaging events, and from there aggregate the data as needed. Currently I'm playing around with Trident and DRPC and I've came up with two possible topologies that solve the problem at this stage. Can't decide which approach is better, if any?!
The entire source is available at this gist.
It has three classes:
RandomMessageSpout
used to emit the messaging data
simulates the real data source
SeparateTopology
creates a separate DRPC stream for each metric needed
also a separate query state is created for each metric
they all use the same spout instance
CombinedTopology
creates a single DRPC stream with all the metrics needed
creates a separate query state for each metric
each query state extracts the desired metric and groups results for it
Now, for the problems and questions:
SeparateTopology
is it necessary to use the same spout instance or can I just say new RandomMessageSpout() each time?
I like the idea that I don't need to persist grouped data by all the metrics, but just the groupings we need to extract later
is the spout emitted data actually processed by all the state/query combinations, e.g. not the first one that comes?
would this also later enable dynamic addition of new state/query combinations at runtime?
CombinedTopology
I don't really like the idea that I need to persist data grouped by all the metrics since I don't need all the combinations
it came as a surprise that the all the metrics always return the same data
e.g. channel and gateway inquiries return status metrics data
I found that this was always the data grouped by the first field in state definition
this topic explains the reasoning behind this behaviour
but I'm wondering if this is a good way of doing thins in the first place (and will find a way around this issue if need be)
SnapshotGet vs TupleCollectionGet in stateQuery
with SnapshotGet things tended to work, but not always, only TupleCollectionGet solved the issue
any pointers as to what is correct way of doing that?
I guess this is a longish question / topic, but any help is really appreciated!
Also, if I missed the architecture entirely, suggestions on how to accomplish this would be most welcome.
Thanks in advance :-)
You can't actually split a stream in SeparateTopology by invoking newStream() using the same spout instance, since that would create new instances of the same RandomMessageSpout spout, which would result in duplicate values being emitted to your topology by multiple, separate spout instances. (Spout parallelization is only possible in Storm with partitioned spouts, where each spout instance processes a partition of the whole dataset -- a Kafka partition, for example).
The correct approach here is to modify the CombinedTopology to split the stream into multiple streams as needed for each metric you need (see below), and then do a groupBy() by that metric's field and persistentAggregate() on each newly branched stream.
From the Trident FAQ,
"each" returns a Stream object, which you can store in a variable. You can then run multiple eaches on the same Stream to split it, e.g.:
Stream s = topology.each(...).groupBy(...).aggregate(...)
Stream branch1 = s.each(...)
Stream branch2 = s.each(...)
See this thread on Storm's mailing list, and this one for more information.

How to make EC2 instance call another instance?

I have two EC2 instances. I want that if one finish a job, it will sign the other one to do other stuff.
So, how to make the communication? I don't want to use CURL.. coz it seems like expensive. I think AWS should have some simple solution but I still can't find relevant help in the documentation.
:(
also, how to send data between two instances without giong through SSH in a fast way? I know ssh can be done. but it seems slow. once again, any tool that EC2 provide to do that?
Actually, I need two methods:
1) Instance A tells Instance B to grab the data from Instance A.
This is answered by Adrian that I can use SQS. I will try that.
2) Once Instance B get the signal, then the data (EBS) data in Instance A needs to transfer to Instance B. The amount of data can be big even I zip it. It is around 50 MB. And I need Instance B to get the data fast so that Instance B will have enough time to process the data before next interval comes in.
So, I am thinking of either these methods:
a) Instance A has the data dump from DB, upload to S3. Then signal Instance B. Instance B gets the data from S3.
b) Instance A has the data dump from DB. Then signal Instance B. Instance B establish SSH (or any connection) to Instance A and grabs the data.
The data may need to be stored permanently but it is not a concern at this moment. It is mainly for Instance B to process.
This is a simple scenario. I'm thinking of what if I scale it with multiple instances, what the proper approach is. :)
Thanks.
Amazon has a special service for this -- it's called SQS, and it allows instances to send messages to each other through special queues. There are SDKs for SQS in various languages, like Java and PHP. This should serve your signaling needs.
For actually sending the bulky data over, it's best to use S3 (and send the object key in the SQS message). You're right that you're introducing latency by adding the extra middle-man, but you'll find that S3 is very fast from EC2 instances (if you put them in the same availability zone, that is), and more importantly than performance, S3 is very reliable. If you try to manage the transfer yourself through SSH, you'll have to work out a lot of error checking and retry logic that S3 handles for you. You can use S3FS to easily write and read to/from S3 from EC2.
Edited to address your updated question.
You may want to look at SNS... which is kind of like push SQS.
How fast do you need this communication to be? SSH is pretty darn fast. The only thing that I can think of that might be faster is raw sockets (from within whatever program is running the jobs).
You could use a distributed workflow managing service.
If Instance B has already completed the task, it can go on to pick another task. Usually, you would want Instance B to signal that is has "picked" up a task and is doing it. Then other instances should try to pick up other tasks on your list. You need a central service which knows which task has been picked up already, and which ones are left for grabs.
When Instance B completes the task successfully, it should signal the central service that it is free for a new task, and pick one up if there is something left.
If it fails to complete the task, the central service should be able to detect it (via heartbeats and timeouts you defined) and put the task back on the list so that some other instance can pick it up.
Amazon SWF is the central service which will provide you with all of this.
For data required by each instance, you should put it in a central store like s3, and configure s3 paths in a way such that each task knows where to download data from, without having to sync up.
e.g. data for task 1 could be placed in something like s3://my-bucket/task1

Resources