I have a Kafka-streams transformer which functions like a windower: it accumulates state into a state store in transform() and then forwards it in an output topic during punctuate(), with the state store topic partition key the same as the input topic.
During punctuate(), I would like each StreamThread to only iterate its own partition of the state store to minimize the amount of data to be read from the backing kafka topic. But the only iterator I can get is through
org.apache.kafka.streams.state.ReadOnlyKeyValueStore<K,V>.all()
which iterates through the whole state store.
Is there any way to "assign partitions" of a state store and make punctuate() iterate only on the assigned partitions?
I guess, ReadOnlyKeyValueStore<K,V>.all() does what you want. Note, that the overall state is sharded into multiple stores with one shard/store per partitions. all() does not iterate through "other shards". "all" means "everything local", ie, everything from the shard of a single partition.
Related
I have one question w.r.t kafka-stream state stores getAll() and delete().
My topology is simple it has one source topic, one processor node and sink topic. I need to maintain state of the stream so I'm using KeyValue based state store.
The processor node reads saves the state in KVStore(rocksdb) in process() and then punctuator scheduled at certain time performs getAll() from the state store, performs certain complex business logic and once done deletes the retrieved entries/keys from stateStore.
Now, my question is if there are multiple sub-tasks(stream threads) running this logic of doing getAll from stateStore and then delete. Is it possible that each of these different sub-tasks(threads) end up getting all the keys and then each of them attempting delete ?
Or is in other words if the KeyValueIterator<K, V> all() returns all the keys across all the partitions of the stateStore OR only the shard'ed partition dedicated to that sub-task(thread).
Thanks,
A unique queue would only allow the queueing of a value once. Subsequent queuing would not do anything.
If it is not clear what the uniqueness criteria is, there could be a key that would be added.
Is there such a data structure in Ruby?
I've got my first Process Group that drops indexes in table.
Then that routes to another process group the does inserts into table.
After successfully inserting the half million rows, I want to create the indexes on the table and analyze it. This is typical Data Warehouse methodology. Can anyone please give advice on how to do this?
I've tried setting counters, but cannot reference counters in Expression Language. I've tried RouteOnAttribute but getting nowhere. Now I'm digging into Wait & Notify Processors - maybe there's a solution there??
I have gotten Counters to count the flow file sql insert statements, but cannot reference the Counter values via Expression Language. Ie this always returns nulls: "${InsertCounter}" where InsertCounter is being set properly it appears via my UpdateCounter process in my flow.
So maybe this code can be used?
In the wait processor set the Target Signal Count to ${fragment.count}.
Set the Release Signal Identifier in both the notify and wait processor to ${fragment.identifier}
nothing works
You can use Wait/Notify processors to do that.
I assume you're using ExecuteSQL, SplitAvro? If so, the flow will look like:
Split approach
At the 2nd ProcessGroup
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
SpritAvro: creates 5,000 FlowFiles, this processor adds fragment.identifier and fragment.count (=5,000) attributes.
split:
XXXX: Do some conversion per record
PutSQL: Insert records individually
Notify: Increase count for the fragment.identifier (Release Signal Identifier) by 1. Executed 5,000 times.
original - to the next ProcessGroup
At the 3rd ProcessGroup
Wait: waiting for fragment.identifier (Release Signal Identifier) to reach fragment.count (Target Signal Count). This route processes the original FlowFile, so executed only once.
PutSQL: Execute a query to create indices and analyze tables
Alternatively, if possible, using Record aware processors would make the flow simpler and more efficient.
Record approach
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
Perform record level conversion: With UpdateRecord or LookupRecord, you can do data processing without splitting records into multiple FlowFiles.
PutSQL: Execute a query to create indices and analyze tables. Since the single FlowFile containing all records, no Wait/Notify is required, the output FlowFile can be connected to the downstream flow.
I Think my suggestion to this question will fit into your scenario as well
How to execute a processor only when another processor is not executing?
Check it out
These days I made some experiments about loadCache、localLoadCache and query data from cache. However, I was more and more puzzled. Here are my puzzles. Please help me if you know how to explain.
What's the difference between loadCache and localLoadCache?
What's the logic inside cache's storing data? For example, I start a node called ''A',whose cache stores some data(assume 10 items) from table Person in DB. And then inside the code I let it query data from cache per 5 seconds.
Then I start a new node called 'B',whose cache stores 20 other items data from table Person in DB and also let it query data from cache per 5 seconds. However, why querying data from ‘A’ and 'B' is 30 items data (the sum of 10+20)?
If I let B put a new item data into cache,and then 'A' can also query out the new data? When I close B, then A query out 10 items data, which means it is same as first. Why?
Ignite is a distributed data storage. It partitions your data set and equally distributes it across available nodes. E.g., if you have 30 entries and 2 nodes, you will have approximately 15 entries on each node. The ownership is defined by Ignite automatically, you can't decide where to store a particular entry (well, you can, but this is non-trivial).
Having said that, when a table is loaded into the cache, it is treated as a single data set. When you get an entry from the cache, it will be transparently returned regardless of where it is stored.
As for the loading, the process is the following:
Each node independently fetches the whole table from the DB and iterates through rows.
For each row the CacheStore implementation creates key and value objects and passes them to the cache.
Cache decides whether this particular key-value pair belongs to the local node. If yes, it is saved. If not, it is discarded.
As a result, the table will be fully stored in the cluster in a distributed fashion and each node will have it's own subset of data.
localLoadCache method executes this process on local node only (useful in some specific cases). loadCache is basically just a shortcut that broadcasts a closure and calls localLoadCache on all nodes, therefore it triggers the distributed data loading.
I need to polling a table and create a tuple for each row for streaming in Apache Storm.
Where could I find a example ?
Based on your requirements, it does't have much to do with storm. This is a database related question.
Since you have no info about the database to use, the table structure and so on, I would drop a rough steps:
Supposing the table has a last-updated-timestamp or an increment ID, using it as a marker to pull data. take ID for example.
1) execute sql select * from myyable where id > ${last retrived id} order by id limit 100 every 100ms. ${last retrived id} will be -1 for initial.
2) iterate the result sets and send out tuples
3) update ${last retrived id} with the last record's id.
(and please notice that if using the last updated timestamp, there would be some difference because different record can have the same last update timestamp)
hope this helps
We have a Storm MySql Spout that caters to your requirements. It tails the bin logs to generate the tuples.
https://github.com/flipkart-incubator/storm-mysql
You can use the table filters to actually listen to bin log events of only the table you are interested in. So whenever a Insert/Delete/Update is done on the table it generates a tuple.
The spout also gives you "at least once/at most once" guarantees. Since it stores the bin log offsets in Zookeeper, in the event of a crash it can recover from where it last was. There is no need for any polling.
Disclaimer: Author of the aforementioned spout