How to parallelize a Flink job with Guava cache? - caching

I have written a Flink job which uses Guava cache. The cache object is created and used in a run() function called in the main() function.
It is something like :
main() {
run(some,params)
}
run() {
//create and use Guava cache object here
}
If I run this Flink job, with some level of parallelism, will all of the parallel tasks, use the same cache object? If not, how can I make them all use a single cache?
The cache is used inside a process() function for a stream. So it's like
incoming_stream.process(new ProcessFunction() { //Use Guava Cache here })
You can think of my use case as of cache based deduping, so I want all of the parallel tasks to refer to a single cache object

Using a Guava cache with Flink is usually an anti-pattern. Not that it can't be made to work, but there's probably a simpler and more scalable solution.
The standard approach to deduplicating in a thoroughly scalable, performant way with Flink is to partition the stream by some key (using keyBy), and then use keyed state to remember the keys that have been seen. Flink's keyed state is managed by Flink in a way that makes it fault tolerant and rescalable, while keeping it local. Flink's keyed state is a sharded key/value store, with each instance handling all of the events for some portion of the key space. You are guaranteed that for each key, all events for the same key will be processed by the same instance -- which is why this works well for deduplication.
If you need instead that all of the parallel instances have a complete copy of some (possibly evolving) data set, that's what broadcast state is for.

Flink tasks run on multi JVMs or machines,so the issue is how to share objects between JVM.
Normally,you can acquire objects from remote JVM by RPC (via tcp) or rest (via http) call.
Alternatively,you may serialize objects and store them to database like reids,then read from database and deserialize to objects.
In Flink,there is a more graceful way to achive this,you can store objects in state,and broadcast_state may fit you.
Broadcast state was introduced to support use cases where some data coming from one stream is required to be broadcasted to all downstream tasks
Hope this helps.

Related

Is this Redis Race Condition Scenario Possible?

I'm debugging an issue in an application and I'm running into a scneario where I'm out of ideas, but I suspect a race condition might be in play.
Essentially, I have two API routes - let's call them A and B. Route A generates some data and Route B is used to poll for that data.
Route A first creates an entry in the redis cache under a given key, then starts a background process to generate some data. The route immediately returns a polling ID to the caller, while the background data thread continues to run. When the background data is fully generated, we write it to the cache using the same cache key. Essentially, an overwrite.
Route B is a polling route. We simply query the cache using that same cache key - we expect one of 3 scenarios in this case:
The object is in the cache but contains no data - this indicates that the data is still being generated by the background thread and isn't ready yet.
The object is in the cache and contains data - this means that the process has finished and we can return the result.
The object is not in the cache - we assume that this means you are trying to poll for an ID that never existed in the first place.
For the most part, this works as intended. However, every now and then we see scenario 3 being hit, where an error is being thrown because the object wasn't in the cache. Because we add the placeholder object to the cache before the creation route ever returns, we should be able to safely assume this scenario is impossible. But that's clearly not the case.
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying? That is, is it possible that even though the call to add the cache entry has completed but the data would briefly not be returned by queries? It seems the be the only thing that can explain the behavior we are seeing.
If that is a possibility, how can I avoid this scenario? Is there some way to force Redis to wait until the data is available for query before returning?
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying?
Yes and it may depend on your Redis topology and on your network configuration. Only standalone Redis servers provides strong consistency, albeit with some considerations - see below.
Redis replication
While using replication in Redis, the writes which happen in a master need some time to propagate to its replica(s) and the whole process is asynchronous. Your client may happen to issue read-only commands to replicas, a common approach used to distribute the load among the available nodes of your topology. If that is the case, you may want to lower the chance of an inconsistent read by:
directing your read queries to the master node; and/or,
issuing a WAIT command right after the write operation, and ensure all the replicas acknowledged it: while the replication process would happen to be synchronous from the client standpoint, this option should be used only if absolutely needed because of its bad performance.
There would still be the (tiny) possibility of an inconsistent read if, during a failover, the replication process promotes a replica which did not receive the write operation.
Standalone Redis server
With a standalone Redis server, there is no need to synchronize data with replicas and, on top of that, your read-only commands would be always handled by the same server which processed the write commands. This is the only strongly consistent option, provided you are also persisting your data accordingly: in fact, you may end up having a server restart between your write and read operations.
Persistence
Redis supports several different persistence options; in your scenario, you may want to configure your server so that it
logs to disk every write operation (AOF) and
fsync every query.
Of course, every configuration setting is a trade off between performance and durability.

What is behaviour of ProcessorContext.getStateStore(String name) & ReadOnlyKeyValueStore.get(String key) in Kafka sream

I have 1.0.0 kafka stream application with two classes as updated at How to evaluate consuming time in kafka stream application. In my application, I read the events, perform some conditional checks and forward to same kafka in another topic. During my evaluation , I am getting some of expressions from Kafka with help of global table store. Observed that most of the time was taken while getting the value from store (sample code is below).
Is it read only one time from Kafka and maintain it in local store?
or
Is it read from Kafka whenever we call the org.apache.kafka.streams.state.ReadOnlyKeyValueStore.get(String key) API? If yes then how to maintain local store instead of read everytime from Kafka?
Please help.
Ex:
private KeyValueStore<String, List<String>> policyStore = (KeyValueStore<String, List<String>>) this.context
.getStateStore(policyGlobalTableName);
List<String> policyIds = policyStore.get(event.getCustomerCode());
By default, stores use an application local RocksDB instance to buffer data. Thus, if you query the store with a get() it will not go over the network and not the brokers, but only the local RocksDB.
You can try to change RocksDB setting to improve the performance, but I have no guidelines atm which configs you might wanna change. Configuring RocksDB is a quite tricky thing. But you might want to search the Internet for further information about it.
You can pass in RocksDB configs via StreamsConfig (cf. https://docs.confluent.io/current/streams/developer-guide/config-streams.html#rocksdb-config-setter)
As an alternative, you could also try to reconfigure Streams to use in-memory stores instead of RocksDB. Note, that this will increase your rebalance time, as there is no local buffered state if you use in-memory instead of RocksDB. (cf. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-and-creating-a-state-store)

Amazon Web Services: Spark Streaming or Lambda

I am looking for some high level guidance on an architecture. I have a provider writing "transactions" to a Kinesis pipe (about 1MM/day). I need to pull those transactions off, one at a time, validating data, hitting other SOAP or Rest services for additional information, applying some business logic, and writing the results to S3.
One approach that has been proposed is use Spark job that runs forever, pulling data and processing it within the Spark environment. The benefits were enumerated as shareable cached data, availability of SQL, and in-house knowledge of Spark.
My thought was to have a series of Lambda functions that would process the data. As I understand it, I can have a Lambda watching the Kinesis pipe for new data. I want to run the pulled data through a bunch of small steps (lambdas), each one doing a single step in the process. This seems like an ideal use of Step Functions. With regards to caches, if any are needed, I thought that Redis on ElastiCache could be used.
Can this be done using a combination of Lambda and Step Functions (using lambdas)? If it can be done, is it the best approach? What other alternatives should I consider?
This can be achieved using a combination of Lambda and Step Functions. As you described, the lambda would monitor the stream and kick off a new execution of a state machine, passing the transaction data to it as an input. You can see more documentation around kinesis with lambda here: http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html.
The state machine would then pass the data from one Lambda function to the next where the data will be processed and written to S3. You need to contact AWS for an increase on the default 2 per second StartExecution API limit to support 1MM/day.
Hope this helps!

How to add storage-level caching between DynamoDB and Titan?

I am using the Titan/DynamoDB library to use AWS DynamoDB as a backend for my Titan DB graphs. My app is very read-heavy and I noticed Titan is mostly executing query requests against DynamoDB. I am using transaction- and instance-local caches and indexes to reduce my DynamoDB read units and the overall latency. I would like to introduce a cache layer that is consistent for all my EC2 instances: A read/write-through cache between DynamoDB and my application to store query results, vertices, and edges.
I see two solutions to this:
Implicit caching done directly by the Titan/DynamoDB library. Classes like the ParallelScanner could be changed to read from AWS ElastiCache first. The change would have to be applied to read & write operations to ensure consistency.
Explicit caching done by the application before even invoking the Titan/Gremlin API.
The first option seems to be the more fine-grained, cross-cutting, and generic.
Does something like this already exist? Maybe for other storage backends?
Is there a reason why this does not exist already? Graph DB applications seem to be very read-intensive so cross-instance caching seems like a pretty significant feature to speedup queries.
First, ParallelScanner is not the only thing you would need to change. Most importantly, all the changes you need to make are in DynamoDBDelegate (that is the only class that makes low level DynamoDB API calls).
Regarding implicit caching, you could add a caching layer on top of DynamoDB. For example, you could implement a cache using API Gateway on top of DynamoDB, or you could use Elasticache. Either way, you need to figure out a way to invalidate Query/Scan pages. Inserting/deleting items will cause page boundaries to change so it requires some thought.
Explicit caching may be easier to do than implicit caching. The level of abstraction is higher, so based on your incoming writes it may be easier for you to decide at the application level whether a traversal that is cached needs to be invalidated. If you treat your graph application as another service, you could cache the results at the service level.
Something in between may also be possible (but requires some work). You could continue to use your vertex/database caches as provided by Titan, and use a low value for TTL that is consistent with how frequently you write columns. Or, you could take your caching approach a step further and do the following.
Enable DynamoDB Stream on edgestore.
Use a Lambda function to stream the edgestore updates to a Kinesis Stream.
Consume the Kinesis Stream with edgestore updates in the same JVM as the Gremlin Server on each of your Gremlin Server instances. You would need to instrument the database level cache in Titan to consume the Kinesis stream and invalidate the cached columns as appropriate, in each Titan instance.

How does infinispan know that it have to take the changes from delta aware object

We are using infinispan and in our system we have a big object in which we have to push small changes per transaction. I have implemented the DeltaAware interface for this object and also the Delta. The problem i am facing is that the changes are not getting propagated to other nodes and only the initial object state is prapogated to other nodes. Also the delta and commit methods are not called on the big object which implements DeltaAware. Do i need to register this object somewhere other than simply putting it in the cache ?
Thanks
It's probably better if you simply use an AtomicHashMap, which is a construction within Infinispan. This allows you to group a series of key/value pairs as a single value. Infinispan can detect changes in this AtomicHashMap because it implements the DeltaAware interface. AHM is a higher level construct than DeltaAware, and one that probably suits you better.
To give you an example where AtomicHashMaps are used, they're heavily used by JBoss AS7 HTTP session replication, where each session id is mapped to an AtomicHashMap. This means that we can detect when individual session data changes and only replicate that.
Cheers,
Galder

Resources