I have a rockDB write instance which writes data to "/rocksDB/data" . I have read instances which are also pointing to "/rocksDB/data" but they are unable to read data which is being written by write instance. If i restart the read instances , they are then able to read the data. Is there a way for read instances to read the latest keys written by write instances without having to restart .
Also would be interested to understand the reason behind this behavior . Any flag which can be added to let read instances fetch data without having to restart?
Read and Write should be handled in the same process, you can use multiple threads to do read and write, which guarantees read-your-write. There's no such guarantee when you read from a different process. The RocksDB Secondary instance feature is designed to open the same DB in read-only mode from different process, it won't refresh when the DB is changed, there's an API db->TryCatchUpWithPrimary() to update with primary DB.
Related
I'm debugging an issue in an application and I'm running into a scneario where I'm out of ideas, but I suspect a race condition might be in play.
Essentially, I have two API routes - let's call them A and B. Route A generates some data and Route B is used to poll for that data.
Route A first creates an entry in the redis cache under a given key, then starts a background process to generate some data. The route immediately returns a polling ID to the caller, while the background data thread continues to run. When the background data is fully generated, we write it to the cache using the same cache key. Essentially, an overwrite.
Route B is a polling route. We simply query the cache using that same cache key - we expect one of 3 scenarios in this case:
The object is in the cache but contains no data - this indicates that the data is still being generated by the background thread and isn't ready yet.
The object is in the cache and contains data - this means that the process has finished and we can return the result.
The object is not in the cache - we assume that this means you are trying to poll for an ID that never existed in the first place.
For the most part, this works as intended. However, every now and then we see scenario 3 being hit, where an error is being thrown because the object wasn't in the cache. Because we add the placeholder object to the cache before the creation route ever returns, we should be able to safely assume this scenario is impossible. But that's clearly not the case.
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying? That is, is it possible that even though the call to add the cache entry has completed but the data would briefly not be returned by queries? It seems the be the only thing that can explain the behavior we are seeing.
If that is a possibility, how can I avoid this scenario? Is there some way to force Redis to wait until the data is available for query before returning?
Is it possible that there is some delay between when a Redis write operation returns and when the data is actually available for querying?
Yes and it may depend on your Redis topology and on your network configuration. Only standalone Redis servers provides strong consistency, albeit with some considerations - see below.
Redis replication
While using replication in Redis, the writes which happen in a master need some time to propagate to its replica(s) and the whole process is asynchronous. Your client may happen to issue read-only commands to replicas, a common approach used to distribute the load among the available nodes of your topology. If that is the case, you may want to lower the chance of an inconsistent read by:
directing your read queries to the master node; and/or,
issuing a WAIT command right after the write operation, and ensure all the replicas acknowledged it: while the replication process would happen to be synchronous from the client standpoint, this option should be used only if absolutely needed because of its bad performance.
There would still be the (tiny) possibility of an inconsistent read if, during a failover, the replication process promotes a replica which did not receive the write operation.
Standalone Redis server
With a standalone Redis server, there is no need to synchronize data with replicas and, on top of that, your read-only commands would be always handled by the same server which processed the write commands. This is the only strongly consistent option, provided you are also persisting your data accordingly: in fact, you may end up having a server restart between your write and read operations.
Persistence
Redis supports several different persistence options; in your scenario, you may want to configure your server so that it
logs to disk every write operation (AOF) and
fsync every query.
Of course, every configuration setting is a trade off between performance and durability.
A few years ago I read ODL recommendation not to use READ operation but instead use Data Change Listener or some of its variation. Is it still valid recommendation?
Looking at the ODL code, I got impression that each transaction commit is applied to “In Memory Data Store” immediately during the commit simultaneously with sending notification to the listener. Is it correct?
Why in this case, reading is not as efficient as using the notification?
Where did you read this recommendation? It depends on your use case. Using a data tree change listener (DTCL) with your own cache is going to have faster access than issuing a read operation, especially in a clustered environment if the shard leader is remote. However maintaining your own cache via a DTCL is eventually consistent, meaning your cache may not have up-to-date data. This has to be considered for the use case. If you need strong consistency, then you must use read operations.
I have Batch Processing project, wanted to cluster on 5 machines.
Suppose I have input source is database having 1000 records.
I want to split these records equally i.e. 200 records/instance of batch job.
How could we distribute the work load ?
Given below, is the workflow that you may want to follow.
Assumptions:
You have the necessary Domain Objects respective to the DB table.
You have a batch flow configured wherein, there is a
reader/writer/tasklet mechanism.
You have a Messaging System (Messaging Queues are a great way to
make distributed applications talk to each other)
Input object is an object to the queue that contains the set of
input records split as per the required size.
Result object is an object to the queue that contains the processed
records or result value(if scalar)
The chunkSize is configured in a property file. Here 200
Design:
In the application,
Configure a queueReader to read from a queue
Configure a queueWriter to write to a queue
If using the task/tasklet mechanism, configure different queues to carry the input/result objects.
Configure a DB reader which reads from a DB
Logic in the DBReader
Read records from DB one by one and count of records maintained. if
(count%chunkSize==0) then write all the records to the inputMessage
object and write the object to the queue.
Logic in queueReader
Read the messages one by one
For each present message do the necessary processing.
Create a resultObject
Logic in the queueWriter
Read the resultObject (usually batch frameworks provide a way to
ensure that writers are able to read the output from readers)
If any applicable processing or downstream interaction is needed,
add it here.
Write the result object to the outputQueue.
Deployment
Package once, deploy multiple instances. For better performance, ensure that the chunkSize is small to enable fast processing. The queues are managed by the messaging system (The available systems in the market provide ways to monitor the queues) where you will be able to see the message flow.
Started using Realm as storage layer for my app. This is these scenario I am trying to solve
Scenario: I get a whole bunch of data from the server. I convert each piece of data into a RLMObject. I want to just "save" to persistent storage at the end. In between, I want these RLMObjects create dot reflected when I do a query
I don't see a solution for this in Realm. Looks like only way to is to write each Object back into the Realm DB after they are created. Documentation also says that writes are expensive. Is there any way around?
To reduce the overhead, I guess I could maintain list of objects created and write all of them in one transaction. Still seems like a lot of work. Is that how it is intended to be used?
You can create the objects as standalone without adding them to the Realm, and then add them all in single transaction (which is very efficient) at the end.
Check out the documentation about creating objects here: https://realm.io/docs/objc/latest/#creating-objects
There is also an example of adding objects in bulk here, where they get added in chunks so that other threads can observe the changes as they happens: https://realm.io/docs/objc/latest/#using-a-realm-across-threads
I have two EC2 instances. I want that if one finish a job, it will sign the other one to do other stuff.
So, how to make the communication? I don't want to use CURL.. coz it seems like expensive. I think AWS should have some simple solution but I still can't find relevant help in the documentation.
:(
also, how to send data between two instances without giong through SSH in a fast way? I know ssh can be done. but it seems slow. once again, any tool that EC2 provide to do that?
Actually, I need two methods:
1) Instance A tells Instance B to grab the data from Instance A.
This is answered by Adrian that I can use SQS. I will try that.
2) Once Instance B get the signal, then the data (EBS) data in Instance A needs to transfer to Instance B. The amount of data can be big even I zip it. It is around 50 MB. And I need Instance B to get the data fast so that Instance B will have enough time to process the data before next interval comes in.
So, I am thinking of either these methods:
a) Instance A has the data dump from DB, upload to S3. Then signal Instance B. Instance B gets the data from S3.
b) Instance A has the data dump from DB. Then signal Instance B. Instance B establish SSH (or any connection) to Instance A and grabs the data.
The data may need to be stored permanently but it is not a concern at this moment. It is mainly for Instance B to process.
This is a simple scenario. I'm thinking of what if I scale it with multiple instances, what the proper approach is. :)
Thanks.
Amazon has a special service for this -- it's called SQS, and it allows instances to send messages to each other through special queues. There are SDKs for SQS in various languages, like Java and PHP. This should serve your signaling needs.
For actually sending the bulky data over, it's best to use S3 (and send the object key in the SQS message). You're right that you're introducing latency by adding the extra middle-man, but you'll find that S3 is very fast from EC2 instances (if you put them in the same availability zone, that is), and more importantly than performance, S3 is very reliable. If you try to manage the transfer yourself through SSH, you'll have to work out a lot of error checking and retry logic that S3 handles for you. You can use S3FS to easily write and read to/from S3 from EC2.
Edited to address your updated question.
You may want to look at SNS... which is kind of like push SQS.
How fast do you need this communication to be? SSH is pretty darn fast. The only thing that I can think of that might be faster is raw sockets (from within whatever program is running the jobs).
You could use a distributed workflow managing service.
If Instance B has already completed the task, it can go on to pick another task. Usually, you would want Instance B to signal that is has "picked" up a task and is doing it. Then other instances should try to pick up other tasks on your list. You need a central service which knows which task has been picked up already, and which ones are left for grabs.
When Instance B completes the task successfully, it should signal the central service that it is free for a new task, and pick one up if there is something left.
If it fails to complete the task, the central service should be able to detect it (via heartbeats and timeouts you defined) and put the task back on the list so that some other instance can pick it up.
Amazon SWF is the central service which will provide you with all of this.
For data required by each instance, you should put it in a central store like s3, and configure s3 paths in a way such that each task knows where to download data from, without having to sync up.
e.g. data for task 1 could be placed in something like s3://my-bucket/task1