Does RethinkDb support request pipelining? - rethinkdb

Does RethinkDb support request pipelining, grouping multiple requests in one connection? If yes, is it done automatically behind the scenes in a lower level?
Thanks!

RethinkDB currently does not process more than one query per connection at a time.
In the special case of insert operations, batched inserts can be used to obtain a similar effect.
Edit: This answer is outdated. Since RethinkDB 2.0, multiple queries can be executed on the same connection at the same time, as long as the driver supports issuing multiple queries without waiting for the previous one to finish.

Related

JdbcBatchItemWriterBuilder vs org.springframework.jdbc.core.jdbcTemplate.batchUpdate

I understand jdbcTemplate.batchUpdate is used for sending several records to data base in one communication.
Lets say i have 1000 records to be updated, instead of 1000 communications from Application to database, the Application will send 1000 records in request.
Coming to JdbcBatchItemWriterBuilder its combination of Tasks in a job.
My question is, if there is 1000 records to be processed(INSERT statements) via JdbcBatchItemWriterBuilder, all INSERTS executed in one go? or one after one?
If one after one, connecting to database 1000 times using JdbcBatchItemWriterBuilder causes perf issues? hows that handled?
i would like to understand if Spring batch performs better than running 1000 INSERT staments using jdbcTemplate.update ?
The JdbcBatchItemWriter uses java.sql.PreparedStatement#addBatch and java.sql.Statement#executeBatch internally (See https://github.com/spring-projects/spring-batch/blob/c4010fbffa6b71cbcfe79d523023251ce73666a4/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/database/JdbcBatchItemWriter.java#L189-L195), so there will be a single batch insert for all items of the chunk.
Moreover, this will be executed in a single transaction as described in the Chunk-oriented Processing section of the reference documentation.

How to check properties before update in elasticsearch?

I've already read official documentation and find no way.
My datas to es are from kafka which sometimes can be out of order. In the past, message from kafka is parsed and directly insert or update ES doc with specific ID. To avoid the older data override the newer data, I have to check whether the doc with specific ID is already exists and some properties of this doc are meet the conditions. Then I do the UPDATE action(or INSERT).
What I'm doing now is 'search before update'.
Before updating a doc, I search from ES with specific ID(included in kafka msg). Then check if this doc meets the conditions(for example, whether update_time is older?). Lastly I update the doc. And I set refresh to true to update index instantly.
What I'm worried about?
It seems Transactional.
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
That is a possibility since indexes are refreshed once in every second (by default), reducing this value is neither recommended nor guaranteed to give you the desired result since Elasticsearch is NOT designed for this.
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
You can use script if the number of fields being updated are very limited. Personally I have found script to be best suited for single field update and that too for corner use cases, it should not be used as a general practice. Any more than that and you are running into the same risk as that with stored procedures in the RDBMS world. It makes data management volatile overall and a system which is harder to maintain/extend in the longer run.
Your use case is best suited for optimistic locking support available from Elasticsearch out of the box. Take a look at Elasticsearch Versioning Support for full details.
You can very well use the inbuilt doc version if concurrency is the only problem that you need to solve. If, however, you need more than concurrency (out of order message delivery and respective ES updates) then you should use your application/domain specific field as the inbuilt version wouldn't work as-is.
You can very well use any of the app specific (numeric) field as a version field and use it for optimistic locking during document updates. If you use this approach, please pay special attention to all insert, update, delete operations for that index. Quoting AS-IS from versioning support - when using external versioning, make sure you always add the current version (and version_type) to any index, update or delete calls. If you forget, Elasticsearch will use it's internal system to process that request, which will cause the version to be incremented erroneously
I'll recommend you evaluate the inbuilt version first and use it if it fulfills your needs. It'll make the overall design much simpler. Consider the app specific version as the second option if the inbuilt version does not meet your requirements.
If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?
Ad 1. It is possible to save data in ElasticSearch and in a short while after receive stale result (before the index is updated)
If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?
Ad 2. If you process Kafka messages in several threads, it would be the best to use business data (eg. some business ids) as partition keys in Kafka to ensure data is processed in order. Remember to use Kafka to consume messages in many threads and don't consume messages by single consumer to fan out later to multiple threads.
It seems it would be best to ensure data is processed in order and then drop checking in Elasticsearch since it is not guaranteed to give valid results.

How to know when data has been inserted in clickhouse

I understood that clickhouse is eventually consistent. So once an insert call returns, it doesn't mean that the data will appear in a select query.
does that apply to stand-alone clickhouse (no distribution, no replication)?
I understand the concept of eventual consistency for data replication, but does it apply with distribution but no replication?
using a distributed+replicated clickhouse, what is a recommended way to know that some insert(s) can be safely looked up?
Basically I didn't find much information on this topic, so maybe I am not asking the best questions. Feel free to enlighten me.
No, but single-node setup shouldn't be considered reliable either.
By default yes, you'll insert to node the client is connected to (probably via some load balancer) and Distributed table will asynchronously forward each piece of data to node where it belongs. The insert_distributed_sync=1 setting will make the client wait synchronously.
On insert use ***MergeTree shard tables directly (not Distributed) with insert_quorum=2 setting (if there are 3 replicas) and retry infinitely with exactly same batch if there are some errors (can use different replicas on retry, since there's a deduplication based on batch hash). Then on reads use select_sequential_consistency=1 setting.

RethinkDB changefeeds performance: architectural advice?

I am building an application with RethinkDB and I'm about to switch to using changefeeds. But I'm facing an architectural choice and I'd like to get some advice.
My application currently loads all user data from several tables on user login (sending all of it to the frontend), and then processes requests from the frontend, altering the database, and preparing and sending changed items to users. I'd like to switch that over to changefeeds. The way I see it, I have two choices:
Set up a single changefeed for each table. Filter by users logged in to a particular server, and distribute the changes to users manually. These changefeeds are never closed, e.g. they have the lifetime of my servers.
When a user logs in, set up an individual changefeed for that user, for that user's data only (using a getAll with a secondary index). Maintain as many changefeeds as there are currently logged in users. Close them when users log out.
Solution #1 has a big disadvantage: RethinkDB changefeeds do not have a concept of time (or version number), like for example Kafka does. This means that there is no way to a) load initial data, and b) get changes that happened since the initial load. There is a time window where changes can be lost: between initial data load (a) and the moment the changefeed is set up (b). I find this worrying.
Solution #2 seems better, because includeInitial can be used to get initial data, and then get subsequent changes without interruption. I'd have to deal with initial load performance (it's faster to load a single dump of all data than process thousands of updates), but it seems more "correct". But what about scaling? I'm planning to handle up to 1k users per server — is RethinkDB prepared to handle thousands of changefeeds, each being essentially a getAll query? The actual activity in these changefeeds will be very low, it's just the number that I'm worried about.
The RethinkDB manual is a bit terse about changefeed scaling, saying that:
Changefeeds perform well as they scale, although they create extra intracluster messages in proportion to the number of servers with open feed connections on each write.
Solution #2 creates many more feeds, but the number of servers with open feed connections is actually the same for both solutions. And "changefeeds perform well as they scale" isn't quite enough to go on :-)
I'd also be interested to know what are recommended practices for handling server restarts/upgrades and disconnections. The way I see it, if anything happens to RethinkDB, clients have to perform a full data load (using includeInitial) after reconnecting, because there is no way to know what changes have been lost during downtime. Is that what people do?
RethinkDB should be able to handle thousands of changefeeds just fine if it's on reasonable hardware. One thing some people to do lower network load in that case is they put a proxy node on the same machine as their app server, and connect to that, since the proxy node knows enough to deduplicate the changefeed messages coming in over the network, and because it takes a lot of CPU/memory load off of their main cluster.
Currently the only way to recover from a crash is to restart the changefeed using includeInitial. There are plans to add write timestamps in the future, but handling deletes is complicated in that case.

bulk.find.update() vs update.collection(multi=true) in mongo

I am newbie to mongodb and would like to implement mongodb in my project having millions of record.Would like to know what should i prefer for update -bulk.find.update() vs update.collection with multi =true for performance.
As far as I know the biggest gains Bulk provides are these:
Bulk operations send only one request to the MongoDB for the all requests in the bulk. Others send a request per document or send only for one operation type from one of insert, update, updateOne, upsert with update operations and remove .
Bulk can handle a lot of different cases at different lines on a code page.
Bulk operations can work as asynchronously. Others cannot.
But today some operations work bulk based. For instance insertMany.
If the gains above have taken into account, update() must show the same performance results with an bulk.find.update() operation.
Because update() can take only one query object sending to MongoDB. And multi: true is only an argument which specifies that the all matched documents have to be updated. This means it makes only one request on the network. Just like Bulk operations.
So, both of sends only one request to MongoDB and MongoDB evaluates the query clause to find the documents that'd be updated then updates them!
I had tried to find out an answer for this question on MongoDB official site but I could not.
So, an explanation from #AsyaKamsky would be great!

Resources