I keep finding information about write atomicity, but not reads. Are reads always atomic? For example I pass a read-only query to populate an array with append() by querying multiple different documents. Basically, am I guaranteed to receive a "snapshot" with no writes after the query started appearing in the result?
No, if you read from multiple documents in the same query it's possible for one of those reads to see a write and another one not to.
Related
I Have a Spring boot project where I would like to execute a specific query in a database from x different threads while preventing different threads from reading the same database entries. So far I was able to run the query in multiple threads but had no luck on finding a way to "split" the read load. My code so far is as follows:
#Async
#Transactional
public CompletableFuture<Book> scanDatabase() {
final List<Book> books = booksRepository.findAllBooks();
return CompletableFuture.completedFuture(books);
}
Any ideas on how should I approach this?
There are plenty of ways to do that.
If you have a numeric field in the data that is somewhat random you can add a condition to your where clause like ... and some_value % :N = :i with :N being a parameter for the number of threads and :i being the index of the specific thread (0 based).
If you don't have a numeric field you can create one by using a hash function and apply it on some other field in order to turn it into something numeric. See your database specific documentation for available hash functions.
You could use an analytic function like ROW_NUMBER() to create a numeric value to be use in the condition.
You could query the number of rows in a first query and then query a the right Slice using Spring Datas pagination feature.
And many more variants.
They all have in common that the complete set of rows must not change during the processing, otherwise you may get rows queried multiple times or not at all.
If you can't guarantee that you need to mark the records to be processed by a thread before actually selecting them, for example by marking them in an extra field or by using a FOR UPDATE clause in your query.
And finally there is the question if this is really what you need.
Querying the data in multiple threads probably doesn't make the querying part faster since it makes the query more complex and doesn't speed up those parts that typically limit the throughput: network between application and database and I/O in the database.
So it might be a better approach to select the data with one query and iterate through it, passing it on to a pool of thread for processing.
You also might want to take a look at Spring Batch which might be helpful with processing large amounts of data.
I've got my first Process Group that drops indexes in table.
Then that routes to another process group the does inserts into table.
After successfully inserting the half million rows, I want to create the indexes on the table and analyze it. This is typical Data Warehouse methodology. Can anyone please give advice on how to do this?
I've tried setting counters, but cannot reference counters in Expression Language. I've tried RouteOnAttribute but getting nowhere. Now I'm digging into Wait & Notify Processors - maybe there's a solution there??
I have gotten Counters to count the flow file sql insert statements, but cannot reference the Counter values via Expression Language. Ie this always returns nulls: "${InsertCounter}" where InsertCounter is being set properly it appears via my UpdateCounter process in my flow.
So maybe this code can be used?
In the wait processor set the Target Signal Count to ${fragment.count}.
Set the Release Signal Identifier in both the notify and wait processor to ${fragment.identifier}
nothing works
You can use Wait/Notify processors to do that.
I assume you're using ExecuteSQL, SplitAvro? If so, the flow will look like:
Split approach
At the 2nd ProcessGroup
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
SpritAvro: creates 5,000 FlowFiles, this processor adds fragment.identifier and fragment.count (=5,000) attributes.
split:
XXXX: Do some conversion per record
PutSQL: Insert records individually
Notify: Increase count for the fragment.identifier (Release Signal Identifier) by 1. Executed 5,000 times.
original - to the next ProcessGroup
At the 3rd ProcessGroup
Wait: waiting for fragment.identifier (Release Signal Identifier) to reach fragment.count (Target Signal Count). This route processes the original FlowFile, so executed only once.
PutSQL: Execute a query to create indices and analyze tables
Alternatively, if possible, using Record aware processors would make the flow simpler and more efficient.
Record approach
ExecuteSQL: e.g. 1 output FlowFile containing 5,000 records
Perform record level conversion: With UpdateRecord or LookupRecord, you can do data processing without splitting records into multiple FlowFiles.
PutSQL: Execute a query to create indices and analyze tables. Since the single FlowFile containing all records, no Wait/Notify is required, the output FlowFile can be connected to the downstream flow.
I Think my suggestion to this question will fit into your scenario as well
How to execute a processor only when another processor is not executing?
Check it out
I'm updating a DB that has several million documents with less than 10 _id collisions.
I'm currently using the PyMongo module to do batch inserts using insert_many by:
Querying the db to see if the _id exists
Then adding the document to an array if _id doesn't exist
Insert to the database using insert_many, 1000 documents at a time.
There are only about 10 collisions out of several million documents and I'm currently querying the database for each _id. I think that I could cut down on overall insert time by a day or two if I could cut out the query process.
Is there something similar to upsert perhaps that only inserts a document if it doesn't exist?
The better way to handle this and also "inserting/updating" many documents in an efficient way is to use the Bulk Operations API to submit everything in "batches" with effecient sending of all and receiving a "singular response" in confirmation.
This can be handled in two ways.
Firstly to ignore any "duplicate errors" on the primary key or other indexes then you can use an "UnOrdered" form of operation:
bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=False)
for doc in docs:
bulk.insert(doc)
response = bulk.execute()
The "UnOrdered" or false argument there means that the operations can both execute in any order and that the "whole" batch will be completed with any actual errors simply being "reported" in the response. So that is one way to basically "ignore" the duplicates and move along.
The alternate approach is much the same but using the "upsert" functionality along with $setOnInsert:
bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=True)
for doc in docs:
bulk.find({ "_id": doc["_id"] }).upsert().updateOne({
"$setOnInsert": doc
})
response = bulk.execute()
Whereby the "query" portion in .find() is used to query for the presence of a document using the "primary key" or alternately the "unique keys" of the document. Where no match is found an "upsert" occurs with a new doccument created. Since all the modification content is within $setOnInsert then the document fields are only modified here when an "upsert" occurs. Otherwise while the document is "matched" nothing is actually changed with respect to the data kept under this operator.
The "Ordered" in this case means that every statement is actually committed in the "same" order it was created in. Also any "errors" here will halt the update ( at the point where the error occurred ) so that no more operations will be committed. It's optional, but probably advised for normal "dupliate" behaviour where later statements "duplicate" the data of a previous one.
So for more efficient writes, the general idea is to use the "Bulk" API and build your actions accordingly. The choice here really comes down to whether the "order of insertion" from the source is important to you or not.
Of course the same "ordered"=False operation applies to insert_many which actually uses "Bulk" operations in the newer driver releases. But you will get more flexibilty from sticking with the general interface which can "mix" operations wit a simple API.
While Blakes' answer is great, for most of cases it's fine to use ordered=False argument and catch BulkWriteError in case of duplicates.
try:
collection.insert_many(data, ordered=False)
except BulkWriteError:
logger.info('Duplicates were found.')
I have a case in which we need to insert records into Hbase table, in which 90% of the records coming from the source are repeated. In this case,
is it advisable to first query for the record from Hbase, if not present then call put
or
just simply call put.
Which of the above will be good in terms of performance.
Both HTable methods checkAndPut() and exists() requires accessing to table data which could hurt you badly if you receive lots of write requests and the data is not in the memstore.
Plain writes in HBase are usually not so expensive, so, if you have a good rowKey design and you're already avoiding hot regions, I'll just stick to overwriting data.
If you don't want to re-insert existing records you can use the checkAndPut method of HTable. Which this the put will be applied only if the condition you specify is met. So you could check for an existence of a column to put only if not existing.
I kind of agree with both answers. It is true that before using the CAS (Check And Set) mechanism, one has to revise his design first, and see if it is possible to refactor it and use plain writes instead. However, in some cases, this is not trivial.
Another thing I would make sure of before using the checkAndPut(), is that this operation requires Isolation, when updating values. HBase only guarantees it, when rewriting, but not updating.
And at last, check if it is possible to use the Append instead of checkAndPut.
In RethinkDB, is it possible to create multiple indexes at once?
Something like (which doesn't work):
r.db('test').table('user').indexCreate('name').indexCreate('email').run(conn, callback)
Index creation is a fairly heavyweight operation because it requires scanning the existing documents to bring the index up to date. It's theoretically possible to allow creation of 2 indexes at the same time such that they both perform this process in parallel and halve the work We don't support that right now though.
However I suspect that's not what you're asking about. If you're just looking for a way to not have to wait for the index to complete and then come back and start the next one the best way would be to do:
table.index_create("foo").run(noreply=True)
# returns immediately
table.index_create("bar").run(noreply=True)
# returns immediately
You also can always do any number of writes in a single query by putting them in an array like so:
r.expr([table.index_create("foo"), table.index_create("bar")]).run()
I can't actually think of why this would be useful for index creation because index writes don't block until the index is ready but hey who knows. It's definitely useful in table creation.