Pymongo script thread safe - thread-safety

I have a pymongo script which checks for an element, if it is in the collection then returns it's id otherwise it inserts and then return the id of the newly inserted collection. I want this to be thread safe. Because several scripts may use this script to check for the element.
How should I make this thread safe.
I saw start_request method, thought it will work but it doesn't. It inserts two documents referring to the same element.

As shx2 mentioned, you are not looking for thread safety, but for atomic database transactions.
MongoDB findAndModify might be what you are looking for. It atomically updates the document, or inserts it if you specify the new: true and upsert: true options and returns the document.

Related

Atomic reads in RethinkDB

I keep finding information about write atomicity, but not reads. Are reads always atomic? For example I pass a read-only query to populate an array with append() by querying multiple different documents. Basically, am I guaranteed to receive a "snapshot" with no writes after the query started appearing in the result?
No, if you read from multiple documents in the same query it's possible for one of those reads to see a write and another one not to.

Is there a way to skip over existing _id's for insert_many in Pymongo 3.0?

I'm updating a DB that has several million documents with less than 10 _id collisions.
I'm currently using the PyMongo module to do batch inserts using insert_many by:
Querying the db to see if the _id exists
Then adding the document to an array if _id doesn't exist
Insert to the database using insert_many, 1000 documents at a time.
There are only about 10 collisions out of several million documents and I'm currently querying the database for each _id. I think that I could cut down on overall insert time by a day or two if I could cut out the query process.
Is there something similar to upsert perhaps that only inserts a document if it doesn't exist?
The better way to handle this and also "inserting/updating" many documents in an efficient way is to use the Bulk Operations API to submit everything in "batches" with effecient sending of all and receiving a "singular response" in confirmation.
This can be handled in two ways.
Firstly to ignore any "duplicate errors" on the primary key or other indexes then you can use an "UnOrdered" form of operation:
bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=False)
for doc in docs:
bulk.insert(doc)
response = bulk.execute()
The "UnOrdered" or false argument there means that the operations can both execute in any order and that the "whole" batch will be completed with any actual errors simply being "reported" in the response. So that is one way to basically "ignore" the duplicates and move along.
The alternate approach is much the same but using the "upsert" functionality along with $setOnInsert:
bulk = pymongo.bulk.BulkOperationBuilder(collection,ordered=True)
for doc in docs:
bulk.find({ "_id": doc["_id"] }).upsert().updateOne({
"$setOnInsert": doc
})
response = bulk.execute()
Whereby the "query" portion in .find() is used to query for the presence of a document using the "primary key" or alternately the "unique keys" of the document. Where no match is found an "upsert" occurs with a new doccument created. Since all the modification content is within $setOnInsert then the document fields are only modified here when an "upsert" occurs. Otherwise while the document is "matched" nothing is actually changed with respect to the data kept under this operator.
The "Ordered" in this case means that every statement is actually committed in the "same" order it was created in. Also any "errors" here will halt the update ( at the point where the error occurred ) so that no more operations will be committed. It's optional, but probably advised for normal "dupliate" behaviour where later statements "duplicate" the data of a previous one.
So for more efficient writes, the general idea is to use the "Bulk" API and build your actions accordingly. The choice here really comes down to whether the "order of insertion" from the source is important to you or not.
Of course the same "ordered"=False operation applies to insert_many which actually uses "Bulk" operations in the newer driver releases. But you will get more flexibilty from sticking with the general interface which can "mix" operations wit a simple API.
While Blakes' answer is great, for most of cases it's fine to use ordered=False argument and catch BulkWriteError in case of duplicates.
try:
collection.insert_many(data, ordered=False)
except BulkWriteError:
logger.info('Duplicates were found.')

Get a collection and then changes to it without gaps or overlap

How do I reliably get the contents of a table, and then changes to it, without gaps or overlap? I'm trying to end up with a consistent view of the table over time.
I can first query the database, and then subscribe to a change feed, but there might be a gap where a modification happened between those queries.
Or I can first subscribe to the changes, and then query the table, but then a modification might happen in the change feed that's already processed in the query.
Example of this case:
A subscribe 'messages'
B add 'messages' 'message'
A <- changed 'messages' 'message'
A run get 'messages'
A <- messages
Here A received a 'changed' message before it sent its messages query, and the result of the messages query includes the changed message. Possibly A could simply ignore any changed messages before it has received the query result. Is it guaranteed that changes received after a query (on the same connection) were not already applied in the previous query, i.e. are handled on the same thread?
What's the recommended way? I couldn't find any docs on this use case.
I know you said you came up with an answer but I've been doing this quite a bit and here is what I've been doing:
r.db('test').table('my_table').between(tsOne, tsTwo, {index: 'timestamp'});
So in my jobs, I run an indexed between query which captures data between last run time and that exact moment. You can run a lock on the config table which tracks the last_run_time for your jobs so that you can even scale with multiple processors! And because we are using between the next job that is waiting for the lock to complete will only grab data after the first processor ran. Hope that helps!
Michael Lucy of RethinkDB Wrote:
For .get.changes and .order_by.limit.changes you should be fine because we already send the initial value of the query for those. For other queries, the only way to do that right now is to subscribe to changes on the query, execute the query, and then read from the changefeed and discard any changes from before the read (how to do this depends on what read you're executing and what legal changes to it are, but the easiest way to hack it would probably be to add a timestamp field to your objects that you increment whenever you do an update).
In 2.1 we're planning to add an optional argument return_initial that will do what I just described automatically and without any need to change your document schema.

How to create multiple indexes at once in RethinkDB?

In RethinkDB, is it possible to create multiple indexes at once?
Something like (which doesn't work):
r.db('test').table('user').indexCreate('name').indexCreate('email').run(conn, callback)
Index creation is a fairly heavyweight operation because it requires scanning the existing documents to bring the index up to date. It's theoretically possible to allow creation of 2 indexes at the same time such that they both perform this process in parallel and halve the work We don't support that right now though.
However I suspect that's not what you're asking about. If you're just looking for a way to not have to wait for the index to complete and then come back and start the next one the best way would be to do:
table.index_create("foo").run(noreply=True)
# returns immediately
table.index_create("bar").run(noreply=True)
# returns immediately
You also can always do any number of writes in a single query by putting them in an array like so:
r.expr([table.index_create("foo"), table.index_create("bar")]).run()
I can't actually think of why this would be useful for index creation because index writes don't block until the index is ready but hey who knows. It's definitely useful in table creation.

One data store. Multiple processes. Will this SQL prevent race conditions?

I'm trying to create a Ruby script that spawns several concurrent child processes, each of which needs to access the same data store (a queue of some type) and do something with the data. The problem is that each row of data should be processed only once, and a child process has no way of knowing whether another child process might be operating on the same data at the same instant.
I haven't picked a data store yet, but I'm leaning toward PostgreSQL simply because it's what I'm used to. I've seen the following SQL fragment suggested as a way to avoid race conditions, because the UPDATE clause supposedly locks the table row before the SELECT takes place:
UPDATE jobs
SET status = 'processed'
WHERE id = (
SELECT id FROM jobs WHERE status = 'pending' LIMIT 1
) RETURNING id, data_to_process;
But will this really work? It doesn't seem intuitive the Postgres (or any other database) could lock the table row before performing the SELECT, since the SELECT has to be executed to determine which table row needs to be locked for updating. In other words, I'm concerned that this SQL fragment won't really prevent two separate processes from select and operating on the same table row.
Am I being paranoid? And are there better options than traditional RDBMSs to handle concurrency situations like this?
As you said, use a queue. The standard solution for this in PostgreSQL is PgQ. It has all these concurrency problems worked out for you.
Do you really want many concurrent child processes that must operate serially on a single data store? I suggest that you create one writer process who has sole access to the database (whatever you use) and accepts requests from the other processes to do the database operations you want. Then do the appropriate queue management in that thread rather than making your database do it, and you are assured that only one process accesses the database at any time.
The situation you are describing is called "Non-repeatable read". There are two ways to solve this.
The preferred way would be to set the transaction isolation level to at least REPEATABLE READ. This will mean that any row that concurrent updates of the nature you described will fail. if two processes update the same rows in overlapping transactions one of them will be canceled, its changes ignored, and will return an error. That transaction will have to be retried. This is achieved by calling
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
At the start of the transaction. I can't seem to find documentation that explains an idiomatic way of doing this for ruby; you may have to emit that sql explicitly.
The other option is to manage the locking of tables explicitly, which can cause a transaction to block (and possibly deadlock) until the table is free. Transactions won't fail in the same way as they do above, but contention will be much higher, and so I won't describe the details.
That's pretty close to the approach I took when I wrote pg_message_queue, which is a simple queue implementation for PostgreSQL. Unlike PgQ, it requires no components outside of PostgreSQL to use.
It will work just fine. MVCC will come to the rescue.

Resources