bulk.find.update() vs update.collection(multi=true) in mongo - performance

I am newbie to mongodb and would like to implement mongodb in my project having millions of record.Would like to know what should i prefer for update -bulk.find.update() vs update.collection with multi =true for performance.

As far as I know the biggest gains Bulk provides are these:
Bulk operations send only one request to the MongoDB for the all requests in the bulk. Others send a request per document or send only for one operation type from one of insert, update, updateOne, upsert with update operations and remove .
Bulk can handle a lot of different cases at different lines on a code page.
Bulk operations can work as asynchronously. Others cannot.
But today some operations work bulk based. For instance insertMany.
If the gains above have taken into account, update() must show the same performance results with an bulk.find.update() operation.
Because update() can take only one query object sending to MongoDB. And multi: true is only an argument which specifies that the all matched documents have to be updated. This means it makes only one request on the network. Just like Bulk operations.
So, both of sends only one request to MongoDB and MongoDB evaluates the query clause to find the documents that'd be updated then updates them!
I had tried to find out an answer for this question on MongoDB official site but I could not.
So, an explanation from #AsyaKamsky would be great!

Related

Update nested field for millions of documents

I use bulk update with script in order to update a nested field, but this is very slow :
POST index/type/_bulk
{"update":{"_id":"1"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"1","field2":"2"}}}}
{"update":{"_id":"2"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"3","field2":"4"}}}}
... [a lot more splitted in several batches]
Do you know another way that could be faster ?
It seems possible to store the script in order to not repeat it for each update, but I couldn't find a way to keep "dynamic" params.
As often with performance optimization questions, there is no single answer since there are many possible causes of poor performance.
In your case you are making bulk update requests. When an update is performed, the document is actually being re-indexed:
... to update a document is to retrieve it, change it, and then reindex the whole document.
Hence it makes sense to take a look at indexing performance tuning tips. The first few things I would consider in your case would be selecting
right bulk size, using several threads for bulk requests and increasing/disabling indexing refresh interval.
You might also consider using a ready-made client that supports parallel bulk requests, like Python elasticsearch client does.
It would be ideal to monitor ElasticSearch performance metrics to understand where the bottleneck is, and if your performance tweaks are giving actual gain. Here is an overview blog post about ElasticSearch performance metrics.

Optimistic locking over multiple documents

I need to update some documents at once, like a RDBMS transaction. The best way to do this for a single document in a key-value store like couchbase seems to be using optimistic locking. This would work for me. However, I need to update multiple documents at once.
I need all documents to be updated, or none. Is this possible in couchbase or some similar highly scalable database?
(by the way, I'm using Go)
There are three approaches to resolve it:
You should take another look at your key/document designs and identify if its possible to combine your multiple docs into one. Then you will be able to do a single transactional update in Couchbase.
Simulate Transaction the effect can be simulated by writing a suitable document and view definition that produces the effect while still only requiring a single document update to be applied.
Simulate Multi-phase Transactions to use the transaction record to record each stage of the update process

Pattern to load data to Elasticsearch from SQL server

Here is what we came up with. By using 3 value status column.
0 = Not indexed
1 = Updated
2 = Indexed
There will be 2 jobs...
Job 1 will select top X records where status = 0 and pop them into a queue like RabitMQ.
Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
For updates, since we have control of our data... The SQL stored proc that updates that particular record will set it's status to 2. Job2 will select top x records where status = 2 and pop them on RabitMQ. Then a consumer will bulk insert those records to ES and update the status of DB records to 1.
Of course we may need an intermediate status for "queued" so none of the jobs pick up the same record again but the same job should not run if it hasn't completed. The chances of a queued record being updated are slim to none. Since updates only happen at end of day usually the next day.
So I know there's rivers (but being deprecated and probably not flexible like ETL)
I would like to bulk insert records from my SQL server to Elasticsearch.
Write a scheduled batch job of some sort either ETL or any other tool doesn't matter.
select from table where id > lastIdInsertedToElasticSearch this will allow to load the latest records into Elasticsearch at scheduled interval.
But what if a record is updated in the SQL server? What would be a good pattern to track updated records in the SQL server and then push the updated records in ES? I know ES has document versions when putting the same Id. But can't seem to be able to visualize a pattern.
So IMHO, batch inserts are good for building or re-building the index. So for the first time, you can run batch jobs that run SQL queries and perform bulk updates. Rivers, as you correctly pointed out, don't provide a lot of flexibility in terms of transformation.
If the entries in your SQL data store are created by you (i.e. some codebase in your control), it would be better that the same code base updates documents in Elasticsearch, may be not directly but by notifying some other service or with the help of queues to not waste time in responding to requests (if that's the kind of setup you have).
We have a pretty similar use case of Elasticsearch. We provide search inside our app, which performs search across different categories of data. Some of this data is actually created by the users of our app through our app - so we handle this easily. Our app writes that data to our SQL data store and pushes the same data in RabbitMQ for indexing/updating in Elasticsearch. On the other side of RabbitMQ, we have a consumer written in Python that basically replaces the entire document in Elasticsearch. So the corresponding rows in our SQL datastore and documents in Elasticsearch share the ID which enables us to update the document.
Another case is where there are a few types of data that we perform search on comes from some 3rd party service which exposes the data over their HTTP API. The data creation is in our control but we don't have an automated mechanism of updating the entries in Elasticsearch. In this case, we basically run a cron job that takes care of this. We have managed to tune the cron's schedule because we also have a limited number of API queries quota. But in this case, our data is not really updated so much per day. So this kind of system works for us.
Disclaimer: I co-developed this solution.
I needed something like the jdbc-river that could do more complex "roll-ups" of data. After careful consideration of what it would take to modify the jdbc-river to suit my needs, I ended up writing the river-net.
Here are a few of the features:
It gets fairly decent performance (comparable to the jdbc-river. We get upwards of 6k rows/sec)
It can join many tables to create complex nested arrays of documents without creating duplicate child documents
It follows a lot of the same conventions as the jdbc-river.
It also supports reading from files.
It's written in C#
It uses Quartz.Net and supports cron expressions for scheduling.
This project is open source, and we already have a second project (also to be open sourced) that does generic job scheduling with RabbitMQ. We have ported over a lot of this project, and plan to the RabbitMQ river for better performance and stability when indexing into Elasticsearch.
To combat large updates, we aren't hitting tables directly. Instead we use stored procedures that only grab deltas. We also have an option on the sp to reset the delta to reindex everything.
The project is fairly young with only a few commits, but we are open to collaboration and new ideas.

What is the most efficient way to filter a search?

I am working with node.js and mongodb.
I am going to have a database setup and use socket.io to have real-time updates that will have the db queried again as well or push the new update to the client.
I am trying to figure out what is the best way to filter the database?
Some more information in regards to what is being queried and what the real time updates are:
A document in the database will include information such as an address, city, time, number of packages, name, price.
Filters include city/price/name/time (meaning only to see addresses within the same city, or within the same time period)
Real-time info: includes adding a new document to the database which will essentially update the admin on the website with a notification of a new address added.
Method 1: Query the db with the filters being searched?
Method 2: Query the db for all searches and then filter it on the client side (Javascript)?
Method 3: Query the db for all searches then store it in localStorage then query localStorage for what the filters are?
Trying to figure out what is the fastest way for the user to filter it?
Also, if it is different than what is the most cost effective way, then the most cost effective as well (which I am assuming is less db queries)...
It's hard to say because we don't see exact conditions of the filter, but in general:
Mongo can use only 1 index in a query condition. Thus whatever fields are covered by this index can be used in an efficient filtering. Otherwise it might do full table scan which is slow. If you are using an index then you are probably doing the most efficient query. (Mongo can still use another index for sorting though).
Sometimes you will be forced to do processing on client side because Mongo can't do what you want or it takes too many queries.
The least efficient option is to store results somewhere just because IO is slow. This would only benefit you if you use them as cache and do not recalculate.
Also consider overhead and latency of networking. If you have to send lots of data back to the client it will be slower. In general Mongo will do better job filtering stuff than you would do on the client.
According to you if you can filter by addresses within time period then you could have an index that cuts down lots of documents. You most likely need a compound index - multiple fields.

Apply Laravel 4 query cache to all database reads

Laravel 4 has a query cache built into its query builder: just add ->remember(), according to the docs.
Can anybody tell me how I can apply this method to all queries in my application, without appending ->remember() to each and every database call in it? Some kind of after filter, I suppose.
You might be able to extend the query builder and simply overload the get() method to first call remember(), and then do the get() statement.
Practically, though, if you want to cache every single query, you might as well just do this at the database level. MySQL, for example, has a configuration option to automatically cache all queries for a certain amount of time. However, in an application that does a lot of inserts/updates/deletes, this will have poor performance since the cache is cleared for that table on every such call.
Using Laravel for every query would also mean getting outdated data if you do inserts/updates/deletes meanwhile, so you'd have to clear the cache every time you update.
Best practice would be to diligently decide if a query should be cached or not.

Resources