Update nested field for millions of documents - elasticsearch

I use bulk update with script in order to update a nested field, but this is very slow :
POST index/type/_bulk
{"update":{"_id":"1"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"1","field2":"2"}}}}
{"update":{"_id":"2"}}
{"script"{"inline":"ctx._source.nestedfield.add(params.nestedfield)","params":{"nestedfield":{"field1":"3","field2":"4"}}}}
... [a lot more splitted in several batches]
Do you know another way that could be faster ?
It seems possible to store the script in order to not repeat it for each update, but I couldn't find a way to keep "dynamic" params.

As often with performance optimization questions, there is no single answer since there are many possible causes of poor performance.
In your case you are making bulk update requests. When an update is performed, the document is actually being re-indexed:
... to update a document is to retrieve it, change it, and then reindex the whole document.
Hence it makes sense to take a look at indexing performance tuning tips. The first few things I would consider in your case would be selecting
right bulk size, using several threads for bulk requests and increasing/disabling indexing refresh interval.
You might also consider using a ready-made client that supports parallel bulk requests, like Python elasticsearch client does.
It would be ideal to monitor ElasticSearch performance metrics to understand where the bottleneck is, and if your performance tweaks are giving actual gain. Here is an overview blog post about ElasticSearch performance metrics.

Related

elasticsearch mapping for Friend to Friend list

we have started using elasticsearch in our project, we are storing user data and his friend list as nested object, and nested to nested object storing friend's friend list because we required this data when we are doing global search.
Now we are syncing this data in real time with our database, so is this good to syncing done in real time 50-100 TPS or in future it will create problem.
We need to create complex queries for updating the data because we are managing friend list in 2nd level. so how to create advance scripting in painless, I have checked this in Google but not found anything in detail.
If my approach is wrong of doing this, please let me know.
To answer your first question
so is this good to syncing done in real time 50-100 TPS or in future
it will create problem
As of version ES6.0, Multi level nesting is automatically supported, and detected, resulting in an inner nested query to automatically match the relevant nesting level (and not root) if it exists within another nested query. But there is a caveat, indexing a document with 100 nested fields actually indexes 101 documents as each nested document is indexed as a separate document. To safeguard against ill-defined mappings the number of nested fields that can be defined per index is usually limited to 50 using the index.mapping.nested_fields.limit . This setting allows you to limit the number of field mappings that can be created manually or dynamically, in order to prevent bad documents from causing a mapping explosion. So to answer your question, this is fine, but as your data grows, it becomes more complicated to manage and you risk the danger of a mapping explosion.
To answer your second question
We need to create complex queries for updating the data because we are
managing friend list in 2nd level. so how to create advance scripting
in painless, I have checked this in Google but not found anything in
detail.
You might need to present some context here to be able to understand why your approach is necessary, but basically, in a social profile context, managing a friends list as you are doing is always a bad idea, especially if you anticipate scaling in the future. It may work for smaller use-cases, but it does not work very well when you scale. This is because, the relationships become more sophisticated and you will end up having too many multi nested objects. As mentioned, all factors kept at a constant, you might want to look at a graph database for this kind of a scenario. You could, however, have other reasons to your approach which is why you might want to enumerate your context so we can better advise.
Hope this helps!!

Limit server resources for a query in RethinkDB

I want to run a heavy query and somehow limit the resources it uses, so it never affects other client's queries.
Is it possible?
This is currently not possible. You could use sharding to make the query not affect other queries that are reading totally different data, but there's no way right now to prioritize between different queries operating on the same data.

bulk.find.update() vs update.collection(multi=true) in mongo

I am newbie to mongodb and would like to implement mongodb in my project having millions of record.Would like to know what should i prefer for update -bulk.find.update() vs update.collection with multi =true for performance.
As far as I know the biggest gains Bulk provides are these:
Bulk operations send only one request to the MongoDB for the all requests in the bulk. Others send a request per document or send only for one operation type from one of insert, update, updateOne, upsert with update operations and remove .
Bulk can handle a lot of different cases at different lines on a code page.
Bulk operations can work as asynchronously. Others cannot.
But today some operations work bulk based. For instance insertMany.
If the gains above have taken into account, update() must show the same performance results with an bulk.find.update() operation.
Because update() can take only one query object sending to MongoDB. And multi: true is only an argument which specifies that the all matched documents have to be updated. This means it makes only one request on the network. Just like Bulk operations.
So, both of sends only one request to MongoDB and MongoDB evaluates the query clause to find the documents that'd be updated then updates them!
I had tried to find out an answer for this question on MongoDB official site but I could not.
So, an explanation from #AsyaKamsky would be great!

What is the most efficient way to filter a search?

I am working with node.js and mongodb.
I am going to have a database setup and use socket.io to have real-time updates that will have the db queried again as well or push the new update to the client.
I am trying to figure out what is the best way to filter the database?
Some more information in regards to what is being queried and what the real time updates are:
A document in the database will include information such as an address, city, time, number of packages, name, price.
Filters include city/price/name/time (meaning only to see addresses within the same city, or within the same time period)
Real-time info: includes adding a new document to the database which will essentially update the admin on the website with a notification of a new address added.
Method 1: Query the db with the filters being searched?
Method 2: Query the db for all searches and then filter it on the client side (Javascript)?
Method 3: Query the db for all searches then store it in localStorage then query localStorage for what the filters are?
Trying to figure out what is the fastest way for the user to filter it?
Also, if it is different than what is the most cost effective way, then the most cost effective as well (which I am assuming is less db queries)...
It's hard to say because we don't see exact conditions of the filter, but in general:
Mongo can use only 1 index in a query condition. Thus whatever fields are covered by this index can be used in an efficient filtering. Otherwise it might do full table scan which is slow. If you are using an index then you are probably doing the most efficient query. (Mongo can still use another index for sorting though).
Sometimes you will be forced to do processing on client side because Mongo can't do what you want or it takes too many queries.
The least efficient option is to store results somewhere just because IO is slow. This would only benefit you if you use them as cache and do not recalculate.
Also consider overhead and latency of networking. If you have to send lots of data back to the client it will be slower. In general Mongo will do better job filtering stuff than you would do on the client.
According to you if you can filter by addresses within time period then you could have an index that cuts down lots of documents. You most likely need a compound index - multiple fields.

Joomla getItems default Pagination

Can anyone tell me if the getItems() function in the model automatically adds the globally set LIMIT before it actions the query (from getListQuery()). Joomla is really struggling, seemingly trying to cache the entire results (over 1 million records here!).
After looking in /libraries/legacy/model/list.php AND /libraries/legacy/model/legacy.php it appears that getItems() does add LIMIT to setQuery using $this->getState('list.limit') before it sends the results to the cache but if this is the case - why is Joomla struggling so much.
So what's going on? How come phpMyAdmin can return the limited results within a second and Joomla just times out?
Many thanks!
If you have one million records, you'll most definitely want to do as Riccardo is suggesting, override and optimize the model.
JModelList runs the query twice, once for the pagination numbers and then for the display query itself. You'll want to carefully inherit from JModellist to avoid the pagination query.
Also, the articles query is notorious for it's joins. You can definitely lose some of that slowdown (doubt you are using the contacts link, for example).
If all articles are visible to public, you can remove the ACL check - that's pretty costly.
There is no DBA from the West or the East who is able to explain why all of those GROUP BY's are needed, either.
Losing those things will help considerably. In fact, building your query from scratch might be best.
It does add the pagination automatically.
Its struggling is most likely due to a large dataset (i.e. 1000+ items returned in the collection) and many lookup fields: the content modules for example join as many as 10 tables, to get author names etc.
This can be a real killer, I had queries running for over one second with a dedicated server and only 3000 content items. One tag cloud component we found could take as long as 45 seconds to return a keywords list. If this is the situation (a lot of records and many joins), your only way out is to further limit the filters in the options to see if you can get some faster results (for example, limiting to articles in the last 3 months can reduce the time needed dramatically).
But if this is not sufficient or not viable, you're left with writing a new optimized query in a new model, which ultimately will bring the best performance optimization of any other optimization. In writing the query, consider leveraging the database specific optimizations, i.e. adding indexes, full-text indexes and only use joins if you really need them.
Also consider that joins must never grow with the number of fields, translations or else.
A constant query is easy for the db engine to optimize and cache, whilst a dynamic query will never be as efficient.

Resources