Elasticsearch get multiple documents by uids over multiple indices - elasticsearch

The previous setting was all documents of one type were in the same index. But due to different forms (conceptually) of types, and for backing up purposes, I need multiple indices of a single type.
They will all be in the form _feed. While this setting is great in some circumstances, for
client.prepareGet(index, typename, ids).execute().actionGet(); // works great if you know in which index to search
it is useless, since no wildcards may be used. What I can do is use multiple multigets and interleave the results. This results in what I want, but increase the amount of queries significantly.
Assuming I know, for sure, only one document exist with a given index, is there a better way to query does than call a multiget on all _uids for each possible index?

The best way would be to develop a mechanism in your application that would allow you to deduce the index name from the id. But assuming that this is not possible or practical, you have pretty much only two choices. If you need realtime get, then your approach is the only way to do it. If realtime get is not a requirement, you can perform a search across all indices using ids filter. If the id list is small you can benefit from using routing on your search query. This way the search request will only be dispatch to the shards that might contain any of the ids listed in the query. However, if the list of ids is big enough to span most of the shards, it will not provide any benefit.

Related

Elastic Search Number of Document Views

I have a web app that is used to search and view documents in Elastic Search.
The goal now is to maintain two values.
1. How many times the document was fetched in total (life time views)
2. How many times the document was fetched in last 30 days.
Achieving the first is somewhat possible, but the second one seems to be a very hard problem.
The two values need to be part of the document as they will be used for sorting the results.
What is the best way to achieve this.
To maintain expiring data like that you will need to store each view with its timestamp. I suppose you could store them in an array in the ES document, but you're asking for trouble doing it like that, as the update operation that you'd need to call every time the document is viewed will have to delete and recreate the document (that's how ES does updates), and if two views happen at the same time it will be difficult to make sure they both get stored.
There are two ways to store the views, and make use of them in the query:
Put them in a separate store (could be a different index in ES if you like), and run a cron job or similar every day to update every item in the main index with the number of views from the last thirty days in the view store. Even with a lot of data it should be possible to make this quite efficient, depending on your choice of store for views.
Use the ElasticSearch parent/child datatype to store views in the same index as the main documents, as children. I'm not sure that I'd particularly recommend this approach, but I think it should be possible with aggregations to write a query that sorts primary documents by the number of children (filtered by date). It might be quite slow though.
I doubt there is any other way to do this with current versions of ES, because it doesn't support joining across indices. Either the data must be aggregated in advance onto the document, or it has to be available in the same index.

Associating each document with a function to be satisfied by search parameters in Elasticsearch

In Elasticsearch, can I associate each document with a (different) function that must be satisfied by parameters I supply on a search, in order to be returned on that search?
The particular functions I would particularly like to use involve a loop, some kind of simple branching (if-statement of switch-statement), an array-like data structure, strings comparisons, and simple boolean operators.
couple of keynotes here:
At query time:
- If your looking to shape the relevancy function, meaning the actual relevancy score of each document, you could use a script score query.
- If you're only looking to filter out unwanted documents, you could use a script query that allows you to do just that.
Both of those solutions enables you to compute a score comparing incoming query parameters against existing previously indexed values.
Take note that usage of scripts at query time can lead to increased memory usage and performance issues.
Elastic can also handle a second batch of filtering rules that are applied to the actual query result in the form of a post filter. Can come in handy sometime if you're not in a position of stream processing the output at API view level.
At index time:
There is such a thing called script fields that allows you to store a function that computes a result based on other fields value and incoming query parameters. they can be really powerful given the fact that they are assigned at index time. I think they might be what you are looking for.
I would not be using those if i weren't to have those field values compared against query params. Reason is that I like my index process to be lean and fast so I tend to compute those kinds of values at stream level, in upstream from the actual bulk indexing query.
Although convenient, those custom scripts results are likely to be achievable with a combination of regular queries and filters. In each release, the elasticsearch teams is adding new query and field types that let you do what you use to do via scripted queries whiteout the risk of blowing out you memory. a good example of this is the rank feature datatype recently introduced in the 7.x release.
A piece of advice for you. think of your elasticsearch service as a regular API in your datalayer. As such you can do query processing before the actual call to elastic and you can do data processing from the actual elastic results. If you really can't fit your business rules in there, that would be your last resort.
Fell free to contact me if you still have any questions. All the best.

MongoDB efficient dealing with embedded documents

I have serious trouble finding anything useful in Mongo documentation about dealing with embedded documents. Let's say I have a following schema:
{
_id: ObjectId,
...
data: [
{
_childId: ObjectId // let's use custom name so we can distinguish them
...
}
]
}
What's the most efficient way to remove everything inside data for
particular _id?
What's the most efficient way to remove embedded document with
particular _childId inside given _id? What's the performance
here, can _childId be indexed in order to achieve logarithmic (or
similar) complexity instead of linear lookup? If so, how?
What's the most efficient way to insert a lot of (let's say a 1000)
documents into data for given _id? And like above, can we get
O(n log n) or similar complexity with proper indexing?
What's the most efficient way to get the count of documents inside data for given _id?
The other two answers give sensible advice on your questions 1-4, but I want to address your question by interrogating the basis for asking it in the first place. The terminology of "embedded document" in the context of MongoDB storing "documents" confuses people. You should not think of an embedded document as another document in MongoDB that you search for, index, or update as its own document, because that's not what it is. It's a grouped collection of fields inside a document; it's a BSON field of type Object. To quote the embedded document docs,
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations.
Starting from knowledge about your use case, you should pick your documents and document structure to make your common operations easier. If you are so concerned about 1-4, you probably want to unwind your data array of childIds into separate documents. A concrete example of this common "antipattern" is a blog with many authors - you could have a user document with a large, changing array of posts embedded inside, or a post document with user information replicated in each. I can't say for sure what is or isn't wrong with your data model as you've given no specific details about it, but struggling to understand why 1-4 seem hard or undocumented or slow in MongoDB is a good sign that you should rethink the data model so the equivalent of 1-4 are fun and easy! Or at least easier and more fun.
I can't find anything on speed so I will go with the ways found in the documentation in the hope that they made the most efficient ways the one they documented:
If you want to remove all subdocuments in data you can just update data to []
The official way to remove a document with a specific _childId from data would be $pull:
db.collection.update(
{ },
{ $pull: { data: { _childId: id } } },
)
might need to add { multi: true } if _childId is not unique (multipart subdocuments)
On indexing on subdocuments I would refer you to this question. Short answer yes you can index fields in subdocuments for faster lookup just like you would index normal fields by
db.collection.ensureIndex({"data._childId" : 1})
If you want to search for a subdocument in only one specific document you can use aggregation i.e.
db.collection.aggregate({$match:{_id : _id},
{$unwind:'$data'},
{$match:{data._childId: _childID})
which will first match for _id and only then for _childId. It will return the parent document with data only containing the subdocument(s) with _childId.
There is $push for that although for 1000 subdocument you might not want to do it in one query anyways
Trudbert is right: db.collection.update({_id:yourId},{$set:{data:[]}})
Two points for Trudbert. However, I would like to add that if you have the whole document available in your app, it might be reasonable to simply replace the contents of the whole document if suitable for your use case.
I have made good experience with bulk updates performance wise. You might want to try it.
I don't know how you come to the idea that an aggregate wouldn't use indices, but since _id is unique, it would make much more sense to use db.collection.findOne({_id:yourId},{"data._childId":1,_id:0}).data.length or use it's equivalent as a raw command in the driver of choice. Since the connection is already established, unless the array is very big, it should be faster to simply return the data instead of having the calculations done on a possibly (over)loaded server.
As per your comments to Trudberts answer: _id is unique. So exactly one doc will need to be modified for a known _id: db.collection.update({_id:theId},{$pull..... It does not get more efficient. For an unknown id, create an index on childId and do the same pull operation with a match on childId instead of id with the multi option set to remove all references to a specific childId.
I strongly second Trudberts suggestion of using the aggregation framework to create documents when needed out of optimized data. Currently, I have an aggregation pipeline which analyses 5M records with more than 7 million relations to each other in some 6 seconds. On a non sharded standalone instance. With spinning disks, crappy IO and not even optimized. With careful planning the aggregations (an early match limiting the documents passed to the ones not processed so far) and merging them with earlier results (adapt the _id in the group phase can achieve that), you can even optimize this for some mere fractions of seconds, if absolutely necessary.

Data model for fields that change frequently in ElasticSearch

What is the best way to deal with fields that change frequently inside a document for ElasticSearch? Per their docs about partial updates...
Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described.
In particular, what should be done when the indexing of the document will likely be expensive given the number of indexed field and the size of some of the text fields that have to be analyzed?
As a concrete example, use SO's view and vote counts on questions and answers. It would seem expensive to reindex the text body just to update those values.
Maybe you shouldn't update so frequently. Perhaps things like vote/views should only be periodically updated in ES, while more critical fields like answers/questions be pushed immediately. Consider what's most important and see if you can get away with some level of staleness.
ElasticSearch is great for text search, but I would not consider ES to support SO in its entirety (or similar applications). It could be a useful tool for searching for answers/questions on SO, or for internal applications (like log/event analysis). But perhaps the actual serving of data could be better done with a different solution? Maybe it should be powered by Cassandra instead for the bulk of the work? You get the idea...
If you want to use ES as a solution to your needs, and you MUST update frequently, you could definitely consider the parent/child model mentioned already. of course, that method will require more memory/disk space, and it will take up more cpu/time when you query for totals. An alternative would be to have the parent store searchable fields, and let the child hold the metadata (where the child's fields are not analyzed). this will allow you to make frequent updates without having to undergo an expensive re-index, since there is nothing to index.
You could also consider what I mentioned above and see if you can get away with some staleness. This can be done in many ways too. You can throttle your requests by type of change, or change the refresh/flush interval, or consider de-duping updates if you are sending updates in bulk. These too have their shortcomings...
I think best way to handle the change is to split the document (you can use Parent child relationship, or just have parent id), and make document as small as possible (moving changeable part to new types) .
This can be a way to accomplish your requirement say SO,
You can use multiple types for this, consider This post (Views and Vote count).
Create a type for post, view and vote.
For a post , index a document to post type (index post id, title description tag), and for every view of that post you can index a document to view type (with id of post), and if voted you can index vote with (no of votes , id of post and other info you need [like positive or negative flag] ) to vote type.
So, to get views for post, use filter of post id, and get document counts in views type
To get no of votes, use stat aggregation for no of votes , or terms aggregation followed by stat aggregation for getting positive and negative votes.
This is way I think is best, and there can be other opinion too.
Thanks
What I do is that I use a database like mongo or mysql for storing properties that get updated frequently and use elastic search to store documents for text searching.
Example: I want to keep data about a book and its contents and I also want to keep the total number of views, updating and reindexing the document each time a user views it is a total overkill.

How to write fast Elastic Search queries

Is there a guide to writing the ES queries - what to do, what to avoid, this sort of stuff. The official site describes all various ways to search, but provides little giudance as to when select what.
In my particular instance I have a list of providers, each one has a name an address and a number of IDs. I want to give the user a box he can type in anything he knows about the provider and run search based on whatever is provided. Essentially I would like to match every word from the box against the records (documents) in the index.
For the end user this should look like a simple keyword search.
Matching should cover exact matches, wild card matches, phonetic matches, synonyms (for names). Also some fuzziness should be included too.
The official site describes various ways to do that, but how to combine them together? For instance to support wild card search do I use wild card query, or do I index it with the NGram and do just text query?
With the SQL queries a certain way to get this sort of information is to check the execution plan for the query. If the SQL optimizer tells you that it will use table scan against a table of considerable size, you know you should change your query, or, may be, add an index. AFAIK there is no equivalent for this powerful feature in ES and I am not even sure if it is possible to build it.
But at least some generic considerations...? Pretty please...
There is not a best way to go about doing things, because a lot of times it depends on what you are indexing, and how you map your data into variables within Elasticsearch.
Some rule of thumb that you should look out for:
a. Faceted Queries in Elasticsearch work in sequences:
{
"query": {
// data will be searched from this block first //
}, "facets": {
// after the data is received, it will be processed into facets //
}
}
Hence if your query size is huge, you are going to slow down your query further by faceting. Monitor the results of your query.
b. Filters vs Queries
Filters do a subset of your queries, meaning it will take the entire result of what your query is, and then filter out what you do want or what you do not want.
Queries are usually direct searches for data.
Hence, if you can make your query as specific as possible before you do a filter, it should yield faster results.
c. Queries are cached; running them again and again will generally yield faster responses. The Warmers API should be able to make your queries even quicker if you are always going to use the same set of queries
Again, all these are rule of thumbs and cannot be followed strictly, because what you index into specific variables will affect processing times. A string is different from long types, and strings with analyzers are different from non-analyzers. What you need to do is probably to experiment with your queries to get a better judgement.
One correction from the above - Filters are cacheable by ES, and not queries. Queries does the extra step of relevance scoring & full text search. So, where ever full text search is not needed using filter is advised.
Also, design your mappings with correct index values (not_analyzed, no, analyzed)

Resources