Lucene filter with docIds - filter

I'm trying to do the following: I want to create a set of candidates by querying each field separately and then adding the top k matches to this set. After I'm done with that, I need to run another query on this candidate set.
The way how I implemented it right now is using a QueryWrapperFilter with a BooleanQuery that matches the unique id field of each candidate document. However, this means I have to call IndexSearcher.doc().get("docId") for each candidate document before I can add it to my BooleanQuery, which is the major bottleneck. I'm only loading the docId field via MapFieldSelector("docId).
I wanted to create my own Filter class, but I can't use the internal Lucene doc ids directly, because they are specified per segment. Any thoughts on how to approach this?

Instead of reading the stored docId, index the field (it probably already is) and use the FieldCache to retrieve docIds much faster. Then instead of using the docIds in a BooleanQuery, try using a TermsFilter or FieldCacheTermsFilter. The latter documentation describes the performance trade-offs.

Related

elasticsearch copy field when indexing

I would like to create a one to many relashanship for the purpose of aggregations.
The "join" will be according to a field called "common_id":
When I create the first document belonging to the same group I would like to use it's flakeId (it's _id) as the common_id.
When adding other document belonging to the same group I would like to explicitly set the common_id to have the same value as the first document I added. This can be done by my app since my application will know the common_id of the first element.
My problem is with the first document:
How can i tell elasticsearch to copy the _id into common_id in a single call to elastic (I know I can do it using update script, or using two calls one for index and one for update... but this requires two requests instead of one).
I would like a simple syntax for this.
thanks

Solr boost query sort by whether result is boosted then by another field

I'm using Solr to run a query on one of our cores. Suppose my documents have two fields: ID, and Name. I also have a separate list of IDs I'm grabbing from a database and passing into the query to boost certain results.
If the document gets returned in the query and the ID is in the list it goes to the top of the results, and if it gets returned in the query and the ID is not in the list then it goes below those that are in the list. The former is from the "boost". My query is something like this -
http://mysolrserver:8983/solr/MyCore/MyQueryHandler?q=Smith&start=0&rows=25&bq=Id%3a(36+OR+76+OR+90+OR+224+OR+391)
I am able to get the boost query working but I need the boosted results to be in alphabetical order by name, then the non boosted results under that also in alphabetical order by name. I need to know what to user for the &sort= parameter.
&sort=score%20desc,Name+asc does not work.
I've looked over a lot of documentation, but I still don't know if this even possible. Any help is appreciated. Thanks!
Solr version is 6.0.1. I am actually using SolrNet to interface with Solr, but I think I can figure out the SolrNet part if I know what the url's &sort= parameter value needs to be.
I figured it out, by doing away with the boost query. I added a sort query using the "exists" function and passing it a sub-query for the ID. The exists returns a boolean value to sort on, then I added the name as a second sort. It works perfect!!
The URL looks like this:
http://mysolrserver:8983/solr/MyCore/MyQueryHandler?q=Smith&start=0&rows=25&sort=exists(query({!v=%27Id:(36+OR+76+OR+90+OR+224+OR+391)%27}))%20DESC,%20Name%20ASC
The closest match to your requirement is the query elevation component[1] .
In your particular case I would first sort my Ids according to my requirements ( sorting them by name for example), then maintain them in the elevate.xml.
At query time you can use the "forceElevation" parameter to force the elevation and then sort the remaining results by name.
[1] https://cwiki.apache.org/confluence/display/solr/The+Query+Elevation+Component

elasticsearch: decide which query should run first

We have a simple web page, where the user can provide some input and query the database. We currently use mongodb but want to migrate to elasticsearch, since the queries are faster.
There are some required search fields, like start and end date, and some optional ones, like a search string to match an entry, or a parent search string, to match parent entries. Parent-child relations are just described through fields containing each entry's ancestors ids.
The question is the following: If both search and parent search string are provided, is there a way to know before executing the queries, which query should be executed first, in order to provide results faster and to be more performant?
For example, it could be that a specific parent search results in only 2 docs/parent entries, and then we can fetch all children matching the search string. In that case we should execute firstly the parent query and then the entry query.
One option would be to get the count of both queries and then execute first the one with the smallest count, but isn't this solution worse, since the queries are going to be executed twice? Once for the count and once for the actual query.
Are there any other options to solve this?
PS. We use elasticsearch v1.7
Example
Let's say the user wants to search for all entries matching the following fields.
searchString: type:BLOCK AND name:test
parentSearchString: name:parentTest AND NOT type:BLOCK
This means that we either have to
fetch all entries (parents) matching the parentSearchString and store their ids. Then, we have to fetch all entries that match the searchString and also have to contain any of the parent ids in the ancestors field.
OR
fetch all entries that match the searchString and store all ancestors ids. Then fetch all entries that match the parentSearchString and their id is one of the ancestors ids.
Just to clarify, both parent and children entries have the exact same structure and reside in the same index. We cannot have different indices since the pare-child relation can be 10 times nested, so an entry can be both a parent and a child. An entry looks more or less like:
{
id: "e32452365321",
name: "name",
type: "type",
ancestors: "id1 id2 id3" // stored in node as an array of ids
}
First of all, I would advise you, to upgrade your Elasticsearch version, if possible. There happened a lot since 1.7 and to be honest, I can't tell if all of what's written in the following article is valid for such an old version (probably it isn't).
But to your actual question: Hopefully I am understanding you correctly, but you try to estimate how costly a query for Elasticsearch is? Well, you don't have to. If you provide all 'queries' in one nested query, Elasticsearch will do that for you: https://www.elastic.co/blog/elasticsearch-query-execution-order
Regarding speed, there is one other thing I can mention: calculating score does take time. So if sorting is not based on the elasticsearch _score, you want to use boolean filter queries. This would also apply, if you want to sort only by _score of parent matches, then you could put the query for children into a filter.
update
Thanks to your example, I now see the problem. Self referencial Parent-Child relations are unfortunately not supported by ElasticSearch, so your approach is probably right. You might want to check out the short chapter of the documentation about application-joins.
So yes, in general, you want to send the second query with the least possible amount of ids/terms. While getting counts for both queries is not as bad as you might think, because the results are most likely still cached, does it actually help? Because if you're going from child to parent, you would have to count the ancestors (field values), and not the actual document count.
I would argue, that the most expensive operation is very often fetching result source from disk. So whichever way you go, you probably should only fetch what you need in the first query. So your options are:
Fetch only the id of parent matches, and then use a terms filter on ancestors in the second query.
Or, fetch only the ancestors field of child matches, and use an id filter in your second query.
Unfortunately, I can't help you more than that, since I don't have enough experience in comparing speed of those approaches. My guess would be, that an id filter might be faster in general. But that's just a guess...

Elasticsearch - Lookup value for query score

I am using elasticsearch with a custom script. I need to create a custom score based on another values defined in another index. Let's call this value "ranking". The score should look like this:
Math.log(_score * 2) + doc['ranking'].value
The ranking is dependent on the provider field in the type that is being scored. My problem is that since then ranking is in another index all together, it doesn't retrieve the value. What is the best way to conduct such a score? Also, how could it be done so that it could cache the results. For example, is there anyway in mvel to retrieve the value based on a redis key-value pair.
You must forget about mvel :)
You should use Groovy scripting
You could check this link for caching technique https://github.com/imotov/elasticsearch-native-script-example/blob/master/src/main/java/org/elasticsearch/examples/nativescript/script/LookupScript.java, but it's native java plugin. Though it could be translated to groovy script instance.
Another way (but it depends on your index relations, for example for parent/child), you could use has_child query and use score field as a transfer block between indexes, but this is too limited.

Lucene equivalent of SQL Server's ORDER BY [duplicate]

I got my lucene index with a field that needs to be sorted on.
I have my query and I can make my Sort object.
If I understand right from the javadoc I should be able to do query.SetSort(). But there seems to be no such method...
Sure I'm missing something vital.
Any suggestions?
There are actually two important points. First, the field must be indexed. Second, pass the Sort object into the overloaded search method.
Last time I looked, the docs didn't do a very good job of pointing out the indexing part, and certainly didn't explain why this is so. It took some digging to find out why.
When a field is sortable, the searcher creates an array with one element for each document in the index. It uses information from the term index to populate this array so that it can perform sorting very quickly. If you have a lot of documents, it can use a lot of memory, so don't make a field sortable unless there is a need.
One more caveat: a sortable field must have no more than one value stored in each field. If there are multiple values, Lucene doesn't know which to use as the sort key.
It looks like the actual method you want is e.g. Searcher.search(Query query, Filter filter, int n, Sort sort). setSort is a method of Sort.

Resources