Elastic Search Query that identifies not only the results but also highlights the items that satisfy a particular attribute - elasticsearch

We have a requirement that we think is a candidate for a Multi-Search query but we are not sure.
Say we are selling clothes.
The user can enter a type of clothing such as Shirts and we bring back all the shirts using a filter.
We would also like to provide the user with an option of them typing in a keyword such as "formal" or "beach" etc. But this keyword should not effect the results but identify with a flag which items in the results these keywords appear.
Any suggestion would be greatly appreciated.

What about using Named Queries?
The ideas is that this feature of Elasticsearch allows users to mark each clause in the query with a _name value identifying the particular clause. If the particular clauses, say a clause with "formal" keyword or a clause with "beach" keyword", are matched, the response will include a matched_queries prop with each result noting which particular item was matched.
Another option might be to use Highlights. The feature does allow you to specify a separate highlighting query. However, the response format here may not be as straightforward or meet your requirements.

Related

Elasticsearch multiple score fields

Maybe a dummy question: is it possible to have multiple score fields?
I use a custom score based on function_score query. This score is being displayed to the user to show, how much each document matches his/her preferences. So far so good.
But! The user should be able to filter the documents and (of course) sort them not only by the custom relevance (how much each document matches his/her preferences) but also by the common relevance - how much each document matches the filter criteria.
So my first idea was to place the score calculated by function_score query to a custom field but it does not seems to be supported.
Or am I completely wrong and I should use another approach?
I took a different approach - in case user applies some filter the I run the query without function_score percolation and use the score calculated by ES and sort by it. Then I take all IDs from the result page and run percolation query with these IDs to get the custom "matching score". It does not seems to cause noticeable slowdown.
Anyway, I welcome any feedback.

Elasticsearch query for wikipedia pages

I have indexed all wikipedia pages on elasticsearch, and now I would like to search through them according to a list of keywords that I have created. The documents on elasticsearch have only three fields: id for the page id, title for the page title and content for the page content (already clean of wikipedia markup).
My goal is to reproduce the mediawiki query api as much as possible, with parameters action=query and list=search. For instance, given the keywords "non riemannian metric spaces", a call to
https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srlimit=10&srprop=&srsearch=non%20riemannian%20metric%20spaces
gives a list of the most relevant pages for those keywords.
So far I have been using rather simple elasticsearch search queries, like for instance
POST _search
{
"query": {
"bool" : {
"must" : {
"match" : {
"content": {
"query": "non riemannian metric spaces"
}
}
},
"should" : {
"match" : {
"title": {
"query": "non riemannian metric spaces",
"boost": x
}
}
}
}
}
}
for several values of boost, like 1, 2 or 0.5. This gives already some decent results, in the sense that the pages I obtain are relevant to the keywords, but sometimes they are not quite the same I get with the mediawiki api.
I would be glad to hear some suggestions on how to fine-tune the elasticsearch query to mimic more accurately the mediawiki api behavior. Or even, since the mediawiki api itself is built with elasticsearch and cirrussearch, I would like to know whether the actual elasticsearch query for the entry point above with those specific parameters is openly available.
Thank you in advance!
UPDATE (after Robis Koopmans' answer): Seeing the actual query with cirrusDumpQuery has indeed been very useful. I do however have some followup questions concerning the query:
The query has a set of similar multi_match clauses searching my keywords in fields like ["title.plain^1", "title^3"]. While I understand the ^n boost, I ignore what .plain refers to. Does it have to do with elasticsearch itself (i.e. is it a field derived from title at index time?) or is it something that has to do with the specific mediawiki mapping they use? In any case, I would appreciate some more information about this.
At some other point in the query, there is a {"match": {"all": {...}}} clause. What exactly is the all key here? Is it a document field? Is it related with the match_all clause?
What is the suggest field that appears in the query? In the score explanation it seems to be associated with synonyms. How are those handled in this case?
To be performed after the search, there is a rescore clause with two other score functions. One of them uses the popularity_score of a wikipedia page. What is that?
And finally, the most relevant score that ends up ranking the pages is the output of the sltr clause. In it, there is a "model": "enwiki-20220421-20180215-query_explorer", and in the score explanation it is identified with a LtrModel: naive_additive_decision_tree. I understand that this model is some stored LTR model. However, since it seems to be the most relevant number in the final sorting of the results, what exactly is that model and is it openly available?
Please feel free to answer whichever question you know the answer to, and again thanks a lot!
The query has a set of similar multi_match clauses searching my keywords in fields like ["title.plain^1", "title^3"]. While I understand the ^n boost, I ignore what .plain refers to. Does it have to do with elasticsearch itself (i.e. is it a field derived from title at index time?) or is it something that has to do with the specific mediawiki mapping they use? In any case, I would appreciate some more information about this.
The .plain fields are generated as part of the elasticsearch mapping. The current settings and mappings are available to see how exactly they work. mediawiki.org includes a search glossary entry on the plain field as well. In general the top level field contains a highly processed form of the text, and the plain field uses minimal analysis.
At some other point in the query, there is a {"match": {"all": {...}}} clause. What exactly is the all key here? Is it a document field? Is it related with the match_all clause?
mediawiki.org also contains an (incomplete) CirrusSearch schema that gives a brief description of these fields and the various analysis chain components used. The all field is an optimization to give a strong first-pass filter against the search index.
What is the suggest field that appears in the query? In the score explanation it seems to be associated with synonyms. How are those handled in this case?
Suggest field contains shingles (word ngrams) of the articles title and redirects, essentially a pre-calculation of phrase queries. The suggest might look like synonyms in the explain output, and they often contain those, but it also includes misspellings, translations, and numerous other reasons editors have for creating redirects. Matches on redirects are generally a strong relevance signal.
To be performed after the search, there is a rescore clause with two other score functions. One of them uses the popularity_score of a wikipedia page. What is that?
This is the fraction of page views on the wiki that go to that article.
And finally, the most relevant score that ends up ranking the pages is the output of the sltr clause. In it, there is a "model": "enwiki-20220421-20180215-query_explorer", and in the score explanation it is identified with a LtrModel: naive_additive_decision_tree. I understand that this model is some stored LTR model. However, since it seems to be the most relevant number in the final sorting of the results, what exactly is that model and is it openly available?
This model is generated by mjolnir and essentially overwrites the score from the rest of the query. There is some information in wikitech (found there as it is more specific to the WMF deployment of mediawiki than mediawiki itself), also a slide deck called From Clicks to Models might give some insight into whats happening in that code base. Perhaps important to know mjolnir only applies to bag of words queries, queries invoking phrases or other expert functionality skip the ML model.
Noone had asked for the models before, if they might be useful i dumped the current models from the ranking plugin. This contains both the feature definitions used and the decision trees generated by xgboost.
I didn't find an excuse to link it above, but maybe the draft page at CirrusSearch/Scoring that mentions some of the factors that go into retrieval and scoring, particularly for queries that can't be run through mjolnir models, might help as well.
You can add cirrusDumpQuery to your query
example:
https://en.wikipedia.org/w/index.php?title=Special:Search&cirrusDumpQuery=&search=cat+dog+chicken&ns0=1
more information:
https://www.mediawiki.org/wiki/Extension:CirrusSearch#API
You can't make Elasticsearch queries to Wikipedia directly, but CirrusSearch can generate many types of queries beyond fulltext search. It's not clear from your question exactly what type of query you are looking for, but it might be worth to look at sorting options, if you prefer to weight results by text similarity only, and not things like page views.

Exact match on some fields and "fuzzy" search on others?

I am attempting to create a query to exactly match on a few fields, such as account_id and from_addresses (which is an array), while also fuzzy matching on another field such as message_content. What is the best way to do this?
I have tried a Bool query with a few must and should parameters but can't seem to get it working.
I believe what you want to do it to use Filters. More specifically, an AND filter. So your query message_content, but filter by account_id and from_addresses.
I don't know which library you are using, so I can't really provide any code examples.

Haystack boosting based on specific value in specific field

I am using Haystack with ElasticSearch and I would like to perform boosts that don't just boost a term in general, but instead boost a term only when it is found on a specific field.
For instance, on my UserIndex, I would like to prioritize (boost) search results where the user is marked as active. is_active is a BooleanField on the index model. I know how to filter so that I only fetch active users, but how can I boost active users but not outright filter out inactive users? I could apply a boost to the field in UserIndex, but that doesn't seem like it would work without some way other than an outright filter to search against that BooleanField (since otherwise there are no search terms that the field boost would affect). I could apply a boost to the SearchQuerySet, but the boost() function takes a string which appears to just be a straight-up search term, and you cannot specify a field for that term to occur in.
I might be able to solve that issue in isolation with order_by, but I have a bunch of other complex boosts I want to do:
I want to be able to boost matching users if they have IDs in a list specified by the application at runtime (this is so I can boost users relative to the context of the page where the search button was pressed). I could simply boost a search term containing the user's ID, but then if that number was coincidentally in another field, it would boost that field too and thus give very strange results.
I want to be able to boost the searching user's friends. I currently have the list of every user's friends in a MultiValueField on the search index model. I want to pass the searching user's ID in with the search query, and boost any users in the index who have the searching user's ID in their friends list. Again, I have the same problem as above -- I can boost the ID, but I can't specify that I only want to boost the occurrence of that ID in that specific field.
I have a second BooleanField I want to boost by, similar to is_active but boosted by a smaller amount.
All of this is easy-ish if I can boost by a combination of a term and a field, but it seems very hard if I can only boost a term and not a field.
The only thing I have been able to think of so far is basically a hack: instead of BooleanFields, use CharFields with magic strings in them. Then boost those magic strings as search terms, and count on nobody accidentally using the magic strings in their inputted text. Likewise, instead of raw ids in my MultiValueFields, use ids prepended with magic strings. This is awkward, fragile and potentially buggy given that the behavior of the ElasticSearch standard tokenizer may be unpredictable given nonsensical "magic strings".
Another option I considered was using a Raw input type and adding ElasticSearch-specific syntax, but usage of Raw with ElasticSearch is almost entirely undocumented and the ElasticSearch boosting documentation itself is very thin.
Is there any way to solve this that does not involve mangling my index data in such a fashion?
In your mapping you could add:
"is_active":{
"type":"boolean",
"boost":10.0
}
and
"friends":{
"type":"int",
"index":"not_analyzed",
"boost":5.0
}
And then wrap your original query in a boolean query with a MUST on your original query and a SHOULD on is_active:true and SHOULD on friends:1234

ElasticSearch / Tire & Keywords. Right way to match "or" for a keyword list?

I've got an Entity model (in Mongoid) that I'm trying to search on its keywords field which is an array. I want to do a query where I pass in an array of potential search terms, and any entity that matches any of the terms will pass.
I don't have this working well yet.
But, why I'm asking this question, is that it's more complex. I also DONT want to return any entities that have been marked as "do not return" which I do via a "ignore_project_ids" parameter.
So, when I query, I get 0 results. I was using Bonsai.io. But, I've moved this to my own EC2 instance to reduce complexity/variables on solving the problem.
So, what am I doing wrong? Here are the relevant bits of code.
https://gist.github.com/3405763
You want a terms query rather than a term query - a term query is only interested in equality, whereas a terms query requires that the field match any of the specified values.
Given that you don't seem to care about the query score (you're sorting by another attribute), you'll get faster queries by using a filtered query and expressing your conditions as filters

Resources