Elasticsearch query for wikipedia pages - elasticsearch

I have indexed all wikipedia pages on elasticsearch, and now I would like to search through them according to a list of keywords that I have created. The documents on elasticsearch have only three fields: id for the page id, title for the page title and content for the page content (already clean of wikipedia markup).
My goal is to reproduce the mediawiki query api as much as possible, with parameters action=query and list=search. For instance, given the keywords "non riemannian metric spaces", a call to
https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srlimit=10&srprop=&srsearch=non%20riemannian%20metric%20spaces
gives a list of the most relevant pages for those keywords.
So far I have been using rather simple elasticsearch search queries, like for instance
POST _search
{
"query": {
"bool" : {
"must" : {
"match" : {
"content": {
"query": "non riemannian metric spaces"
}
}
},
"should" : {
"match" : {
"title": {
"query": "non riemannian metric spaces",
"boost": x
}
}
}
}
}
}
for several values of boost, like 1, 2 or 0.5. This gives already some decent results, in the sense that the pages I obtain are relevant to the keywords, but sometimes they are not quite the same I get with the mediawiki api.
I would be glad to hear some suggestions on how to fine-tune the elasticsearch query to mimic more accurately the mediawiki api behavior. Or even, since the mediawiki api itself is built with elasticsearch and cirrussearch, I would like to know whether the actual elasticsearch query for the entry point above with those specific parameters is openly available.
Thank you in advance!
UPDATE (after Robis Koopmans' answer): Seeing the actual query with cirrusDumpQuery has indeed been very useful. I do however have some followup questions concerning the query:
The query has a set of similar multi_match clauses searching my keywords in fields like ["title.plain^1", "title^3"]. While I understand the ^n boost, I ignore what .plain refers to. Does it have to do with elasticsearch itself (i.e. is it a field derived from title at index time?) or is it something that has to do with the specific mediawiki mapping they use? In any case, I would appreciate some more information about this.
At some other point in the query, there is a {"match": {"all": {...}}} clause. What exactly is the all key here? Is it a document field? Is it related with the match_all clause?
What is the suggest field that appears in the query? In the score explanation it seems to be associated with synonyms. How are those handled in this case?
To be performed after the search, there is a rescore clause with two other score functions. One of them uses the popularity_score of a wikipedia page. What is that?
And finally, the most relevant score that ends up ranking the pages is the output of the sltr clause. In it, there is a "model": "enwiki-20220421-20180215-query_explorer", and in the score explanation it is identified with a LtrModel: naive_additive_decision_tree. I understand that this model is some stored LTR model. However, since it seems to be the most relevant number in the final sorting of the results, what exactly is that model and is it openly available?
Please feel free to answer whichever question you know the answer to, and again thanks a lot!

The query has a set of similar multi_match clauses searching my keywords in fields like ["title.plain^1", "title^3"]. While I understand the ^n boost, I ignore what .plain refers to. Does it have to do with elasticsearch itself (i.e. is it a field derived from title at index time?) or is it something that has to do with the specific mediawiki mapping they use? In any case, I would appreciate some more information about this.
The .plain fields are generated as part of the elasticsearch mapping. The current settings and mappings are available to see how exactly they work. mediawiki.org includes a search glossary entry on the plain field as well. In general the top level field contains a highly processed form of the text, and the plain field uses minimal analysis.
At some other point in the query, there is a {"match": {"all": {...}}} clause. What exactly is the all key here? Is it a document field? Is it related with the match_all clause?
mediawiki.org also contains an (incomplete) CirrusSearch schema that gives a brief description of these fields and the various analysis chain components used. The all field is an optimization to give a strong first-pass filter against the search index.
What is the suggest field that appears in the query? In the score explanation it seems to be associated with synonyms. How are those handled in this case?
Suggest field contains shingles (word ngrams) of the articles title and redirects, essentially a pre-calculation of phrase queries. The suggest might look like synonyms in the explain output, and they often contain those, but it also includes misspellings, translations, and numerous other reasons editors have for creating redirects. Matches on redirects are generally a strong relevance signal.
To be performed after the search, there is a rescore clause with two other score functions. One of them uses the popularity_score of a wikipedia page. What is that?
This is the fraction of page views on the wiki that go to that article.
And finally, the most relevant score that ends up ranking the pages is the output of the sltr clause. In it, there is a "model": "enwiki-20220421-20180215-query_explorer", and in the score explanation it is identified with a LtrModel: naive_additive_decision_tree. I understand that this model is some stored LTR model. However, since it seems to be the most relevant number in the final sorting of the results, what exactly is that model and is it openly available?
This model is generated by mjolnir and essentially overwrites the score from the rest of the query. There is some information in wikitech (found there as it is more specific to the WMF deployment of mediawiki than mediawiki itself), also a slide deck called From Clicks to Models might give some insight into whats happening in that code base. Perhaps important to know mjolnir only applies to bag of words queries, queries invoking phrases or other expert functionality skip the ML model.
Noone had asked for the models before, if they might be useful i dumped the current models from the ranking plugin. This contains both the feature definitions used and the decision trees generated by xgboost.
I didn't find an excuse to link it above, but maybe the draft page at CirrusSearch/Scoring that mentions some of the factors that go into retrieval and scoring, particularly for queries that can't be run through mjolnir models, might help as well.

You can add cirrusDumpQuery to your query
example:
https://en.wikipedia.org/w/index.php?title=Special:Search&cirrusDumpQuery=&search=cat+dog+chicken&ns0=1
more information:
https://www.mediawiki.org/wiki/Extension:CirrusSearch#API

You can't make Elasticsearch queries to Wikipedia directly, but CirrusSearch can generate many types of queries beyond fulltext search. It's not clear from your question exactly what type of query you are looking for, but it might be worth to look at sorting options, if you prefer to weight results by text similarity only, and not things like page views.

Related

Elastic Search Query that identifies not only the results but also highlights the items that satisfy a particular attribute

We have a requirement that we think is a candidate for a Multi-Search query but we are not sure.
Say we are selling clothes.
The user can enter a type of clothing such as Shirts and we bring back all the shirts using a filter.
We would also like to provide the user with an option of them typing in a keyword such as "formal" or "beach" etc. But this keyword should not effect the results but identify with a flag which items in the results these keywords appear.
Any suggestion would be greatly appreciated.
What about using Named Queries?
The ideas is that this feature of Elasticsearch allows users to mark each clause in the query with a _name value identifying the particular clause. If the particular clauses, say a clause with "formal" keyword or a clause with "beach" keyword", are matched, the response will include a matched_queries prop with each result noting which particular item was matched.
Another option might be to use Highlights. The feature does allow you to specify a separate highlighting query. However, the response format here may not be as straightforward or meet your requirements.

elasticsearch scoring on multiple indexes

i have an index for any quarter of a year ("index-2015.1","index-2015.2"... )
i have around 30 million documents on each index.
a document has a text field ('title')
my document sorting method is (1)_score (2)created date
the problem is:
when searching for some text on on 'title' field for all indexes ("index-201*"), always the first results is from one index.
lets say if i am searching for 'title=home' and i have 10k documents on "index-2015.1" with title=home and 10k documents on "index-2015.2" with title=home then the first results are all documents from "index-2015.1" (and not from "index-2015.2", or mixed) even that on "index-2015.2" there are documents with "created date" higher then in "index-2015.1".
is there a reason for this?
The reason is probably, that the scores are specific to the index. So if you really have multiple indices, the result score of the documents will be calculated (slightly) different for each index.
Simply put, among other things, the score of a matching document is dependent on the query terms and their occurrences in the index. The score is calculated in regard to the index (actually, by default even to each separate shard). There are some normalizations elasticsearch does, but I don't know the details of those.
I'm not really able to explain it well, but here's the article about scoring. I think you want to read at least the part about TF/IDF. Which I think, should explain why you get different scores.
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
EDIT:
So, after testing it a bit on my machine, it seems possible to use another search_type, to achieve a score suitable for your case.
POST /index1,index2/_search?search_type=dfs_query_then_fetch
{
"query" : {
"match": {
"title": "home"
}
}
}
The important part is search_type=dfs_query_then_fetch. If you are programming java or something similar, there should be a way to specify it in the request. For details about the search_types, refer to the documentation.
Basically it will first collect the term-frequencies on all affected shards (+ indexes). Therefore the score should be generalized over all these.
according to Andrei Stefan and Slomo, index boosting solve my problem:
body={
"indices_boost" : { "index-2015.4" : 1.4, "index-2015.3" : 1.3,"index-2015.2" : 1.2 ,"index-2015.1" : 1.1 }
}
EDIT:
using search_type=dfs_query_then_fetch (as Slomo described) will solve the problem in better way (depend what is your business model...)

Return a list of search results with results related to user first with ElasticSearch or Neo4j

I'm trying to choose a database/search engine to return a list of results which shows any results the user has a relationship with first, then others after. Similar to the way Facebook works where you search a business name and one's you have liked appear first then others after?
I've seen this question which is similar to what I need but I believe it only show's results for that user: How can ElasticSearch be used to implement social search?
Is this possible with either ElasticSearch, Neo4j or anything else?
Elasticsearch can certainly do this.
Results are returned from Elasticsearch based on the score, which basically means the better the match the bigger the score.
You could use the "bool" query to specify your query as a "must" and then the user match as a "should". Optionally you might want to add a "boost" to the should query so it scores highest if matched.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

Multiple field autocomplete with index type boost

What I'm trying to accomplish on a high level is an autocomplete input field which queries both customers and orders on multiple fields, with customers ranking higher for customer name searches.
It seems to me that there are various ways to approach this problem with the tools that elasticsearch provides.
The way that I have approached this is to use multi_match queries with prefix_phrase type in order to get partial queries to work across multiple fields.
For example, "bo" should return back matches for "Bob Smith" as well as "Adam Boss". I'm indexing fullname as a separate field from firstname and lastname, so that "adam boss" will return a valid prefix match as well.
In addition, I'd like to boost customer results - trying to do that with a boost param on the multi_match, but that doesn't seem to be working the way I'd expect it to.
What would be a straight forward way to tackle this problem?
One of the challenges I'm facing with the elasticsearch docs is that it's not always clear which properties and features apply to which others. For example, the multi_match documentation doesn't talk about using a custom boost, other than on a field-level.
I think the best way is using completion suggester of ES (v0.90.3+), please refer here for a real use case:
http://www.elasticsearch.org/blog/you-complete-me/
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-suggesters-completion.html

Searching only specific fields with elasticsearch

How can I tell Elasticsearch to exclude a field when searching by a term?
I have an index of users (names, email, certifications, experience, office...), but only certain people can search for users by certification. In my current PHP Lucene implementation I have 2 separate indexes with and without that data. Is there a way I can do this with only one users index? I assume I need to apply some kind of filter [1] [2], but don't see one that will allow me to ignore a field entirely.
If there is any way specifically to do this with Elastica (PHP Client) that would be even more helpful, but native ES would be equally as helpful.
Say I have 2 users in my index
Kevin Smith
Certified in Muffin Making
Mark Smith
Certified in Motorcycle jumping.
When a normal user searched for motorcycle, nothing should be returned, but if they search for Smith both should be returned.
A user with the ability to search the certifications field will return Mark if they search for motorcycle and both if they search for Smith.
I have not tested anything, but it seems that you might be able to set "include_in_all" to false on the mapping phase. That means that your field won't be include in the "_all". Then you just have to make a query on "_all" fields.
Note that the field is still available for search and indexed, you can query it by specifying it in the field query; it's just excluded from the "_all" parameters.
Again I haven't tested anything yet.

Resources