Return a list of search results with results related to user first with ElasticSearch or Neo4j - elasticsearch

I'm trying to choose a database/search engine to return a list of results which shows any results the user has a relationship with first, then others after. Similar to the way Facebook works where you search a business name and one's you have liked appear first then others after?
I've seen this question which is similar to what I need but I believe it only show's results for that user: How can ElasticSearch be used to implement social search?
Is this possible with either ElasticSearch, Neo4j or anything else?

Elasticsearch can certainly do this.
Results are returned from Elasticsearch based on the score, which basically means the better the match the bigger the score.
You could use the "bool" query to specify your query as a "must" and then the user match as a "should". Optionally you might want to add a "boost" to the should query so it scores highest if matched.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

Related

Elasticsearch multiple score fields

Maybe a dummy question: is it possible to have multiple score fields?
I use a custom score based on function_score query. This score is being displayed to the user to show, how much each document matches his/her preferences. So far so good.
But! The user should be able to filter the documents and (of course) sort them not only by the custom relevance (how much each document matches his/her preferences) but also by the common relevance - how much each document matches the filter criteria.
So my first idea was to place the score calculated by function_score query to a custom field but it does not seems to be supported.
Or am I completely wrong and I should use another approach?
I took a different approach - in case user applies some filter the I run the query without function_score percolation and use the score calculated by ES and sort by it. Then I take all IDs from the result page and run percolation query with these IDs to get the custom "matching score". It does not seems to cause noticeable slowdown.
Anyway, I welcome any feedback.

Elasticsearch query for wikipedia pages

I have indexed all wikipedia pages on elasticsearch, and now I would like to search through them according to a list of keywords that I have created. The documents on elasticsearch have only three fields: id for the page id, title for the page title and content for the page content (already clean of wikipedia markup).
My goal is to reproduce the mediawiki query api as much as possible, with parameters action=query and list=search. For instance, given the keywords "non riemannian metric spaces", a call to
https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srlimit=10&srprop=&srsearch=non%20riemannian%20metric%20spaces
gives a list of the most relevant pages for those keywords.
So far I have been using rather simple elasticsearch search queries, like for instance
POST _search
{
"query": {
"bool" : {
"must" : {
"match" : {
"content": {
"query": "non riemannian metric spaces"
}
}
},
"should" : {
"match" : {
"title": {
"query": "non riemannian metric spaces",
"boost": x
}
}
}
}
}
}
for several values of boost, like 1, 2 or 0.5. This gives already some decent results, in the sense that the pages I obtain are relevant to the keywords, but sometimes they are not quite the same I get with the mediawiki api.
I would be glad to hear some suggestions on how to fine-tune the elasticsearch query to mimic more accurately the mediawiki api behavior. Or even, since the mediawiki api itself is built with elasticsearch and cirrussearch, I would like to know whether the actual elasticsearch query for the entry point above with those specific parameters is openly available.
Thank you in advance!
UPDATE (after Robis Koopmans' answer): Seeing the actual query with cirrusDumpQuery has indeed been very useful. I do however have some followup questions concerning the query:
The query has a set of similar multi_match clauses searching my keywords in fields like ["title.plain^1", "title^3"]. While I understand the ^n boost, I ignore what .plain refers to. Does it have to do with elasticsearch itself (i.e. is it a field derived from title at index time?) or is it something that has to do with the specific mediawiki mapping they use? In any case, I would appreciate some more information about this.
At some other point in the query, there is a {"match": {"all": {...}}} clause. What exactly is the all key here? Is it a document field? Is it related with the match_all clause?
What is the suggest field that appears in the query? In the score explanation it seems to be associated with synonyms. How are those handled in this case?
To be performed after the search, there is a rescore clause with two other score functions. One of them uses the popularity_score of a wikipedia page. What is that?
And finally, the most relevant score that ends up ranking the pages is the output of the sltr clause. In it, there is a "model": "enwiki-20220421-20180215-query_explorer", and in the score explanation it is identified with a LtrModel: naive_additive_decision_tree. I understand that this model is some stored LTR model. However, since it seems to be the most relevant number in the final sorting of the results, what exactly is that model and is it openly available?
Please feel free to answer whichever question you know the answer to, and again thanks a lot!
The query has a set of similar multi_match clauses searching my keywords in fields like ["title.plain^1", "title^3"]. While I understand the ^n boost, I ignore what .plain refers to. Does it have to do with elasticsearch itself (i.e. is it a field derived from title at index time?) or is it something that has to do with the specific mediawiki mapping they use? In any case, I would appreciate some more information about this.
The .plain fields are generated as part of the elasticsearch mapping. The current settings and mappings are available to see how exactly they work. mediawiki.org includes a search glossary entry on the plain field as well. In general the top level field contains a highly processed form of the text, and the plain field uses minimal analysis.
At some other point in the query, there is a {"match": {"all": {...}}} clause. What exactly is the all key here? Is it a document field? Is it related with the match_all clause?
mediawiki.org also contains an (incomplete) CirrusSearch schema that gives a brief description of these fields and the various analysis chain components used. The all field is an optimization to give a strong first-pass filter against the search index.
What is the suggest field that appears in the query? In the score explanation it seems to be associated with synonyms. How are those handled in this case?
Suggest field contains shingles (word ngrams) of the articles title and redirects, essentially a pre-calculation of phrase queries. The suggest might look like synonyms in the explain output, and they often contain those, but it also includes misspellings, translations, and numerous other reasons editors have for creating redirects. Matches on redirects are generally a strong relevance signal.
To be performed after the search, there is a rescore clause with two other score functions. One of them uses the popularity_score of a wikipedia page. What is that?
This is the fraction of page views on the wiki that go to that article.
And finally, the most relevant score that ends up ranking the pages is the output of the sltr clause. In it, there is a "model": "enwiki-20220421-20180215-query_explorer", and in the score explanation it is identified with a LtrModel: naive_additive_decision_tree. I understand that this model is some stored LTR model. However, since it seems to be the most relevant number in the final sorting of the results, what exactly is that model and is it openly available?
This model is generated by mjolnir and essentially overwrites the score from the rest of the query. There is some information in wikitech (found there as it is more specific to the WMF deployment of mediawiki than mediawiki itself), also a slide deck called From Clicks to Models might give some insight into whats happening in that code base. Perhaps important to know mjolnir only applies to bag of words queries, queries invoking phrases or other expert functionality skip the ML model.
Noone had asked for the models before, if they might be useful i dumped the current models from the ranking plugin. This contains both the feature definitions used and the decision trees generated by xgboost.
I didn't find an excuse to link it above, but maybe the draft page at CirrusSearch/Scoring that mentions some of the factors that go into retrieval and scoring, particularly for queries that can't be run through mjolnir models, might help as well.
You can add cirrusDumpQuery to your query
example:
https://en.wikipedia.org/w/index.php?title=Special:Search&cirrusDumpQuery=&search=cat+dog+chicken&ns0=1
more information:
https://www.mediawiki.org/wiki/Extension:CirrusSearch#API
You can't make Elasticsearch queries to Wikipedia directly, but CirrusSearch can generate many types of queries beyond fulltext search. It's not clear from your question exactly what type of query you are looking for, but it might be worth to look at sorting options, if you prefer to weight results by text similarity only, and not things like page views.

Elastic Search and Search Ranking Models

I am new to Elastic Search. I would like to know if the following steps are how typically people use ES to build a search engine.
Use Elastic Search to get a list of qualified documents/results based on a user's input.
Build and use a search ranking model to sort this list.
Use this sorted list as the output of the search engine to the user.
I would probably add a few steps
Think about your information model.
What kinds of documents are you indexing?
What are the important fields and what field types are they?
What fields should be shown in the search result?
All this becomes part of your mapping
Index documents
Are the underlying data changing or can you index it just once?
How are you detecting new docuemtns/deletes/updates?
This will be included in your connetors, that can be set up in multiple ways, for example using the Documents API
A bit of trial and error to sort out your ranking model
Depending on your use case, the default ranking may be enough.
have a look at the Search API to try out different ranking.
Use the search result list to present the results to the end user

Cannot use "OR" with "NOT _exists_" in Kibana 6.8.0 search bar

I am trying to create one query in the Kibana search bar to retrieve some specific documents.
The goal is to get the documents that either have the field "myDate" before 2019-10-08 or "myDate" does not exist.
I have documents that meet one or the other condition.
I started by creating this query :
myDate:<=2019-10-08 OR NOT _exists_:myDate
But no documents were returned.
Since it did not work, I tried some other ways i found online :
myDate:<=2019-10-08 OR NOT (_exists_:myDate)
myDate:<=2019-10-08 OR !(_exists_:myDate)
myDate:<=2019-10-08 OR NOT (myDate:*)
But still, no results.
When I use either "part" of the "OR" condition, it works perfectly : I get either the documents who have myDate<=2019-10-08 or the ones that do not have a "myDate" field filled.
But when I try with both conditions, I get no document.
I have to use only the search bar to find these documents, neither an elasticsearch rest query nor by using kibana filters.
Thank you for your help :)
Below query works. Use Inspect button in kibana to see what query is actually being fired and make sure you are using correct index pattern as well.
(myDate:<=2019-12-31) OR (NOT _exists_:myDate)
Take a look at Query DSL documentation for Boolean operators for more better understanding with different use cases

How do I match a partial result with Elastic Search?

I'm trying to find out how to properly write my query in order to do a LIKE query with ElasticSearch.
Let's say I have a record of firstname and I want to find every one where there is ma in it.
So I've tried multiple things but none are working. Here is a list :
{"match": {"text": ".*ma.*"}}
{"match": {"text": "*ma*"}}
{"match":{"text"{"query":"ma","fuzziness":"AUTO","prefix_length":1}}}
Do you have an idea of how to do that or where am I missing something?
You might look into using the N-Gram tokenizer to split your documents' tokens up into their substrings.
This will allow you to search against the index with the "partial" matches you're describing.
Bear in mind that this will affect how your documents are tokenized for search so, if you are using other types of analysis for other parts of your application, you may want to create additional fields for your N-Gram tokenized values (or even create a separate index for them).
As a rule of thumb, always try to optimize your index for the queries you want to perform, rather than trying to solve your search problems at query time.

Resources