Elastic Search Word Combination Understanding - elasticsearch

Example: the index contains documents with movie names such as movie_name , movie_name part 2 and movie_name part 3 and so on. For the search query "movie_name part 2" I get the exact document. But how to search for "movie_name 2" such that I get the document "movie_name part 2". I am only getting "movie_name" as result document.

you can leverage wild card search concept using regex like below
{
"query": {
"wildcard": {
"<field_name>": {
"value": "^movie_name[a-z A-Z _]*2$"
}
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html

Related

Requiring Phrase Matches in Elasticsearch SimpleStringQuery

I'm creating a simple search engine using Elasticsearch 7.7 and the python elasticsearch_dsl package version 7.0.0. I'm using the simple_query_string search, because I'd like to enable most common search functionality (boolean operators, phrase search) without having to parse the query myself. This is largely working well except for the phrase match functionality.
I would like to ensure all results will include a phrase match if one is in the query. E.g. How google works - If I search for "green eggs" and ham, there will be no results that do not include "green eggs".
Let's assume I have 3 documents in my index:
{
"question":"I love my phrase",
"background: "dont you"
},
{
"question":"I love my phrase",
"background: "and other terms"
},
{
"question":"I have other terms",
"background: "and more"
}
What I am seeing now:
As expected, the below query only returns the first two documents, which have "my phrase" in one of the fields.
{
'simple_query_string':
{
'query': '"my phrase"',
'fields': ['question', 'background']
}
}
Contrary to what I expect, the below query will return all 3 results, with the 3rd one scored higher than the 1st.
{
'simple_query_string':
{
'query': '"my phrase" other terms',
'fields': ['question', 'background']
}
}
How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?
Things I have tried that have not worked:
'query': '"my phrase" AND (other terms)'
'query': '"my phrase" AND other terms'
Thank you
Contrary to what I expect, the below query will return all 3 results
By default words in query combine with OR operator: see description for default_operator parameter in simple_query_string documentation. Your second query is interpreted as "my phrase" OR other OR terms, so it will return all 3 results: each document contains at least one of the terms "my phrase", other, terms.
How can I alter my query so that searching for '"my phrase" other terms' will not return the 3rd document because it does not contain the phrase search, but score the 2nd document higher than the 1st because it contains additional search terms outside of the phrase?
AFAIK, this isn't possible with simple_query_string search. You can try to use query_string search, which have feature named boolean operators. Using that feature you can write query which provide desired result:
{
"query": {
"query_string": {
"query": "+\"my phrase\" other terms",
"fields": ["question", "background"]
}
}
}

Elasticsearch simple query string: removing documents containing words

I created a foo example to express what I mean. Suppose we have an index which documents contain the words Text and Texture.
Then I'd like to select all documents containing the word Text (I'm using the simple query string).
When I use the query "query": "Text", I get areas 1, 2 and 3 from the picture bellow.
When I use the query "query": "Text -Texture", I get only the area 3 from the picture bellow.
How could I get both areas 2 and 3?
Thanks.
To understand your problem you need to post your query.
Try to use term:
{
"query": {
"term": {
"myField": "Text"
}
}
}

elasticsearch: or operator, number of matches

Is it possible to score my searches according to the number of matches when using operator "or"?
Currently query looks like this:
"query": {
"function_score": {
"query": {
"match": {
"tags.eng": {
"query": "apples banana juice",
"operator": "or",
"fuzziness": "AUTO"
}
}
},
"script_score": {
"script": # TODO
},
"boost_mode": "replace"
}
}
I don't want to use "and" operator, since I want documents containing "apple juice" to be found, as well as documents containing only "juice", etc. However a document containing the three words should score more than documents containing two words or a single word, and so on.
I found a possible solution here https://github.com/elastic/elasticsearch/issues/13806
which uses bool queries. However I don't know how to access the tokens (in this example: apples, banana, juice) generated by the analyzer.
Any help?
Based on the discussions above I came up with the following solution, which is a bit different that I imagined when I asked the question, but works for my case.
First of all I defined a new similarity:
"settings": {
"similarity": {
"boost_similarity": {
"type": "scripted",
"script": {
"source": "return 1;"
}
}
}
...
}
Then I had the following problem:
a query for "apple banana juice" had the same score for a doc with tags ["apple juice", "apple"] and another doc with tag ["banana", "apple juice"]. Although I would like to score the second one higher.
From the this other discussion I found out that this issue was caused because I had a nested field. And I created a usual text field to address it.
But I also was wanted to distinguish between a doc with tags ["apple", "banana", "juice"] and another doc with tag ["apple banana juice"] (all three words in the same tag). The final solution was therefore to keep both fields (a nested and a text field) for my tags.
Finally the query consists of bool query with two should clauses: the first should clause is performed on the text field and uses an "or" operator. The second should clause is performed on the nested field and uses and "and operator"
Despite I found a solution for this specific issue, I still face a few other problems when using ES to search for tagged documents. The examples in the documentation seem to work very well when searching for full texts. But does someone know where I can find something more specific to tagged documents?

Elasticsearch how to match documents for which the field tokens are a sub-set of the query tokens

I have a keyword/key-phrase field I tokenize using standard analyser. I want this field to match if if there is a search phrase that has all tokens of this field in it.
For example if the field value is "veni, vidi, vici" and the search phrase is "Ceaser veni,vidi,vici" I want this search phrase to match but search phrase "veni, vidi" not match.
I also need "vidi, veni, vici" (weird!) to match. So the positions and ordering of the terms is not really important. A phrase match would not quite work for me I think.
I can use "bool query" with "minimum_should_match" parameter for this specific example but that is not really what I want as minimum should match is about ratio/number of tokens in the search phrase.
Pure ES solution would go like this. You will need two requests.
1) First you need to pass user query through analyze api to get all the search tokens.
curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "Ceaser veni,vidi,vici"
}'
you will get 4 tokens ceaser, veni, vidi, vici . You need to pass these tokens as an array to next search request.
2) We need to search for documents whose tokens are subset of search tokens.
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"query": {
"match": {
"title": "Ceaser veni,vidi,vici"
}
}
},
{
"script": {
"script": "if(search_tokens.containsAll(doc['title'].values)){return true;}",
"params": {
"search_tokens": [
"ceaser",
"veni",
"vidi",
"vici"
]
}
}
}
]
}
}
}
}
}
Here job of first match query inside the filter is to narrow down the documents on which script should run. containsAll method will check if the documents tokens are sublist of search tokens. This will be slow but will do the job with your current set up. One big improvement you can do is store tokens as an array so that doc['title'].values can be replaced with that field which will improve the script.
Hope this helps!
No built-in solution but this works:
Add an extra field with the number of terms in the field for each document. So in your "veni, vidi, vici" example, you would have a field like "field_term_count" : 3.
Perform a separate match search for each token in the search query.
Sum the number of searches that matched for each document with at least one match (e.g. a hashtable with key of document ID and value of count).
Compare the number of matches in 3 to the "field_term_count" field for each of the documents with matches. If they are equal then the document is a match.
Then "Ceaser veni,vidi,vici" will match but the search phrases "veni, vidi" will not, as desired. It should be quite fast for reasonable numbers of matches.

How to filter results based on frequency of repeating terms in an array in elasticsearch

I have an array field with a lot of keywords and i need to sort the documents on the basis on how many times a particular keyword repetation in those arrays.
For eg,if my field name is "nationality" and for document 1, it consists of the following
doc1
nationality :
["US","UK","Australia","India","US","US"]
and for doc2
nationality:
["US","UK","US","US","US","China"]
I want only those documents to be shown where the term "US" occurs more than 3 times. That would make only doc2 to be shown. How to do this?
You can use scripting for this to be implemented.
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "_index['nationality']['US'].tf() > 3"
}
}
}
}
}
Here in this scripy the array "nationality" is checked for the term "US" and the count is taken by tf (term frequency). Now only the documents with term frequency greater than three are shown in the results. You can learn more about the filter operations here

Resources