problems with phrase matching in elasticsearch - elasticsearch

I'm trying to perform Phrase matching using elasticsearch.
Here is what I'm trying to accomplish:
data - 1: {
"test" {
"title" : "text1 text2"
}
}
2: {
"test" {
"title" : "text3 text4"
}
}
3: {
"test" {
"title" : "text5"
}
}
4: {
"test" {
"title" : "text6"
}
}
Search terms:
If I lookup for "text0 text1 text2 text3" - It should return #1 (matches full string)
If I lookup for "text6 text5 text4 text3" - It should return #4, #3, but not #2 as its not in same order.
Here is what I've tried:
set the index_analyzer as keyword, and search_analyzer as standard
also tried creating custom tokens
but none of my solution allows me to lookup a substring match from search query against keyword in document.
If anyone has written similar queries, can you provide how the mappings are configured and what kind of query is been used.

What I see here is this: You want your search to match on any tokens sent from the query. If those tokens do match, it must be an exact match to the title.
This means that indexing your title field as keyword would get you that mandatory match. However, the standard analyzer for search would never match titles spaces as you'd have your index token {"text1 text2"} and your search token [{"text1},{"text2"}]. You can't use a phrase match with any sloppy value or else your token order requirement will be ignored.
So, what you really need is to generate keyword tokens during the index, but you need to generate shingles whenever you search. Your shingles will maintain order and if one of them matches, consider it a go. I would set to not output unigrams, but do allow unigrams if no shingles. This means that if you have just one word, it will output that token, but it if can combine your search words into various number of shingled tokens, it will not emit single word tokens.
PUT
{ "settings":
{
"analysis": {
"filter": {
"my_shingle": {
"type": "shingle",
"max_shingle_size": 50,
"output_unigrams": false
}
},
"analyzer": {
"my_shingler": {
"filter": [
"lowercase",
"asciifolding",
"my_shingle"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
}
}
}
Then you just want to set your type mapping to use the keyword analyzer for index and the `my_shingler` analyzer for search.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html

Related

Elastic query bool must match issue

Below is the query part in Elastic GET API via command line inside openshift pod , i get all the match query as well as unmatch element in the fetch of 2000 documents. how can i limit to only the match element.
i want to specifically get {\"kubernetes.container_name\":\"xyz\"}} only.
any suggestions will be appreciated
-d ' {\"query\": { \"bool\" :{\"must\" :{\"match\" :{\"kubernetes.container_name\":\"xyz\"}},\"filter\" : {\"range\": {\"#timestamp\": {\"gte\": \"now-2m\",\"lt\": \"now-1m\"}}}}},\"_source\":[\"#timestamp\",\"message\",\"kubernetes.container_name\"],\"size\":2000}'"
For exact matches there are two things you would need to do:
Make use of Term Queries
Ensure that the field is of type keyword datatype.
Text datatype goes through Analysis phase.
For e.g. if you data is This is a beautiful day, during ingestion, text datatype would break down the words into tokens, lowercase them [this, is, a, beautiful, day] and then add them to the inverted index. This process happens via Standard Analyzer which is the default analyzer applied on text field.
So now when you query, it would again apply the analyzer at querying time and would search if the words are present in the respective documents. As a result you see documents even without exact match appearing.
In order to do an exact match, you would need to make use of keyword fields as it does not goes through the analysis phase.
What I'd suggest is to create a keyword sibling field for text field that you have in below manner and then re-ingest all the data:
Mapping:
PUT my_sample_index
{
"mappings": {
"properties": {
"kubernetes":{
"type": "object",
"properties": {
"container_name": {
"type": "text",
"fields":{ <--- Note this
"keyword":{ <--- This is container_name.keyword field
"type": "keyword"
}
}
}
}
}
}
}
}
Note that I'm assuming you are making use of object type.
Request Query:
POST my_sample_index
{
"query":{
"bool": {
"must": [
{
"term": {
"kubernetes.container_name.keyword": {
"value": "xyz"
}
}
}
]
}
}
}
Hope this helps!

Elasticsearch search result relevance issue

Why does match query return less relevant results first? I have an index field named normalized. Its mapping is:
normalized: {
type: "text"
analyzer: "autocomplete"
}
settings for this field are:
analysis; {
filter: {
autocomplete_filter: {
type: "edge_ngram",
min_gram => "1",
max_gram => "20"
}
analyzer: {
autocomplete: {
filter: [
"lowercase",
"asciifolding",
"autocomplete_filter"
],
type: "custom",
tokenizer: "standard"
}
}
so as I know it makes an ascii, lowercase, tokens e.g. MOUSE = m, mo, mou, mous, mouse.
The problem is that request like:
{
'query': {
'bool': {
'must': {
'match': {
'normalized': 'simag'
}
}
}
}
}
returns results like
"siman siman service"
"mgr simona simunkova simiki"
"Siman - SIMANS"
"simunek simunek a simunek"
.....
But there is no SIMAG which contains all the letters of the match phrase.
How to achieve that most relevant result will be the words which contains all the letters before the tokens which does not contain all letters.
Hope somebody understand what I need.
Thanks.
PS: I am not sure but what about this query:
{
'query': {
'bool': {
'should': [
{'term': {'normalized': 'simag'}},
{'match': {'normalized': 'simag'}}
]
}
}
}
Does it make sense in comparison to previous code?
Please note that match query is analyzed, which means the same analyzer is used at the query time, which was used at the index time for the field you mentioned in your query.
In your case, you applied autocomplete analyzer on your normalized field and as you mentioned, it generates below token for MOUSE :
MOUSE = m, mo, mou, mous, mouse.
In similar way, if you search for mouse using the match query on the same field, it would search for below query strings :-
m, mo, mou, mous, mouse .. hence results which contain the words like mousee or mouser will also come as during index .. it created tokens which matches with the tokens generated on the search term.
Read more about match query on Elastic site https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html first line itself explains your search results
match queries accept text/numerics/dates, analyzes them, and
constructs a query:
If you want to go deep and understand, how your search query is matching the documents and its score use explain API
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

Elastic exact matching and substring matching together

I know that Elastic have "keyword" type in order to find something with exact matching. Ex:
"address": { "type": "keyword"}
That's cool. exact matching works!
but I would like to have both "exact matching" and "sub-string" matching. So I decided to create the following mapping:
"address": { "type": "text" , "index": true }
Problem
If I have "text" type, how can I search exact matching string? (not sub-string). I've tried several ways but does not works:
GET testing_index/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"address" : "washington"
}
}
}
}
}
or
GET testing_index/_search
{
"query": {
"match": {
"address" : "washington"
}
}
}
I need just something universal mapping:
to find exact string
to find sub-strings
I hope elastic can do this.
By default, text fields use the default analyzer, which drops most punctuation, breaks up text into individual words, and lower cases them. For instance, the standard analyzer would turn the string “Quick Brown Fox!” into the terms [quick, brown, fox]. As you can imagine, this makes it difficult to write an exact match query against the text field. For your use case, I suggest one of 2 options:
store as keyword, and accomplish sub-string-like matching using wildcard or fuzzy queries. Wildcard queries, in particular queries with a leading wildcard, are notoriously slow, so proceed with caution.
store the field twice: one as keyword and one as text. Obvious downside here is bloating the size of the index.
For more background, see the "Term Query" Elasticsearch documentation, and in particular the section on "Why doesn’t the term query match my document?": https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

In Elasticsearch, how do I search for an arbitrary substring?

In Elasticsearch, how do I search for an arbitrary substring, perhaps including spaces? (Searching for part of a word isn't quite enough; I want to search any substring of an entire field.)
I imagine it has to be in a keyword field, rather than a text field.
Suppose I have only a few thousand documents in my Elasticsearch index, and I try:
"query": {
"wildcard" : { "description" : "*plan*" }
}
That works as expected--I get every item where "plan" is in the description, even ones like "supplantation".
Now, I'd like to do
"query": {
"wildcard" : { "description" : "*plan is*" }
}
...so that I might match documents with "Kaplan isn't" among many other possibilities.
It seems this isn't possible with wildcard, match prefix, or any other query type I might see. How do I simply search on any substring? (In SQL, I would just do description LIKE '%plan is%')
(I am aware any such query would be slow or perhaps even impossible for large data sets.)
Have you tried the regxp query in elasticsearch? It sure does sound like something you might be interested in.
I was hoping there might be something built-in for this Elasticsearch, given that this simple substring search seems like a very basic capability (Thinking about it, it is implemented as strstr() in C, LIKE '%%' in SQL, Ctrl+F in most text editors, String.IndexOf in C#, etc.), but this seems not to be the case. Note that the regexp query doesn't support case insensitivity, so I also needed to pair it with this custom analyzer, so that the index matches all-lowercase. Then I can convert my search string to lowercase as well.
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
...
"description": {"type": "text", "analyzer": "lowercase_keyword"},
}
}
Example query:
"query": {
"regexp" : { "description" : ".*plan is.*" }
}
Thanks to Jai Sharma for leading me; I just wanted to provide more detail.

How to match parts of a word in elasticsearch?

How can I match parts of a word to the parent word ?. For example: I need to match "eese" or "heese" to the word "cheese".
The best way to achieve this is using an edgeNGram token filter combined with two reverse token filters. So, first you need to define a custom analyzer called reverse_analyzer in your index settings like below. Then you can see that I've declared a string field called your_field with a sub-field called suffix which has our custom analyzer defined.
PUT your_index
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase", "reverse", "substring", "reverse"]
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 10
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"your_field": {
"type": "string",
"fields": {
"suffix": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
}
Then you can index a test document with "cheese" inside, like this:
PUT your_index/your_type/1
{"your_field": "cheese"}
When this document is indexed, the your_field.suffix field will contain the following tokens:
e
se
ese
eese
heese
cheese
Under the hood what is happening when indexing cheese is the following:
The keyword tokenizer will tokenize a single token, => cheese
The lowercase token filter will put the token in lowercase => cheese
The reverse token filter will reverse the token => eseehc
The substring token filter will produce different tokens of length 1 to 10 => e, es, ese, esee, eseeh, eseehc
Finally, the second reverse token filter will reverse again all tokens => e, se, ese, eese, heese, cheese
Those are all the tokens that will be indexed
So we can finally search for eese (or any suffix of cheese) in that sub-field and find our match
POST your_index/_search
{
"query": {
"match": {
"your_field.suffix": "eese"
}
}
}
=> Yields the document we've just indexed above.
You can do it two ways:
If you need it happen only for some search then search box only you can pass
*eese* or *heese*
Just give * in beginning and end of your search word. If you need it for every search
string "*#{params[:query]}*"
this will match with your parent word and give the result
There are multiple ways to do this
The analyzer approach - Here you Ngram tokenizer to break sub tokens of all the words. Hence for the word "cheese" -> [ "chee" , "hees" , "eese" , "cheese" ] and all ind of substrings would be generated. With this index size will go high , but the search speed would be optimized
The wildcard query approach - In this approach , a scan happens on the reverse index. This does not occupy additional index size , but it will take more time on the search.

Resources