Tokens in Index Time vs Query Time are not the same when using common_gram filter ElasticSearch - elasticsearch

I want to use common_gram token filter based on this link.
My elasticsearch version is: 7.17.8
Here is the setting of my index in ElasticSearch.
I have defined a filter named "common_grams" that uses "common_grams" as type.
I have defined a custom analyzer named "index_grams" that use "whitespace" as tokenizer and the above filter as a token filter.
I have just one field named as "title_fa" and I have used my custom analyzer for this field.
PUT /my-index-000007
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": [ "common_grams" ]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": [ "the","is" ]
}
}
}
}
,
"mappings": {
"properties": {
"title_fa": {
"type": "text",
"analyzer": "index_grams",
"boost": 40
}
}
}
}
It works fine in Index Time and the tokens are what I expect to be. Here I get the tokens via kibana dev tool.
GET /my-index-000007/_analyze
{
"analyzer": "index_grams",
"text" : "brown is the"
}
Here is the result of the tokens for the text.
{
"tokens" : [
{
"token" : "brown",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "brown_is",
"start_offset" : 0,
"end_offset" : 8,
"type" : "gram",
"position" : 0,
"positionLength" : 2
},
{
"token" : "is",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "is_the",
"start_offset" : 6,
"end_offset" : 12,
"type" : "gram",
"position" : 1,
"positionLength" : 2
},
{
"token" : "the",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 2
}
]
}
When I search the query "brown is the", I expect these tokens to be searched:
["brown", "brown_is", "is", "is_the", "the" ]
But these are the tokens that will actually be searched:
["brown is the", "brown is_the", "brown_is the"]
Here you can see the details
Query Time Tokens
UPDATE:
I have added a sample document like this:
POST /my-index-000007/_doc/1
{ "title_fa" : "brown" }
When I search "brown coat"
GET /my-index-000007/_search
{
"query": {
"query_string": {
"query": "brown is coat",
"default_field": "title_fa"
}
}
}
it returns the document because it searches:
["brown", "coat"]
When I search "brown is coat", it can't find the document because it is searching for
["brown is coat", "brown_is coat", "brown is_coat"]
Clearly when it gets a query that contains a common word, it acts differently and I guess it's because of the index time tokens and query time tokens.
Do you know where I am getting this wrong? Why is it acting differently?

Related

ElasticSearch catenate_words -- only keep concatenated value

Following examples here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-word-delimiter-graph-tokenfilter.html
Specifically the catenate_words option.
I would like to use this to concatenate words that I can then use in a phrase query before and after the concatenated word, but the word parts prevent this.
For example, their example is this:
super-duper-xl → [ superduperxl, super, duper, xl ]
Now if my actual phrase was "what a great super-duper-xl" that would turn into a sequence:
[what,a,great,superduperxl,super,duper,xl]
That matches the phrase "great superduperxl" which is fine.
However, if the phrase was "the super-duper-xl emerged" the sequence would be:
[the,superduperxl,super,duper,xl,emerged]
This does not phrase match "superduperxl emerged", however it would if the part tokens (super,duper,xl) were not emitted.
Is there any way I can concatenate words keeping only the concatenated word and filtering out the word parts?
Pattern replace character filter can be used here.
"-" is replaced with "" to generate tokens
Query
PUT my-index1
{
"settings": {
"analysis": {
"analyzer": {
"remove_hyphen_analyzer": {
"tokenizer": "standard",
"char_filter": [
"remove_hyphen_filter"
]
}
},
"char_filter": {
"remove_hyphen_filter": {
"type": "pattern_replace",
"pattern": "-",
"replacement": ""
}
}
}
},
"mappings": {
"properties": {
"title":{
"type": "text",
"analyzer": "remove_hyphen_analyzer"
}
}
}
}
POST my-index1/_analyze
{
"analyzer": "remove_hyphen_analyzer",
"text": "the super-duper-xl emerged"
}
Result
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "superduperxl",
"start_offset" : 4,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "emerged",
"start_offset" : 19,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 2
}
]
}

Query string with anomalous behavior

I am trying to understand the query_string clause in elasticsearch. Specifically, I need to understand the next behavior. After I putted the next document.
PUT test/doc/1
{
"name": "1RD.ISABELA.GRADOS"
}
I expect that the next two queries result have one document. But only the last query return 1 document. My question is why the first query is not returning anything?. Could you help me, please?
GET test/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "1RD.ISABELA",
"default_field": "*"
}
}
]
}
}
}
GET test/_search
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "1RD.ISABELA.GRADOS",
"default_field": "*"
}
}
]
}
}
}
If you will run below query
GET index28/_analyze
{
"text": "1RD.ISABELA.GRADOS",
"analyzer": "standard"
}
Response:
"tokens" : [
{
"token" : "1rd.isabela.grados",
"start_offset" : 0,
"end_offset" : 18,
"type" : "<ALPHANUM>",
"position" : 0
}
]
There is a single token generated for entire text. By default, a tokenizer named standard is used. It splits text by whitespace and also removes most symbols, such as commas, periods, semicolons, etc.
So only 1rd.isabela.grados will match this token.
If you will execute below query
GET index28/_analyze
{
"text": "RD ISABELA GRADOS.",
"analyzer": "standard"
}
Response
"tokens" : [
{
"token" : "rd",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "isabela",
"start_offset" : 3,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "grados",
"start_offset" : 11,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 2
}
]
3 tokens are generated. SO search on any of these tokens will return the documents

How to map one word to another word in elasticsearch?

how can I map a word to another word in Elasticsearch?. That is suppose I have the following data document
{
"carName" : "Porche"
"review": " this car is so awesome"
}
Now when I search good/fantastic etc, it should map to "awesome".
Is there any way I can do this in elasticsearch?
Yes, you can achieve this by using a synonym token filter.
First you need to define a new custom analyzer in your index and use that analyzer in your mapping.
curl -XPUT localhost:9200/cars -d '{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"synonyms"
]
}
},
"filter": {
"synonyms": {
"type": "synonym",
"synonyms": [
"good, awesome, fantastic"
]
}
}
}
},
"mappings": {
"car": {
"properties": {
"carName": {
"type": "string"
},
"review": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}'
You can add as many synonyms as you want, either in the settings directly or in a separate file that you can reference in the settings using the synonyms_path property.
Then we can index your sample document above:
curl -XPUT localhost:9200/cars/car/1 -d '{
"carName": "Porche",
"review": " this car is so awesome"
}'
What is going to happen is that when the synonyms token filter kicks in, it will also index the tokens good and fantastic along with awesome so that you can search and find that document by those tokens as well. Concretely, analyzing the sentence this car is so awesome...
curl -XGET 'localhost:9200/cars/_analyze?analyzer=my_analyzer&pretty' -d 'this car is so awesome'
...will produce the following tokens (see the last three tokens)
{
"tokens" : [ {
"token" : "this",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "car",
"start_offset" : 5,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
}, {
"token" : "is",
"start_offset" : 9,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
}, {
"token" : "so",
"start_offset" : 12,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 4
}, {
"token" : "good",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "awesome",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
}, {
"token" : "fantastic",
"start_offset" : 15,
"end_offset" : 22,
"type" : "SYNONYM",
"position" : 5
} ]
}
Finally, you can search like this and the document will be retrieved:
curl -XGET localhost:9200/cars/car/_search?q=review:good

Combine search terms automatically when with Elasticssearch?

Using elasticsearch for searching our documents we discovered that when we search for "wave board" we get no good results, because documents containing "waveboard" are not at the top of the results. Google does this kind of "term combining". Is there a simple way to do this in ES?
Found a good solution: Create a custom anaylzer with a shingle filter using "" as a token separator and use that in a query (use bool query to combine with standard queries)
To do this at analysis time, you can also use what is know as a "decompounding"
token filter. Here is an example to decompound the text "catdogmouse" into the
tokens "cat", "dog", and "mouse":
POST /decom
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"decom_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["decom_filter"]
}
},
"filter": {
"decom_filter": {
"type": "dictionary_decompounder",
"word_list": ["cat", "dog", "mouse"]
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "string",
"analyzer": "decom_analyzer"
}
}
}
}
}
And then you can see how they are applied to certain terms:
POST /decom/_analyze?field=body&pretty
racecatthings
{
"tokens" : [ {
"token" : "racecatthings",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
And another: (you should be able to extrapolate this to separate "waveboard"
into "wave" and "board")
POST /decom/_analyze?field=body&pretty
catdogmouse
{
"tokens" : [ {
"token" : "catdogmouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "cat",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "dog",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "mouse",
"start_offset" : 1,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}

Elasticsearch, search for domains in urls

We index HTML documents which may include links to other documents. We're using elasticsearch and things are pretty smooth for most keyword searches, which is great.
Now, we're adding more complex searches similar to Google site: or link: searches: basically we want to retrieve documents which point to eithr specific urls or even domains. (If document A has a link to http://a.site.tld/path/, the search link:http://a.site.tld should yield it.).
And we're now trying what would be the best way to achieve this.
So far, we have extracted the links from the documents and added a links field to our document. We setup the links to be not analyzed. We can then do search that match the exact url link:http://a.site.tld/path/ But of course link:http://a.site.tld does not yield anything.
Our initial idea would be to create a new field linkedDomains which would work similarly... but there may exist better solutions?
You could try the Path Hierarchy Tokenizer:
Define a mapping as follows:
PUT /link-demo
{
"settings": {
"analysis": {
"analyzer": {
"path-analyzer": {
"type": "custom",
"tokenizer": "path_hierarchy"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"link": {
"type": "string",
"index_analyzer": "path-analyzer"
}
}
}
}
}
Index a doc:
POST /link-demo/doc
{
link: "http://a.site.tld/path/"
}
The following term query returns the indexed doc:
POST /link-demo/_search?pretty
{
"query": {
"term": {
"link": {
"value": "http://a.site.tld"
}
}
}
}
To get a feel for how this is being indexed:
GET link-demo/_analyze?analyzer=path-analyzer&text="http://a.site.tld/path"&pretty
Shows the following:
{
"tokens" : [ {
"token" : "\"http:",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "\"http:/",
"start_offset" : 0,
"end_offset" : 7,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld",
"start_offset" : 0,
"end_offset" : 18,
"type" : "word",
"position" : 1
}, {
"token" : "\"http://a.site.tld/path\"",
"start_offset" : 0,
"end_offset" : 24,
"type" : "word",
"position" : 1
} ]
}

Resources