Count n-grams with token_count field - elasticsearch

Is it possible to count number of produced n-grams using token_count field?
Let's suppose I have the following mapping:
{
"settings": {
"analysis": {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3
}
},
"analyzer": {
"trigrams": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "trigrams_filter"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"message": {
"type": "text",
"analyzer": "trigrams",
"fields": {
"length": {
"type": "token_count",
"analyzer": "trigrams"
}
}
}
}
}
}
}
With this mapping I'd expect to get three terms for value "quick": "qui", "uic" and "ick", but the following query doesn't return any hit despite the fact that message.length field has trigrams analyzer:
{
"query": {
"term": {
"message.length": 3
}
}
}

Related

How to match partial words in elastic search text search

I have a field name in my elastic search with a value of Single V
Now if i search it with a value of S or Sing , i don't get no result , but if i enter a full value Single , then i get the result Single V, the query i am using is as following :-
{
"query": {
"match": {
"name": "singl"
}
},
"sort": []
}
This gives me no results , do i need to change the mapping/setting for name or analyzer ?
EDIT:-
I am trying to create the following index with the following mapping/setting
PUT my_cars
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": ["lowercase"]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
But i get the following error
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
}
],
"type" : "illegal_argument_exception",
"reason" : "analyzer [tokenizer] must specify either an analyzer type, or a tokenizer"
},
"status" : 400
}
Elasticsearch by default uses a standard analyzer for the text field if no analyzer is specified. This will tokenize "Single V" into "single" and "v". Due to this, you are getting the result for "Single" and not for the other terms.
If you want to do a partial search, you can use edge n-gram tokenizer or a Wildcard query
The mapping for the Edge n-gram tokenizer would be
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 6,
"token_chars": [
"letter",
"digit"
]
}
}
},
"max_ngram_diff": 10
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Update 1:
In the index mapping given above, there is one bracket } missing. Modify your index mapping as shown below
{
"settings": {
"analysis": {
"normalizer": {
"sortable": {
"filter": [
"lowercase"
]
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
}, // note this
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 36,
"token_chars": [
"letter"
]
}
}
},
"max_ngram_diff": 50
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"normalizer": "sortable"
}
}
}
}
}
}
This is because of the default analyzer. The field is broken into tokens because of the analyzer - [Single,V].
Match query will try to find an exact search of any of the query tokens. Since you are only passing Singl that will be the only token, which is not matching any of the two tokens which are saved in the DB.
{
"query": {
"wildcard": {
"user.id": {
"name": "*singl*"
}
}
}
}
You can use wildcard queries

Elasticsearch term query to number token

I need to explain some weird behavior of term query to Elasticsearch database which contains number part in the string. Query is pretty simple:
{
"query": {
"bool": {
"should": [
{
"term": {
"address.street": "8 kvetna"
}
}
]
}
}
}
The problem is that term 8 kvetna returns empty result. I tried to _analyze it ad it make regular tokens like 8, k, kv, kve .... Also I am pretty sure there is a value 8 kvetna in database.
Here is the mapping for the field:
{
"settings": {
"index": {
"refresh_interval": "1m",
"number_of_shards": "1",
"number_of_replicas": "1",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"autocomplete": {
"filter": [
"lowercase",
"asciifolding",
"autocomplete_filter"
],
"type": "custom",
"tokenizer": "standard"
}
"default": {
"filter": [
"lowercase",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
},
"mappings": {
"doc": {
"dynamic": "strict",
"_all": {
"enabled": false
},
"properties": {
"address": {
"properties": {
"city": {
"type": "text",
"analyzer": "autocomplete"
},
"street": {
"type": "text",
"analyzer": "autocomplete"
}
}
}
}
}
}
}
What caused this weird result? I don't understand it. Thanks for any help.
Great start so far! Your only issue is that you're using a term query, while you should use a match one. A term query will try to do an exact match for 8 kvetna and that's not what you want. The following query will work:
{
"query": {
"bool": {
"should": [
{
"match": { <--- change this
"address.street": "8 kvetna"
}
}
]
}
}
}

Search partial word in elasticsearch

I'm kind of new to Elasticsearch but I would like to search the partial in the word
For example if I search "helloworld" is it possible to type only "world"?
Right now it work perfectly for case "hello" the elasticsearch return the suggestion helloworld for me
Here is the code:
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"word": {
"properties": {
"text": {
"type": "string",
"analyzer": "autocomplete"
}
}
}
}
}
Can anyone give me any suggestion?

Using Elasticsearch to search special characters

How can I force Elasticsearch query_string to recognize '#' as a simple character?
Assuming I have an Index, and I added a few documents, by this statement:
POST test/item/_bulk
{"text": "john.doe#gmail.com"}
{"text": "john.doe#outlook.com"}
{"text": "john.doe#gmail.com, john.doe#outlook.com"}
{"text": "john.doe[at]gmail.com"}
{"text": "john.doe gmail.com"}
I want this search:
GET test/item/_search
{
"query":
{
"query_string":
{
"query": "*#gmail.com",
"analyze_wildcard": "true",
"allow_leading_wildcard": "true",
"default_operator": "AND"
}
}
}
to return only the first and third documents.
I tried 3 kinds of mapping:
First i tried -
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"email_analyzer": {
"tokenizer": "email_tokenizer"
}
},
"tokenizer": {
"email_tokenizer": {
"type": "uax_url_email"
}
}
}
},
"mappings": {
"item": {
"properties": {
"text": {
"type": "string",
"analyzer": "email_analyzer"
}
}
}
}
}
than i tried -
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "whitespace"
}
}
}
},
"mappings": {
"item": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
and i also tried this one -
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "whitespace"
}
}
}
},
"mappings": {
"item": {
"properties": {
"text": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
None of the above worked, actually they all returned all the documents.
Is there an analyzer/tokenizer/parameter that will make Elasticsearch to acknowledge the '#' sign like it does with any other character
This is working with your last setting, by putting the text to not analyze:
GET test/item/_search
{
"query":
{
"wildcard":
{
"text": "*#gmail.com*"
}
}
}
When using not analyzed field, you should use Term level query but not Full-Text level query: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/term-level-queries.html

Google type query using Elasticsearch

Suppose I have the following document:
{title:"Sennheiser HD 800"}
I want to all this queries return this document.
senn
heise
sennheise
sennheiser
sennheiser 800
sennheiser hd
hd
800 hd
hd ennheise
In short I want to find partial words either one or more.
In my map i am using this analyzer
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_sort": {
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
}
}
and the map
{
"title": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "case_insensitive_sort"
}
}
}
}
and the query is a simple string query
{
"query": {
"query_string": {
"fields": [
"title.lower_case_sort"
],
"query": "*800 hd*"
}
}
}
For example this query fails.
You need ngrams.
Here is a blog post I wrote up about it for Qbox:
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch
(Note that "index_analyzer" no longer works in ES 2.x; use "analyzer" instead; "search_analyzer" still works, though.)
Using this mapping (slightly modified from one in the blog post; I'll refer you there for an in-depth explanation):
PUT /test_index
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"analyzer": "ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
index your document:
POST /test_index/doc/1
{
"title": "Sennheiser HD 800"
}
and then any of your listed queries work, in the following form:
POST /test_index/_search
{
"query": {
"match": {
"title": {
"query": "heise hd 800",
"operator": "and"
}
}
}
}
If you only have a single term, then you don't need the "operator" part:
POST /test_index/_search
{
"query": {
"match": {
"title": "hd"
}
}
}
Here is some code I used to play around with it:
http://sense.qbox.io/gist/a9accf67f1713ca99819f45ce0ac28adaea691a9

Resources