Autocompletion with whitespace tokenizer in elasticsearch. Tokenize whitespaces correctly - elasticsearch

I have an elastic index I want to do autocompletion with.
Therfore i have a suggestField of type completion where i put text that should be autocompleted.
"suggestField" : {
"type" : "completion",
"analyzer" : "IndexAnalyzer",
"search_analyzer" : "SearchAnalyzer",
"preserve_separators" : true,
"preserve_position_increments" : true,
"max_input_length" : 50
},
With Analyzers:
"IndexAnalyzer" : {
"filter" : [
"lowercase",
"stop",
"stopGerman",
"EdgeNGramFilter"
],
"type" : "custom",
"tokenizer" : "MyTokenizer"
},
"SearchAnalyzer" : {
"filter" : [
"lowercase"
],
"type" : "custom",
"tokenizer" : "MyTokenizer"
},
Filters and Tokenizer:
"filter" : {
"EdgeNGramFilter" : {
"type" : "edge_ngram",
"min_gram" : "1",
"max_gram" : "50"
},
"stopGerman" : {
"type" : "stop",
"stopwords" : "_german_"
}
},
"tokenizer" : {
"MyTokenizer" : {
"type" : "whitespace"
}
}
My Problem is now that if i query that field the autocompletion only works if i start at the beginning of the text, not for every word.
E.g. i have one value in my suggest field that looks like: "123-456-789 thisisatest"
If i search my suggest field for 123- i get that value as a result.
But if i search for thisis id do not get a result.
This is my Query.
POST myindex/_search?typed_keys=true
{
"suggest": {
"completion-term": {
"completion" : {
"field" : "suggestField"
} ,
"prefix" : "thisis"
}
}
}
The Question: How do I have to change the above setup to get the given value as a result if i search for thisis ?
FYI: If I use the IndexAnalyzer in kibana with an _analyze query for 123-456-789 thisisatest i get the (from my point of view correct) tokens:
1
12
123
123-
123-4
123-45
123-456
123-456-7
123-456-78
123-456-789
t
th
thi
this
thisi
thisis
thisisa
thisisat
thisisate
thisisates
thisisatest

Related

Extract Hashtags and Mentions into separate fields

I am doing a DIY Tweet Sentiment analyser, I have an index of tweets like these
"_source" : {
"id" : 26930655,
"status" : 1,
"title" : "Hereโ€™s 5 underrated #BTC and realistic crypto accounts that everyone should follow: #Quinnvestments , #JacobOracle , #jevauniedaye , #ginsbergonomics , #InspoCrypto",
"hashtags" : null,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan ๐Ÿ™๐Ÿšฉ๐Ÿšฉ๐Ÿ‡ฎ๐Ÿ‡ณ""",
"hashtags" : null,
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
There Mappings are settings are like
"sentiment-en" : {
"mappings" : {
"properties" : {
"category" : {
"type" : "text"
},
"created_at" : {
"type" : "integer"
},
"hashtags" : {
"type" : "text"
},
"id" : {
"type" : "long"
},
"language" : {
"type" : "integer"
},
"status" : {
"type" : "integer"
},
"title" : {
"type" : "text",
"fields" : {
"raw" : {
"type" : "keyword"
},
"raw_text" : {
"type" : "text"
},
"stop" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "stop_words_filter"
},
"syn" : {
"type" : "text",
"index_options" : "docs",
"analyzer" : "synonyms_filter"
}
},
"index_options" : "docs",
"analyzer" : "all_ok_filter"
}
}
}
}
}
"settings" : {
"index" : {
"number_of_shards" : "10",
"provided_name" : "sentiment-en",
"creation_date" : "1627975717560",
"analysis" : {
"filter" : {
"stop_words" : {
"type" : "stop",
"stopwords" : [ ]
},
"synonyms" : {
"type" : "synonym",
"synonyms" : [ ]
}
},
"analyzer" : {
"stop_words_filter" : {
"filter" : [ "stop_words" ],
"tokenizer" : "standard"
},
"synonyms_filter" : {
"filter" : [ "synonyms" ],
"tokenizer" : "standard"
},
"all_ok_filter" : {
"filter" : [ "stop_words", "synonyms" ],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "0",
"uuid" : "Q5yDYEXHSM-5kvyLGgsYYg",
"version" : {
"created" : "7090199"
}
}
Now the problem is i want to extract all the Hashtags and mentions in a seprate field.
What i want as O/P
"id" : 26930655,
"status" : 1,
"title" : "Hereโ€™s 5 underrated #BTC and realistic crypto accounts that everyone should follow: #Quinnvestments , #JacobOracle , #jevauniedaye , #ginsbergonomics , #InspoCrypto",
"hashtags" : BTC,
"created_at" : 1622390229,
"category" : null,
"language" : 50
},
{
"id" : 22521897,
"status" : 1,
"title" : "#bulls gonna overtake the #bears soon #ATH coming #ALTSEASON #BSCGem #eth #btc #memecoin #100xgems #satyasanatan ๐Ÿ™๐Ÿšฉ๐Ÿšฉ๐Ÿ‡ฎ๐Ÿ‡ณ""",
"hashtags" : bulls,bears,ATH, ALTSEASON, BSCGem, eth , btc, memecoin, 100xGem, satyasanatan
"created_at" : 1620045296,
"category" : null,
"language" : 50
}
What i have tried so far
Create a pattern based tokenizer to just read Hashtags and mentions and no other token for field hashtag and mentions did not had much success there.
Tried to write an n-gram tokenizer without any analysers did not achive much success there as well.
Any help would be appreciated, I am open to reindex my data. Thanks in advance !!!
You can use Logstash Twitter input plugin for indexing data and configured below ruby script in filter plugin as mentioned in blog.
if [message] {
ruby {
code => "event.set('hashtags', event.get('message').scan(/\#[a-z]*/i))"
}
}
You can use Logtstash Elasticsearch Input plugin for source index and configured about ruby code in Filter plugin and Logtstash elasticsearch output plugin with destination index.
input {
elasticsearch {
hosts => "localhost:9200"
index => "current_twitter"
query => '{ "query": { "query_string": { "query": "*" } } }'
size => 500
scroll => "5m"
}
}
filter{
if [message] {
ruby {
code => "event.set('hashtags', event.get('message').scan(/\#[a-z]*/i))"
}
}
}
output {
elasticsearch {
index => "new_twitter"
}
}
Another option is to use reingest API with ingest pipeline but ingest pipeline not support ruby code. So you need to convert above ruby code to the painless script.

Elasticsearch multi_match + nested search

I am trying to execute a multi_match + nested search in ElasticSearch 6.4. I have the following mappings:
"name" : {
"type" : "text"
},
"status" : {
"type" : "short"
},
"user" : {
"type" : "nested",
"properties" : {
"first_name" : {
"type" : "text"
},
"last_name" : {
"type" : "text"
},
"pk" : {
"type" : "integer"
},
"profile" : {
"type" : "nested",
"properties" : {
"location" : {
"type" : "nested",
"properties" : {
"name" : {
"type" : "text",
"analyzer" : "html_strip"
}
}
}
}
}
}
},
And this is the html_strip analyzer:
"html_strip" : {
"filter" : [
"lowercase",
"stop",
"snowball"
],
"char_filter" : [
"html_strip"
],
"type" : "custom",
"tokenizer" : "standard"
}
And my current query is this one:
"query": {
"bool": {
"must": {
"multi_match": {
"query": 'Paris',
"fields": ['name', 'user.profile.location.name']
},
},
"filter": {
"term": {
"status": 1
}
}
}
}
Obviously searching for "Paris" in user.profile.location.name doesn't work. I was trying to adapt my code to following this answer https://stackoverflow.com/a/48836012/12007123 but without any success.
What I am basically trying to achieve, is to be able to search for a value in multiple fields, this may or may not be nested.
I was also checking this discussion https://discuss.elastic.co/t/multi-match-query-string-with-nested-and-non-nested-fields/118652/5 but everything I tried wasn't successful.
If I just search for name, the search is working fine.
Any tips on how can I achieve this the right way, would be much appreciated.
EDIT:
While I didn't get an answer to my initial question, I was following Nikolay's (#nikolay-vasiliev) comment and changed th mappings to Object instead of Nested.
At least now I am able to search in user.profile.location.name. This is how the new mapping for user looks like:
"user" : {
"properties" : {
"first_name" : {
"type" : "text"
},
"last_name" : {
"type" : "text"
},
"pk" : {
"type" : "integer"
},
"profile" : {
"properties" : {
"location" : {
"properties" : {
"name" : {
"type" : "text",
"analyzer" : "html_strip"
}
}
}
}
}
}
},

How to highlight ngram tokens in a word using elastic search

I would like to highlight just the ngrams which match, not the whole word.
Example:
term: "Wo"
highlight should be: "<em>Wo</em>nderfull world!"
currently it is: "<em>Wonderfull</em> world!"
Mapping is:
{
"global_search_1495732922733" : {
"mappings" : {
"meeting" : {
"properties" : {
...
"name" : {
"type" : "text",
"analyzer" : "meeteor_index_analyzer",
"search_analyzer" : "meeteor_search_term_analyzer"
},
...
}
}
}
}
}
Analyzers are:
"analysis" : {
"filter" : {
"meeteor_stemmer" : {
"name" : "english",
"type" : "stemmer"
},
"meeteor_ngram" : {
"type" : "nGram",
"min_gram" : "2",
"max_gram" : "15"
}
},
"analyzer" : {
"meeteor_search_term_analyzer" : {
"filter" : [
"lowercase",
"asciifolding"
],
"tokenizer" : "standard"
},
"meeteor_index_analyzer" : {
"filter" : [
"lowercase",
"asciifolding",
"meeteor_ngram"
],
"tokenizer" : "standard"
},
"meeteor_project_id_analyzer" : {
"tokenizer" : "standard"
}
}
},
Concrete example:
curl -XGET 'localhost:9200/global_search/meeting/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"name": "Me"
}
},
"highlight":{
"fields": {
"name": {}
}
}
}
'
The result is:
"...highlight" : {
"name" : [
"Sad <em>Meeting</em>"
]
}
The correct way to achieve what you want is using ngram as tokenizer and not filter. You can do something like this:
"analysis" : {
"filter" : {
"meeteor_stemmer" : {
"name" : "english",
"type" : "stemmer"
}
},
"tokenizer" : {
"meeteor_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "2",
"max_gram" : "15"
}
},
"analyzer" : {
"meeteor_search_term_analyzer" : {
"filter" : [
"lowercase",
"asciifolding"
],
"tokenizer" : "standard"
},
"meeteor_index_analyzer" : {
"filter" : [
"lowercase",
"asciifolding"
],
"tokenizer" : "meeteor_ngram_tokenizer"
},
"meeteor_project_id_analyzer" : {
"tokenizer" : "standard"
}
}
},
It will generate the highlighting by ngram for you like this:
"...highlight" : {
"name" : [
"Sad <em>Me</em>eting"
]
}

Why doesnt Elasticsearch match any number in keyphrase?

I am searching for the following keyphrase: "xiaomi redmi note 3" in my Elasticsearch database. I m making the following bool query:
"filtered" : {
"query" : {
"match" : {
"name" : {
"query" : "xiaomi redmi note 3",
"type" : "boolean",
"operator" : "AND"
}
}
}
}
However, no matches are found. Still in Elasticsearch there is the following document:
xiaomi redmi note 3 16GB 4G Phablet
Why doesnt es match this document?
What I noticed in general is that es doesnt match any numbers in the keyprhase? Does it have to do with the analyzer I m using?
EDIT
"analyzer": {
"second": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"autocomplete_filter"
]
},
and the mapping for my field is:
"name" : {
"type" : "string",
"index_analyzer" : "second",
"search_analyzer" : "standard",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
},
"search_quote_analyzer" : "second"
},
Autocomplete_filter:
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
}

How to get the definitiion of a search analyzer of an index in elasticsearch

The mapping of the elasticsearch index has a custom analyzer attached to it. How to read the definition of the custom analyzer.
http://localhost:9200/test_namespace/test_namespace/_mapping
"matchingCriteria": {
"type": "string",
"analyzer": "custom_analyzer",
"include_in_all": false
}
my search is not working with the analyzer thats why i need to know what exactly this analyzer is doing.
the doc explains how to modify an analyzer or attach a new analyzer to an existing index but i didnt find a way to see what an analyzer does.
use the _settings API:
curl -XGET 'http://localhost:9200/test_namespace/_settings?pretty=true'
it should generate a response similar to:
{
"test_namespace" : {
"settings" : {
"index" : {
"creation_date" : "1418990814430",
"routing" : {
"allocation" : {
"disable_allocation" : "false"
}
},
"uuid" : "FmX9NrSNSTO2bQM5pd-iQQ",
"number_of_replicas" : "2",
"analysis" : {
"analyzer" : {
"edi_analyzer" : {
"type" : "custom",
"char_filter" : [ "my_pattern" ],
"filter" : [ "lowercase", "length" ],
"tokenizer" : "whitespace"
},
"xml_analyzer" : {
"type" : "custom",
"char_filter" : [ "html_strip" ],
"filter" : [ "lowercase", "length" ],
"tokenizer" : "whitespace"
},
...

Resources