How do I get the most frequent uni-, bi-, tri-grams using shingles in Elasticsearch across all documents - elasticsearch

I am using the following field definition in my elasticsearch index:
"my_text" :{
"type" : "string",
"index" : "analyzed",
"analyzer" : "my_ngram_analyzer",
"term_vector": "with_positions",
"term_statistics" : true
}
where, my_ngram_analyzer is used to tokenize text into n-grams using shingles and is defined as:
"settings" : {
"analysis" : {
"filter" : {
"nGram_filter": {
"type": "shingle",
"max_shingle_size": 5,
"min_shingle_size": 2,
"output_unigrams":"true"
}
},
"analyzer" : {
"my_ngram_analyzer" :{
"tokenizer" : "standard",
"filter" : [
"lowercase",
"nGram_filter"
]
}
}
}
}
I have two questions:
How can I find the most frequent n-gram (n = 1 to 5) and its frequency across all documents ?
Is there a way to get total term frequency of an n-gram without querying for a document using the termvector API with term_statistics ?

Related

Autocompletion with whitespace tokenizer in elasticsearch. Tokenize whitespaces correctly

I have an elastic index I want to do autocompletion with.
Therfore i have a suggestField of type completion where i put text that should be autocompleted.
"suggestField" : {
"type" : "completion",
"analyzer" : "IndexAnalyzer",
"search_analyzer" : "SearchAnalyzer",
"preserve_separators" : true,
"preserve_position_increments" : true,
"max_input_length" : 50
},
With Analyzers:
"IndexAnalyzer" : {
"filter" : [
"lowercase",
"stop",
"stopGerman",
"EdgeNGramFilter"
],
"type" : "custom",
"tokenizer" : "MyTokenizer"
},
"SearchAnalyzer" : {
"filter" : [
"lowercase"
],
"type" : "custom",
"tokenizer" : "MyTokenizer"
},
Filters and Tokenizer:
"filter" : {
"EdgeNGramFilter" : {
"type" : "edge_ngram",
"min_gram" : "1",
"max_gram" : "50"
},
"stopGerman" : {
"type" : "stop",
"stopwords" : "_german_"
}
},
"tokenizer" : {
"MyTokenizer" : {
"type" : "whitespace"
}
}
My Problem is now that if i query that field the autocompletion only works if i start at the beginning of the text, not for every word.
E.g. i have one value in my suggest field that looks like: "123-456-789 thisisatest"
If i search my suggest field for 123- i get that value as a result.
But if i search for thisis id do not get a result.
This is my Query.
POST myindex/_search?typed_keys=true
{
"suggest": {
"completion-term": {
"completion" : {
"field" : "suggestField"
} ,
"prefix" : "thisis"
}
}
}
The Question: How do I have to change the above setup to get the given value as a result if i search for thisis ?
FYI: If I use the IndexAnalyzer in kibana with an _analyze query for 123-456-789 thisisatest i get the (from my point of view correct) tokens:
1
12
123
123-
123-4
123-45
123-456
123-456-7
123-456-78
123-456-789
t
th
thi
this
thisi
thisis
thisisa
thisisat
thisisate
thisisates
thisisatest

Query to partially match every word in a search term in Elasticsearch

I have an array of tags containing words.
tags: ['australianbrownsnake', 'venomoussnake', ...]
How do I match this against these search terms:
'brown snake', 'australian snake', 'venomous', 'venomous brown snake'
I am not even sure if this is possible since I am new to Elasticsearch.
Help would be appreciated. Thank you.
Edit: I have created an ngram analyzer and added a field called ngram like so.
properties": {
"tags": {
"type": "text",
"fields": {
"ngram": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
i tried the following query but no luck
"query": {
"multi_match": {
"query": "snake",
"fields": [
"tags.ngram"
],
"type": "most_fields"
}
}
my tag mapping is as follows:
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
},
"ngram" : {
"type" : "text",
"analyzer" : "my_analyzer"
}
}
},
my settings are:
{
"image" : {
"settings" : {
"index" : {
"max_ngram_diff" : "10",
"number_of_shards" : "1",
"provided_name" : "image",
"creation_date" : "1572590562106",
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "my_tokenizer"
}
},
"tokenizer" : {
"my_tokenizer" : {
"token_chars" : [
"letter",
"digit"
],
"min_gram" : "3",
"type" : "ngram",
"max_gram" : "10"
}
}
},
"number_of_replicas" : "1",
"uuid" : "pO9F7W43QxuZmI9vmXfKyw",
"version" : {
"created" : "7040299"
}
}
}
}
}
Update:
This config should work fine.
I believe it was my mistake. I was searching on the wrong index
You need to index your tags in the way you want to search them. For queries like 'brown snake', 'australian snake' to match your tags you would need to break them into smaller tokens.
By default elasticsearch indexes strings by passing it through its standard analyzer. You can always create your custom analyzer to store your field however you want. You can create your custom analyzer which tokenizes strings into nGrams. You can specify a size of 3-10 which will store your 'australianbrownsnake' tag as something like: ['aus', 'aust', ..., 'tra', 'tral',...]
You can then modify your search query to match on your tags.ngram field and you should get the desired results.
tags.ngrams field can be created like so:
https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html
using ngram tokenizer:
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
EDIT1: Elastic tends to use the analyzer of the field being matched on, to analyze the query keywords. You might not need the user query to be tokenized in nGrams since there should be a matching nGram stored in the tags field. You could specify a standard search_analyzer in your mappings.

Why doesnt Elasticsearch match any number in keyphrase?

I am searching for the following keyphrase: "xiaomi redmi note 3" in my Elasticsearch database. I m making the following bool query:
"filtered" : {
"query" : {
"match" : {
"name" : {
"query" : "xiaomi redmi note 3",
"type" : "boolean",
"operator" : "AND"
}
}
}
}
However, no matches are found. Still in Elasticsearch there is the following document:
xiaomi redmi note 3 16GB 4G Phablet
Why doesnt es match this document?
What I noticed in general is that es doesnt match any numbers in the keyprhase? Does it have to do with the analyzer I m using?
EDIT
"analyzer": {
"second": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"trim",
"autocomplete_filter"
]
},
and the mapping for my field is:
"name" : {
"type" : "string",
"index_analyzer" : "second",
"search_analyzer" : "standard",
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed"
}
},
"search_quote_analyzer" : "second"
},
Autocomplete_filter:
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10
}

Elasticsearch query response influenced by _id

I created an index with the following mappings and settings:
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_index": {
"type": "custom",
"tokenizer": "filename",
"filter": ["icu_folding", "edge_ngram"]
},
"default_search": {
"type":"standard",
"tokenizer": "filename",
"filter": [
"icu_folding"
]
}
},
"tokenizer" : {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 3,
"type" : "edgeNGram"
}
}
}
},
"mappings": {
"metadata": {
"properties": {
"title": {
"type": "string",
"analyzer": "case_insensitive_index"
}
}
}
}
}
I have the following documents:
{"title":"P-20150531-27332_News.jpg"}
{"title":"P-20150531-27341_News.jpg"}
{"title":"P-20150531-27512_News.jpg"}
{"title":"P-20150531-27343_News.jpg"}
creating a document with simple numerical ids
111
112
113
114
and querying using the query
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO"
}
}
}
}
results in the correct scoring and ordering of the documents returned:
P-20150531-27332_News.jpg -> 2.780985
P-20150531-27341_News.jpg -> 0.8262239
P-20150531-27512_News.jpg -> 0.8120311
P-20150531-27343_News.jpg -> 0.7687101
Strangely, creating the same documents with UUIDs
557eec2e3b00002c03de96bd
557eec0f3b00001b03de96b8
557eec0c3b00001b03de96b7
557eec123b00003a03de96ba
as IDs results in different scorings of the documents:
P-20150531-27341_News.jpg -> 2.646321
P-20150531-27332_News.jpg -> 2.1998127
P-20150531-27512_News.jpg -> 1.7725387
P-20150531-27343_News.jpg -> 1.2718291
Is this an intentional behaviour of Elasticsearch? If yes - how can I preserve the correct ordering regardless of the IDs used?
In the query it looks like you should be using 'default_search' as the analyzer for match query unless you actuall intended to use egde-ngram on the search query too.
Example :
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO",
"analyzer" : "default_search"
}
}
}
}
default_search would be the default-search analyzer only if there is are no explicit search_analyzer or analyzer specified in the mapping of the field.
The articlehere gives a good explanation of the rules by which analyzers are applied.
Also to ensure idf takes into account documents across shards you could use search_type=dfs_query_then_fetch

How to get the definitiion of a search analyzer of an index in elasticsearch

The mapping of the elasticsearch index has a custom analyzer attached to it. How to read the definition of the custom analyzer.
http://localhost:9200/test_namespace/test_namespace/_mapping
"matchingCriteria": {
"type": "string",
"analyzer": "custom_analyzer",
"include_in_all": false
}
my search is not working with the analyzer thats why i need to know what exactly this analyzer is doing.
the doc explains how to modify an analyzer or attach a new analyzer to an existing index but i didnt find a way to see what an analyzer does.
use the _settings API:
curl -XGET 'http://localhost:9200/test_namespace/_settings?pretty=true'
it should generate a response similar to:
{
"test_namespace" : {
"settings" : {
"index" : {
"creation_date" : "1418990814430",
"routing" : {
"allocation" : {
"disable_allocation" : "false"
}
},
"uuid" : "FmX9NrSNSTO2bQM5pd-iQQ",
"number_of_replicas" : "2",
"analysis" : {
"analyzer" : {
"edi_analyzer" : {
"type" : "custom",
"char_filter" : [ "my_pattern" ],
"filter" : [ "lowercase", "length" ],
"tokenizer" : "whitespace"
},
"xml_analyzer" : {
"type" : "custom",
"char_filter" : [ "html_strip" ],
"filter" : [ "lowercase", "length" ],
"tokenizer" : "whitespace"
},
...

Resources