Elasticsearch's minimumShouldMatch for each member of an array - elasticsearch

Consider an Elasticsearch entity:
{
"id": 123456,
"keywords": ["apples", "bananas"]
}
Now, imagine I would like to find this entity by searching for apple.
{
"match" : {
"keywords" : {
"query" : "apple",
"operator" : "AND",
"minimum_should_match" : "75%"
}
}
}
The problem is that the 75% minimum for matching would be required for both of the strings of the array – so nothing will be found. Is there a way to say something like minimumSouldMatch: "75% of any array fields"?
Note that I need to use AND as each item of keywords may be composed of longer text.
EDIT:
I tried the proposed solutions, but none of them was giving expected results. I guess the problem is that the text might be quite long, eg.:
["national gallery in prague", "narodni galerie v praze"]
I guess the fuzzy expansion is just not able to expand such long strings if you just start searching by "national g".
Would this may be be possible somehow via Nested objects?
{ keywords: [{keyword: "apples"}, {keyword: "babanas"}}
and then have minimumShouldMatch=1 on keywords and then 75% on each keyword?

As per doc
The match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator parameter can be set to or or and to control the boolean clauses (defaults to or). The minimum number of optional should clauses to match can be set using the minimum_should_match parameter.
If you are searching for multiple tokens example "apples mangoes" and set minimum as 100%. It will mean both tokens should be present in document. If you set it at 50% , it means at least one of these should be present.
If you want to match tokens partially
You can use fuzziness parameter
Using fuzziness you can set maximum edit distance allowed for matching
{
"query": {
"match": {
"keywords": {
"query": "apple",
"fuzziness": "auto"
}
}
}
}
If you are trying to match word to its root form you can use "stemming" token filter
PUT index-name
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [ "stemmer" ]
}
}
}
},
"mappings": {
"properties": {
"keywords":{
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Tokens generated
GET index-name/_analyze
{
"text": ["apples", "bananas"],
"analyzer": "my_analyzer"
}
"tokens" : [
{
"token" : "appl",
"start_offset" : 0,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "banana",
"start_offset" : 7,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 101
}
]
stemming breaks words to their root form.
You can also explore n-grams, edge grams for partial matching

Related

ElasticSearch custom analyzer breaks words containing special characters

If user searches for foo(bar), elasticsearch breaks it into foo and bar.
What I'm trying to achieve, is when a user types in say, i want a foo(bar), I match exactly an item named foo(bar), the name is fixed, and it will be used by a filter, so it is set to a keyword type.
The approximate steps I did,
define a custom analyzer
define a dictionary containing foo(bar)
define a synonym mapping containing abc => foo(bar)
Now, when I search for abc, elasticsearch translates it to foo(bar), but then it breaks it into foo and bar.
The question, as you may have guessed, is how to preserve special characters in elasticsearch analyzer?
I tried to use quotes(") in the dictionary file, like "foo(bar)", it didn't work.
Or is there maybe another way to work around this problem?
By the way, I'm using foo(bar) here just for simplicity, the actual case is much more complicated.
Thanks in advance.
---- edit ----
Thanks to #star67, now I realized that it is the issue of the tokenizer.
I am using the plugin medcl/elasticsearch-analysis-ik, which provides an ik_smart tokenizer, which is designed for the Chinese language.
Although I realized what the real problem is, I still don't know how to solve it. I mean, I have to use the ik_smart tokenizer, but how do I modify it to exclude some special characters?
I know that I can define a custom pattern tokenizer like this as #star67 provided:
{
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[^\\w\\(\\)]+"
}
}
}
But I also want to use the ik_smart tokenizer, because in Chinese, words and characters are not separated by space, for example 弹性搜索很厉害 should be tokenized as ['弹性', '搜索', '很', '厉害'], words can only be split based on a dictionary, so the default behavior is not desirable. What I want is maybe something like this:
{
"tokenizer": {
"my_tokenizer": {
"tokenizer": "ik_smart",
"ignores": "[\\w\\(\\)]+"
}
}
}
And I couldn't find an equivalent setting in elasticsearch.
Is it that I have to build my own plugin to achieve this?
You might want to use another tokenizer in your custom analyzer for your index.
For example, the standard tokenizer (used via analyzer for short) splits by all non-word characters (\W+):
POST _analyze
{
"analyzer": "standard",
"text": "foo(bar)"
}
==>
{
"tokens" : [
{
"token" : "foo",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "bar",
"start_offset" : 4,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 1
}
]
}
Compare to a custom tokenizer, that splits by all non-word characters except the ( and ) ( which is [^\w\(\)]+):
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "pattern",
"pattern": "[^\\w\\(\\)]+"
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "foo(bar)"
}
===>
{
"tokens" : [
{
"token" : "foo(bar)",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
}
]
}
I used a Pattern Tokenier as an example to exclude certain symbols (( and ) in your case) from being used in tokenization.

Indexing the last word of a string in Elasticsearch

I'm looking for a way to index the last word (or more generally: the last token) of a field into a separate sub-field. I've looked into the Predicate Script token filter but the painless script API in that context only provides the absolute position of the toekn from the start of the original input string so I could find the first token like this:
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "predicate_token_filter",
"script": {
"source": """
token.position == 0
"""
}
}
],
"text": "the fox jumps the lazy dog"
}
This works and results in:
{
"tokens" : [
{
"token" : "the",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
But I need the last token, not the first. Is there any way to achieve this without preparing a separate field pre-indexing, outside of Elasticsearch?
You're on the right path!! The solution is not that far from what you have... When you know you can easily fetch the first token, but what you need is the last... just reverse the string...
The following analyzer will output just the token you need, i.e. dog.
We first start by reversing the whole string, then we split by token, use your predicate script to only select the first one and reverse that token again. Voilà!
POST test/_analyze
{
"text": "the fox jumps the lazy dog",
"tokenizer": "keyword",
"filter": [
"reverse",
"word_delimiter",
{
"type": "predicate_token_filter",
"script": {
"source": """
token.position == 0
"""
}
},
"reverse"
]
}
Result:
{
"tokens" : [
{
"token" : "dog",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
}
]
}

Get top 100 most used three word phrases in all documents

I have about 15,000 scraped websites with their body texts stored in an elastic search index. I need to get the top 100 most used three-word phrases being used in all these texts:
Something like this:
Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]
I'm new to this. I looked into term vectors but they appear to apply to single documents. So I feel it will be a combination of term vectors and aggregation with n-gram analysis of sorts. But I have no idea how to go about implementing this. Any pointers will be helpful.
My current mapping and settings:
{
"mappings": {
"items": {
"properties": {
"body": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"analyzer" : "fulltext_analyzer"
}
}
}
},
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
}
}
What you're looking for are called Shingles. Shingles are like "word n-grams": serial combinations of more than one term in a string. (E.g. "We all live", "all live in", "live in a", "in a yellow", "a yellow submarine")
Take a look here: https://www.elastic.co/blog/searching-with-shingles
Basically, you need a field with a shingle analyzer producing solely 3-term shingles:
Elastic blog-post configuration but with:
"filter_shingle":{
"type":"shingle",
"max_shingle_size":3,
"min_shingle_size":3,
"output_unigrams":"false"
}
The, after applying the shingle analyzer to the field in question (as in the blog post), and reindexing your data, you should be able to issue a query returning a simple terms aggregation, on your body field to see the top one-hundred 3-word phrases.
{
"size" : 0,
"query" : {
"match_all" : {}
},
"aggs" : {
"three-word-phrases" : {
"terms" : {
"field" : "body",
"size" : 100
}
}
}
}

Elastic search is not returning expected results for term query

This is how ,my article data looks in elastic search
id:123,
title:xyz,
keywords:"Test Example"
id:124,
title:xyzz,
keywords:"Test Example|test1"
When a keyword is clicked on the front end,say for example: 'Test Example' then i should get articles having that keyword ( i should get above two articles as my results).But i am getting only first article as my result and below is my mapping:
"keywords":
{
"type":"string",
"index":"not_analysed"
}
How can i get both articles in search results?Thank you
Term Query searches for exact terms. That's why when you search for Test Example you get only one result, as there is only one record that exactly matches Test Example. If you want both the results you need to use something like match or query_string. You can use query_string like:
{
"query": {
"query_string": {
"default_field": "keywords",
"query": "Test Example*"
}
}
}
You have to Query with query_string,term query search only for exact term.
You set your keywords field to not_analyzed: if you want the field to be searchable you should remove the index clause like so
"keywords": {
"type":"string"
}
Searching over this field with a match query, anyway, will return results containing a superset of the provided query: searching for test will return both documents even though the tag is actually Test Example.
If you can change your documents to something like this
id:123,
title:xyz,
keywords:"Test Example"
id:124,
title:xyzz,
keywords: ["Test Example", "test1"]
you can use your original mapping with "index":"not_analysed" and a term query will return only documents containing exactly the tag you were looking for.
{
"query": {
"term": {
"keywords": "test1"
}
}
}
Another option to accomplish the same result is to use a pattern tokenizer to split your tag string on the | character to accomplish the same result
"tokenizer": {
"split_tags": {
"type": "pattern",
"group": "-1",
"pattern": "\|"
}
}
I have got it working with the following tokenizer:
"split_keywords": {
"type": "pattern",
"group": "0",
"pattern": "([^|]+)"
}
Keywords will split at pipe character(below is the example)
{
"tokens" : [ {
"token" : "TestExample",
"start_offset" : 0,
"end_offset" : 12,
"type" : "word",
"position" : 1
}, {
"token" : "test",
"start_offset" : 13,
"end_offset" : 17,
"type" : "word",
"position" : 2
}, {
"token" : "1",
"start_offset" : 17,
"end_offset" : 18,
"type" : "word",
"position" : 3
}, {
"token" : "test1",
"start_offset" : 13,
"end_offset" : 18,
"type" : "word",
"position" : 3
} ]
}
Now when i search for 'TestExample',i get above two articles.
Thanks a lot for your help :)

Elastic Search multilingual field

I have read through few articles and advices, but unfortunately I haven't found working solution for me.
The problem is I have a field in index that can have content in any possible language and I don't know in which language it is. I need to search and sort on it. It is not localisation, just values in different languages.
The first language (excluding few European) I have tried it on was Japanese. For the beginning I set for this field only one analyzer and tried to search only for Japanese words/phrases. I took example from here. Here is what I used for this:
'analysis': {
"filter": {
...
"ja_pos_filter": {
"type": "kuromoji_part_of_speech",
"stoptags": [
"\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c",
"\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
},
...
},
"analyzer": {
...
"ja_analyzer": {
"type": "custom",
"filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
"tokenizer": "kuromoji_tokenizer"
},
...
},
"tokenizer": {
"kuromoji": {
"type": "kuromoji_tokenizer",
"mode": "search"
}
}
}
Mapper:
'name': {
'type': 'string',
'index': 'analyzed',
'analyzer': 'ja_analyzer',
}
And here are few tries to get result from it:
{
'filter': {
'query': {
'bool': {
'must': [
{
# 'wildcard': {'name': u'*ネバーランド福島*'}
# 'match': {'name": u'ネバーランド福島'
# },
"query_string": {
"fields": ['name'],
"query": u'ネバーランド福島',
"default_operator": 'AND'
}
},
],
'boost': 1.0
}
}
}
}
None of them works.
If I just take a standard analyser and query in with query_string or brake phrase myself (breaking on whitespace, what i don't have here) and use wildcard *<>* for this it will find me nothing again. Analyser says that ネバーランド and 福島 are separate words/parts:
curl -XPOST 'http://localhost:9200/test/_analyze?analyzer=ja_analyzer&pretty' -d 'ネバーランド福島'
{
"tokens" : [ {
"token" : "ネハラント",
"start_offset" : 0,
"end_offset" : 6,
"type" : "word",
"position" : 1
}, {
"token" : "福島",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}
And in case of standard analyser I'll get result if I'll look for ネバーランド I'll get what I want. But if I use customised analyser and try the same or just one symbol I'm still getting nothing.
The behaviour I'm looking for is: breaking query string on words/parts, all words/parts should be present in resulting name field.
Thank you in advance

Resources