Elasticsearch crash repeatedly after phrase prefix search on whitespace analyzer - elasticsearch

I have defined my mapping as:
{
mappings: { // defined all mappings },
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "whitespace",
"filter" : ["lowercase"]
}
}
}
}
}
The query which I am executing is this one:
{
"bool" : {
"must" : [
{
"query_string" : {
"query" : "*2AW\\-COTTON_\\&_SON_\\(*",
"fields" : [ ],
"type" : "phrase_prefix",
"default_operator" : "or",
"max_determinized_states" : 10000,
"enable_position_increments" : true,
"fuzziness" : "AUTO",
"fuzzy_prefix_length" : 0,
"fuzzy_max_expansions" : 50,
"phrase_slop" : 0,
"escape" : false,
"auto_generate_synonyms_phrase_query" : true,
"fuzzy_transpositions" : true,
"boost" : 1.0
}
}
],
"filter" : [
{
"terms" : {
"id" : [
"50010",
"1604"
],
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
I am using a whitespace analyzer instead of the standard one, as I had to search on special characters as well. I have escaped special characters in this search. But when I do a phrase prefix query on this index, my whole elasticsearch crashes every time. For the first two queries, it will take 20-30 seconds, after that for any further query, ES will crash. Right now, I was testing this on a 2GB RAM machine, with an allocated heap size of 1GB, can this be the reason, will increasing machine size help? Thanks for any help!!

Since you haven't specified a field to perform the wilcard on, ES will search almost all fields.
Have you tried using a wildcard or regexp filter instead of query_string?
If you do know which field you want to query (and I suspect you do), use something along the lines of:
GET fuzzy/_search?request_cache=false
{
"query": {
"bool": {
"must": [
{
"regexp": {
"identifier": ".*2aw-cotton_\\&_son_\\(.*"
}
}
]
}
}
}
Even with 400 sample docs on my machine, the speed improvements are 40x over the wide-range query_string.
P.S: Of course, remove request_cache when in production.

Related

Elasticsearch: simple_query_string and multi-words synonyms

I have a field with the following search_analyzer:
"name_search_en" : {
"filter" : [
"english_possessive_stemmer",
"lowercase",
"name_synonyms_en",
"english_stop",
"english_stemmer",
"asciifolding"
],
"tokenizer" : "standard"
}
name_synonyms_en is a synonym_graph that looks like this
"name_synonyms_en" : {
"type" : "synonym_graph",
"synonyms" : [
"beach bag => straw bag,beach bag",
"bicycle,bike"
]
}
Running the following multi_match query the synonym are correctly applied
{
"query": {
"multi_match": {
"query": "beach bag",
"auto_generate_synonyms_phrase_query": false,
"type": "cross_fields",
"fields": [
"brand.en-US^1.0",
"name.en-US^1.0"
]
}
}
}
Here is the _validate explanation output. Both beach bag and straw bag are present, as expected, in the raw query:
"explanations" : [
{
"index" : "d7598351-311f-4844-bb91-4f26c9f538f3",
"valid" : true,
"explanation" : "+((((+name.en-US:straw +name.en-US:bag) (+name.en-US:beach +name.en-US:bag))) | (brand.en-US:beach brand.en-US:bag)) #DocValuesFieldExistsQuery [field=_primary_term]"
}
]
I would expect the same in the following simple_query_string
{
"query": {
"simple_query_string": {
"query": "beach bag",
"auto_generate_synonyms_phrase_query": false,
"fields": [
"brand.en-US^1.0",
"name.en-US^1.0"
]
}
}
}
but the straw bag synonym is not present in the raw query
"explanations" : [
{
"index" : "d7598351-311f-4844-bb91-4f26c9f538f3",
"valid" : true,
"explanation" : "+((name.en-US:beach | brand.en-US:beach)~1.0 (name.en-US:bag | brand.en-US:bag)~1.0) #DocValuesFieldExistsQuery [field=_primary_term]"
}
]
The problem seems to be related to multi-terms synonyms only. If I search for bike, the bicycle synonym is correctly present in the query
"explanations" : [
{
"index" : "d7598351-311f-4844-bb91-4f26c9f538f3",
"valid" : true,
"explanation" : "+(Synonym(name.en-US:bicycl name.en-US:bike) | brand.en-US:bike)~1.0 #DocValuesFieldExistsQuery [field=_primary_term]"
}
]
Is this the expected behaviour (meaning multi terms synonyms are not supported for this query)?
By default simple_query_string has the WHITESPACE flag enabled. The input text is tokenized. That's the reason the synonym filter doesn't handle correctly multi-words. This query disable all flags making multi-words synonyms working as expected
{
"query": {
"simple_query_string": {
"query": "beach bag",
"auto_generate_synonyms_phrase_query": false,
"flags": "NONE",
"fields": [
"brand.en-US^1.0",
"name.en-US^1.0"
]
}
}
}
This unfortunately does not play well with minimum_should_match parameter. Full discussion and more details on this can be found here https://discuss.elastic.co/t/simple-query-string-and-multi-terms-synonyms/174780

Match fails elasticsearch

I have the following index in which I index mail addresses.
PUT _myindex
{
"settings" : {
"analysis" : {
"filter" : {
"email" : {
"type" : "pattern_capture",
"preserve_original" : true,
"patterns" : [
"^(.*?)#",
"(\\w+(?=.*#))"]
}
},
"analyzer" : {
"email" : {
"tokenizer" : "uax_url_email",
"filter" : [ "lowercase","email", "unique" ]
}
}
}
},
"mappings": {
"emails": {
"properties": {
"email": {
"type": "text",
"analyzer": "email"
}
}
}
}
My e-mail in the following form "example.elastic#yahoo.com". When i index them they get analysed like example.elastic#yahoo.com, example.elastic, elastic, example.
When i run a match
GET _myindex/_search
{
"query": {
"match": {
"email": "example.elastic#yahoo.com"
}
}
}
or using as a query string example, elastic, Elastic it works and retrieves results. But the problem is when I have "example.elastic.blabla#yahoo.com", it also returns the same results. What can be the problem?
Using term query instead of match query will solve this.
Reason is, The match query will apply analyzer to the search term and will therefore match what is stored in the index. The term query does not apply any analyzers to the search term, so will only look for that exact term in the index.
Ref: https://stackoverflow.com/a/23151332/6546289
GET _myindex/_search
{
"query": {
"term": {
"email": "example.elastic#yahoo.com"
}
}
}

exact match in elasticSearch after incorporating hunspell filter

We have added the hunspell filter to our elastic search instance. Nothing fancy...
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"filter": {
"en_GB": {
"type": "hunspell",
"language": "en_GB"
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
},
"en_GB": {
"filter": [
"lowercase",
"en_GB"
],
"tokenizer": "standard"
}
}
}
}
}
Now though we seem to have lost the built in facility to do exact match queries using quotation marks. So searching for "lace" will also do an equal score search for "lacy" for example. I understand this is kind of the point of including hunspell but I would like to be able to force exact matches by using quotes
I am doing boolean queries for this by the way. Along the lines of (in java)
"bool" : {
"must" : {
"query_string" : {
"query" : "\"lace\"",
"fields" :
...
or (postman direct to 9200 ...
{
"query" : {
"query_string" : {
"query" : "\"lace\"",
"fields" :
....
Is this possible ? I'm guessing this might be something we would do in the tokaniser but I'm not quite sure where to start...?
You will not be able to handle this tokenizer level, but you can tweak configurations at mapping level to use multi-fields, you can keep a copy of the same field which will not be analyzed and later use this in query to support your usecase.
You can update your mappings like following
"mappings": {
"desc": {
"properties": {
"labels": {
"type": "string",
"analyzer": "en_GB",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
Furthur modify your query to search on raw field instead of analyzed field.
{
"query": {
"bool": {
"must": [{
"query_string": {
"default_field": "labels.raw",
"query": "lace"
}
}]
}
}
}
Hope this helps
Thanks

Elastic search - no hit though there should be result

I've encountered the following problem with Elastic search, does anyone know where should I troubleshoot?
I'm happily retrieving result with the following query:
{
"query" : {
"match" : { "name" : "A1212001" }
}
}
But when I refine the value of the search field "name" to a substring, i've not no hit?
{
"query" : {
"match" : { "name" : "A12120" }
}
}
"A12120" is a substring of already hit query "A1212001"
If you don't have too many documents, you can go with a regexp query
POST /index/_search
{
"query" :{
"regexp":{
"name": "A12120.*"
}
}
}
or even a wildcard one
POST /index/_search
{
"query": {
"wildcard" : { "name" : "A12120*" }
}
}
However, as #Waldemar suggested, if you have many documents in your index, the best approach for this is to use an EdgeNGram tokenizer since the above queries are not ultra-performant.
First, you define your index settings like this:
PUT index
{
"settings" : {
"analysis" : {
"analyzer" : {
"my_analyzer" : {
"type": "custom",
"tokenizer" : "edge_tokens",
"filter": ["lowercase"]
}
},
"tokenizer" : {
"edge_tokens" : {
"type" : "edgeNGram",
"min_gram" : "1",
"max_gram" : "10",
"token_chars": [ "letter", "digit" ]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"name": {
"type": "string",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
Then, when indexing a document whose name field contains A1212001, the following tokens will be indexed: A, A1, A12, A121, A1212, A12120, A121200, A1212001. So when you'll search for A12120 you'll find a match.
Are you using a Match Query this query will check for terms inside lucene and your term is A1212001 if you need to find a part of your term do you can use a Regex Query but you need know that will be there some internal impacts using regex because the shard will check in all of your terms.
If you need a more "professional" way to search a part of a term do you can use NGrams

Elasticsearch query response influenced by _id

I created an index with the following mappings and settings:
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_index": {
"type": "custom",
"tokenizer": "filename",
"filter": ["icu_folding", "edge_ngram"]
},
"default_search": {
"type":"standard",
"tokenizer": "filename",
"filter": [
"icu_folding"
]
}
},
"tokenizer" : {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 3,
"type" : "edgeNGram"
}
}
}
},
"mappings": {
"metadata": {
"properties": {
"title": {
"type": "string",
"analyzer": "case_insensitive_index"
}
}
}
}
}
I have the following documents:
{"title":"P-20150531-27332_News.jpg"}
{"title":"P-20150531-27341_News.jpg"}
{"title":"P-20150531-27512_News.jpg"}
{"title":"P-20150531-27343_News.jpg"}
creating a document with simple numerical ids
111
112
113
114
and querying using the query
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO"
}
}
}
}
results in the correct scoring and ordering of the documents returned:
P-20150531-27332_News.jpg -> 2.780985
P-20150531-27341_News.jpg -> 0.8262239
P-20150531-27512_News.jpg -> 0.8120311
P-20150531-27343_News.jpg -> 0.7687101
Strangely, creating the same documents with UUIDs
557eec2e3b00002c03de96bd
557eec0f3b00001b03de96b8
557eec0c3b00001b03de96b7
557eec123b00003a03de96ba
as IDs results in different scorings of the documents:
P-20150531-27341_News.jpg -> 2.646321
P-20150531-27332_News.jpg -> 2.1998127
P-20150531-27512_News.jpg -> 1.7725387
P-20150531-27343_News.jpg -> 1.2718291
Is this an intentional behaviour of Elasticsearch? If yes - how can I preserve the correct ordering regardless of the IDs used?
In the query it looks like you should be using 'default_search' as the analyzer for match query unless you actuall intended to use egde-ngram on the search query too.
Example :
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO",
"analyzer" : "default_search"
}
}
}
}
default_search would be the default-search analyzer only if there is are no explicit search_analyzer or analyzer specified in the mapping of the field.
The articlehere gives a good explanation of the rules by which analyzers are applied.
Also to ensure idf takes into account documents across shards you could use search_type=dfs_query_then_fetch

Resources