elasticsearch multi_match with regexp - elasticsearch

I'm trying to rebuild my elastic search query, because I found that I don't receiving all documents I am looking for.
So, let's assume that I have document like this:
{
"id": 1234,
"mail_id": 5,
"sender": "john smith",
"email": "johnsmith#gmail.com",
"subject": "somesubject",
"txt": "abcdefgh\r\n",
"html": "<div dir=\"ltr\">abcdefgh</div>\r\n",
"date": "2017-07-020 10:00:00"
}
I have few millions documents like this and now I am trying to search for some by query like this:
{
"sort": [
{
"date": {
"order": "desc"
}
}
],
"query": {
"bool": {
"minimum_should_match": "100%",
"should": [
{
"multi_match": {
"type": "cross_fields",
"query": "abcdefgh johnsmith john smith",
"operator": "and",
"fields": [
"email.full",
"sender",
"subject",
"txt",
"html"
]
}
}
],
"must": [
{
"ids": {
"values": [
"1234"
]
}
},
{
"term": {
"mail_id": 5
}
}
]
}
}
}
For query like this it is all fine, but when i want to find document by query 'gmail' or 'com', it would not work.
"query": "abcdefgh johnsmith john smith gmail"
"query": "abcdefgh johnsmith john smith com"
It will work only when I will search for 'gmail.com'
"query": "abcdefgh johnsmith john smith gmail.com"
So... I was trying to attach analyzer
...
"type": "cross_fields",
"query": "abcdefgh johnsmith john smith",
"operator": "and",
"analyzer": "simple",
...
Does not help at all. The only way I am able to find this document was define regex, e.g.:
"minimum_should_match": 1,
"should": [
{
"multi_match": {
"type": "cross_fields",
"query": "fdsfs wukamil kam wuj gmail.com",
"operator": "and",
"fields": [
"email.full",
"sender",
"subject",
"txt",
"html"
]
}
},
{
"regexp": {
"email.full": ".*gmail.*"
}
}
],
but in this approach I will have to add (queries * fields) regexp objects to my json, so I don't think it will be the best solution. I also know about wildcard but it will be mess just like with regexps.
If anyone had problem like this and know the solution I will be thankful for help :)

If you run your search term through the standard analyser you can see what tokens johnsmith#gmail.com gets broken down to. You can do this directly in your browser using the below URL:
https://<your_site>:<es_port>/_analyze/?analyzer=standard&text=johnsmith#gmail.com
This will show that the email gets broken down into the following tokens:
{
"tokens": [
{
"token": "johnsmith",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "gmail.com",
"start_offset": 10,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
So this shows that you can't search using just gmail but you can using gmail.com. To split your text on the dot too you can update your mapping to use the Simple Analyzer which says:
The simple analyzer breaks text into terms whenever it encounters a character which is not a letter. All terms are lower cased.
We can show this works by updating our URL from earlier to use the simple analyser as below:
https://<your_site>:<es_port>/_analyze/?analyzer=simple&text=johnsmith#gmail.com
Which returns:
{
"tokens": [
{
"token": "johnsmith",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "gmail",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "com",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 3
}
]
}
This analyser may not be the right tool for the job as it ignores any non-letter values but you can play with analysers and tokenisers until you get what you need.

Related

Elasticsearch: FVH highlights with multiple pre and post tags marking tokens incorrectly?

I'm querying my index using boolean query with two match terms. For each term I have a separate set of pre- and post- tags. Using highlights I would like to obtain the documents in which both terms exist and see which tokens were matched as each of them. The index contains documents in Polish analyzed using morfologik. Let's call the two terms I'm searching for aspect and feature. I want to query the index and retrieve the documents in which both a specific aspect and feature exist and I want the highlight feature to mark the aspect token with <aspect> tag and the feature with <feature> tag. Most of the time it works as expected, sometimes, though, Elasticsearch is marking one or both of the tokens incorrectly. I'll give you an example.
So let's say my index contains the following document:
"Najlepsza maseczka na zniszczone włosy!"
If I search for "maseczka" (aspect) and "dobry" (feature) I expect the output to be like this:
"<feature>Najlepsza</feature> <aspect>maseczka</aspect> na zniszczone włosy! "
For some reason the results from Elasticsearch are like this:
"<aspect>Najlepsza</aspect> <aspect>maseczka</aspect> na zniszczone włosy! "
What I know so far:
I thought maybe the aspect and feature have similar form when analyzed, but it's not the case, for example _analyze for the above example returns:
#query
GET my_index/_analyze
{
"analyzer": "morfologik",
"text": "dobra maseczka"
}
#results
{
"tokens": [
{
"token": "dobra",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dobro",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dobry",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "maseczka",
"start_offset": 6,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
# Analysis of the document:
get my_index/_analyze
{
"analyzer": "morfologik",
"text": "Najlepsza maseczka na zniszczone włosy"
}
# response
{
"tokens": [
{
"token": "dobry",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "maseczka",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "na",
"start_offset": 19,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 2
},
...
]
}
it's also not a problem with specific aspect or feature, because for some queries the index will return both correctly and incorrectly highlighted documents (so I'd expect it to be a problem with documents, rather than queries)
in some cases both terms are highlighted as aspects, in some aspect is marked as feature and feature as aspect, I haven't found any rule so far
I thought if my search terms match the order of the highlights tags, the first term should always get the first tag and the second term always the second tag, but maybe they work in a different way? I thought that's how it works inspired by this response:
Using the Fast Vector Highlighter, you can specify tags in order of "importance" which seems to mean that their order and the order of your search terms should match.
Here's how my index is constructed:
{
"settings": {
"analysis": {
"analyzer": {
"morfologik": {
"tokenizer": "standard",
"filter": [
"morfologik_stem",
"lowercase"
],
"type": "custom"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "morfologik"
},
"original_doc": {
"type": "integer"
}
}
}
}
}
Here's my query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{ "match" : { "content" : "maseczki" } },
{ "match" : { "content" : "dobre" } }
]
}},
"highlight": {
"fields": {
"content": {
"fragment_size": 200,
"type": "fvh",
"pre_tags": ["<aspect>", "<feature>"],
"post_tags": ["</aspect>", "</feature>"]
}
}
}
}
And here's a sample response:
{
"_index": "my_index",
"_type": "doc",
"_id": "R91v7GkB0hUBqPARgC54",
"_score": 16.864662,
"_source": {
"content": "Najlepsza maseczka na zniszczone włosy! ",
"original_doc_id": 74290
},
"highlight": {
"content": [
"<aspect>Najlepsza</aspect> <aspect>maseczka</aspect> na zniszczone włosy! "
]
}
},
As I said, most of the time the query works fine and sometimes the all-aspect-highlighting occurs only for a subset of a specific query results, like it does in case of "(opakowanie, solidne)":
aspect here is in fact feature and feature is aspect
<aspect>solidne</aspect>, naprawdę świetne <feature>opakowanie</feature>
solidne should be marked as feature here
Jedyne do czego mogłabym się przyczepić to <aspect>opakowanie</aspect> które wg mnie niestety nie jest <aspect>solidne</aspect>
In my understanding if you want to do a match query on a space separated string, you should be using tokenizer as whitespace.
I would suggest you to check this tokenizer. https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-whitespace-tokenizer.html

Elasticsearch - Stop analyzer doesn't allow number

I'm trying to build a search utility using elasticsearch 6.3.0 where any term can be searched within the database. I have applied Stop Analyzer to exclude some of the generic words. However, after having that analyzer system stopped giving me term with numbers as well.
Like if I search for news24 then it removes 24 and search only for "news" term in all records. Unsure why.
Below is the query I am using
{
"from": 0,
"size": 10,
"explain": false,
"stored_fields": [
"_source"
],
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "news24",
"analyzer": "stop",
"fields": [
"title",
"keywords",
"url"
]
}
},
"functions": [
{
"script_score": {
"script": "( (doc['isSponsered'].value == 'y') ? 100 : 0 )"
}
},
{
"script_score": {
"script": "doc['linksCount'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"script_fields": {
"custom_score": {
"script": {
"lang": "painless",
"source": "params._source.linksArray"
}
}
},
"highlight": {
"pre_tags": [
""
],
"post_tags": [
"<\/span>"
],
"fields": {
"title": {
"type": "plain"
},
"keywords": {
"type": "plain"
},
"description": {
"type": "plain"
},
"url": {
"type": "plain"
}
}
}
}
That is because stop analyzer is just an extension of Simple Analyzer which makes use of Lowercase Tokenizer which would simply break terms into tokens if it encounters character which is not a letter (of course also lowercasing all the terms).
So bascially if you have something like news24 what it does it, breaks it into news as it encountered 2.
This is the default behaviour of the stop analyzer. If you intend to make use of stop words and still want to keep numerics in picture, then you would be required to create a custom analyzer as shown below:
Mapping:
POST sometestindex
{
"settings":{
"analysis":{
"analyzer":{
"my_english_analyzer":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}
What it does it it makes use of Standard Analyzer which internally uses Standard Tokenizer and also ignores stop words.
Analysis Query To Test
POST sometestindex/_analyze
{
"analyzer": "my_english_analyzer",
"text": "the name of the channel is news24"
}
Query Result
{
"tokens": [
{
"token": "name",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "channel",
"start_offset": 16,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "news24",
"start_offset": 27,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 6
}
]
}
You can see in the above tokens, that news24 is being preserved as token.
Hope it helps!

Elasticsearch match certain fields exactly but not others

I am needing ElasticSearch to match certain fields exactly, currently using multi_match.
For example, a user types in long beach chiropractor.
I want long beach to match the city field exactly, and not return results for seal beach or glass beach.
At the same time chiropractor should also match chiropractic.
Here is the current query I am using:
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": [
"title",
"location_address_address_1.value",
"location_address_city.value^2",
"location_address_state.value",
"specialty" // e.g. chiropractor
],
"query": "chiropractor long beach",
"boost": 6,
"type": "cross_fields"
}
}
]
}
},
The right approach would be to separate term that is searched and location, and store location as keyword type. If that's not possible then you can use synonym tokenizer to store locations as single tokens, but this will require to have the list of all possible locations. e.g.
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"long beach=>long-beach"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Now if you call
POST /my_index/_analyze?analyzer=my_synonyms
{
"text": ["chiropractor long beach"]
}
the response is
{
"tokens": [
{
"token": "chiropractor",
"start_offset": 0,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "long-beach",
"start_offset": 13,
"end_offset": 23,
"type": "SYNONYM",
"position": 1
}
]
}

Phonetic search results for integers with Elasticserach

Forgive me as I am new to Elasticsearch, but I am following the Phonetic start guide found here: Phonetic Matching
I have the following
POST /app
{
"settings": {
"index": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
},
"year": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
}
}
} }
I add some results by doing:
POST /app/movie
{ "title": "300", "year": 2006"} & { "title":"500 days of summer", "year": "2009" }
I want to query for the movie '300' by entering this query though:
POST /app/movie/_search
{
"query": {
"match": {
"title.phonetic": {
"query": "three hundred"
}
}
}
}
but I get no results. If change my query to "300" though it works just fine.
If I do:
GET /app/_analyze?analyzer=dbl_metaphone&text=300
{
"tokens": [
{
"token": "300",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
}
]
}
I see that there is only a number token returned not alphanumeric version like:
GET /app/_analyze?analyzer=dbl_metaphone&text=three hundred
{
"tokens": [
{
"token": "0R",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "TR",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "HNTR",
"start_offset": 6,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Is there something that I am missing with my phonetic query that I am supposed to define to get both the numerical and alphanumeric tokens?
That is not possible. Double Metaphone is a form of phonetic encoding algorithm.
Simply put it tries to encode similarly pronounced words to the same key.
This facilitates to search for terms like names that could be spelt differently but sound the same.
As you can see from the algorithm double metaphone ignores numbers/numeric characters.
You can read more about double metaphone here.
A better case for phonetic matching is finding "Judy Steinheiser" when the search query is [Jodi Stynehaser].
If you need to be able to search numbers using English, then you'll need to create some synonyms or alternate text at index-time, so that both "300" and "three hundred" are stored in Elasticsearch.
Shouldn't be too hard to find/write a function that converts integers to English.
Call your function when constructing your document to ingest into ES.
Alternately, write it in Groovy, and call it as a Transform script in your mapping.

Confusing query_string search results

I've got Elasticsearch set up and am running queries against it, but I'm getting odd results, and can't figure out why:
For example, the here's one relevant portion of my mapping:
"classification": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
And, then here's some of the queries and results. For all of these, there are objects with classification value of "Jewelry & Adornment":
Query:
"query": {
"bool": {
"must": [
{
"match_all": {}
},
{
"query_string": {
"query": "(classification:/jewel.*/)"
}
}
]
}
}
Result:
"hits": {
"total": 2541,
"max_score": 1.4142135,
"hits": [
{
...
Yet if I add "ry":
Query:
"query_string": {
"query": "(classification:/jewelry.*/)"
}
Result:
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
I've also tried running the queries:
"query_string": {
"query": "(classification\\*:/jewelry.*/)"
}
(should match either "classification" or "classification.raw")
And:
"query_string": {
"query": "(classification.raw:/jewelry.*/)"
}
I've also tried cases variations, e.g. "Jewelry" vs. "jewelry", to no effect. All of these return no results. This makes no sense to me. Even when querying "classification.raw" with "Jewelry" (same case and on a completely unanalyzed field), I get no results. Any ideas?
UPDATE
As per request of #keety
{
"tokens": [
{
"token": "jewelri",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "adorn",
"start_offset": 10,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
I imagine the fact that it's stemming "jewelry" to "jewelri" is my problem, but not sure why it's doing that or how to fix it.
UPDATE #2
These are the analyzers in play:
"analyzer": {
"default_index": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding",
"custom_stem",
"porter_stem",
"index_filter"
],
"char_filter": [
"html_strip",
"quotes"
]
},
"default_search": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding",
"custom_stem",
"porter_stem",
"search_filter"
],
"char_filter": [
"html_strip",
"quotes"
]
}
}
UPDATE #3
I ran an _explain query on one of the objects that should be matching but isn't and got the following:
"matched": false,
"explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0.70710677,
"description": "ConstantScore(*:*), product of:",
"details": [
{
"value": 1,
"description": "boost"
},
{
"value": 0.70710677,
"description": "queryNorm"
}
]
},
{
"value": 0,
"description": "no match on required clause (ConstantScore())"
}
]
}
I don't know what "required clause (ConstantScore())" is. The only thing I can find related is Constant Score Query, but I'm not employing this particular query anywhere.
UPDATE #4
Okay, this is getting a little long-winded. Sorry about that. However, I just discovered that the problem seems to lie in using the regex syntax. If I just use a basic wildcard (along with "analyze_wildcard": true), then all my queries start working.

Resources