Elasticsearch - Stop analyzer doesn't allow number - elasticsearch

I'm trying to build a search utility using elasticsearch 6.3.0 where any term can be searched within the database. I have applied Stop Analyzer to exclude some of the generic words. However, after having that analyzer system stopped giving me term with numbers as well.
Like if I search for news24 then it removes 24 and search only for "news" term in all records. Unsure why.
Below is the query I am using
{
"from": 0,
"size": 10,
"explain": false,
"stored_fields": [
"_source"
],
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "news24",
"analyzer": "stop",
"fields": [
"title",
"keywords",
"url"
]
}
},
"functions": [
{
"script_score": {
"script": "( (doc['isSponsered'].value == 'y') ? 100 : 0 )"
}
},
{
"script_score": {
"script": "doc['linksCount'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"script_fields": {
"custom_score": {
"script": {
"lang": "painless",
"source": "params._source.linksArray"
}
}
},
"highlight": {
"pre_tags": [
""
],
"post_tags": [
"<\/span>"
],
"fields": {
"title": {
"type": "plain"
},
"keywords": {
"type": "plain"
},
"description": {
"type": "plain"
},
"url": {
"type": "plain"
}
}
}
}

That is because stop analyzer is just an extension of Simple Analyzer which makes use of Lowercase Tokenizer which would simply break terms into tokens if it encounters character which is not a letter (of course also lowercasing all the terms).
So bascially if you have something like news24 what it does it, breaks it into news as it encountered 2.
This is the default behaviour of the stop analyzer. If you intend to make use of stop words and still want to keep numerics in picture, then you would be required to create a custom analyzer as shown below:
Mapping:
POST sometestindex
{
"settings":{
"analysis":{
"analyzer":{
"my_english_analyzer":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}
What it does it it makes use of Standard Analyzer which internally uses Standard Tokenizer and also ignores stop words.
Analysis Query To Test
POST sometestindex/_analyze
{
"analyzer": "my_english_analyzer",
"text": "the name of the channel is news24"
}
Query Result
{
"tokens": [
{
"token": "name",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "channel",
"start_offset": 16,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "news24",
"start_offset": 27,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 6
}
]
}
You can see in the above tokens, that news24 is being preserved as token.
Hope it helps!

Related

how to search a document containing a substring

I have the following document with this (partial) mapping:
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
I'm trying to perform a query for document containing "success":"0" through the following DSL query:
{
"query": {
"bool": {
"must": {
"regexp": {
"message": ".*\"success\".*0.*"
}
}
}
}
}
but I don't get any result, whereas if I perform the following DSL:
{
"query": {
"bool": {
"must": {
"regexp": {
"message": ".*\"success\""
}
}
}
}
}
I'm returned some document! I.e.
{"data":"[{\"appVersion\":\"1.1.1\",\"installationId\":\"any-ubst-id\",\"platform\":\"aaa\",\"brand\":\"Dalvik\",\"screenSize\":\"xhdpi\"}]","executionTime":"0","flags":"0","method":"aaa","service":"myService","success":"0","type":"aservice","version":"1"}
What's wrong with my query?
The text field message uses standard analyzer which tokenize the input string and convert it to tokens.
If we analyze the string "success":"0" using standard analyzer we will get these tokens
{
"tokens": [
{
"token": "success",
"start_offset": 2,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "0",
"start_offset": 12,
"end_offset": 13,
"type": "<NUM>",
"position": 1
}
]
}
So you can see that colon double quotes etc are removed. And since regexp query applied on each token it will not match your query.
But if we use message.keyword which has field type keyword. it is not analyzed thus keep the string as it is.
{
"tokens": [
{
"token": """ "success":"0" """,
"start_offset": 0,
"end_offset": 15,
"type": "word",
"position": 0
}
]
}
So if we use the below query it should work
{
"query": {
"regexp": {
"message.keyword": """.*"success".*0.*"""
}
}
}
But another problem is you have set message.keyword field settings to "ignore_above": 256 So This field will ignore any string longer than 256 characters.

Elasticsearch match certain fields exactly but not others

I am needing ElasticSearch to match certain fields exactly, currently using multi_match.
For example, a user types in long beach chiropractor.
I want long beach to match the city field exactly, and not return results for seal beach or glass beach.
At the same time chiropractor should also match chiropractic.
Here is the current query I am using:
"query": {
"bool": {
"should": [
{
"multi_match": {
"fields": [
"title",
"location_address_address_1.value",
"location_address_city.value^2",
"location_address_state.value",
"specialty" // e.g. chiropractor
],
"query": "chiropractor long beach",
"boost": 6,
"type": "cross_fields"
}
}
]
}
},
The right approach would be to separate term that is searched and location, and store location as keyword type. If that's not possible then you can use synonym tokenizer to store locations as single tokens, but this will require to have the list of all possible locations. e.g.
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"long beach=>long-beach"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
}
}
Now if you call
POST /my_index/_analyze?analyzer=my_synonyms
{
"text": ["chiropractor long beach"]
}
the response is
{
"tokens": [
{
"token": "chiropractor",
"start_offset": 0,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "long-beach",
"start_offset": 13,
"end_offset": 23,
"type": "SYNONYM",
"position": 1
}
]
}

elasticsearch multi_match with regexp

I'm trying to rebuild my elastic search query, because I found that I don't receiving all documents I am looking for.
So, let's assume that I have document like this:
{
"id": 1234,
"mail_id": 5,
"sender": "john smith",
"email": "johnsmith#gmail.com",
"subject": "somesubject",
"txt": "abcdefgh\r\n",
"html": "<div dir=\"ltr\">abcdefgh</div>\r\n",
"date": "2017-07-020 10:00:00"
}
I have few millions documents like this and now I am trying to search for some by query like this:
{
"sort": [
{
"date": {
"order": "desc"
}
}
],
"query": {
"bool": {
"minimum_should_match": "100%",
"should": [
{
"multi_match": {
"type": "cross_fields",
"query": "abcdefgh johnsmith john smith",
"operator": "and",
"fields": [
"email.full",
"sender",
"subject",
"txt",
"html"
]
}
}
],
"must": [
{
"ids": {
"values": [
"1234"
]
}
},
{
"term": {
"mail_id": 5
}
}
]
}
}
}
For query like this it is all fine, but when i want to find document by query 'gmail' or 'com', it would not work.
"query": "abcdefgh johnsmith john smith gmail"
"query": "abcdefgh johnsmith john smith com"
It will work only when I will search for 'gmail.com'
"query": "abcdefgh johnsmith john smith gmail.com"
So... I was trying to attach analyzer
...
"type": "cross_fields",
"query": "abcdefgh johnsmith john smith",
"operator": "and",
"analyzer": "simple",
...
Does not help at all. The only way I am able to find this document was define regex, e.g.:
"minimum_should_match": 1,
"should": [
{
"multi_match": {
"type": "cross_fields",
"query": "fdsfs wukamil kam wuj gmail.com",
"operator": "and",
"fields": [
"email.full",
"sender",
"subject",
"txt",
"html"
]
}
},
{
"regexp": {
"email.full": ".*gmail.*"
}
}
],
but in this approach I will have to add (queries * fields) regexp objects to my json, so I don't think it will be the best solution. I also know about wildcard but it will be mess just like with regexps.
If anyone had problem like this and know the solution I will be thankful for help :)
If you run your search term through the standard analyser you can see what tokens johnsmith#gmail.com gets broken down to. You can do this directly in your browser using the below URL:
https://<your_site>:<es_port>/_analyze/?analyzer=standard&text=johnsmith#gmail.com
This will show that the email gets broken down into the following tokens:
{
"tokens": [
{
"token": "johnsmith",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "gmail.com",
"start_offset": 10,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}
So this shows that you can't search using just gmail but you can using gmail.com. To split your text on the dot too you can update your mapping to use the Simple Analyzer which says:
The simple analyzer breaks text into terms whenever it encounters a character which is not a letter. All terms are lower cased.
We can show this works by updating our URL from earlier to use the simple analyser as below:
https://<your_site>:<es_port>/_analyze/?analyzer=simple&text=johnsmith#gmail.com
Which returns:
{
"tokens": [
{
"token": "johnsmith",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 1
},
{
"token": "gmail",
"start_offset": 10,
"end_offset": 15,
"type": "word",
"position": 2
},
{
"token": "com",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 3
}
]
}
This analyser may not be the right tool for the job as it ignores any non-letter values but you can play with analysers and tokenisers until you get what you need.

No results returned for filtered Elasticsearch query

I'm having trouble executing the following request against Elasticsearch v2.2.0. If I remove the filter property (and contents, of course), I get my entity back (only one exists). With the filter clause in place, I just get 0 results, but no error. Same if I remove the email filter and/or the name filter. Am I doing something wrong with this request?
Request
GET http://localhost:9200/my-app/my-entity/_search?pretty=1
{
"query": {
"filtered" : {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"term": {
"email": "my.email#email.com"
}
},
{
"term": {
"name": "Test1"
}
}
]
}
}
}
}
Existing Entity
{
"email": "my.email#email.com",
"name": "Test1"
}
Mapping
"properties": {
"name": {
"type": "string"
},
"email": {
"type": "string"
},
"term": {
"type": "long"
}
}
Since email field is analyzed with no custom analyzer, Standard Analyzer will get applied to it and it will split into tokens.
Read about Standard Tokenizer here.
You can use below command to see how my.email#email.com is getting tokenized.
curl -XGET "http://localhost:9200/_analyze?tokenizer=standard" -d "my.email#email.com".
This will generate following output.
{
"tokens": [
{
"token": "my.email", ===> Notice this
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "email.com", ===> Notice this
"start_offset": 9,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
If you want full or exact search you need to make it not_analyzed. Study about how to create a not_analyzed field here.
{
"email": {
"type": "string",
"index": "not_analyzed"
}
}
Hope it is clear

Phonetic search results for integers with Elasticserach

Forgive me as I am new to Elasticsearch, but I am following the Phonetic start guide found here: Phonetic Matching
I have the following
POST /app
{
"settings": {
"index": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
},
"year": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
}
}
} }
I add some results by doing:
POST /app/movie
{ "title": "300", "year": 2006"} & { "title":"500 days of summer", "year": "2009" }
I want to query for the movie '300' by entering this query though:
POST /app/movie/_search
{
"query": {
"match": {
"title.phonetic": {
"query": "three hundred"
}
}
}
}
but I get no results. If change my query to "300" though it works just fine.
If I do:
GET /app/_analyze?analyzer=dbl_metaphone&text=300
{
"tokens": [
{
"token": "300",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
}
]
}
I see that there is only a number token returned not alphanumeric version like:
GET /app/_analyze?analyzer=dbl_metaphone&text=three hundred
{
"tokens": [
{
"token": "0R",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "TR",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "HNTR",
"start_offset": 6,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Is there something that I am missing with my phonetic query that I am supposed to define to get both the numerical and alphanumeric tokens?
That is not possible. Double Metaphone is a form of phonetic encoding algorithm.
Simply put it tries to encode similarly pronounced words to the same key.
This facilitates to search for terms like names that could be spelt differently but sound the same.
As you can see from the algorithm double metaphone ignores numbers/numeric characters.
You can read more about double metaphone here.
A better case for phonetic matching is finding "Judy Steinheiser" when the search query is [Jodi Stynehaser].
If you need to be able to search numbers using English, then you'll need to create some synonyms or alternate text at index-time, so that both "300" and "three hundred" are stored in Elasticsearch.
Shouldn't be too hard to find/write a function that converts integers to English.
Call your function when constructing your document to ingest into ES.
Alternately, write it in Groovy, and call it as a Transform script in your mapping.

Resources