how to search a document containing a substring - elasticsearch

I have the following document with this (partial) mapping:
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
I'm trying to perform a query for document containing "success":"0" through the following DSL query:
{
"query": {
"bool": {
"must": {
"regexp": {
"message": ".*\"success\".*0.*"
}
}
}
}
}
but I don't get any result, whereas if I perform the following DSL:
{
"query": {
"bool": {
"must": {
"regexp": {
"message": ".*\"success\""
}
}
}
}
}
I'm returned some document! I.e.
{"data":"[{\"appVersion\":\"1.1.1\",\"installationId\":\"any-ubst-id\",\"platform\":\"aaa\",\"brand\":\"Dalvik\",\"screenSize\":\"xhdpi\"}]","executionTime":"0","flags":"0","method":"aaa","service":"myService","success":"0","type":"aservice","version":"1"}
What's wrong with my query?

The text field message uses standard analyzer which tokenize the input string and convert it to tokens.
If we analyze the string "success":"0" using standard analyzer we will get these tokens
{
"tokens": [
{
"token": "success",
"start_offset": 2,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "0",
"start_offset": 12,
"end_offset": 13,
"type": "<NUM>",
"position": 1
}
]
}
So you can see that colon double quotes etc are removed. And since regexp query applied on each token it will not match your query.
But if we use message.keyword which has field type keyword. it is not analyzed thus keep the string as it is.
{
"tokens": [
{
"token": """ "success":"0" """,
"start_offset": 0,
"end_offset": 15,
"type": "word",
"position": 0
}
]
}
So if we use the below query it should work
{
"query": {
"regexp": {
"message.keyword": """.*"success".*0.*"""
}
}
}
But another problem is you have set message.keyword field settings to "ignore_above": 256 So This field will ignore any string longer than 256 characters.

Related

elasticsearch match query in array

I have follow query with terms, that works fine.
{
"query": {
"terms": {
"130": [
"jon#domain.com",
"mat#domain.com"
]
}
}
}
Found 2 docs.
but now i would like to build similar query with match (want to find all users in domain). I've tried follow query without any result
{
"query": {
"match": {
"130": {
"query":"#domain.com"
}
}
}
}
Found 0 docs. Why??
Field 130 has follow mapping:
"130":{"type":"text","analyzer":"whitespace","fielddata":true}
If you are using a whitespace analyzer, then the token generated will be :
{
"tokens": [
{
"token": "jon#domain.com",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
So terms query will match with the above token as it returns documents that contain one or more exact terms in a provided field, but match query will give 0 results
Instead, you should use a standard analyzer (which is the default one), which will generate the following tokens:
{
"tokens": [
{
"token": "jon",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "domain.com",
"start_offset": 4,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
You can even go through the uax_url_email tokenizer which is like the standard tokenizer except that it recognizes URLs and email addresses as single tokens.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"130": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Index Data:
{
"130":"jon#domain.com"
}
Search Query:
{
"query": {
"match": {
"130": {
"query": "#domain.com"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65121147",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"130": "jon#domain.com"
}
}
]

Elasticsearch match vs. term in filter

I don't see any difference between term and match in filter:
POST /admin/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber": "j1knd"
}
}
]
}
}
}
And the result contains not exactly matched partnumbers too, e.g.: "52527.J1KND-H"
Why?
Term queries are not analyzed and mean whatever you send will be used as it is to match the tokens in the inverted index, while match queries are analyzed and the same analyzer applied on the fields, which is used at index time and accordingly matches the document.
Read more about term query and match query. As mentioned in the match query:
Returns documents that match a provided text, number, date or boolean
value. The provided text is analyzed before matching.
You can also use the analyze API to see the tokens generated for a particular field.
Tokens generated by standard analyzer on 52527.J1KND-H text.
POST /_analyze
{
"text": "52527.J1KND-H",
"analyzer" : "standard"
}
{
"tokens": [
{
"token": "52527",
"start_offset": 0,
"end_offset": 5,
"type": "<NUM>",
"position": 0
},
{
"token": "j1knd",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "h",
"start_offset": 12,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Above explain to you why you are getting the not exactly matched partnumbers too, e.g.: "52527.J1KND-H", I would take your example and how you can make it work.
Index mapping
{
"mappings": {
"properties": {
"partnumber": {
"type": "text",
"fields": {
"raw": {
"type": "keyword" --> note this
}
}
}
}
}
}
Index docs
{
"partnumber" : "j1knd"
}
{
"partnumber" : "52527.J1KND-H"
}
Search query to return only the exact match
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber.raw": "j1knd" --> note `.raw` in field
}
}
]
}
}
Result
"hits": [
{
"_index": "so_match_term",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"partnumber": "j1knd"
}
}
]
}

Elasticsearch - Stop analyzer doesn't allow number

I'm trying to build a search utility using elasticsearch 6.3.0 where any term can be searched within the database. I have applied Stop Analyzer to exclude some of the generic words. However, after having that analyzer system stopped giving me term with numbers as well.
Like if I search for news24 then it removes 24 and search only for "news" term in all records. Unsure why.
Below is the query I am using
{
"from": 0,
"size": 10,
"explain": false,
"stored_fields": [
"_source"
],
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "news24",
"analyzer": "stop",
"fields": [
"title",
"keywords",
"url"
]
}
},
"functions": [
{
"script_score": {
"script": "( (doc['isSponsered'].value == 'y') ? 100 : 0 )"
}
},
{
"script_score": {
"script": "doc['linksCount'].value"
}
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"script_fields": {
"custom_score": {
"script": {
"lang": "painless",
"source": "params._source.linksArray"
}
}
},
"highlight": {
"pre_tags": [
""
],
"post_tags": [
"<\/span>"
],
"fields": {
"title": {
"type": "plain"
},
"keywords": {
"type": "plain"
},
"description": {
"type": "plain"
},
"url": {
"type": "plain"
}
}
}
}
That is because stop analyzer is just an extension of Simple Analyzer which makes use of Lowercase Tokenizer which would simply break terms into tokens if it encounters character which is not a letter (of course also lowercasing all the terms).
So bascially if you have something like news24 what it does it, breaks it into news as it encountered 2.
This is the default behaviour of the stop analyzer. If you intend to make use of stop words and still want to keep numerics in picture, then you would be required to create a custom analyzer as shown below:
Mapping:
POST sometestindex
{
"settings":{
"analysis":{
"analyzer":{
"my_english_analyzer":{
"type":"standard",
"stopwords":"_english_"
}
}
}
}
}
What it does it it makes use of Standard Analyzer which internally uses Standard Tokenizer and also ignores stop words.
Analysis Query To Test
POST sometestindex/_analyze
{
"analyzer": "my_english_analyzer",
"text": "the name of the channel is news24"
}
Query Result
{
"tokens": [
{
"token": "name",
"start_offset": 4,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "channel",
"start_offset": 16,
"end_offset": 23,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "news24",
"start_offset": 27,
"end_offset": 33,
"type": "<ALPHANUM>",
"position": 6
}
]
}
You can see in the above tokens, that news24 is being preserved as token.
Hope it helps!

No results returned for filtered Elasticsearch query

I'm having trouble executing the following request against Elasticsearch v2.2.0. If I remove the filter property (and contents, of course), I get my entity back (only one exists). With the filter clause in place, I just get 0 results, but no error. Same if I remove the email filter and/or the name filter. Am I doing something wrong with this request?
Request
GET http://localhost:9200/my-app/my-entity/_search?pretty=1
{
"query": {
"filtered" : {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"term": {
"email": "my.email#email.com"
}
},
{
"term": {
"name": "Test1"
}
}
]
}
}
}
}
Existing Entity
{
"email": "my.email#email.com",
"name": "Test1"
}
Mapping
"properties": {
"name": {
"type": "string"
},
"email": {
"type": "string"
},
"term": {
"type": "long"
}
}
Since email field is analyzed with no custom analyzer, Standard Analyzer will get applied to it and it will split into tokens.
Read about Standard Tokenizer here.
You can use below command to see how my.email#email.com is getting tokenized.
curl -XGET "http://localhost:9200/_analyze?tokenizer=standard" -d "my.email#email.com".
This will generate following output.
{
"tokens": [
{
"token": "my.email", ===> Notice this
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "email.com", ===> Notice this
"start_offset": 9,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 2
}
]
}
If you want full or exact search you need to make it not_analyzed. Study about how to create a not_analyzed field here.
{
"email": {
"type": "string",
"index": "not_analyzed"
}
}
Hope it is clear

Phonetic search results for integers with Elasticserach

Forgive me as I am new to Elasticsearch, but I am following the Phonetic start guide found here: Phonetic Matching
I have the following
POST /app
{
"settings": {
"index": {
"analysis": {
"filter": {
"dbl_metaphone": {
"type": "phonetic",
"encoder": "double_metaphone"
}
},
"analyzer": {
"dbl_metaphone": {
"tokenizer": "standard",
"filter": "dbl_metaphone"
}
}
}
}
},
"mappings": {
"movie": {
"properties": {
"title": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
},
"year": {
"type": "string",
"fields": {
"phonetic": {
"type": "string",
"analyzer": "dbl_metaphone"
}
}
}
}
}
} }
I add some results by doing:
POST /app/movie
{ "title": "300", "year": 2006"} & { "title":"500 days of summer", "year": "2009" }
I want to query for the movie '300' by entering this query though:
POST /app/movie/_search
{
"query": {
"match": {
"title.phonetic": {
"query": "three hundred"
}
}
}
}
but I get no results. If change my query to "300" though it works just fine.
If I do:
GET /app/_analyze?analyzer=dbl_metaphone&text=300
{
"tokens": [
{
"token": "300",
"start_offset": 0,
"end_offset": 3,
"type": "<NUM>",
"position": 0
}
]
}
I see that there is only a number token returned not alphanumeric version like:
GET /app/_analyze?analyzer=dbl_metaphone&text=three hundred
{
"tokens": [
{
"token": "0R",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "TR",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "HNTR",
"start_offset": 6,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Is there something that I am missing with my phonetic query that I am supposed to define to get both the numerical and alphanumeric tokens?
That is not possible. Double Metaphone is a form of phonetic encoding algorithm.
Simply put it tries to encode similarly pronounced words to the same key.
This facilitates to search for terms like names that could be spelt differently but sound the same.
As you can see from the algorithm double metaphone ignores numbers/numeric characters.
You can read more about double metaphone here.
A better case for phonetic matching is finding "Judy Steinheiser" when the search query is [Jodi Stynehaser].
If you need to be able to search numbers using English, then you'll need to create some synonyms or alternate text at index-time, so that both "300" and "three hundred" are stored in Elasticsearch.
Shouldn't be too hard to find/write a function that converts integers to English.
Call your function when constructing your document to ingest into ES.
Alternately, write it in Groovy, and call it as a Transform script in your mapping.

Resources