Elasticsearch term vs match - elasticsearch

I have to write a search query on 2 condition.
timestamp
directory
When I am using match in search query like below
{
"query":{
"bool":{
"must":{
"match":{
"directory":"/user/ayush/test/error/"
}
},
"filter":{
"range":{
"#timestamp":{
"gte":"2020-08-25 01:00:00",
"lte":"2020-08-25 01:30:00",
"format":"yyyy-MM-dd HH:mm:ss"
}
}
}
}
}
}
In the filter result I am getting records with directory
/user/ayush/test/error/
/user/hive/
/user/
but when I am using term like below
{
"query":{
"bool":{
"must":{
"term":{
"directory":"/user/ayush/test/error/"
}
},
"filter":{
"range":{
"#timestamp":{
"gte":"2020-08-25 01:00:00",
"lte":"2020-08-25 01:30:00",
"format":"yyyy-MM-dd HH:mm:ss"
}
}
}
}
}
}
I am not getting any results not even with directory value /user/ayush/test/error/

The match query analyzes the input string and constructs more basic
queries from that.
The term query matches exact terms.
Refer these blogs to get detailed information :
SO question on Term vs Match query
https://discuss.elastic.co/t/term-query-vs-match-query/14455
elasticsearch match vs term query
The field value /user/ayush/test/error/ is analyzed as follows :
POST/_analyze
{
"analyzer" : "standard",
"text" : "/user/ayush/test/error/"
}
The tokens generated are:
{
"tokens": [
{
"token": "user",
"start_offset": 1,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "ayush",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "test",
"start_offset": 12,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "error",
"start_offset": 17,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
}
]
}
Index data:
{ "directory":"/user/ayush/test/error/" }
{ "directory":"/user/ayush/" }
{ "directory":"/user" }
Search Query using Term query:
The term query does not apply any analyzers to the search term, so will only look for that exact term in the inverted index. So to search for the exact term, you need to use directory.keyword OR change the mapping of field.
{
"query": {
"term": {
"directory.keyword": {
"value": "/user/ayush/test/error/",
"boost": 1.0
}
}
}
}
Search Result for Term query:
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.9808291,
"_source": {
"directory": "/user/ayush/test/error/"
}
}
]

Related

Elasticsearch : how to return the document with the exact word searched and not all documents that contain that word in an sentence?

I have field (type text) named 'description'
I have 3 documents.
doc1 description = "test"
doc2 description = "test dsc"
doc3 description = "2021 test desc"
CASE 1- if i search "test" i want only doc1
CASE 2- if i search "test dsc" i want only doc2
CASE 3- if i search "2021 test desc" i want only doc3
But now only CASE 3 is working
For example CASE1 not working .If i try this query i have all 3 document
GET /myindex/_search
{
"query": {
"match" : {
"Description" : "test"
}
}
}
thanks
You are getting all three documents in your search because by default elasticsearch uses a standard analyzer, for the text type field. This will tokenize "2021 test desc" into
{
"tokens": [
{
"token": "2021",
"start_offset": 0,
"end_offset": 4,
"type": "<NUM>",
"position": 0
},
{
"token": "test",
"start_offset": 5,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "desc",
"start_offset": 10,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Therefore, it will return all the documents that match any of the above tokens.
If you want to search for the exact term you need to update your index mapping.
You can update the mapping, by indexing the same field in multiple ways i.e by using multi fields.
PUT /_mapping
{
"properties": {
"description": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
And then reindex the data again. After this, you will be able to query using the "description" field as of text type and "description.raw" as of keyword type
Search Query:
{
"query": {
"match": {
"description.raw": "test dsc"
}
}
}
Search Result:
"hits": [
{
"_index": "67777521",
"_type": "_doc",
"_id": "2",
"_score": 0.9808291,
"_source": {
"description": "test dsc"
}
}
]

elasticsearch match query in array

I have follow query with terms, that works fine.
{
"query": {
"terms": {
"130": [
"jon#domain.com",
"mat#domain.com"
]
}
}
}
Found 2 docs.
but now i would like to build similar query with match (want to find all users in domain). I've tried follow query without any result
{
"query": {
"match": {
"130": {
"query":"#domain.com"
}
}
}
}
Found 0 docs. Why??
Field 130 has follow mapping:
"130":{"type":"text","analyzer":"whitespace","fielddata":true}
If you are using a whitespace analyzer, then the token generated will be :
{
"tokens": [
{
"token": "jon#domain.com",
"start_offset": 0,
"end_offset": 14,
"type": "word",
"position": 0
}
]
}
So terms query will match with the above token as it returns documents that contain one or more exact terms in a provided field, but match query will give 0 results
Instead, you should use a standard analyzer (which is the default one), which will generate the following tokens:
{
"tokens": [
{
"token": "jon",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "domain.com",
"start_offset": 4,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
You can even go through the uax_url_email tokenizer which is like the standard tokenizer except that it recognizes URLs and email addresses as single tokens.
Adding a working example with index data, mapping, search query, and search result
Index Mapping:
{
"mappings": {
"properties": {
"130": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Index Data:
{
"130":"jon#domain.com"
}
Search Query:
{
"query": {
"match": {
"130": {
"query": "#domain.com"
}
}
}
}
Search Result:
"hits": [
{
"_index": "65121147",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"130": "jon#domain.com"
}
}
]

Elasticsearch match vs. term in filter

I don't see any difference between term and match in filter:
POST /admin/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber": "j1knd"
}
}
]
}
}
}
And the result contains not exactly matched partnumbers too, e.g.: "52527.J1KND-H"
Why?
Term queries are not analyzed and mean whatever you send will be used as it is to match the tokens in the inverted index, while match queries are analyzed and the same analyzer applied on the fields, which is used at index time and accordingly matches the document.
Read more about term query and match query. As mentioned in the match query:
Returns documents that match a provided text, number, date or boolean
value. The provided text is analyzed before matching.
You can also use the analyze API to see the tokens generated for a particular field.
Tokens generated by standard analyzer on 52527.J1KND-H text.
POST /_analyze
{
"text": "52527.J1KND-H",
"analyzer" : "standard"
}
{
"tokens": [
{
"token": "52527",
"start_offset": 0,
"end_offset": 5,
"type": "<NUM>",
"position": 0
},
{
"token": "j1knd",
"start_offset": 6,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "h",
"start_offset": 12,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 2
}
]
}
Above explain to you why you are getting the not exactly matched partnumbers too, e.g.: "52527.J1KND-H", I would take your example and how you can make it work.
Index mapping
{
"mappings": {
"properties": {
"partnumber": {
"type": "text",
"fields": {
"raw": {
"type": "keyword" --> note this
}
}
}
}
}
}
Index docs
{
"partnumber" : "j1knd"
}
{
"partnumber" : "52527.J1KND-H"
}
Search query to return only the exact match
{
"query": {
"bool": {
"filter": [
{
"term": {
"partnumber.raw": "j1knd" --> note `.raw` in field
}
}
]
}
}
Result
"hits": [
{
"_index": "so_match_term",
"_type": "_doc",
"_id": "2",
"_score": 0.0,
"_source": {
"partnumber": "j1knd"
}
}
]
}

Wildcard query only returns results if query is exactly "*"

I'm using Elasticsearch 6.7.0, and I'm trying to make a wildcard query, say to select documents where the field datafile_url ends with .RLF.
To start with a simple query, I just use the wildcard * to query for any value:
GET data/_search
{
"query": {
"wildcard": {
"datafile_url": "*"
}
}
}
This returns documents, such as this one:
{
"_index" : "data",
"_type" : "doc",
"_id" : "1HzJaWoBVj7X61Ih767N",
"_score" : 1.0,
"_source" : {
"datafile_url" : "/uploads/data/1/MSN001.RLF",
...
}
},
Ok, great. But when I change the wildcard query to *.RLF, I get no results.
Short Answer: That is because elastic applies Standard Analyzer when the default analyzer is not explicitly specified for a field.
If you do a wild card search on the keyword, it will work and return expected result:
GET data/_search
{
"query": {
"wildcard": {
"datafile_url.keyword": "*.RLF"
}
}
}
Now, for some background on why it doesnt work without .keyword
Take a look at this example and try running it on your own index.
POST data/_analyze
{
"field": "datafile_url",
"text" : "/uploads/data/1/MSN001.RLF"
}
#Result
{
"tokens": [
{
"token": "uploads",
"start_offset": 1,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "data",
"start_offset": 9,
"end_offset": 13,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "1",
"start_offset": 14,
"end_offset": 15,
"type": "<NUM>",
"position": 2
},
{
"token": "msn001",
"start_offset": 16,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "rlf",
"start_offset": 23,
"end_offset": 26,
"type": "<ALPHANUM>",
"position": 4
}
]
}
Notice how all special characters are missing in the inverted index. Your wild card search will only work on any of the above words from the inverted index. for example:
#this will work
GET data/_search
{
"query": {
"wildcard": {
"datafile_url": "*rlf"
}
}
}
#this will NOT work because of case sensitive inverted index.
GET data/_search
{
"query": {
"wildcard": {
"datafile_url": "*RLF"
}
}
}
You would need to write a custom analyzer if you wan to preserve those special characters.

Elasticsearch: FVH highlights with multiple pre and post tags marking tokens incorrectly?

I'm querying my index using boolean query with two match terms. For each term I have a separate set of pre- and post- tags. Using highlights I would like to obtain the documents in which both terms exist and see which tokens were matched as each of them. The index contains documents in Polish analyzed using morfologik. Let's call the two terms I'm searching for aspect and feature. I want to query the index and retrieve the documents in which both a specific aspect and feature exist and I want the highlight feature to mark the aspect token with <aspect> tag and the feature with <feature> tag. Most of the time it works as expected, sometimes, though, Elasticsearch is marking one or both of the tokens incorrectly. I'll give you an example.
So let's say my index contains the following document:
"Najlepsza maseczka na zniszczone włosy!"
If I search for "maseczka" (aspect) and "dobry" (feature) I expect the output to be like this:
"<feature>Najlepsza</feature> <aspect>maseczka</aspect> na zniszczone włosy! "
For some reason the results from Elasticsearch are like this:
"<aspect>Najlepsza</aspect> <aspect>maseczka</aspect> na zniszczone włosy! "
What I know so far:
I thought maybe the aspect and feature have similar form when analyzed, but it's not the case, for example _analyze for the above example returns:
#query
GET my_index/_analyze
{
"analyzer": "morfologik",
"text": "dobra maseczka"
}
#results
{
"tokens": [
{
"token": "dobra",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dobro",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "dobry",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "maseczka",
"start_offset": 6,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
}
]
}
# Analysis of the document:
get my_index/_analyze
{
"analyzer": "morfologik",
"text": "Najlepsza maseczka na zniszczone włosy"
}
# response
{
"tokens": [
{
"token": "dobry",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "maseczka",
"start_offset": 10,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "na",
"start_offset": 19,
"end_offset": 21,
"type": "<ALPHANUM>",
"position": 2
},
...
]
}
it's also not a problem with specific aspect or feature, because for some queries the index will return both correctly and incorrectly highlighted documents (so I'd expect it to be a problem with documents, rather than queries)
in some cases both terms are highlighted as aspects, in some aspect is marked as feature and feature as aspect, I haven't found any rule so far
I thought if my search terms match the order of the highlights tags, the first term should always get the first tag and the second term always the second tag, but maybe they work in a different way? I thought that's how it works inspired by this response:
Using the Fast Vector Highlighter, you can specify tags in order of "importance" which seems to mean that their order and the order of your search terms should match.
Here's how my index is constructed:
{
"settings": {
"analysis": {
"analyzer": {
"morfologik": {
"tokenizer": "standard",
"filter": [
"morfologik_stem",
"lowercase"
],
"type": "custom"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"term_vector": "with_positions_offsets",
"analyzer": "morfologik"
},
"original_doc": {
"type": "integer"
}
}
}
}
}
Here's my query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{ "match" : { "content" : "maseczki" } },
{ "match" : { "content" : "dobre" } }
]
}},
"highlight": {
"fields": {
"content": {
"fragment_size": 200,
"type": "fvh",
"pre_tags": ["<aspect>", "<feature>"],
"post_tags": ["</aspect>", "</feature>"]
}
}
}
}
And here's a sample response:
{
"_index": "my_index",
"_type": "doc",
"_id": "R91v7GkB0hUBqPARgC54",
"_score": 16.864662,
"_source": {
"content": "Najlepsza maseczka na zniszczone włosy! ",
"original_doc_id": 74290
},
"highlight": {
"content": [
"<aspect>Najlepsza</aspect> <aspect>maseczka</aspect> na zniszczone włosy! "
]
}
},
As I said, most of the time the query works fine and sometimes the all-aspect-highlighting occurs only for a subset of a specific query results, like it does in case of "(opakowanie, solidne)":
aspect here is in fact feature and feature is aspect
<aspect>solidne</aspect>, naprawdę świetne <feature>opakowanie</feature>
solidne should be marked as feature here
Jedyne do czego mogłabym się przyczepić to <aspect>opakowanie</aspect> które wg mnie niestety nie jest <aspect>solidne</aspect>
In my understanding if you want to do a match query on a space separated string, you should be using tokenizer as whitespace.
I would suggest you to check this tokenizer. https://www.elastic.co/guide/en/elasticsearch/reference/5.4/analysis-whitespace-tokenizer.html

Resources