How to deal with punctuation in an ElasticSearch field

How to deal with punctuation in an ElasticSearch field - elasticsearch

I have a field in a document stored in Elastic Search, which I want to be analyzed as a full text field. In one case, it contains a value for the name field like this:
A&B Corp
I want to be able to search the documents for an auto-complete widget, using a query like this (suppose the user typed A&B into the autocomplete field). The intention is to match documents that contain the any terms with the typed prefix.
{ "query": {
"filtered": {
"query": {
"query_string": {
"query": "A&B*",
"fields": [
"firstName",
"lastName",
"name",
"key",
"email"
]
}
},
"filter": {
"terms": {
"environmentId": [
"foo"
]
}
}
}
}
}
```
My mapping for the name field looks like this:
"name": {
"type": "string"
},
But, I get no results. The query structure works for documents that don't have & in the field, so I'm pretty sure that is part of the problem.
But, I'm not sure how to deal with this. I am pretty sure I still want to analyze the field for full text search.
In addition, if I add a space before the * in the query (ie, "query": "A&B *",) then I get results including A&B, so I don't think it is just discarding the ampersand and treating the A and B as separate terms.
Should I change my mapping? The query?

The Query_string query has a set of reserved characters that needs to be escaped.
query_string : Read the reserved characters section
So to search for
'A&B' (or) 'A&B Corp' (or) 'A&B....'
Your query must be "A&B\\*" such that the query_string parser treats
it as a * wildcard operator.
While currently your query is searching for exact match of
"A&B*" it expects asterik to be part of your data.
And when you search "A&B *" the whitespace is a reserved
character so its
now searching for "A&B" (or) "*" and hence you get a match in this
case.

Related

Elatisearch match_phrase_prefix query, with exact prefix match

I have a match_phrase_prefix query, which works as expected. But when the users passes any special characters at the end of the keyword, ES ignores these characters, and still returns the result.
query{ match_phrase_prefix:{ content: { query: searchTerm } } }
I am using this query to search for prefix. If i pass a term like overflow####!! ES is returning me all the results with the word overflow in it. But instead i want to make an exact prefix match, where the special characters are not ignored. The search term could be of multiple words as well stack overflow search.
How could i make ES search of prefix_match without ignoring the special_chars.

You can use keyword analyzer when defining your query.
{
"query": {
"match_phrase_prefix": {
"content": {
"query": "overflow####!!",
"analyzer": "keyword"
}
}
}
}

Elasticsearch wrong explanation validate api

I'm using Elasticsearch 5.2. I'm executing the below query against an index that has only one document
Query:
GET test/val/_validate/query?pretty&explain=true
{
"query": {
"bool": {
"should": {
"multi_match": {
"query": "alkis stackoverflow",
"fields": [
"name",
"job"
],
"type": "most_fields",
"operator": "AND"
}
}
}
}
}
Document:
PUT test/val/1
{
"name": "alkis stackoverflow",
"job": "developer"
}
The explanation of the query is
+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow))) #(#_type:val)
I read this as:
Field job must have alkis and stackoverflow
AND
Field name must have alkis and stackoverflow
This is not the case with my document though. The AND between the two fields is actually OR (as it seems from the result I'm getting)
When I change the type to best_fields I get
+(((+job:alkis +job:stackoverflow) | (+name:alkis +name:stackoverflow))) #(#_type:val)
Which is the correct explanation.
Is there a bug with the validate api? Have I misunderstood something? Isn't the scoring the only difference between these two types?

Since you picked the most_fields type with an explicit AND operator, the reasoning is that one match query is going to be generated per field and all terms must be present in a single field for a document to match, which is your case, i.e. both terms alkis and stackoverflow are present in the name field, hence why the document matches.
So in the explanation of the corresponding Lucene query, i.e.
+(((+job:alkis +job:stackoverflow) (+name:alkis +name:stackoverflow)))
when no specific operator is specified between the terms, the default one is an OR
So you need to read this as: Field job must have both alkis and stackoverflow OR field name must have both alkis and stackoverflow.
The AND operator that you apply only concerns all the terms in your query but in regard to a single field, it's not an AND between all fields. Said differently, your query will be executed as a two match queries (one per field) in a bool/should clause, like this:
{
"query": {
"bool": {
"should": [
{ "match": { "job": "alkis stackoverflow" }},
{ "match": { "name": "alkis stackoverflow" }}
]
}
}
}
In summary, the most_fields type is most useful when querying multiple fields that contain the same text analyzed in different ways. This is not your case and you'd probably better be using cross_fields or best_fields depending on your use case, but certainly not most_fields.
UPDATE
When using the best_fields type, ES generates a dis_max query instead of a bool/should and the | (which is not an OR !!) sign separates all sub-queries in a dis_max query.

In Elasticsearch match query how to deal with slash

I have a match query searching for a type of doc:
{
"query": {
"bool": {
"should": {
"match": {
"ph1_enc": "EAAQnb1kMr/e2/ADqo"
}
}
}
}
}
"EAAQnb1kMr/e2/ADqo" is the string i'm trying to match, however in the search results I can see multiple records with substring "/e2/" are also returned.
Looks like "/e2/" is indexed separately, so that this could happen.I thought the match query is to do full-text match... Is it because I missed something when creating the template? Any idea?
Add-on instead of reindex, how to modify the query to match the exact value in the query?

Which analyzer do you set in the mapping to index your data?
If you are using the default one (standard analyzer), then according to the documentation, this uses the default tokenizer that seems to split also the text by slash ('/'). The documentation redirects here for more information about the tokenizer.
So, that will index the following words 'EAAQnb1kMr', 'e2', and 'ADqo'. Accordingly, your query value will also been analyzed the same way the field was indexed. That is why documents with 'e2' are also being returned.
If you don't need to tokenize the 'ph1_enc' field, you can just set its type in the mapping as 'keyword'.
"properties": {
"ph1_enc": {
"type": "keyword"
}
}
That will not analyze the field and it will match exactly while you query.
I hope that it helps.

Elasticsearch find missing word in phrase

How can i use Elasticsearch to find the missing word in a phrase? For example i want to find all documents which contain this pattern make * great again, i tried using a wildcard query but it returned no results:
{
"fields": [
"file_name",
"mime_type",
"id",
"sha1",
"added_at",
"content.title",
"content.keywords",
"content.author"
],
"highlight": {
"encoder": "html",
"fields": {
"content.content": {
"number_of_fragments": 5
}
},
"order": "score",
"tags_schema": "styled"
},
"query": {
"wildcard": {
"content.content": "make * great again"
}
}
}
If i put in a word and use a match_phrase query i get results, so i know i have data which matches the pattern.
Which type of query should i use? or do i need to add some type of custom analyzer to the field?

Wildcard queries operate on terms, so if you use it on an analyzed field, it will actually try to match every term in that field separately. In your case, you can create a not_analyzed sub-field (such as content.content.raw) and run the wildcard query on that. Or just map the actual field to not be analyzed, if you don't need to query it in other ways.

Elasticsearch wildcard query not honoring the analyzer of the field

I have a field named "tag" which is analyzed(default behavior) in elasticsearch. The "tag" field can have a single word or a comma separated string to store multiple tags. For eg. "Festive, Fast, Feast".
Now for example if a tag is "Festive", before indexing I am converting it to small case(to ignore case sensitivity) and indexing it as "festive".
Now if I search using a match query with all caps letters as mentioned below I get results fine(as expected).
{
"query": {
"match": {
"tag": "FESTIVE"
}
}
}
But if I do a wildcard query as mentioned below I don't get results :(
{
"query": {
"wildcard": {
"tag": {
"value": "F*"
}
}
}
}
If I change the value field in wildcard search to "f*" instead of "F*" then I get results.
Does anyone have any clue why is wildcard query behaving case sensitive?

Wildcard queries, fall under term level queries and hence not analyzed. From the Docs
Matches documents that have fields matching a wildcard expression (not
analyzed)
You will get expected results with query string query, it will lowercase the terms because by default as lowercase_expanded_terms is true. Try this
GET your_index/_search
{
"query": {
"query_string": {
"default_field": "tag",
"query": "F*"
}
}
}
Hope this helps!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to deal with punctuation in an ElasticSearch field - elasticsearch

Related

Elatisearch match_phrase_prefix query, with exact prefix match

Elasticsearch wrong explanation validate api

In Elasticsearch match query how to deal with slash

Elasticsearch find missing word in phrase

Elasticsearch wildcard query not honoring the analyzer of the field

Categories

Resources