Elastic Search: Matching sub token default operator - elasticsearch

Is there a way to set the default operator for sub tokens (tokens generated through the analyzer)? It currently seems to default to OR and setting operator does not work.
I'm using the validate API to see how Elastic Search is understanding the query:
/myIndex/mapping/_validate/query?explain=true
{
"query":{
"multi_match":{
"type":"phrase_prefix",
"query":"test123",
"fields":[
"message"
],
"lenient":true,
"analyzer":"myAnalyzer"
}
}
}
Which returns
+(message:test123 message:test message:123)
What I want is
+message:test123 +message:test +message:123
Is there any way to do this without using a script or splitting the terms and creating a more complex query in the application?
EDIT
Using operator or minimum_should_match does not make a difference.
My elastic search mapping for myAnalyzer is
{
"analysis":{
"filter":{
"foldAscii":{
"type":"asciifolding",
"preserve_original":"1"
},
"capturePattern":{
"type":"pattern_capture",
"patterns":[
"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+(?!\\p{Ll}+))",
"(\\d+)"
]
},
"noDuplicates":{
"type":"unique",
"only_on_same_position":"true"
}
},
"myAnalyzer":{
"filter":[
"capturePattern",
"lowercase",
"foldAscii",
"noDuplicates"
],
"tokenizer":"standard"
}
}
}

Related

Elasticsearch: search with wildcard and custom analyzer

Requirement: Search with special characters in a text field.
my Solution so far: Use wildcard query with custom analyzer. I want to use wildcards because it seems the easiest way to do partial searches in a long string with multiple search keys. See ES query below.
I have an index called "invoices" and it has document with one of the fields as
"searchString" : "I000010-1 000010 3901 North Saginaw Road add 2 Midland MI 48640 US MS Dhoni MSD-Company MSD (777) 777-7777 (333) 333-3333 sandeep#xyz.io msd-company msdhoni Dhoni, MS (3241480)"
Note: This field acts as the deprecated _all field in ES.
Index Mapping for this field:
"searchString": {"type": "text","analyzer": "multi_level_analyzer"},
Analyzer settings:
PUT invoices
{
"settings": {
"analysis": {
"analyzer": {
"multi_level_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
My query looks something like this:
GET invoices/_search
{
"query": {
"bool": {
"must": [{
"wildcard": {
"searchString": {
"value": "msd-company*",
"boost": 1.0
}
}
},
{
"wildcard": {
"searchString": {
"value": "Saginaw*",
"boost": 1.0
}
}
}
]
}
}
}
My question:
Earlier when I was not using a custom analyzer the above query worked BUT I was not able to search for words with special characters like "msd-company".
After attaching the custom analyzer(multi_level_analyzer) the above query fails to return any result. I changed the wildcard query and appended an asterisk before the search key and for some reason it works now. (referred this answer)
I want to know the impact of using "* msd-company*" instead of "msd-company*" in the wildcard query for the text field.
How can I still use the wildcard query "msd-company*" with custom analyzer?
Open to suggestions for any other approach to my problem statement.
I have solved my problem by changing the mapping of the said field to this:
"searchString": {"type": "text","analyzer": "multi_level_analyzer", "search_analyzer": "standard"},
But since wildcard queries are expensive, I would still like to know if there exists a better solution to satisfy my search use case.

Add simple search on every field in ElasticSearch

here's my query:
"query":{
"query":{
"bool":{
"must":[
{
"term":{
"device_id":"1"
}
},
{
"range":{
"created_at":{
"gte":"2019-10-01",
"lte":"2019-12-10"
}
}
}
],
"must_not":[
{
"term":{
"action":"ping"
}
},
{
"term":{
"action":"checkstatus"
}
}
]
}
},
"sort":{
"created_at":"desc"
},
"size":10,
"from":0
}
the logs vary a lot with completely different set of fields. What I can do to search if in any of them is a string that I'm looking for? Let's say I'm looking for "mit" and it shows me one log with
surname -> Smith
and the second one with
action -> "message comMITted"
I've tried using match [ '_all' => "mit" ] but it's deprecated I heard.
I'm using Elasticsearch 7.3.1
To perform search across all fields, use copy_to to copy the values of all fields into a grouped field, and perform the queries against that field. In case you have dynamic index mappings refer this.
For supporting infix matching, you can use ngrams. This is a good tutorial for ngrams.
Infix matching can be performed using wildcard query too. But it's not recommended due to performance reasons.

Exact Sub-String Match | ElasticSearch

We are migrating our search strategy, from database to ElasticSearch. During this we are in need to preserve the existing functionality of partially searching a field similar the SQL query below (including whitespaces):
SELECT *
FROM customer
WHERE customer_id LIKE '%0995%';
Having said that, I've gone through multiple articles related to ES and achieving the said functionality. After the above exercise following is what I've come up with:
Majority of the article which I read recommended to use nGram analyzer/filter; hence following is how mapping & setting looks like:
Note:
The max length of customer_id field is VARCHAR2(100).
{
"customer-index":{
"aliases":{
},
"mappings":{
"customer":{
"properties":{
"customerName":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"customerId":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
},
"analyzer":"substring_analyzer"
}
}
}
},
"settings":{
"index":{
"number_of_shards":"3",
"provided_name":"customer-index",
"creation_date":"1573333835055",
"analysis":{
"filter":{
"substring":{
"type":"ngram",
"min_gram":"3",
"max_gram":"100"
}
},
"analyzer":{
"substring_analyzer":{
"filter":[
"lowercase",
"substring"
],
"type":"custom",
"tokenizer":"standard"
}
}
},
"number_of_replicas":"1",
"uuid":"XXXXXXXXXXXXXXXXX",
"version":{
"created":"5061699"
}
}
}
}
}
Request to query the data looks like this:
{
"from": 0,
"size": 10,
"sort": [
{
"name.keyword": {
"missing": "_first",
"order": "asc"
}
}
],
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "0995",
"fields": [
"customer_id"
],
"analyzer": "substring_analyzer"
}
}
]
}
}
}
With that being said, here are couple of queries/issue:
Lets say there are 3 records with customer_id:
0009950011214,
0009900011214,
0009920011214
When I search for "0995". Ideally, I am looking forward to get only customer_id: 0009950011214.
But I get all three records as part of result set and I believe its due to nGram analyzer and the way it splits the string (note: minGram: 3 and maxGram:100). Setting maxGram to 100 was for exact match.
How should I fix this?
This brings me to my second point. Is using nGram analyzer for this kind of requirement the most effective strategy? My concern is the memory utilization of having minGram = 3 and maxGram = 100. Is there are better way to implement the same?
P.S: I'm on NEST 5.5.
In your customerID field you can pass a "search_analyzer": "standard". Then in your search query remove the line "analyzer": "substring_analyzer".
This will ensure that the searched customerID is not tokenized into nGrams and is searched as is, while the customerIDs are indexed as nGrams.
I believe that's the functionality that you were trying to replicate from your SQL query.
From the mapping I can see that the field customerId is a text/keyword field.( Difference between keyword and text in ElasticSearch )
So you can use a regex filter as shown below to make searches like the sql query you have given as example, Try this-
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"regexp": {
"customerId": {
"value": ".*0995.*",
"flags": "ALL"
}
}
}
]
}
}
}
}
}
notice the "." in the value of the regex expression.
..* is same as contains search
~(..) is same as not contains
You can also append ".*" at the starting or the end of the search term to do searches like Ends-with and Starts-with type of searches. Reference -https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-regexp-query.html

Not getting where data with filter (elastic search 6.4)

elasticsearch version: 6.4
Here is my current data:
I want to search for products which has Xbox in name. I am using the match keyword but that is not working.
Below is my elastic search query:
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "xbox"
}
}
},
{
"terms": {
"deep_sub": [
"Konsol Oyunları",
"Konsol Aksesuarları"
]
}
}
]
}
},
"from": 0,
"size": 50
}
Whenever you face such kind of issues, try breaking down the query. You have Match Query and Term Query. Run both of them individually and see what's not working.
From what I understand, looks like your field deep_sub is of text type and this would mean Term Query is not returning results.
You would need to create its sibling equivalent using keyword type and then run Term Query on it for exact matches.
From the above link we have the below info:
Keyword fields are only searchable by their exact value.
If you do not control the mapping, meaning if your mapping if of dynamic type, then you must have its sibling equivalent keyword field available which would be deep_sub.keyword
You can check by running GET <your_index_name>/_mapping
Your query would then be as follows:
POST <your_index_name>/_search
{
"query":{
"bool":{
"must":[
{
"match":{
"name":{
"query":"xbox"
}
}
},
{
"terms":{
"deep_sub.keyword":[ <----- Change This
"Konsol Oyunları",
"Konsol Aksesuarları"
]
}
}
]
}
},
"from":0,
"size":50
}
Let me know if this helps!

How to find numbers with comma in Elasticsearch?

Query number like below data doesn't get any result, but space after commas it can find.
Sample data:
{
"data":"34543,2525,5674,879"
}
Query:
"query": {
"query_string" : {
"query" : "(data:2525)"
}
}
Settings:
"analysis":{
"filter":{
"my_ascii_folding":{
"type":"asciifolding",
"preserve_original":"true"
}
},
"analyzer":{
"default":{
"filter":[
"lowercase",
"my_ascii_folding"
],
"char_filter":[
"html_strip"
],
"tokenizer":"standard"
}
}
}
For example querying 2525 in "34543, 2525, 5674, 879" found, but with "34543,2525,5674,879" doesn't find.
Without any more information it looks like you're probably using the standard tokenizer. You can show how your tokens are analyzed by using
GET users/_analyze
{
"text": "34543, 2525, 5674, 879"
}
or
GET users/_analyze
{
"text": "34543,2525,5674,879"
}
If you're using the standard tokenizer then 34543,2525,5674,879 is only one token in your inverted index. When you search for 2525 it won't match that token. On the other hand, 3453, 2525, 5674, 879 is tokenized into four tokens without commas. 2525 matches the second token.
If you want to solve this problem you'll need to use a different tokenizer that always tokenizes on a comma rather than just when it's at the beginning or end of a token Indexing a comma-separated value field in Elastic Search.

Resources