Exact Sub-String Match | ElasticSearch - elasticsearch

We are migrating our search strategy, from database to ElasticSearch. During this we are in need to preserve the existing functionality of partially searching a field similar the SQL query below (including whitespaces):
SELECT *
FROM customer
WHERE customer_id LIKE '%0995%';
Having said that, I've gone through multiple articles related to ES and achieving the said functionality. After the above exercise following is what I've come up with:
Majority of the article which I read recommended to use nGram analyzer/filter; hence following is how mapping & setting looks like:
Note:
The max length of customer_id field is VARCHAR2(100).
{
"customer-index":{
"aliases":{
},
"mappings":{
"customer":{
"properties":{
"customerName":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"customerId":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
},
"analyzer":"substring_analyzer"
}
}
}
},
"settings":{
"index":{
"number_of_shards":"3",
"provided_name":"customer-index",
"creation_date":"1573333835055",
"analysis":{
"filter":{
"substring":{
"type":"ngram",
"min_gram":"3",
"max_gram":"100"
}
},
"analyzer":{
"substring_analyzer":{
"filter":[
"lowercase",
"substring"
],
"type":"custom",
"tokenizer":"standard"
}
}
},
"number_of_replicas":"1",
"uuid":"XXXXXXXXXXXXXXXXX",
"version":{
"created":"5061699"
}
}
}
}
}
Request to query the data looks like this:
{
"from": 0,
"size": 10,
"sort": [
{
"name.keyword": {
"missing": "_first",
"order": "asc"
}
}
],
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "0995",
"fields": [
"customer_id"
],
"analyzer": "substring_analyzer"
}
}
]
}
}
}
With that being said, here are couple of queries/issue:
Lets say there are 3 records with customer_id:
0009950011214,
0009900011214,
0009920011214
When I search for "0995". Ideally, I am looking forward to get only customer_id: 0009950011214.
But I get all three records as part of result set and I believe its due to nGram analyzer and the way it splits the string (note: minGram: 3 and maxGram:100). Setting maxGram to 100 was for exact match.
How should I fix this?
This brings me to my second point. Is using nGram analyzer for this kind of requirement the most effective strategy? My concern is the memory utilization of having minGram = 3 and maxGram = 100. Is there are better way to implement the same?
P.S: I'm on NEST 5.5.

In your customerID field you can pass a "search_analyzer": "standard". Then in your search query remove the line "analyzer": "substring_analyzer".
This will ensure that the searched customerID is not tokenized into nGrams and is searched as is, while the customerIDs are indexed as nGrams.
I believe that's the functionality that you were trying to replicate from your SQL query.

From the mapping I can see that the field customerId is a text/keyword field.( Difference between keyword and text in ElasticSearch )
So you can use a regex filter as shown below to make searches like the sql query you have given as example, Try this-
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"regexp": {
"customerId": {
"value": ".*0995.*",
"flags": "ALL"
}
}
}
]
}
}
}
}
}
notice the "." in the value of the regex expression.
..* is same as contains search
~(..) is same as not contains
You can also append ".*" at the starting or the end of the search term to do searches like Ends-with and Starts-with type of searches. Reference -https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-regexp-query.html

Related

Not getting where data with filter (elastic search 6.4)

elasticsearch version: 6.4
Here is my current data:
I want to search for products which has Xbox in name. I am using the match keyword but that is not working.
Below is my elastic search query:
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "xbox"
}
}
},
{
"terms": {
"deep_sub": [
"Konsol Oyunları",
"Konsol Aksesuarları"
]
}
}
]
}
},
"from": 0,
"size": 50
}
Whenever you face such kind of issues, try breaking down the query. You have Match Query and Term Query. Run both of them individually and see what's not working.
From what I understand, looks like your field deep_sub is of text type and this would mean Term Query is not returning results.
You would need to create its sibling equivalent using keyword type and then run Term Query on it for exact matches.
From the above link we have the below info:
Keyword fields are only searchable by their exact value.
If you do not control the mapping, meaning if your mapping if of dynamic type, then you must have its sibling equivalent keyword field available which would be deep_sub.keyword
You can check by running GET <your_index_name>/_mapping
Your query would then be as follows:
POST <your_index_name>/_search
{
"query":{
"bool":{
"must":[
{
"match":{
"name":{
"query":"xbox"
}
}
},
{
"terms":{
"deep_sub.keyword":[ <----- Change This
"Konsol Oyunları",
"Konsol Aksesuarları"
]
}
}
]
}
},
"from":0,
"size":50
}
Let me know if this helps!

ElasticSearch query with MUST and SHOULD

I have this query to get data from AWS elasticSearch instance v6.2
{
"query": {
"bool": {
"must": [
{
"term": {"logLevel": "error"}
},
{
"bool": {
"should": [
{
"match": {"EventCategory": "Home Management"}
}
]
}
}
],
"filter": [{
"range": { "timestamp": { "gte": 155254550880 }}
}
]
}
},
"size": 10,
"from": 0
}
My data has multiple EventCategories for example 'Home Management' and 'User Account Management'. Problem with this is inside should having match returns all data because phrase 'Management' is in both categories. If I use term instead of match, it don't returns anything at all even when the given value is exactly same as in document.
I need to get data when any of given category is matched with rest of filters.
EDIT:
There may none, one or more than one EventCategory be passed to should clause
I'm not sure why you added a should within a must. Do you expect to have more than one should cases? It looks a bit odd.
As for your question, you can't use the term query on an analysed field, but only on keyword typed fields. If your EventCategory field has the default mapping, you can run the term query against the default non-analysed multi-field of EventCategory as follows:
...
{
"term": { "EventCategory.keyword": "Home Management" }
}
...
Furthermore, if you just want to filter in/out documents without caring about their relevance, I'd recommend you to move all the conditions in the filter block, to speed-up your query and make a better use of the cache.
Below query should work.
I've just removed should and created two must clauses one for each of event and management. Note that the query is meant for text datatypes.
{
"query":{
"bool":{
"must":[
{
"term":{
"logLevel":"error"
}
},
{
"match":{
"EventCategory":"home"
}
},
{
"match":{
"EventCategory":"management"
}
}
],
"filter":[
{
"range":{
"timestamp":{
"gte":155254550880
}
}
}
]
}
},
"size":10,
"from":0
}
Hope it helps!

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

Getting results for multi_match cross_fields query in elasticsearch with custom analyzer

I have an elastic search 5.3 server with products.
Each product has a 14 digit product code that has to be searchable by the following rules. The complete code should match as well as a search term with only the last 9 digits, the last 6, the last 5 or the last 4 digits.
In order to achieve this I created a custom analyser which creates the appropriate tokens at index time using the pattern capture token filter. This seems to be working correctly. The _analyse API shows that the correct terms are created.
To fetch the documents from elastic search I'm using a multi_match cross_fields bool query to search a number of fields simultaneously.
When I have a query string that has a part that matches a product code and a part that matches any of the other fields no results are returned, but when I search for each part separately the appropriate results are returned. Also when I have multiple parts spanning any of the fields except the product code the correct results are returned.
My maping and analyzer:
PUT /store
{
"mappings": {
"products":{
"properties":{
"productCode":{
"analyzer": "ProductCode",
"search_analyzer": "standard",
"type": "text"
},
"description": {
"type": "text"
},
"remarks": {
"type": "text"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"ProductCodeNGram": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"\\d{5}(\\d{9})",
"\\d{8}(\\d{6})",
"\\d{9}(\\d{5})",
"\\d{10}(\\d{4})"
]
}
},
"analyzer": {
"ProductCode": {
"filter": ["ProductCodeNGram"],
"type": "custom",
"preserve_original": "true",
"tokenizer": "standard"
}
}
}
}
}
The query
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"operator": "and"
}
}
]
}
}
}
Sample data
POST /store/products
{
"productCode": "999999123456789",
"description": "Foo bar",
"remarks": "Foobar"
}
The following query strings all return one result:
"456789", "foo", "foobar", "foo foobar".
But the query_string "foo 456789" returns no results.
I am very curious as to why the last search does not return any results. I am convinced that it should.
The problem is that you are doing a cross_fields over fields with different analysers. Cross fields only works for fields using the same analyser. It in fact groups the fields by analyser before doing the cross fields. You can find more information in this documentation.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#_literal_cross_field_literal_and_analysis
Although cross_fields needs the same analyzer across the fields it operates on, I've had luck using the tie_breaker parameter to allow other fields (that use different analyzers) to be weighed for the total score.
This has the added benefit of allowing per-field boosting to be calculated in the final score, too.
Here's an example using your query:
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"tie_breaker": 1 # You may need to tweak this
}
}
]
}
}
}
I also removed the operator field, as I believe using the "AND" operator will cause fields that don't have the same analyzer to be scored inappropriately.

Proper way to query documents using an 'uppercase' token filter

We have an ElasticSearch index with some fields that use custom analyzers. One of the analyzers includes an uppercase token filter in order to get rid of case sensitivity while making queries (e.g. we want "ball" to also match "Ball" or "BALL")
The issue here is when doing regular expressions, the pattern is matched against the term in the index which is all uppercase. So "app*" won't match "Apple" in our index, because behind the scenes its really indexed as "APPLE".
Is there a way to get this to work without doing some hacky things outside of ES?
I might play around with "query_string" instead and see if that has any different results.
This all depends on the type of the query you are using. If that type will use the analyzer of the field itself to analyze the input string then it should be fine.
If you are using the regexp query, this one will NOT analyze the input string, so if you pass app.* to it, it will stay the same and this is what it will user for search.
But, if you use properly the query_string query that one should work:
{
"settings": {
"analysis": {
"analyzer": {
"my": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"uppercase"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"some_field": {
"type": "text",
"analyzer": "my"
}
}
}
}
}
And the query itself:
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
To make sure it's doing what I think it is, I always use the _validate api:
GET /_validate/query?explain&index=test
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
which will show what ES is doing to the input string:
"explanations": [
{
"index": "test",
"valid": true,
"explanation": "some_field:APP*"
}
]

Resources