Elasticsearch: search with wildcard and custom analyzer - elasticsearch

Requirement: Search with special characters in a text field.
my Solution so far: Use wildcard query with custom analyzer. I want to use wildcards because it seems the easiest way to do partial searches in a long string with multiple search keys. See ES query below.
I have an index called "invoices" and it has document with one of the fields as
"searchString" : "I000010-1 000010 3901 North Saginaw Road add 2 Midland MI 48640 US MS Dhoni MSD-Company MSD (777) 777-7777 (333) 333-3333 sandeep#xyz.io msd-company msdhoni Dhoni, MS (3241480)"
Note: This field acts as the deprecated _all field in ES.
Index Mapping for this field:
"searchString": {"type": "text","analyzer": "multi_level_analyzer"},
Analyzer settings:
PUT invoices
{
"settings": {
"analysis": {
"analyzer": {
"multi_level_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"char_filter": [
"html_strip"
],
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
My query looks something like this:
GET invoices/_search
{
"query": {
"bool": {
"must": [{
"wildcard": {
"searchString": {
"value": "msd-company*",
"boost": 1.0
}
}
},
{
"wildcard": {
"searchString": {
"value": "Saginaw*",
"boost": 1.0
}
}
}
]
}
}
}
My question:
Earlier when I was not using a custom analyzer the above query worked BUT I was not able to search for words with special characters like "msd-company".
After attaching the custom analyzer(multi_level_analyzer) the above query fails to return any result. I changed the wildcard query and appended an asterisk before the search key and for some reason it works now. (referred this answer)
I want to know the impact of using "* msd-company*" instead of "msd-company*" in the wildcard query for the text field.
How can I still use the wildcard query "msd-company*" with custom analyzer?
Open to suggestions for any other approach to my problem statement.

I have solved my problem by changing the mapping of the said field to this:
"searchString": {"type": "text","analyzer": "multi_level_analyzer", "search_analyzer": "standard"},
But since wildcard queries are expensive, I would still like to know if there exists a better solution to satisfy my search use case.

Related

Elasticsearch combined_fields query with synonym_graph token filter

I'm trying to use a combined_fields query with a synonym_graph search-time token filter in Elasticsearch. When I query for a multi-term phrase in my synonym file, Elasticsearch seems to unconfigurably switch from "or logic" to "and logic" between my original terms. Here's an example Elasticsearch query that has been exaggerated for demonstration purposes:
GET /products/_search
{
"query": {
"bool": {
"should": [
{
"combined_fields": {
"query": "boxes other rectangle hinged lid hook cutout",
"operator": "or",
"minimum_should_match": 1,
"fields": [
"productTitle^9",
"fullDescription^5"
],
"auto_generate_synonyms_phrase_query": false
}
}
]
}
}
}
When I submit the query on my index with an empty synonyms.txt file, it returns >1000 hits. As expected, the top hits contain all or many of the terms in the query, and the result set is composed of all documents that contain any of the terms. However, when I add this line to the synonyms.txt file:
black spigot, boxes other rectangle hinged lid hook cutout
the query only returns 4 hits. These hits either contain all of the terms in my query across the queried fields, or both the terms "black" and "spigot".
My conclusion is that presence of the phrase in the synonyms file is influencing how the "non-synonym-replaced" phrase is being searched for. This seems counterintuitive - adding a phrase to the synonyms file should only possibly increase the number of results that a search for that exact phrase produces, right?
Does anyone know what I'm doing incorrectly, or if my expectations are reliant upon some fundamental misunderstanding of how Elasticsearch works? I observe the same behavior when I use a multi-match query or an array of match queries, and I've tried every combination of query options that I reasonably think might resolve the problem.
For reference, here is my analyzer configuration:
"analysis": {
"analyzer": {
"indexAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"porter_stem",
"stop"
]
},
"searchAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"porter_stem",
"stop",
"productSynonym"
]
}
},
"filter": {
"productSynonym": {
"type": "synonym_graph",
"synonyms_path": "analysis/synonyms.txt"
}
}
}

Elasticsearch - Do searches for alternative country codes

I have a document with a field called 'countryCode'. I have a term query that search for the keyword value of it. But having some issues with:
Some records saying UK and some other saying GB
Some records saying US and some other USA
And the list goes on..
Can I instruct my index to handle all those variations somehow, instead of me having to expand the terms on my query filter?
What you are looking for is a way to have your tokens understand similar tokens which may or may not be having similar characters. This is only possible using synonyms.
Elasticsearch provides you to configure your synonyms and have your query use those synonyms and return the results accordingly.
I have configured a field using a custom analyzer using synonym token filter. I have created a sample mapping and query so that you can play with it and see if that fits your needs.
Mapping
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"usa, us",
"uk, gb"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"mydocs": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
Sample Document
POST my_index/mydocs/1
{
"name": "uk is pretty cool country"
}
And when you make use of the below query, it does return the above document as well.
Query
GET my_index/mydocs/_search
{
"query": {
"match": {
"name": "gb"
}
}
}
Refer to their official documentation to understand more on this. Hope this helps!
Handling within ES itself without using logstash, I'd suggest using a simple ingest pipeline with gsub processor to update the field in it's place
{
"gsub": {
"field": "countryCode",
"pattern": "GB",
"replacement": "UK"
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/gsub-processor.html

Query elasticsearch to make all analyzed ngram tokens to match

I indexed some data using a nGram analyzer (which emits only tri-grams), to solve the compound words problem exactly as described at the ES guide.
This doesn't work however as expected: the according match query will return all documents where at least one nGram-token (per word) matched.
Example:
Let's take these two indexed documents with a single field, using that nGram analyzer:
POST /compound_test/doc/_bulk
{ "index": { "_id": 1 }}
{ "content": "elasticsearch is awesome" }
{ "index": { "_id": 2 }}
{ "content": "some search queries don't perform good" }
Now if I run the following query, I get both results:
"match": {
"content": {
"query": "awesome search",
"minimum_should_match": "100%"
}
}
The query that is constructed from this, could be expressed like this:
(awe OR wes OR eso OR ome) AND (sea OR ear OR arc OR rch)
That's why the second document matches (it contains "some" and "search"). It would even match a document with words that contain the tokens "som" and "rch".
What I actually want is a query where each analyzed token must match (in the best case depending on the minimum-should-match), so something like this:
"match": {
"content": {
"query": "awe wes eso ome sea ear arc rch",
"analyzer": "whitespace",
"minimum_should_match": "100%"
}
}
..without actually creating that query "from hand" / pre-analyzing it on client side.
All settings and data to reproduce that behavior can be found at https://pastebin.com/97QxfaSb
Is there such a possibility?
While writing the question, I accidentally found the answer:
If the ngram analyzer uses a ngram-filter to generate trigrams (as described in the guide), it works the way described above. (I guess because the actual tokens are not the single ngrams but the combination of all created ngrams)
To achieve the wanted behavior, the analyzer must use the ngram tokenizer:
"tokenizer": {
"trigram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
},
"analyzer": {
"trigrams_with_tokenizer": {
"type": "custom",
"tokenizer": "trigram_tokenizer"
}
}
Using this way to produce tokens will result in the wished result when queering that field.

Getting results for multi_match cross_fields query in elasticsearch with custom analyzer

I have an elastic search 5.3 server with products.
Each product has a 14 digit product code that has to be searchable by the following rules. The complete code should match as well as a search term with only the last 9 digits, the last 6, the last 5 or the last 4 digits.
In order to achieve this I created a custom analyser which creates the appropriate tokens at index time using the pattern capture token filter. This seems to be working correctly. The _analyse API shows that the correct terms are created.
To fetch the documents from elastic search I'm using a multi_match cross_fields bool query to search a number of fields simultaneously.
When I have a query string that has a part that matches a product code and a part that matches any of the other fields no results are returned, but when I search for each part separately the appropriate results are returned. Also when I have multiple parts spanning any of the fields except the product code the correct results are returned.
My maping and analyzer:
PUT /store
{
"mappings": {
"products":{
"properties":{
"productCode":{
"analyzer": "ProductCode",
"search_analyzer": "standard",
"type": "text"
},
"description": {
"type": "text"
},
"remarks": {
"type": "text"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"ProductCodeNGram": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"\\d{5}(\\d{9})",
"\\d{8}(\\d{6})",
"\\d{9}(\\d{5})",
"\\d{10}(\\d{4})"
]
}
},
"analyzer": {
"ProductCode": {
"filter": ["ProductCodeNGram"],
"type": "custom",
"preserve_original": "true",
"tokenizer": "standard"
}
}
}
}
}
The query
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"operator": "and"
}
}
]
}
}
}
Sample data
POST /store/products
{
"productCode": "999999123456789",
"description": "Foo bar",
"remarks": "Foobar"
}
The following query strings all return one result:
"456789", "foo", "foobar", "foo foobar".
But the query_string "foo 456789" returns no results.
I am very curious as to why the last search does not return any results. I am convinced that it should.
The problem is that you are doing a cross_fields over fields with different analysers. Cross fields only works for fields using the same analyser. It in fact groups the fields by analyser before doing the cross fields. You can find more information in this documentation.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#_literal_cross_field_literal_and_analysis
Although cross_fields needs the same analyzer across the fields it operates on, I've had luck using the tie_breaker parameter to allow other fields (that use different analyzers) to be weighed for the total score.
This has the added benefit of allowing per-field boosting to be calculated in the final score, too.
Here's an example using your query:
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"tie_breaker": 1 # You may need to tweak this
}
}
]
}
}
}
I also removed the operator field, as I believe using the "AND" operator will cause fields that don't have the same analyzer to be scored inappropriately.

Proper way to query documents using an 'uppercase' token filter

We have an ElasticSearch index with some fields that use custom analyzers. One of the analyzers includes an uppercase token filter in order to get rid of case sensitivity while making queries (e.g. we want "ball" to also match "Ball" or "BALL")
The issue here is when doing regular expressions, the pattern is matched against the term in the index which is all uppercase. So "app*" won't match "Apple" in our index, because behind the scenes its really indexed as "APPLE".
Is there a way to get this to work without doing some hacky things outside of ES?
I might play around with "query_string" instead and see if that has any different results.
This all depends on the type of the query you are using. If that type will use the analyzer of the field itself to analyze the input string then it should be fine.
If you are using the regexp query, this one will NOT analyze the input string, so if you pass app.* to it, it will stay the same and this is what it will user for search.
But, if you use properly the query_string query that one should work:
{
"settings": {
"analysis": {
"analyzer": {
"my": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"uppercase"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"some_field": {
"type": "text",
"analyzer": "my"
}
}
}
}
}
And the query itself:
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
To make sure it's doing what I think it is, I always use the _validate api:
GET /_validate/query?explain&index=test
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
which will show what ES is doing to the input string:
"explanations": [
{
"index": "test",
"valid": true,
"explanation": "some_field:APP*"
}
]

Resources