How to handle wildcards in elastic search structured queries - elasticsearch

My use case requires to query for our elastic search domain with trailing wildcards. I wanted to get your opinion on the best practices of handling such wildcards in the queries.
Do you think adding the following clauses is a good practice for the queries:
"query" : {
"query_string" : {
"query" : "attribute:postfix*",
"analyze_wildcard" : true,
"allow_leading_wildcard" : false,
"use_dis_max" : false
}
}
I've disallowed leading wildcards since it is a heavy operation. However I wanted to how good is analyzing wildcard for every query request in the long run. My understanding is, analyze wildcard would have no impact if the query doesn't actually have any wildcards. Is that correct?

If you have the possibility of changing your mapping type and index settings, the right way to go is to create a custom analyzer with an edge-n-gram token filter that would index all prefixes of the attribute field.
curl -XPUT http://localhost:9200/your_index -d '{
"settings": {
"analysis": {
"filter": {
"edge_filter": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 15
}
},
"analyzer": {
"attr_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "edge_filter"]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"attribute": {
"type": "string",
"analyzer": "attr_analyzer",
"search_analyzer": "standard"
}
}
}
}
}'
Then, when you index a document, the attribute field value (e.g.) postfixing will be indexed as the following tokens: p, po, pos, post, postf, postfi, postfix, postfixi, postfixin, postfixing.
Finally, you can then easily query the attribute field for the postfix value using a simple match query like this. No need to use an under-performing wildcard in a query string query.
{
"query": {
"match" : {
"attribute" : "postfix"
}
}
}

Related

Why fuzzy query returns a match but query with fuzziness doesn't on the same input?

I created the following index in Elasticsearch:
PUT /my-index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": ["lowercase", "3_5_edgegrams"]
}
},
"filter": {
"3_5_edgegrams": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 10
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Then I inserted the following document:
{
"name": "Nuvus Gro Corp"
}
When I make the following query (let's call it fuzzy_query):
GET /my-index/_search
{
"query": {
"fuzzy": {
"name": {
"value": "qnuv"
}
}
}
}
I get a match for the above document.
When I make the query (let's call the query match_with_fuzziness):
GET /my-index/_search
{
"query": {
"match": {
"name": {
"query": "qnuv",
"fuzziness": "AUTO"
}
}
}
}
I don't get a match. If I make the following query:
GET /my-index/_search
{
"query": {
"match": {
"name": {
"query": "nuvq",
"fuzziness": "AUTO"
}
}
}
}
I again get a match. I don't understand why when I make the match_with_fuzziness query I don't get any matches.
EDIT: I analyzed the queries with Kibana Profiler and according to the profiler match_with_fuzziness is a SynonymQuery Synonym(name:qnu name:qnuv) query while fuzzy_query is a BoostQuery (name:nuv)^0.6666666
Very similar problem to the one explained in your other question.
The problem is that you haven't specified a specific search_analyzer, so at search time qnuv and nuvq also get analyzed by my_analyzer and edge-ngramed as well, hence the match you're receiving.
If we check the first query, since you're using the fuzzy query, qnuv (the search term) will match nuv (the first indexed edge-ngramed token) with a distance of 1 (i.e. the first q is "tolerated"), which is what the fuzzy query does by default (with "fuzziness: AUTO")
In the third query, nuv (the first edge-ngramed token of the search term) will match nuv (the first indexed edge-ngramed token).
The case of the second query is a bit special and I'm referencing below how the fuzziness parameter works in the context of match queries
Fuzzy matching is not applied to terms with synonyms or in cases where the analysis process produces multiple tokens at the same position. Under the hood these terms are expanded to a special synonym query that blends term frequencies, which does not support fuzzy expansion.
The part in bold is what applies to your case. Since the search term qnuv is analyzed by my_analyzer, it produces the two tokens qnu and qnuv at the same position and that does not support fuzzy matching.
You need to change your mapping to this one instead and it will work the way you expect, i.e. all three queries will return your document:
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard" <---- add this line
}
}
}

Proper way to query documents using an 'uppercase' token filter

We have an ElasticSearch index with some fields that use custom analyzers. One of the analyzers includes an uppercase token filter in order to get rid of case sensitivity while making queries (e.g. we want "ball" to also match "Ball" or "BALL")
The issue here is when doing regular expressions, the pattern is matched against the term in the index which is all uppercase. So "app*" won't match "Apple" in our index, because behind the scenes its really indexed as "APPLE".
Is there a way to get this to work without doing some hacky things outside of ES?
I might play around with "query_string" instead and see if that has any different results.
This all depends on the type of the query you are using. If that type will use the analyzer of the field itself to analyze the input string then it should be fine.
If you are using the regexp query, this one will NOT analyze the input string, so if you pass app.* to it, it will stay the same and this is what it will user for search.
But, if you use properly the query_string query that one should work:
{
"settings": {
"analysis": {
"analyzer": {
"my": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"uppercase"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"some_field": {
"type": "text",
"analyzer": "my"
}
}
}
}
}
And the query itself:
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
To make sure it's doing what I think it is, I always use the _validate api:
GET /_validate/query?explain&index=test
{
"query": {
"query_string": {
"query": "some_field:app*"
}
}
}
which will show what ES is doing to the input string:
"explanations": [
{
"index": "test",
"valid": true,
"explanation": "some_field:APP*"
}
]

What is the best way to handle common term which contains special chars, like C#, C++

I have some documents contains c# or c++ in title which use standard analyzer.
When I query c# on title field, I got all c# and C++ documents, and c++ documents even have higher score than c# document. That makes sense, since both '#' and '++' are removed from token by standard analyzer.
What is the best way to handle this kind special terms? In my case specifically, I want c# documents got higher score than c++ documents when searching for "C#".
Here is approach you can use:
Introduce copy-field where you will have values with special characters. For that you'll need:
Introduce custom analyzer (whitespace tokenizer is important here - it will preserve your special characters):
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"whitespace",
"filter":[
"lowercase"
]
}
}
}
}
}
Create copy-field (_wcc suffix will stand for 'with special characters'):
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"prog_lang": {
"type": "text",
"copy_to": "prog_lang_wcc",
"analyzer": "standard"
},
"prog_lang_wcc": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
}
When issuing query itself you will combine query with boost against prog_lang_wcc field like this (it could be either multi-match or pure boolean + boost):
GET /_search
{
"query": {
"multi_match" : {
"query" : "c#",
"type": "match_phrase",
"fields" : [ "prog_lang_wcc^3", "prog_lang" ]
}
}
}

How to search with keyword analyzer?

I have keyword analyzer as default analyzer, like so:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "keyword"
}}}}}}
```
But now I can't search anything. e.g:
{
"query": {
"query_string": {
"query": "cast"
}}}
Gives me 0 results all though "cast" is a common value i the indexed documents. (http://gist.github.com/baelter/b0720a52ee5a27e27d3a)
Search for "*" works fine btw.
I only have explicit defaults in my mapping:
{
"oceanography_point": {
"_all" : {
"enabled" : true
},
"properties" : {}
}
}
The index behaves as if no fields are included in _all, because field:value queries works fine.
Am I misusing the keyword analyzer?
Using keyword analyzer , you can only do an exact string match.
Lets assume that you have used keyword analyzer and no filters.
In that case for as string indexed as "Cast away in forest" , neither search for "cast" or "away" will work. You need to do an exact "Cast away in forest" string to match it. ( Assuming no lowercase filter used , you need to give the right case too)
A better approach would be to use multi fields to declare one copy as keyword analyzed and other one normal.
You can search on one of this field and aggregate on the other.
Okey, some 15h of trial and error I can conclude that this works for search:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"default": {
"type": "keyword"
}}}}}}
How ever this breaks faceting so I ended up using a dynamic template instead:
"dynamic_templates" : [
{
"strings_not_analyzed" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
],

How to index both a string and its reverse?

I'm looking for a way to analyze the string "abc123" as ["abc123", "321cba"]. I've looked at the reverse token filter, but that only gets me ["321cba"]. Documentation on this filter is pretty sparse, only stating that
"A token filter of type reverse ... simply reverses each token."
(see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-reverse-tokenfilter.html).
I've also tinkered with using the keyword_repeat filter, which gets me two instances. I don't know if that's useful, but for now all it does it reverse both instances.
How can I use the reverse token filter but keep the original token as well?
My analyzer:
{ "settings" : { "analysis" : {
"analyzer" : {
"phone" : {
"type" : "custom"
,"char_filter" : ["strip_non_numeric"]
,"tokenizer" : "keyword"
,"filter" : ["standard", "keyword_repeat", "reverse"]
}
}
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
}
}}}
Make and put a analyzer to reverse a string (say reverse_analyzer).
PUT index_name
{
"settings": {
"analysis": {
"analyzer": {
"reverse_analyzer": {
"type": "custom",
"char_filter": [
"strip_non_numeric"
],
"tokenizer": "keyword",
"filter": [
"standard",
"keyword_repeat",
"reverse"
]
}
},
"char_filter": {
"strip_non_numeric": {
"type": "pattern_replace",
"pattern": "[^0-9]",
"replacement": ""
}
}
}
}
}
then, for a field, (say phoneno), use mapping as, (create a type and append mapping for phone as)
PUT index_name/type_name/_mapping
{
"type_name": {
"properties": {
"phone_no": {
"type": "string",
"fields": {
"reverse": {
"type": "string",
"analyzer": "reverse_analyzer"
}
}
}
}
}
}
So, phone_no is like multifield, which will store a string and its reverse as,
if you index
phone_no: 911220
then in elasticsearch, there will be fields as,
phone_no: 911220 and phone_no.reverse : 022119, so you can search, filter reverse or not-reversed field.
Hope this helps.
I don't believe you can do this directly, as I am unaware of any way to get the reverse token filter to also output the original.
However, you could use the fields parameter to index both the original and the reversed at the same time with no additional coding. You would then search both fields.
So let's say your field was called phone_number:
"phone_number": {
"type": "string",
"fields": {
"reverse": { "type": "string", "index": "phone" }
}
}
In this case we're indexing using the default analyzer (assume standard) plus also indexing into reverse with your customer analyzer phone which reverses. You then issue your queries against both fields.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/_multi_fields.html
I'm not sure it's possible to do this using built-in set of token filters. I would recommend you to create your own plugin. There is ICU Analysis plugin supported by elastic search team, that you can use as example.
I wound up using the following two char_filter's in my analyzer. It's an ugly abuse of regex, but it seems to work. It is limited to the first 20 numeric characters, but in my use-case that is acceptable.
First it groups all numeric characters, then explicitly rebuilds the string with its own (numeric-only!) reverse. The space in the center of the replacement pattern then causes the tokenizer to split it into two tokens - the original and the reverse.
,"char_filter" : {
"strip_non_numeric" : {
"type" : "pattern_replace"
,"pattern" : "[^0-9]"
,"replacement" : ""
}
,"dupe_and_reverse" : {
"type" : "pattern_replace"
,"pattern" : "([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)([0-9]?)"
,"replacement" : "$1$2$3$4$5$6$7$8$9$10$11$12$13$14$15$16$17$18$19$20 $20$19$18$17$16$15$14$13$12$11$10$9$8$7$6$5$4$3$2$1"
}
}

Resources