Elasticsearch Query String Query with # symbol and wildcards - elasticsearch

I defined a custom analyzer that I was surprised not built-in.
analyzer": {
"keyword_lowercase": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
Then my mapping for this field is:
"email": {
"type": "string",
"analyzer": "keyword_lowercase"
}
This works great. (http://.../_analyze?field=email&text=me#example.com) ->
"tokens": [
{
"token": "me#example.com",
"start_offset": 0,
"end_offset": 16,
"type": "word",
"position": 1
}
]
Finding by that keyword works great. http://.../_search?q=me#example.com yields results.
The problem is trying to incorporate wildcards anywhere in the Query String Query. http://.../_search?q=*me#example.com yields no results. I would expect results containing emails such as "me#example.com" and "some#example.com".
It looks like elasticsearch performs the search with the default analyzer, which doesn't make sense. Shouldn't it perform the search with each field's own default analyzer?
I.E. http://.../_search?q=email:*me#example.com returns results because I am telling it which analyzer to use based upon the field.
Can elasticsearch not do this?

See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
Set analyze_wildcard to true, as it is false by default.

Related

Elastic search with Java: exclude matches with random leading characters in a letter

I am new to using elastic search. I managed to get things working somewhat close to what I intended. I am using the following configuration.
{
"analysis": {
"filter": {
"shingle_filter": {
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3,
"output_unigrams": true,
"token_separator": ""
},
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"shingle_search": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase"
]
},
"shingle_index": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter",
"autocomplete_filter"
]
}
}
}
}
I have this applied over multiple fields and doing a multi match query.
Following is the java code:
NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(QueryBuilders.multiMatchQuery(i)
.field("title")
.field("alias")
.fuzziness(Fuzziness.ONE)
.type(MultiMatchQueryBuilder.Type.BEST_FIELDS))
.build();
The problem is it matches with fields that have letters with some leading characters.
For example, if my search input is "ron" I want it to match with "ron mathews", but I don't want it match with "iron". How can I make sure that I am matching with letters having no leading characters?
Update-1
Turning off fuzzy transposition seems to improve search results. But I think we can make it better.
You probably want to score "ron" higher than "ronaldo" and the exact match of complete field "ron" even higher so the best option here would be to use few subfields with standard and keyword analyzers and boost those fields in your multi_match query.
Also, as you figured out yourself, be careful with the fuzziness. Might make sense to run 2 queries in a should with one being fuzzy and another boosted so that exact matches are ranked higher.

Elasticsearch : Problem with querying document where "." is included in field

I have an index where some entries are like
{
"name" : " Stefan Drumm"
}
...
{
"name" : "Dr. med. Elisabeth Bauer"
}
The mapping of the name field is
{
"name": {
"type": "text",
"analyzer": "index_name_analyzer",
"search_analyzer": "search_cross_fields_analyzer"
}
}
When I use the below query
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Stefan Drumm","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
It returns the first document.
But when I try to get the second document using the query below
GET my_index/_search
{"size":10,"query":
{"bool":
{"must":
[{"match":{"name":{"query":"Dr. med. Elisabeth Bauer","operator":"AND"}}}]
,"boost":1.0}},
"min_score":0.0}
it is not returning anything.
Things I can't do
can't change the index
can't use the term query.
change the operator to 'OR', because in that case it will return multiple entries, which I don't want.
What I am doing wrong and how can I achieve this by modifying the query?
You have configured different analyzers for indexing and searching (index_name_analyzer and search_cross_fields_analyzer). If these analyzers process the input Dr. med. Elisabeth Bauer in an incompatible way, the search isn't going to match. This is described in more detail in Index and search analysis, as well as in Controlling Analysis.
You don't provide the definition of these two analyzers, so it's hard to guess from your question what they are doing. Depending on the analyzers, it may be possible to preprocess your query string (e.g. by removing .) before executing the search so that the search will match.
You can investigate how analysis affects your search by using the _analyze API, as described in Testing analyzers. For your example, the commands
GET my_index/_analyze
{
"analyzer": "index_name_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
and
GET my_index/_analyze
{
"analyzer": "search_cross_fields_analyzer",
"text": "Dr. med. Elisabeth Bauer"
}
should show you how the two analyzers configured for your index treats the target string, which might provide you with a clue about what's wrong. The response will be something like
{
"tokens": [
{
"token": "dr",
"start_offset": 0,
"end_offset": 2,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "med",
"start_offset": 4,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "elisabeth",
"start_offset": 9,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "bauer",
"start_offset": 19,
"end_offset": 24,
"type": "<ALPHANUM>",
"position": 3
}
]
}
For the example output above, the analyzer has split the input into one token per word, lowercased each word, and discarded all punctuation.
My guess would be that index_name_analyzer preserves punctuation, while search_cross_fields_analyzer discards it, so that the tokens won't match. If this is the case, and you can't change the index configuration (as you state in your question), one other option would be to specify a different analyzer when running the query:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": {
"query": "Dr. med. Elisabeth Bauer",
"operator": "AND",
"analyzer": "index_name_analyzer"
}
}
}
],
"boost": 1
}
},
"min_score": 0
}
In the query above, the analyzer parameter has been set to override the search analysis to use the same analyzer (index_name_analyzer) as the one used when indexing. What analyzer might make sense to use depends on your setup. Ideally, you should configure the analyzers to align so that you don't have to override at search time, but it sounds like you are not living in an ideal world.

ElasticSearch - exclude special character from standard stemmer

I'm using standard analyzer for my ElasticSearch index, and I have noticed that when I search a query with % in it - the analyzer drops the % as part of the stemmer steps (on the query "2% milk")
GET index_name/_analyze
{
"field": "text.english",
"text": "2% milk"
}
The response is the following 2 tokens (2 and milk):
{
"tokens": [
{
"token": "2",
"start_offset": 0,
"end_offset": 1,
"type": "<NUM>",
"position": 0
},
{
"token": "milk",
"start_offset": 3,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
}
]
}
Meaning, the 2% becomes 2
I want to use the standard stemmer to reduce punctuation, I don't want to use the space stemmer or other stemmer which is not standard but I do want to use the <number>% sign as term in the index.
Is there a way to configure to the stemmer to ignore special character when it's next to a number? worst case not to ignore it at all?
Thanks!
You can achieve the desired behavior by configuring a custom analyzer using a character filter that preserves the "%"-character from getting stripped away.
Check the Elasticsearch documentation about the configuration of the built-in analyzers, to use that configuration as a blueprint to configure your custom analyzer (see Elasticsearch Reference: english analyzer)
Add a character filter that maps the percentage-character to a different string, as demonstrated in the following code snippet:
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_percent_char_filter"
]
}
},
"char_filter": {
"my_percent_char_filter": {
"type": "mapping",
"mappings": [
"0% => 0_percent",
"1% => 1_percent",
"2% => 2_percent",
"3% => 3_percent",
"4% => 4_percent",
"5% => 5_percent",
"6% => 6_percent",
"7% => 7_percent",
"8% => 8_percent",
"9% => 9_percent"
]
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "The fee is between 0.93% or 2%"
}
With this, you can even search for specific percentages (like 2%)!
Alternative Solution
If you simply want to remove the percentage character, you can use the very same approach, but simply map the %-character to an empty string, as shown in the following code snippet
"char_filter": {
"my_percent_char_removal_filter": {
"type": "mapping",
"mappings": [
"% => "]
}
}
BTW: This approach is not considered to be a "hack", it's the standard solution approach to modify your original string before it gets sent to the tokenizer.

Elasticsearch and Drupal : how to add filter lowercase and asciifolding

I created an index in Drupal, and my queries works.
Now I try to add filters lowercase and asciifolding in the elasticsearch.yml file, but unsuccessfully:
I add these lines :
index:
analysis:
analyzer:
default:
filter : [standard, lowercase, asciifolding]
I have an error : IndexCreationException: [myindex] failed to create index.
But 'myindex' already exist, I just try to add filters to this existing index.
How I can add these filters so that the indexation is correct for me?
Thank you very much for your help.
The reason that you get this exception is because it's not possible to update the settings of an index by calling the general create index endpoint. In order to update analyzers you will have to call the '_settings' endpoint.
I've made a small example for you on how to do this:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"new_analyzer": {
"tokenizer": "standard"
}
}
}
}
}
GET test/_analyze
{
"analyzer": "new_analyzer",
"text": "NoLowercasse"
}
POST test/_close
PUT test/_settings
{
"analysis": {
"analyzer": {
"new_analyzer": {
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
}
POST test/_open
GET test/_analyze
{
"analyzer": "new_analyzer",
"text": "LowerCaseAdded"
}
Response:
{
"tokens": [
{
"token": "lowercaseadded",
"start_offset": 0,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 0
}
]
}
You can see that after the second analysis, the lowercase filter is being applied. The reason that you have to close your index is because it needs to rebuild the analyzer. You will notice that the new analyzer won't work as expected since the previously added documents weren't indexed with this analyzer, but rather the one without the asciifolding and lowercase.
In order to fix this, you'll have to rebuild your index (with the Reindex-API for example)
Hope this helps!
Edit: I maybe was a bit too quick in responding, as this is not a Drupal-Elastic solution, but it might point you in the right direction. To be honest I'm not familiar with running ES in combination with Drupal.

Keyword is tokenized and exact match does not work

I have a field named id, that looks like that:
ventures.something.123
It's mapping:
{
"id":{
"fields":{
"keyword":{
"ignore_above":256,
"type":"keyword"
}
},
"type":"text"
}
}
My understanding is that a keyword only allows for EXACT matching - which is what I want.
However, the analyzer tells me it's tokenized:
> http http://localhost:9200/my_index/_analyze field=id text='ventures.house.1137'
{
"tokens": [
{
"end_offset": 14,
"position": 0,
"start_offset": 0,
"token": "ventures.house",
"type": "<ALPHANUM>"
},
{
"end_offset": 19,
"position": 1,
"start_offset": 15,
"token": "1137",
"type": "<NUM>"
}
]
}
... and a search for an id returns indeed ALL ids that start with ventures.house.
Why is that and how can I come to the EXACT matching?
It's ES 5.2.
From https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2
not_analyzed:
Index this field, so it is searchable, but index the value exactly as specified. Do not analyze it.
{
"tag": {
"type": "string",
"index": "not_analyzed"
}
}
I misread the mapping, it looks like my elasticsearch-dsl library does not create a keyword directly, but adds it as a subfield.
Have you tried defining the field 'id' as keyword ?
In this case it does not get analyzed but stored as is.
When I understand your question correctly this is what you want.
{
"id":{
"type":"keyword"
}
}
See https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html
I hope this helped. Christian

Resources