Elasticsearch: match exact keywords with special characters

Elasticsearch: match exact keywords with special characters - elasticsearch

I am storing tags as an array of keywords:
...
Tags: {
type: "keyword"
},
...
Resulting in arrays like this:
Tags: [
"windows",
"opengl",
"unicode",
"c++",
"c",
"cross-platform",
"makefile",
"emacs"
]
I thought that as I am using the keyword type I could easily do exact search terms, as it is not supposed to be using any analyser.
Apparently I was wrong! this gives me results:
body.query.bool.must.push({term: {"_all": "c"}}); # 38 results
But this doesn't:
body.query.bool.must.push({term: {"_all": "c++"}}); # 0 results
Although there are obviously instances of this tag, as seen above.
If I use body.query.bool.must.push({match: {"_all": search}}); instead (using match instead of term) then "c" and "c++" returns the exact same results, which is wrong as well.

The problem here is that you are using _all - Field, which uses an analyzer (standard by default). Make a small test with your data to be sure:
Test 1:
curl -X POST http://127.0.0.1:9200/script/test/_search \
-d '{
"query": {
"term" : { "_all": "c++"}
}
}'
Test 2:
curl -X POST http://127.0.0.1:9200/script/test/_search \
-d '{
"query": {
"term" : { "tags": "c++"}
}
}'
In my test second query returns documents, first not.
Do you really need to search with multiple fields? If so, you can override the default analyzer of _all field - for a quick test I put an index with settings like this:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"test" : {
"_all" : {"type" : "string", "index" : "not_analyzed", "analyzer" : "keyword"},
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
}
Or you can create Custom _all Field.
Solutions like Multi Field query, that allow to define list of fields to be searched over would rather behave like your example with body.query.bool.must.push({match: {"_all": search}});.

Related

Elasticsearch highlighter false positives

I am using an nGram tokenizer in ES 6.1.1 and getting some weird highlights:
multiple adjacent character ngram highlights are not merged into one
tra is incorrectly highlighted in doc 9
The query auftrag matches documents 7 and 9 as expected, but in doc 9 betrag is highlighted incorrectly. That's a problem with the highlighter - if the problem was with the query doc 8 would have also been returned.
Example code
#!/usr/bin/env bash
# Example based on
# https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
# with suggestions from from
# https://github.com/elastic/elasticsearch/issues/21000
DELETE INDEX IF EXISTS
curl -sS -XDELETE 'localhost:9200/my_index'
printf '\n-------------\n'
CREATE NEW INDEX
curl -sS -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"trigrams": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
}
'
printf '\n-------------\n'
POPULATE INDEX
curl -sS -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 7 }}
{ "text": "auftragen" }
{ "index": { "_id": 8 }}
{ "text": "betrag" }
{ "index": { "_id": 9 }}
{ "text": "betrag auftragen" }
'
printf '\n-------------\n'
sleep 1 # Give ES time to index
QUERY
curl -sS -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"text": {
"query": "auftrag",
"minimum_should_match": "100%"
}
}
},
"highlight": {
"fields": {
"text": {
"fragment_size": 120,
"type": "fvh"
}
}
}
}
'
The hits I get are (abbreviated):
"hits" : [
{
"_id" : "9",
"_source" : {
"text" : "betrag auftragen"
},
"highlight" : {
"text" : [
"be<em>tra</em>g <em>auf</em><em>tra</em>gen"
]
}
},
{
"_id" : "7",
"_source" : {
"text" : "auftragen"
},
"highlight" : {
"text" : [
"<em>auf</em><em>tra</em>gen"
]
}
}
]
I have tried various workarounds, such as using the unified/fvh highlighter and setting all options that seemed relevant, but no luck. Any hints are greatly appreciated.

The problem here is not with highlighting but with this how you are using nGram analyzer.
First of all when you are configure mapping this way:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
you are saying to Elasticsearch that you want to use it for both indexed text and provided a search term. In your case, this simply means that:
your text from the document 9 = "betrag auftragen" is split for trigrams so in the index you have something like: [bet, etr, tra, rag, auf, uft, ftr, tra, rag, age, gen]
your text from the document 7 = "auftragen" is split for trigrams so in the index you have something like: [auf, uft, ftr, tra, rag, age, gen]
your search term = "auftrag" is also split for trigrams and Elasticsearch is see it as: [auf, uft, ftr, tra, rag]
at the end Elasticsearch matches all the trigrams from search with those from your index and because of this you have 'auf' and 'tra' highlighted separately. 'ufa', 'ftr', and 'rag' also matches, but they overlaps 'auf' and 'tra' and are not highlighted.
First what you need to do is to say to Elasticsearch that you do not want to split search term to grams. All you need to do is to add search_analyzer property to your mapping:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"search_analyzer": "standard",
"term_vector" : "with_positions_offsets"
}
}
}
}
Now words from a search term are treated by standard analyzer as separate words so in your case, it will be just "auftrag".
But this single change will not help you. It will even break the search because "auftrag" is not matching to any trigram from your index.
Now you need to improve your nGram tokenizer by increasing max_gram:
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "10",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
This way texts in your index will be split into 3-grams, 4-grams, 5-grams, 6-grams, 7-grams, 8-grams, 9-grams, and 10-grams. Among these 7-grams you will find "auftrag" which is your search term.
After this two improvements, highlighting in your search result should look as below:
"betrag <em>auftrag</em>en"
for document 9 and:
"<em>auftrag</em>en"
for document 7.
This is how ngrams and highlighting works together. I know that ES documentation is saying:
It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3) is a good place to start.
This is true. For performance reason, you need to experiment with this configuration but I hope that I explained to you how it is working.

I have the same problem here, with ngram(trigram) tokenizer, got incomplete highlight like:
query with `match`: samp
field data: sample
result highlight: <em>sam</em>ple
expected highlight: <em>samp</em>le
Use match_phrase and use fvh highlight type when set the field's term_vector to with_positions_offsets, this may get the correct highlight.
<em>samp</em>le
I hope this can help you as you do not need to change the tokenizer nor increase max_gram.
But my problem is that I want to use simple_query_string which does not support using phrase for default field query, the only way is using quote to wrap the string like "samp", but as there is some logic in query string so I cant do it for users, and require users to do it neither.
Solution from #piotr-pradzynski may not help me as I have a lot of data, increase the max_gram will lead to lots of storage usage.

Don't make some fields searchable when using query_string or term/terms in Elasticsearch

Having this mapping:
curl -XPUT 'localhost:9200/testindex?pretty=true' -d '{
"mappings": {
"items": {
"dynamic": "strict",
"properties" : {
"title" : { "type": "string" },
"body" : { "type": "string" },
"tags" : { "type": "string" }
}}}}'
I add two simple items:
curl -XPUT 'localhost:9200/testindex/items/1' -d '{
"title": "This is a test title",
"body" : "This is the body of the java",
"tags" : "csharp"
}'
curl -XPUT 'localhost:9200/testindex/items/2' -d '{
"title": "Another text title",
"body": "My body is great and Im super handsome",
"tags" : ["cplusplus", "python", "java"]
}'
If I search the string java:
curl -XGET 'localhost:9200/testindex/items/_search?q=java&pretty=true'
... it will match both items. The first item will match on the body and the other one on the tags.
How can I avoid to search in some fields? In the example I dont know it to match with the field tags. But I want to maintain tags indexed as I use them for getting aggregations.
I know I can do it using this:
{
"query" : {
"query_string": {
"query": "java AND -tags:java"
}},
"_source" : {
"exclude" : ["*.tags"]
}
}'
But is there any other more elegant way, like putting something in the mapping?
PS: My searches are always query_strings and term / terms and I'm using ES 2.3.2

You can specify fields option if you only want to match against certain fields
{
"query_string" : {
"fields" : ["body"],
"query" : "java"
}
}
EDIT 1
You could use the "include_in_all": false param inside mapping. Check the documentation. Query string query defaults to _all so you can add "include_in_all": false to all the fields in which you don't want match and after that this query would only look in body field
{
"query_string" : {
"query" : "java"
}
}
Does this help?

How to search with keyword analyzer?

I have keyword analyzer as default analyzer, like so:
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "keyword"
}}}}}}
```
But now I can't search anything. e.g:
{
"query": {
"query_string": {
"query": "cast"
}}}
Gives me 0 results all though "cast" is a common value i the indexed documents. (http://gist.github.com/baelter/b0720a52ee5a27e27d3a)
Search for "*" works fine btw.
I only have explicit defaults in my mapping:
{
"oceanography_point": {
"_all" : {
"enabled" : true
},
"properties" : {}
}
}
The index behaves as if no fields are included in _all, because field:value queries works fine.
Am I misusing the keyword analyzer?

Using keyword analyzer , you can only do an exact string match.
Lets assume that you have used keyword analyzer and no filters.
In that case for as string indexed as "Cast away in forest" , neither search for "cast" or "away" will work. You need to do an exact "Cast away in forest" string to match it. ( Assuming no lowercase filter used , you need to give the right case too)
A better approach would be to use multi fields to declare one copy as keyword analyzed and other one normal.
You can search on one of this field and aggregate on the other.

Okey, some 15h of trial and error I can conclude that this works for search:
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"default": {
"type": "keyword"
}}}}}}
How ever this breaks faceting so I ended up using a dynamic template instead:
"dynamic_templates" : [
{
"strings_not_analyzed" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string",
"index" : "not_analyzed"
}
}
}
],

Range filter not working

I'm using the elasticsearch search engine and when I run the code below, the results returned doesn't match the range criteria(I get items with published date below the desired limit):
#!/bin/bash
curl -X GET 'http://localhost:9200/newsidx/news/_search?pretty' -d '{
"fields": [
"art_text",
"title",
"published",
"category"
],
"query": {
"bool": {
"should": [
{
"fuzzy": {"art_text": {"boost": 89, "value": "google" }}
},
{
"fuzzy": {"art_text": {"boost": 75, "value": "twitter" }}
}
],
"minimum_number_should_match": 1
}
},
"filter" : {
"range" : {
"published" : {
"from" : "2013-04-12 00:00:00"
}
}
}
}
'
I also tried putting the range clause in a must one, inside the bool query, but the results were the same.
Edit: I use elasticsearch to search in a mongodb through a river plugin. This is the script I ran to search the mongodb db with ES:
#!/bin/bash
curl -X PUT localhost:9200/_river/mongodb/_meta -d '{
"type":"mongodb",
"mongodb": {
"db": "newsful",
"collection": "news"
},
"index": {
"name": "newsidx",
"type": "news"
}
}'
Besides this, I didn't create another indexes.
Edit 2:
A view to the es mappings:
http://localhost:9200/newsidx/news/_mapping
published: {
type: "string"
}

The reason is in your mapping. The published field, which you are using as a date, is indexed as a string. That's probably because the date format you are using is not the default one in elasticsearch, thus the field type is not auto-detected and it's indexed as a simple string.
You should change your mapping using the put mapping api. You need to define the published field as a date there, specifying the format you're using (can be more than one) and reindex your data.
After that your range filter should work!

Why Elasticsearch "not_analyzed" field is split into terms?

I have the following field in my mapping definition:
...
"my_field": {
"type": "string",
"index":"not_analyzed"
}
...
When I index a document with value of my_field = 'test-some-another' that value is split into 3 terms: test, some, another.
What am I doing wrong?
I created the following index:
curl -XPUT localhost:9200/my_index -d '{
"index": {
"settings": {
"number_of_shards": 5,
"number_of_replicas": 2
},
"mappings": {
"my_type": {
"_all": {
"enabled": false
},
"_source": {
"compressed": true
},
"properties": {
"my_field": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}'
Then I index the following document:
curl -XPOST localhost:9200/my_index/my_type -d '{
"my_field": "test-some-another"
}'
Then I use the plugin https://github.com/jprante/elasticsearch-index-termlist with the following API:
curl -XGET localhost:9200/my_index/_termlist
That gives me the following response:
{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms": ["test","some","another"]}

Verify that mapping is actually getting set by running:
curl localhost:9200/my_index/_mapping?pretty=true
The command that creates the index seems to be incorrect. It shouldn't contain "index" : { as a root element. Try this:
curl -XPUT localhost:9200/my_index -d '{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 2
},
"mappings": {
"my_type": {
"_all": {
"enabled": false
},
"_source": {
"compressed": true
},
"properties": {
"my_field": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'

In ElasticSearch a field is indexed when it goes within the inverted index, the data structure that lucene uses to provide its great and fast full text search capabilities. If you want to search on a field, you do have to index it. When you index a field you can decide whether you want to index it as it is, or you want to analyze it, which means deciding a tokenizer to apply to it, which will generate a list of tokens (words) and a list of token filters that can modify the generated tokens (even add or delete some). The way you index a field affects how you can search on it. If you index a field but don't analyze it, and its text is composed of multiple words, you'll be able to find that document only searching for that exact specific text, whitespaces included.
You can have fields that you only want to search on, and never show: indexed and not stored (default in lucene).
You can have fields that you want to search on and also retrieve: indexed and stored.
You can have fields that you don't want to search on, but you do want to retrieve to show them.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio