ElasticSearch n-gram tokenfilter not finding partial words

ElasticSearch n-gram tokenfilter not finding partial words - n-gram

I have been playing around with ElasticSearch for a new project of mine. I have set the default analyzers to use the ngram tokenfilter. This is my elasticsearch.yml file:
index:
analysis:
analyzer:
default_index:
tokenizer: standard
filter: [standard, stop, mynGram]
default_search:
tokenizer: standard
filter: [standard, stop]
filter:
mynGram:
type: nGram
min_gram: 1
max_gram: 10
I created a new index and added the following document to it:
$ curl -XPUT http://localhost:9200/test/newtype/3 -d '{"text": "one two three four five six"}'
{"ok":true,"_index":"test","_type":"newtype","_id":"3"}
However, when I search using the query text:hree or text:ive or any other partial terms, ElasticSearch does not return this document. It returns the document only when I search for the exact term (like text:two).
I have also tried changing the config file such that default_search also uses the ngram token filter, but the result was the same. What am I doing wrong here and how do I correct it?

Not sure about the default_* settings.
But applying a mapping that specifies index_analyzer and search_analyzer works:
curl -XDELETE localhost:9200/twitter
curl -XPOST localhost:9200/twitter -d '
{"index":
{ "number_of_shards": 1,
"analysis": {
"filter": {
"mynGram" : {"type": "nGram", "min_gram": 2, "max_gram": 10}
},
"analyzer": { "a1" : {
"type":"custom",
"tokenizer": "standard",
"filter": ["lowercase", "mynGram"]
}
}
}
}
}
}'
curl -XPUT localhost:9200/twitter/tweet/_mapping -d '{
"tweet" : {
"index_analyzer" : "a1",
"search_analyzer" : "standard",
"date_formats" : ["yyyy-MM-dd", "dd-MM-yyyy"],
"properties" : {
"user": {"type":"string", "analyzer":"standard"},
"message" : {"type" : "string" }
}
}}'
curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "trying out Elastic Search"
}'
curl -XGET localhost:9200/twitter/_search?q=ear
curl -XGET localhost:9200/twitter/_search?q=sea
curl -XGET localhost:9200/twitter/_mapping

You should check the get mapping API to see if your mapping has been applied:
http://www.elasticsearch.org/guide/reference/api/admin-indices-get-mapping.html
Btw it has been said on the mailing list that when an index already contains documents, the mappings you put on the elasticsearch.yml are not applied. You need to clean your index first.
I've tried ngrams with ES and it works fine for me.

Related

Elasticsearch highlighter false positives

I am using an nGram tokenizer in ES 6.1.1 and getting some weird highlights:
multiple adjacent character ngram highlights are not merged into one
tra is incorrectly highlighted in doc 9
The query auftrag matches documents 7 and 9 as expected, but in doc 9 betrag is highlighted incorrectly. That's a problem with the highlighter - if the problem was with the query doc 8 would have also been returned.
Example code
#!/usr/bin/env bash
# Example based on
# https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
# with suggestions from from
# https://github.com/elastic/elasticsearch/issues/21000
DELETE INDEX IF EXISTS
curl -sS -XDELETE 'localhost:9200/my_index'
printf '\n-------------\n'
CREATE NEW INDEX
curl -sS -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"trigrams": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
}
'
printf '\n-------------\n'
POPULATE INDEX
curl -sS -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 7 }}
{ "text": "auftragen" }
{ "index": { "_id": 8 }}
{ "text": "betrag" }
{ "index": { "_id": 9 }}
{ "text": "betrag auftragen" }
'
printf '\n-------------\n'
sleep 1 # Give ES time to index
QUERY
curl -sS -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"text": {
"query": "auftrag",
"minimum_should_match": "100%"
}
}
},
"highlight": {
"fields": {
"text": {
"fragment_size": 120,
"type": "fvh"
}
}
}
}
'
The hits I get are (abbreviated):
"hits" : [
{
"_id" : "9",
"_source" : {
"text" : "betrag auftragen"
},
"highlight" : {
"text" : [
"be<em>tra</em>g <em>auf</em><em>tra</em>gen"
]
}
},
{
"_id" : "7",
"_source" : {
"text" : "auftragen"
},
"highlight" : {
"text" : [
"<em>auf</em><em>tra</em>gen"
]
}
}
]
I have tried various workarounds, such as using the unified/fvh highlighter and setting all options that seemed relevant, but no luck. Any hints are greatly appreciated.

The problem here is not with highlighting but with this how you are using nGram analyzer.
First of all when you are configure mapping this way:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
you are saying to Elasticsearch that you want to use it for both indexed text and provided a search term. In your case, this simply means that:
your text from the document 9 = "betrag auftragen" is split for trigrams so in the index you have something like: [bet, etr, tra, rag, auf, uft, ftr, tra, rag, age, gen]
your text from the document 7 = "auftragen" is split for trigrams so in the index you have something like: [auf, uft, ftr, tra, rag, age, gen]
your search term = "auftrag" is also split for trigrams and Elasticsearch is see it as: [auf, uft, ftr, tra, rag]
at the end Elasticsearch matches all the trigrams from search with those from your index and because of this you have 'auf' and 'tra' highlighted separately. 'ufa', 'ftr', and 'rag' also matches, but they overlaps 'auf' and 'tra' and are not highlighted.
First what you need to do is to say to Elasticsearch that you do not want to split search term to grams. All you need to do is to add search_analyzer property to your mapping:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"search_analyzer": "standard",
"term_vector" : "with_positions_offsets"
}
}
}
}
Now words from a search term are treated by standard analyzer as separate words so in your case, it will be just "auftrag".
But this single change will not help you. It will even break the search because "auftrag" is not matching to any trigram from your index.
Now you need to improve your nGram tokenizer by increasing max_gram:
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "10",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
This way texts in your index will be split into 3-grams, 4-grams, 5-grams, 6-grams, 7-grams, 8-grams, 9-grams, and 10-grams. Among these 7-grams you will find "auftrag" which is your search term.
After this two improvements, highlighting in your search result should look as below:
"betrag <em>auftrag</em>en"
for document 9 and:
"<em>auftrag</em>en"
for document 7.
This is how ngrams and highlighting works together. I know that ES documentation is saying:
It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3) is a good place to start.
This is true. For performance reason, you need to experiment with this configuration but I hope that I explained to you how it is working.

I have the same problem here, with ngram(trigram) tokenizer, got incomplete highlight like:
query with `match`: samp
field data: sample
result highlight: <em>sam</em>ple
expected highlight: <em>samp</em>le
Use match_phrase and use fvh highlight type when set the field's term_vector to with_positions_offsets, this may get the correct highlight.
<em>samp</em>le
I hope this can help you as you do not need to change the tokenizer nor increase max_gram.
But my problem is that I want to use simple_query_string which does not support using phrase for default field query, the only way is using quote to wrap the string like "samp", but as there is some logic in query string so I cant do it for users, and require users to do it neither.
Solution from #piotr-pradzynski may not help me as I have a lot of data, increase the max_gram will lead to lots of storage usage.

Elasticsearch: match exact keywords with special characters

I am storing tags as an array of keywords:
...
Tags: {
type: "keyword"
},
...
Resulting in arrays like this:
Tags: [
"windows",
"opengl",
"unicode",
"c++",
"c",
"cross-platform",
"makefile",
"emacs"
]
I thought that as I am using the keyword type I could easily do exact search terms, as it is not supposed to be using any analyser.
Apparently I was wrong! this gives me results:
body.query.bool.must.push({term: {"_all": "c"}}); # 38 results
But this doesn't:
body.query.bool.must.push({term: {"_all": "c++"}}); # 0 results
Although there are obviously instances of this tag, as seen above.
If I use body.query.bool.must.push({match: {"_all": search}}); instead (using match instead of term) then "c" and "c++" returns the exact same results, which is wrong as well.

The problem here is that you are using _all - Field, which uses an analyzer (standard by default). Make a small test with your data to be sure:
Test 1:
curl -X POST http://127.0.0.1:9200/script/test/_search \
-d '{
"query": {
"term" : { "_all": "c++"}
}
}'
Test 2:
curl -X POST http://127.0.0.1:9200/script/test/_search \
-d '{
"query": {
"term" : { "tags": "c++"}
}
}'
In my test second query returns documents, first not.
Do you really need to search with multiple fields? If so, you can override the default analyzer of _all field - for a quick test I put an index with settings like this:
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"test" : {
"_all" : {"type" : "string", "index" : "not_analyzed", "analyzer" : "keyword"},
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
}
Or you can create Custom _all Field.
Solutions like Multi Field query, that allow to define list of fields to be searched over would rather behave like your example with body.query.bool.must.push({match: {"_all": search}});.

After using the Elasticsearch JDBC Importer, 'asciifolding' is not working as expected

Using the Elasticsearch JDBC importer with this configuration:
bin=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/bin
lib=/usr/share/elasticsearch/elasticsearch-jdbc-2.1.1.2/lib
echo '{
"type" : "jdbc",
"jdbc" : {
"url" : "ip/db",
"user" : "myuser",
"password" : "a7sdf7hsdf8hn78df",
"sql" : "SELECT title, body, source_id, time_order, type, blablabla...",
"index" : "importeditems",
"type" : "item",
"elasticsearch.host": "_eth0_",
"detect_json" : false
}
}' | java \
-cp "${lib}/*" \
-Dlog4j.configurationFile=${bin}/log4j2.xml \
org.xbib.tools.Runner \
org.xbib.tools.JDBCImporter
I've indexed some documents correctly with the form:
{
"title":"Tiempo de Opinión: Puede comenzar un ciclo",
"body":"Sebas Álvaro nos trae cada lunes historias y anécdotas de la montaña<!-- com -->",
"source_id":21188,
"time_order":"1438638043:55c2c6bb96d4c"
"type":"rss"
}
I'm trying to ignore the accents (for example, opiniónin title has an ó), so if a user searches "tiempo de opinión" or "tiempo de opinion" with a match_phrase it gives a match with the documents with or without accent.
So after using the importer and indexing everything, I changed my index settings to defaultanalyzer with an asciifolding filter.
curl -XPOST 'localhost:9200/importeditems/_close'
curl -XPUT 'localhost:9200/importeditems/_settings?pretty=true' -d '{
"analysis": {
"analyzer": {
"default": {
"tokenizer" : "standard",
"filter": [ "lowercase", "asciifolding"]
}}}}'
curl -XPOST 'localhost:9200/importeditems/_open'
Then I make a match_phrase to match"tiempo de opinion" (no accent) and "tiempo de opinión" (with accent)
# No accent
curl -XGET 'localhost:9200/importeditems/_search?pretty=true' -d'
{
"query": {
"match_phrase" : {
"title" : "tiempo de opinion"
}}}'
# With accent
curl -XGET 'localhost:9200/importeditems/_search?pretty=true' -d'
{
"query": {
"match_phrase" : {
"title" : "tiempo de opinión"
}}}'
But no match is given when they exist (if I match_phrase the phrase tiempo de it returns some hits containing tiempo de opinión).
I think the problem is due to de JDBC Importer because I reproduced the error without using the importer, adding another index and entries by hand, changing the index settings also to asciifolding and everything works as expected. You can see this working example right here.
If I check the settings of the index created after using the importer (importeditems)
curl -XGET 'localhost:9200/importeditems/_settings?pretty=true'
This outputs:
{
"importeditems" : {
"settings" : {
"index" : {
"creation_date" : "1457533907278",
"analysis" : {
"analyzer" : {
"default" : {
"filter" : [ "lowercase", "asciifolding" ],
"tokenizer" : "standard"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "x",
"version" : {
"created" : "2010199"
}}}}
... and if I check the settings of the manually created index (test):
curl -XGET 'localhost:9200/test/_settings?pretty=true'
I get the same exact output:
{
"test" : {
"settings" : {
"index" : {
"creation_date" : "1457603253278",
"analysis" : {
"analyzer" : {
"default" : {
"filter" : [ "lowercase", "asciifolding" ],
"tokenizer" : "standard"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "x",
"version" : {
"created" : "2010199"
}}}}
Can someone please tell why is not working if I use the Elasticsearch JDBC Importer and why is it working if I add raw data?

I finally solved the issue by first changing the settings by adding the analysis module:
curl -XPOST 'localhost:9200/importeditems/_close'
curl -XPUT 'localhost:9200/importeditems/_settings?pretty=true' -d '{
"analysis": {
"analyzer": {
"default": {
"tokenizer" : "standard",
"filter": [ "lowercase", "asciifolding"]
}}}}'
curl -XPOST 'localhost:9200/importeditems/_open'
... and then importing all the data again.
It's extrange, because as I stated on the post, I did exactly the same in both cases (with the JDBC Importer and the raw data):
Index data
Change index settings
Make the query with match_phrase
And it worked with the raw data (test) and not with the one I used the importer with (importeditems). The only thing I can think about is that the importeditems were more than 12GB and it needs time to re-create the content with the asciifolding on it. That's why the changes were not reflecting just after the asciifolding was activated.
Anyways, if someone is having the same issue and specially for those who are working with a huge amount of data, remember first to set the analyzer, and then index all the data.
According to the docs:
Queries can find only terms that actually exist in the inverted index,
so it is important to ensure that the same analysis process is applied
both to the document at index time, and to the query string at search
time so that the terms in the query match the terms in the inverted
index.

Is there any analyzer that performs exact word matching in elastic search

Is there any analyzer that performs exact word matching in elastic search
Example if, i have words like "America" and "American" and "America's", if i searched for "America" i should get only first one.. With standard analyzer it gives all the three ones.
I want to make sure this only at query time. I don't want to make changes to existing index. Please help me.

Set your mapping to not_analyzed
curl -XPOST localhost:9200/test -d '{
"mappings" : {
"type1" : {
"properties" : {
"field1" : { "type" : "string", "index" : "not_analyzed" }
}
}
}
}'
That will not do anything to your input string, but will still allow it to be searchable. Note, if you do this searching on "america" will not match "America" as there is a difference in case.
If you want to be able to match those, then you should try the keyword analyzer.
curl -XPOST localhost:9200/test -d '{
"mappings" : {
"type1" : {
"properties" : {
"field1" : { "type" : "string", "analyzer" : "keyword" }
}
}
}
}'

You need not to worry about analyzer.. While querying use term and terms queries it Ll behave as you asked..

How can I boost certain fields over others in elasticsearch?

My goal is to apply the boost to field "name" (see example below), but I have two problems when I search for "john":
search is also matching {name: "dany", message: "hi bob"} when name is "dany" and
search is not boosting name over message (rows with name="john" should be on the top)
The gist is on https://gist.github.com/tomaspet262/5535774
(since stackoverflow's form submit returned 'Your post appears to contain code that is not properly formatted as code', which was formatted properly).

I would suggest using query time boosting instead of index time boosting.
#DELETE
curl -XDELETE 'http://localhost:9200/test'
echo
# CREATE
curl -XPUT 'http://localhost:9200/test?pretty=1' -d '{
"settings": {
"analysis" : {
"analyzer" : {
"my_analyz_1" : {
"filter" : [
"standard",
"lowercase",
"asciifolding"
],
"type" : "custom",
"tokenizer" : "standard"
}
}
}
}
}'
echo
# DEFINE
curl -XPUT 'http://localhost:9200/test/posts/_mapping?pretty=1' -d '{
"posts" : {
"properties" : {
"name" : {
"type" : "string",
"analyzer" : "my_analyz_1"
},
"message" : {
"type" : "string",
"analyzer" : "my_analyz_1"
}
}
}
}'
echo
# INSERT
curl localhost:9200/test/posts/1 -d '{name: "john", message: "hi john"}'
curl localhost:9200/test/posts/2 -d '{name: "bob", message: "hi john, how are you?"}'
curl localhost:9200/test/posts/3 -d '{name: "john", message: "bob?"}'
curl localhost:9200/test/posts/4 -d '{name: "dany", message: "hi bob"}'
curl localhost:9200/test/posts/5 -d '{name: "dany", message: "hi john"}'
echo
# REFRESH
curl -XPOST localhost:9200/test/_refresh
echo
# SEARCH
curl "localhost:9200/test/posts/_search?pretty=1" -d '{
"query": {
"multi_match": {
"query": "john",
"fields": ["name^2", "message"]
}
}
}'

Im not sure if this is relevant in this case, but when testing with such small amounts of data, I always use 1 shard instead of default settings to ensure no issues because of distributed calculation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio