elasticsearch multi-token keyword synonyms - elasticsearch

I'm trying to implement simple multi-token synonyms in Elasticsearch, but not getting the results I expect. Here's some curl:
curl -XPOST "http://localhost:9200/test" -d'
{
"mappings": {
"my_type": {
"properties": {
"blah": {
"type": "string",
"analyzer": "my_synonyms"
}
}
}
},
"settings": {
"index": {
"analysis": {
"filter": {
"my_syn_filt": {
"type": "synonym",
"synonyms": [
"foo bar, fooo bar"
]
}
},
"analyzer": {
"my_synonyms": {
"filter": [
"lowercase",
"my_syn_filt"
],
"tokenizer": "keyword"
}
}
}
}
}
}'
Index a few documents:
curl -XPUT localhost:9200/test/my_type/1 -d '{"blah": "fooo bar"}'
curl -XPUT localhost:9200/test/my_type/2 -d '{"blah": "fooo barr"}'
curl -XPUT localhost:9200/test/my_type/3 -d '{"blah": "foo bar"}'
Now query:
curl -XPOST "http://localhost:9200/test/_search" -d'
{
"query": {
"match": {
"blah": "foo bar"
}
}
}'
I'm expecting to get back documents 1 and 3, however, only get back 3. Does anyone know what the problem could be?
Upon further inspection I'm also not getting the expected tokens when calling the analyzer directly:
curl 'localhost:9200/test/_analyze?analyzer=my_synonyms' -d 'fooo bar'
Returns only one token, "fooo bar", when I'm expecting two tokens: "fooo bar" and "foo bar".

It looks like if you did a search for 'fooo bar' instead, you will get documents 1 and 3. To get the results you were expecting, you will have to flip your synonym terms to go the other way:
"fooo bar => foo bar"
The arrow tells ES to add terms on the right side as synonyms for all terms on the left. If you want them to go bi-directional, you can simply do 'fooo bar, foo bar' and make sure expand is not explicitly set to false.

Related

Elasticsearch join-like query within same index

I have an index with a following structure (mapping)
{
"properties": {
"content": {
"type": "text",
},
"prev_id": {
"type": "text",
},
"next_id": {
"type": "text",
}
}
}
where prev_id and next_id are IDs of documents in this index (may be null values).
I want to perform _search query and get prev.content and next.content fields.
Now I use two queries: the first for searching by content field
curl -X GET 'localhost:9200/idx/_search' -H 'content-type: application/json' -d '{
"query": {
"match": {
"content": "yellow fox"
}
}
}'
and the second to get next and prev records.
curl -X GET 'localhost:9200/idx/_search' -H 'content-type: application/json' -d '{
"query": {
"ids": {
"values" : ["5bb93552e42140f955501d7b77dc8a0a", "cd027a48445a0a193bc80982748bc846", "9a5b7359d3081f10d099db87c3226d82"]
}
}
}'
Then I join results on application-side.
Can I achieve my goal with one query only?
PS: the purpose to store next-prev as IDs is to safe disk space. I have a lot of records and content field is quite large.
What you are doing is the way to go. But how large is the content? - Maybe you can consider not storing content ( source = false)?

elasticsearch mapping is empty after creating

I'm trying to create an autocomplete index for my elasticsearch using the search_as_you_type datatype.
My first command I run is
curl --request PUT 'https://elasticsearch.company.me/autocomplete' \
'{
"mappings": {
"properties": {
"company_name": {
"type": "search_as_you_type"
},
"serviceTitle": {
"type": "search_as_you_type"
}
}
}
}'
which returns
{"acknowledged":true,"shards_acknowledged":true,"index":"autocomplete"}curl: (3) nested brace in URL position 18:
{
"mappings": {
"properties": etc.the rest of the json object I created}}
Then I reindex using
curl --silent --request POST 'http://elasticsearch.company.me/_reindex?pretty' --data-raw '{
"source": {
"index": "existing_index"
},
"dest": {
"index": "autocomplete"
}
}' | grep "total\|created\|failures"
I expect to see some "total":1000,"created":5etc but some kind of response from the terminal, but I get nothing. Also, when I check the mapping of my autocomplete index, by running curl -u thething 'https://elasticsearch.company.me/autocomplete/_mappings?pretty',
I get an empty mapping result:
{
"autocomplete" : {
"mappings" : { }
}
}
Is my error in the creation of my index or the reindexing? I'm expecting the autocomplete mappings to show the two fields I'm searching for, ie: "company_name" and "serviceTitle". Any ideas how to fix?

Elasticsearch highlighter false positives

I am using an nGram tokenizer in ES 6.1.1 and getting some weird highlights:
multiple adjacent character ngram highlights are not merged into one
tra is incorrectly highlighted in doc 9
The query auftrag matches documents 7 and 9 as expected, but in doc 9 betrag is highlighted incorrectly. That's a problem with the highlighter - if the problem was with the query doc 8 would have also been returned.
Example code
#!/usr/bin/env bash
# Example based on
# https://www.elastic.co/guide/en/elasticsearch/guide/current/ngrams-compound-words.html
# with suggestions from from
# https://github.com/elastic/elasticsearch/issues/21000
DELETE INDEX IF EXISTS
curl -sS -XDELETE 'localhost:9200/my_index'
printf '\n-------------\n'
CREATE NEW INDEX
curl -sS -XPUT 'localhost:9200/my_index?pretty' -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"trigrams": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "3",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
}
},
"mappings": {
"my_type": {
"properties": {
"text": {
"type": "text",
"analyzer": "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
}
'
printf '\n-------------\n'
POPULATE INDEX
curl -sS -XPOST 'localhost:9200/my_index/my_type/_bulk?pretty' -H 'Content-Type: application/json' -d'
{ "index": { "_id": 7 }}
{ "text": "auftragen" }
{ "index": { "_id": 8 }}
{ "text": "betrag" }
{ "index": { "_id": 9 }}
{ "text": "betrag auftragen" }
'
printf '\n-------------\n'
sleep 1 # Give ES time to index
QUERY
curl -sS -XGET 'localhost:9200/my_index/my_type/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"text": {
"query": "auftrag",
"minimum_should_match": "100%"
}
}
},
"highlight": {
"fields": {
"text": {
"fragment_size": 120,
"type": "fvh"
}
}
}
}
'
The hits I get are (abbreviated):
"hits" : [
{
"_id" : "9",
"_source" : {
"text" : "betrag auftragen"
},
"highlight" : {
"text" : [
"be<em>tra</em>g <em>auf</em><em>tra</em>gen"
]
}
},
{
"_id" : "7",
"_source" : {
"text" : "auftragen"
},
"highlight" : {
"text" : [
"<em>auf</em><em>tra</em>gen"
]
}
}
]
I have tried various workarounds, such as using the unified/fvh highlighter and setting all options that seemed relevant, but no luck. Any hints are greatly appreciated.
The problem here is not with highlighting but with this how you are using nGram analyzer.
First of all when you are configure mapping this way:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"term_vector": "with_positions_offsets"
}
}
}
}
you are saying to Elasticsearch that you want to use it for both indexed text and provided a search term. In your case, this simply means that:
your text from the document 9 = "betrag auftragen" is split for trigrams so in the index you have something like: [bet, etr, tra, rag, auf, uft, ftr, tra, rag, age, gen]
your text from the document 7 = "auftragen" is split for trigrams so in the index you have something like: [auf, uft, ftr, tra, rag, age, gen]
your search term = "auftrag" is also split for trigrams and Elasticsearch is see it as: [auf, uft, ftr, tra, rag]
at the end Elasticsearch matches all the trigrams from search with those from your index and because of this you have 'auf' and 'tra' highlighted separately. 'ufa', 'ftr', and 'rag' also matches, but they overlaps 'auf' and 'tra' and are not highlighted.
First what you need to do is to say to Elasticsearch that you do not want to split search term to grams. All you need to do is to add search_analyzer property to your mapping:
"mappings": {
"my_type": {
"properties": {
"text": {
"type" : "text",
"analyzer" : "trigrams",
"search_analyzer": "standard",
"term_vector" : "with_positions_offsets"
}
}
}
}
Now words from a search term are treated by standard analyzer as separate words so in your case, it will be just "auftrag".
But this single change will not help you. It will even break the search because "auftrag" is not matching to any trigram from your index.
Now you need to improve your nGram tokenizer by increasing max_gram:
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "3",
"max_gram": "10",
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
This way texts in your index will be split into 3-grams, 4-grams, 5-grams, 6-grams, 7-grams, 8-grams, 9-grams, and 10-grams. Among these 7-grams you will find "auftrag" which is your search term.
After this two improvements, highlighting in your search result should look as below:
"betrag <em>auftrag</em>en"
for document 9 and:
"<em>auftrag</em>en"
for document 7.
This is how ngrams and highlighting works together. I know that ES documentation is saying:
It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3) is a good place to start.
This is true. For performance reason, you need to experiment with this configuration but I hope that I explained to you how it is working.
I have the same problem here, with ngram(trigram) tokenizer, got incomplete highlight like:
query with `match`: samp
field data: sample
result highlight: <em>sam</em>ple
expected highlight: <em>samp</em>le
Use match_phrase and use fvh highlight type when set the field's term_vector to with_positions_offsets, this may get the correct highlight.
<em>samp</em>le
I hope this can help you as you do not need to change the tokenizer nor increase max_gram.
But my problem is that I want to use simple_query_string which does not support using phrase for default field query, the only way is using quote to wrap the string like "samp", but as there is some logic in query string so I cant do it for users, and require users to do it neither.
Solution from #piotr-pradzynski may not help me as I have a lot of data, increase the max_gram will lead to lots of storage usage.

Elasticsearch updating the analyzer creates a members field

I came across a problem where I needed to update the stopwords on an index, which was specifying the english analyzer as the default analyzer. Typically, the analyzers are specified in the settings for the index:
{
"twitter": {
"settings": {
"index": {
"creation_date": "1469465586110",
"analysis": {
"filter": {
"lowercaseFilter": {
"type": "lowercase"
}
},
"analyzer": {
"default": {
"type": "english"
},
...
So, the analyzers are located at <index name>.settings.index.analysis.analyzer
To update the analyzer, I ran these commands:
curl -XPOST "http://localhost:9200/twitter/_close" && \
curl -XPUT "http://localhost:9200/twitter/_settings" -d'
{
"analysis": {
"analyzer": {
"default": {
"type": "english",
"stopwords": "_none_"
}
}
}
}' && \
curl -XPOST "http://localhost:9200/twitter/_open"
After running those commands, I verified that the default analyzer was analyzing text, and keeping all stopwords.
However, when I use the Jest client, now the settings look like this, and the analysis isn't happening properly (note how the analysis settings are under the "members" property now):
{
"twitter": {
"settings": {
"index": {
"members": {
"analysis": {
"analyzer": {
"default": {
"type": "english",
"stopwords": "_none_"
},
I've stepped through the code and everything looks in order:
I figured it out. So by running:
sudo tcpflow -p -c -i lo0 port 9200 2>/dev/null | grep -oE '.*(GET|POST|PUT|DELETE) .*_dev.*' -A30
I could see that the JsonObject I was sending was including the members field, which is where Gson's JsonObject stores the objects inside itself. Since I was passing this raw object into Jest's UpdateSettings builder, it was being serialized in a way I didn't expect (including the members field), and being sent to elasticsearch that way. I solved the problem by calling the JsonObject's toString() method and passing that to the UpdateSettings Builder

Elasticsearch find ONLY perfect match

I've been trying to do this for some time, read and searched a lot and I haven't found any definitive answer or solution.
Let's say we add some documents:
$ curl -XPUT http://localhost:9200/tm/entries/1 -d '{"item": "foo" }'
{"_index":"tm","_type":"entries","_id":"1","_version":1,"created":true}
$ curl -XPUT http://localhost:9200/tm/entries/2 -d '{"item": "foo bar" }'
{"_index":"tm","_type":"entries","_id":"2","_version":1,"created":true}
$ curl -XPUT http://localhost:9200/tm/entries/3 -d '{"item": "foo bar foo" }'
{"_index":"tm","_type":"entries","_id":"3","_version":1,"created":true}
After this, i want to find ONLY the document(s) that match perfectly the search query
$ curl -XGET http://localhost:9200/_search?q=foo
The result contains all 3 documents and I only want to get the one which matches "foo" only and nothing else.
Also,
$ curl -XGET http://localhost:9200/_search?q=bar foo
Should not return any results.
Can Elasticsearch do that?
How?
Update:
Existing mapping:
{
"tm": {
"mappings": {
"entries": {
"properties": {
"item": {
"type": "string"
}
}
}
}
}
}
Use he following Mapping.
{
"tm": {
"mappings": {
"entries": {
"properties": {
"item": {
"type": "string",
"index" : "not_analyzed"
}
}
}
}
}
}
And use term query to find exact match. Term queries are not analyzed.refer
curl -XGET "http://localhost:9200/tm/entries/_search" -d'
{
"query": {
"term": {
"item": {
"value": "foo bar"
}
}
}
}'
Try adding "index" : "not_analyzed" in the mapping.
And query should be something like
{
"match_phrase": {
"item": "foo"
}
}
You should use match query instead of query_string. It'll solve your issue.
{
"match" : {
"item" : "bar foo"
}
}
Take a look at this:
Also, make sure the terms you are searching is actually present in the indexed field. For that you need to use analyser "keyword".For more information take a look at this.
Thanks
If you are trying to search from GET request, I think this might help:
$ curl -XGET http://localhost:9200/tm/entries/_search?q=item:foo
so it is of syntax, _search?q= <field>:<value>
You can find documentation here, URI Search
And, If you are trying to have filter, it is good to have mapping with not_analyzed (as described above).
And for complex queries,
curl -XPOST "http://localhost:9200/tm/entries/_search" -d'
{
"filter": {
"term": {
"item": "foo"
}
}
}'
hope this helps.

Resources