Treatment of special characters in elasticsearch - elasticsearch

I use the following analyzer:
curl -XPUT 'http://localhost:9200/sample/' -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]}
}
}
}
}
}'
Then when I try to insert some documents which contain special characters like % and etc, it converts in to hex.
1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8 -> actual value
1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8
-> stored value.
Sample:
curl -XPUT 'http://localhost:9200/sample/strom/1' -d '{
"user" : "user1",
"message" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'
The problem started occurring only once the data crossed some million documents. Earlier it used store it as it is.
Now if I try to search using,
1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8
it is not able to retrieve the document. How do I deal with this? The behavior seems to non-deterministic in converting special character to hex.
I am unable to replicate the same issue on localmachine.
Can someone explain the mistake I am making?

That is not how the document is tokenized on my end with that analyzer:
curl -XGET localhost:9200/_analyze?tokenizer=keyword\&filters=trim,lowercase\&pretty -d '1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8'
{
"tokens" : [ {
"token" : "1%2fpjjp3jv2c24idfeu9xphbayxxh%2fdhtbmchb35sdznxo2g8vz4d7gtivy54imix_149c95f02a8",
"start_offset" : 0,
"end_offset" : 80,
"type" : "word",
"position" : 1
} ]
}
Reading the analyzer output above, your example text is converted into a single, lowercase-but-otherwise-identical token given the analyzer shown. Are you sure there is no character filter at play? That's what would do the HTML encoding.
You should be able to run it as:
curl -XGET localhost:9200/sample/_analyze?field=message' -d 'text to analyze'
Since it was not reproducing with the analyzer directly, I tried to reproduce this on my end by creating an index to test it:
curl -XPUT localhost:9200/indexed-analysis -d '
{
"settings": {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]
}
}
}
}
},
"mappings": {
"indexed" : {
"properties": {
"text" : { "type" : "string" }
}
}
}
}'
curl -XPUT localhost:9200/indexed-analysis/indexed/1 -d '{
"text" :
"1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'
curl -XGET localhost:9200/indexed-analysis/indexed/1?pretty
This produced the correct, identical result:
{
"_index" : "indexed-analysis",
"_type" : "indexed",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source":{
"text" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}
}
So, I tried _searching for it, and I found it appropriately.
curl -XGET localhost:9200/indexed-analysis/_search -d '{
"query": {
"match": {
"text": "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}
}
}'
Result:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "indexed-analysis",
"_type": "indexed",
"_id": "1",
"_score": 0.30685282,
"_source": {
"text": "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}
}
]
}
}
All of this leads back to three possibilities:
Your search analyzer is different from your index analyzer. This is almost always going to produce unexpected results.
Using default should force it to be used for both reading and writing, but you can/should verify that is actually being used (as opposed to default_index or default_search):
curl -XGET /sample/_settings
curl -XGET /sample/_mapping
If you see analyzers being configured in the mapping for the message field, then that should probably be a red flag.
You have a character filter messing with the indexed string (and it's probably not doing the same thing for your search string, thus pointing back to #1).
There is a bug in the version of Elasticsearch that you are using (hopefully not, but you never know). All of the tests above were done against version 1.3.2.

Related

Query Elasticsearch index for words with and without accent

I query for the word "café" and get 20 articles. Then I repeat the search for the word "cafe" and will only get 3 articles. So I'm looking for a possibility to handle words with letters with accent in the same way like words with letters without accent.
My problem is also, that I already have a filled index so I have to modify an existing system. I'm using Elasticsearch 6.5.
I found some useful information and went through the following steps:
Setting up folding analyzer
curl -H "Content-Type: application/json" --user <user:pass> -XPUT http://localhost/test/_settings?pretty -d '{
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [ "lowercase", "asciifolding" ]
}
}
}
}'
Modify existing mapping for the content field
curl -H "Content-Type: application/json" --user <user:pass> -XPUT http://localhost/test/mytype/_mapping -d '{
"properties" : {
"content" : {
"type" : "text",
"fields" : {
"folded" : {
"type" : "text",
"analyzer" : "folding"
}
}
}
}
}'
Do the search
curl -H "Content-Type: application/json" --user <user:pass> -XGET http://localhost/test/_search -d '{
"query" : {
"bool" : {
"must" : [
{
"query_string" : {
"query" : "cafe"
}
}
]
}
},
"size" : 10,
"from" : 0
}'
But it's the same effect like before: I only find the articles with "cafe", not also the articles with "café". Is there something I miss?
Great start! You have created a new analyzer and changed your mapping, however, you also now need to reindex your data in order to fill in the new content.folded field.
You can do it very easily by calling the update by query endpoint like this:
curl --user <user:pass> -XPOST http://localhost/test/_update_by_query
In your search query you should mention content.folded, folding analyzer is assigned to content.folded and not content.
After a mappings update you will have to reindex your data in order to apply the change.
Reindex step by step Reindex
A working example:
Mappings
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"folding": {
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text",
"fields": {
"folded": {
"type": "text",
"analyzer": "folding"
}
}
}
}
}
}
}
Inserting few documents
POST my_index/_doc/1
{
"content":"café"
}
POST my_index/_doc/2
{
"content":"cafe"
}
Search Query
GET my_index/_search
{
"query": {
"match": {
"content.folded": "cafe"
}
}
}
Results
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 0.18232156,
"hits" : [
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.18232156,
"_source" : {
"content" : "café"
}
},
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "2",
"_score" : 0.18232156,
"_source" : {
"content" : "cafe"
}
}
]
}
Hope this helps

elasticsearch query string dont search by word part

I'm sending this request
curl -XGET 'host/process_test_3/14/_search' -d '{
"query" : {
"query_string" : {
"query" : "\"*cor interface*\"",
"fields" : ["title", "obj_id"]
}
}
}'
And I'm getting correct result
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 5.421598,
"hits": [
{
"_index": "process_test_3",
"_type": "14",
"_id": "141_dashboard_14",
"_score": 5.421598,
"_source": {
"obj_type": "dashboard",
"obj_id": "141",
"title": "Cor Interface Monitoring"
}
}
]
}
}
But when I want to search by word part, as example
curl -XGET 'host/process_test_3/14/_search' -d '
{
"query" : {
"query_string" : {
"query" : "\"*cor inter*\"",
"fields" : ["title", "obj_id"]
}
}
}'
I'm getting no results back:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : []
}
}
What am I doing wrong?
This is because your title field has probably been analyzed by the standard analyzer (default setting) and the title Cor Interface Monitoring has been tokenized as the three tokens cor, interface and monitoring.
In order to search any substring of words, you need to create a custom analyzer which leverages the ngram token filter in order to also index all substrings of each of your tokens.
You can create your index like this:
curl -XPUT localhost:9200/process_test_3 -d '{
"settings": {
"analysis": {
"analyzer": {
"substring_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "substring"]
}
},
"filter": {
"substring": {
"type": "nGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"14": {
"properties": {
"title": {
"type": "string",
"analyzer": "substring_analyzer"
}
}
}
}
}'
Then you can reindex your data. What this will do is that the title Cor Interface Monitoring will now be tokenized as:
co, cor, or
in, int, inte, inter, interf, etc
mo, mon, moni, etc
so that your second search query will now return the document you expect because the tokens cor and inter will now match.
+1 to Val's solution.
Just wanted to add something.
Since your query is relatively simple, you may want to have a look at match/match_phrase queries. Match queries does have the regex parsing like query_string and are thus lighter.
You can find the details here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

ElasticSearch - searching different doc_types with the same field name but different analyzers

Let's say I make a simple ElasticSearch index:
curl -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"char_filter": {
"de_acronym": {
"type": "mapping",
"mappings": [".=>"]
}
},
"analyzer": {
"analyzer1": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": ["de_acronym"]
}
}
}
}
}'
And I make two doc_types that have the same property name but they are analyzed slightly differently from one another:
curl -XPUT 'http://localhost:9200/test/_mapping/docA' -d '{
"docA": {
"properties": {
"name": {
"type": "string",
"analyzer": "simple"
}
}
}
}'
curl -XPUT 'http://localhost:9200/test/_mapping/docB' -d '{
"docB": {
"properties": {
"name": {
"type": "string",
"analyzer": "analyzer1"
}
}
}
}'
Next, let's say I put a document in each doc_type with the same name:
curl -XPUT 'http://localhost:9200/test/docA/1' -d '{ "name" : "U.S. Army" }'
curl -XPUT 'http://localhost:9200/test/docB/1' -d '{ "name" : "U.S. Army" }'
Let's try to search for "U.S. Army" in both doc types at the same time:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.5,
"hits" : [ {
"_index" : "test",
"_type" : "docA",
"_id" : "1",
"_score" : 1.5,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I only get one result! I get the other result when I specify docB's analyzer:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '
{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army",
"analyzer": "analyzer1"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "docB",
"_id" : "1",
"_score" : 1.0,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I was under the impression that ES would search each doc_type with the appropriate analyzer. Is there a way to do this?
The ElasticSearch docs say that precedence for search analyzer goes:
1) The analyzer defined in the query itself, else
2) The analyzer defined in the field mapping, else
...
In this case, is ElasticSearch arbitrarily choosing which field mapping to use?
Take a look at this issue in github, which seems to have started from this post in ES google groups. I believe it answers your question:
if its in a filtered query, we can't infer it, so we simply pick one of those and use its analysis settings

Elasticsearch: _id based on document field?

I am new to Elasticsearch. I have difficulty in using a field of the document for _id. Here is my mapping:
{
"product": {
"_id": {
"path": "id"
},
"properties": {
"id": {
"type": "long",
"index": "not_analyzed",
"store": "yes"
},
"title": {
"type": "string",
"analyzer": "snowball",
"store": "no",
"index": "not_analyzed"
}
}
}
}
Here is a sample document:
{
"id": 1,
"title": "All Quiet on the Western Front"
}
When indexing this document, I got something like:
{
"_index": "myindex",
"_type": "book",
"_id": "PZQu4rocRy60hO2seUEziQ",
"_version": 1,
"created": true
}
Did I do anything wrong? How should this work?
EDIT: _id.path was deprecated in v1.5 and removed in v2.0.
EDIT 2: on versions where this is supported, there is a performance penalty in that the coordinating node is forced to parse all requests (including bulk) in order to determine the correct primary shard for each document.
Provide an _id.path in your mapping, as described here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-id-field.html
Here is a full, working demonstration:
#!/bin/sh
echo "--- delete index"
curl -X DELETE 'http://localhost:9200/so_id_from_field/'
echo "--- create index and put mapping into place"
curl -XPUT http://localhost:9200/so_id_from_field/?pretty=true -d '{
"mappings": {
"tweet" : {
"_id" : {
"path" : "post_id"
},
"properties": {
"post_id": {
"type": "string"
},
"nickname": {
"type": "string"
}
}
}
},
"settings" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
}'
echo "--- index some tweets by POSTing"
curl -XPOST http://localhost:9200/so_id_from_field/tweet -d '{
"post_id": "1305668",
"nickname": "Uncle of the month club"
}'
curl -XPOST http://localhost:9200/so_id_from_field/tweet -d '{
"post_id": "blarger",
"nickname": "Hurry up and spend my money"
}'
curl -XPOST http://localhost:9200/so_id_from_field/tweet -d '{
"post_id": "9",
"nickname": "Who is the guy with the shoe hat?"
}'
echo "--- get the tweets"
curl -XGET http://localhost:9200/so_id_from_field/tweet/1305668?pretty=true
curl -XGET http://localhost:9200/so_id_from_field/tweet/blarger?pretty=true
curl -XGET http://localhost:9200/so_id_from_field/tweet/9?pretty=true

ElasticSearch: snowball not working?

I build the following:
curl -XDELETE "http://localhost:9200/testindex"
curl -XPOST "http://localhost:9200/testindex" -d'
{
"mappings" : {
"article" : {
"dynamic" : false,
"properties" : {
"text" : {
"type" : "string",
"analyzer" : "snowball"
}
}
}
}
}'
... I populate the following:
curl -XPUT "http://localhost:9200/testindex/article/1" -d'{"text": "grey"}'
curl -XPUT "http://localhost:9200/testindex/article/2" -d'{"text": "gray"}'
curl -XPUT "http://localhost:9200/testindex/article/3" -d'{"text": "greyed"}'
curl -XPUT "http://localhost:9200/testindex/article/4" -d'{"text": "greying"}'
... I see the following when I search:
curl -XPOST "http://localhost:9200/testindex/_search" -d'
{
"query": {
"query_string": {
"query": "grey",
"analyzer" : "snowball"
}
}
}'
result is
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "testindex",
"_type": "article",
"_id": "1",
"_score": 0.30685282,
"_source": {
"text": "grey"
}
}
]
}
}
... I'm expecting 3 hits: grey, greyed, and greying. Why doesn't this work? Note that I'm not interested in adding fuzziness to the search, since that will by default match on gray (but not greying).
what I'm doing wrong here?
Your problem is you are using query_string and not defining a default_field, so it's searching against the _all field which is using your default analyzer (standard most likely).
To fix this, do this:
curl -XPOST "http://localhost:9200/testindex/_search" -d'
{
"query": {
"query_string": {
"default_field": "text",
"query": "grey"}
}
}
}'
{"took":7,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":3,"max_score":0.30685282,"hits":[{"_index":"testindex","_type":"article","_id":"4","_score":0.30685282, "_source" : {"text": "greying"}},{"_index":"testindex","_type":"article","_id":"1","_score":0.30685282, "_source" : {"text": "grey"}},{"_index":"testindex","_type":"article","_id":"3","_score":0.30685282, "_source" : {"text": "greyed"}}]}}
I try to stay away from query_string searching though, unless I really can't avoid it. Sometimes, people coming from solr like this method of searching over the search dsl. In this case, try using match:
curl -XPOST "http://localhost:9200/testindex/_search" -d'
> {
> "query": {
> "match": {
> "text": "grey"
> }
> }
> }'
{"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":3,"max_score":0.30685282,"hits":[{"_index":"testindex","_type":"article","_id":"4","_score":0.30685282, "_source" : {"text": "greying"}},{"_index":"testindex","_type":"article","_id":"1","_score":0.30685282, "_source" : {"text": "grey"}},{"_index":"testindex","_type":"article","_id":"3","_score":0.30685282, "_source" : {"text": "greyed"}}]}}
But either way yields the correct results.
See documentation here for the query_string:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

Resources