ElasticSearch: snowball not working? - elasticsearch

I build the following:
curl -XDELETE "http://localhost:9200/testindex"
curl -XPOST "http://localhost:9200/testindex" -d'
{
"mappings" : {
"article" : {
"dynamic" : false,
"properties" : {
"text" : {
"type" : "string",
"analyzer" : "snowball"
}
}
}
}
}'
... I populate the following:
curl -XPUT "http://localhost:9200/testindex/article/1" -d'{"text": "grey"}'
curl -XPUT "http://localhost:9200/testindex/article/2" -d'{"text": "gray"}'
curl -XPUT "http://localhost:9200/testindex/article/3" -d'{"text": "greyed"}'
curl -XPUT "http://localhost:9200/testindex/article/4" -d'{"text": "greying"}'
... I see the following when I search:
curl -XPOST "http://localhost:9200/testindex/_search" -d'
{
"query": {
"query_string": {
"query": "grey",
"analyzer" : "snowball"
}
}
}'
result is
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "testindex",
"_type": "article",
"_id": "1",
"_score": 0.30685282,
"_source": {
"text": "grey"
}
}
]
}
}
... I'm expecting 3 hits: grey, greyed, and greying. Why doesn't this work? Note that I'm not interested in adding fuzziness to the search, since that will by default match on gray (but not greying).
what I'm doing wrong here?

Your problem is you are using query_string and not defining a default_field, so it's searching against the _all field which is using your default analyzer (standard most likely).
To fix this, do this:
curl -XPOST "http://localhost:9200/testindex/_search" -d'
{
"query": {
"query_string": {
"default_field": "text",
"query": "grey"}
}
}
}'
{"took":7,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":3,"max_score":0.30685282,"hits":[{"_index":"testindex","_type":"article","_id":"4","_score":0.30685282, "_source" : {"text": "greying"}},{"_index":"testindex","_type":"article","_id":"1","_score":0.30685282, "_source" : {"text": "grey"}},{"_index":"testindex","_type":"article","_id":"3","_score":0.30685282, "_source" : {"text": "greyed"}}]}}
I try to stay away from query_string searching though, unless I really can't avoid it. Sometimes, people coming from solr like this method of searching over the search dsl. In this case, try using match:
curl -XPOST "http://localhost:9200/testindex/_search" -d'
> {
> "query": {
> "match": {
> "text": "grey"
> }
> }
> }'
{"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":3,"max_score":0.30685282,"hits":[{"_index":"testindex","_type":"article","_id":"4","_score":0.30685282, "_source" : {"text": "greying"}},{"_index":"testindex","_type":"article","_id":"1","_score":0.30685282, "_source" : {"text": "grey"}},{"_index":"testindex","_type":"article","_id":"3","_score":0.30685282, "_source" : {"text": "greyed"}}]}}
But either way yields the correct results.
See documentation here for the query_string:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

Related

Search by exact match in all fields in Elasticsearch

Let's say I have 3 documents, each of them only contains one field (but let's imagine that there are more, and we need to search through all fields).
Field value is "first second"
Field value is "second first"
Field value is "first second third"
Here is a script that can be used to create these 3 documents:
# drop the index completely, use with care!
curl -iX DELETE 'http://localhost:9200/test'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/one' -d '{"name":"first second"}'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/two' -d '{"name":"second first"}'
curl -H 'content-type: application/json' -iX PUT 'http://localhost:9200/test/_doc/three' -d '{"name":"first second third"}'
I need to find the only document (document 1) that has exactly "first second" text in one of its fields.
Here is what I tried.
A. Plain search:
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "first second"
}
}
}'
returns all 3 documents
B. Quoting
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "\"first second\""
}
}
}'
gives 2 documents: 1 and 3, because both contain 'first second'.
Here https://stackoverflow.com/a/28024714/7637120 they suggest to use 'keyword' analyzer to analyze the fields when indexing, but I would like to avoid any customizations to the mapping.
Is it possible to avoid them and still only find document 1?
Yes, you can do that by declaring name mapping type as keyword. The key to solve your problem is just simple -- declare name mapping type:keyword and off you go
to demonstrate it, I have done these
1) created mapping with `keyword` for `name` field`
2) indexed the three documents
3) searched with a `match` query
mappings
PUT so_test16
{
"mappings": {
"_doc":{
"properties":{
"name": {
"type": "keyword"
}
}
}
}
}
Indexing the documents
POST /so_test16/_doc
{
"id": 1,
"name": "first second"
}
POST /so_test16/_doc
{
"id": 2,
"name": "second first"
}
POST /so_test16/_doc
{
"id": 3,
"name": "first second third"
}
The query
GET /so_test16/_search
{
"query": {
"match": {"name": "first second"}
}
}
and the result
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "so_test16",
"_type" : "_doc",
"_id" : "m1KXx2sB4TH56W1hdTF9",
"_score" : 0.2876821,
"_source" : {
"id" : 1,
"name" : "first second"
}
}
]
}
}
Adding second solution
( if the name is not a keyword type but a text type. Only thing here is fielddata:true also needed to be added for name field)
Mappings
PUT so_test18
{
"mappings" : {
"_doc" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"fielddata": true
}
}
}
}
}
and the search query
GET /so_test18/_search
{
"query": {
"bool": {
"must": [
{"match_phrase": {"name": "first second"}}
],
"filter": {
"script": {
"script": {
"lang": "painless",
"source": "doc['name'].values.length == 2"
}
}
}
}
}
}
and the response
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.3971361,
"hits" : [
{
"_index" : "so_test18",
"_type" : "_doc",
"_id" : "o1JryGsB4TH56W1hhzGT",
"_score" : 0.3971361,
"_source" : {
"id" : 1,
"name" : "first second"
}
}
]
}
}
In Elasticsearch 7.1.0, it seems that you can use keyword analyzer even without creating a special mapping. At least I didn't, and the following query does what I need:
curl -H 'Content-Type: application/json' -iX POST 'http://localhost:9200/test/_search' -d '{
"query": {
"query_string": {
"query": "first second",
"analyzer": "keyword"
}
}
}'

elasticsearch query string dont search by word part

I'm sending this request
curl -XGET 'host/process_test_3/14/_search' -d '{
"query" : {
"query_string" : {
"query" : "\"*cor interface*\"",
"fields" : ["title", "obj_id"]
}
}
}'
And I'm getting correct result
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 5.421598,
"hits": [
{
"_index": "process_test_3",
"_type": "14",
"_id": "141_dashboard_14",
"_score": 5.421598,
"_source": {
"obj_type": "dashboard",
"obj_id": "141",
"title": "Cor Interface Monitoring"
}
}
]
}
}
But when I want to search by word part, as example
curl -XGET 'host/process_test_3/14/_search' -d '
{
"query" : {
"query_string" : {
"query" : "\"*cor inter*\"",
"fields" : ["title", "obj_id"]
}
}
}'
I'm getting no results back:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : []
}
}
What am I doing wrong?
This is because your title field has probably been analyzed by the standard analyzer (default setting) and the title Cor Interface Monitoring has been tokenized as the three tokens cor, interface and monitoring.
In order to search any substring of words, you need to create a custom analyzer which leverages the ngram token filter in order to also index all substrings of each of your tokens.
You can create your index like this:
curl -XPUT localhost:9200/process_test_3 -d '{
"settings": {
"analysis": {
"analyzer": {
"substring_analyzer": {
"tokenizer": "standard",
"filter": ["lowercase", "substring"]
}
},
"filter": {
"substring": {
"type": "nGram",
"min_gram": 2,
"max_gram": 15
}
}
}
},
"mappings": {
"14": {
"properties": {
"title": {
"type": "string",
"analyzer": "substring_analyzer"
}
}
}
}
}'
Then you can reindex your data. What this will do is that the title Cor Interface Monitoring will now be tokenized as:
co, cor, or
in, int, inte, inter, interf, etc
mo, mon, moni, etc
so that your second search query will now return the document you expect because the tokens cor and inter will now match.
+1 to Val's solution.
Just wanted to add something.
Since your query is relatively simple, you may want to have a look at match/match_phrase queries. Match queries does have the regex parsing like query_string and are thus lighter.
You can find the details here: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

ElasticSearch - searching different doc_types with the same field name but different analyzers

Let's say I make a simple ElasticSearch index:
curl -XPUT 'http://localhost:9200/test/' -d '{
"settings": {
"analysis": {
"char_filter": {
"de_acronym": {
"type": "mapping",
"mappings": [".=>"]
}
},
"analyzer": {
"analyzer1": {
"type": "custom",
"tokenizer": "keyword",
"char_filter": ["de_acronym"]
}
}
}
}
}'
And I make two doc_types that have the same property name but they are analyzed slightly differently from one another:
curl -XPUT 'http://localhost:9200/test/_mapping/docA' -d '{
"docA": {
"properties": {
"name": {
"type": "string",
"analyzer": "simple"
}
}
}
}'
curl -XPUT 'http://localhost:9200/test/_mapping/docB' -d '{
"docB": {
"properties": {
"name": {
"type": "string",
"analyzer": "analyzer1"
}
}
}
}'
Next, let's say I put a document in each doc_type with the same name:
curl -XPUT 'http://localhost:9200/test/docA/1' -d '{ "name" : "U.S. Army" }'
curl -XPUT 'http://localhost:9200/test/docB/1' -d '{ "name" : "U.S. Army" }'
Let's try to search for "U.S. Army" in both doc types at the same time:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.5,
"hits" : [ {
"_index" : "test",
"_type" : "docA",
"_id" : "1",
"_score" : 1.5,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I only get one result! I get the other result when I specify docB's analyzer:
curl -XGET 'http://localhost:9200/test/_search?pretty' -d '
{
"query": {
"match_phrase": {
"name": {
"query": "U.S. Army",
"analyzer": "analyzer1"
}
}
}
}'
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "test",
"_type" : "docB",
"_id" : "1",
"_score" : 1.0,
"_source":{ "name" : "U.S. Army" }
} ]
}
}
I was under the impression that ES would search each doc_type with the appropriate analyzer. Is there a way to do this?
The ElasticSearch docs say that precedence for search analyzer goes:
1) The analyzer defined in the query itself, else
2) The analyzer defined in the field mapping, else
...
In this case, is ElasticSearch arbitrarily choosing which field mapping to use?
Take a look at this issue in github, which seems to have started from this post in ES google groups. I believe it answers your question:
if its in a filtered query, we can't infer it, so we simply pick one of those and use its analysis settings

Treatment of special characters in elasticsearch

I use the following analyzer:
curl -XPUT 'http://localhost:9200/sample/' -d '
{
"settings" : {
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]}
}
}
}
}
}'
Then when I try to insert some documents which contain special characters like % and etc, it converts in to hex.
1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8 -> actual value
1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8
-> stored value.
Sample:
curl -XPUT 'http://localhost:9200/sample/strom/1' -d '{
"user" : "user1",
"message" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'
The problem started occurring only once the data crossed some million documents. Earlier it used store it as it is.
Now if I try to search using,
1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8
it is not able to retrieve the document. How do I deal with this? The behavior seems to non-deterministic in converting special character to hex.
I am unable to replicate the same issue on localmachine.
Can someone explain the mistake I am making?
That is not how the document is tokenized on my end with that analyzer:
curl -XGET localhost:9200/_analyze?tokenizer=keyword\&filters=trim,lowercase\&pretty -d '1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8'
{
"tokens" : [ {
"token" : "1%2fpjjp3jv2c24idfeu9xphbayxxh%2fdhtbmchb35sdznxo2g8vz4d7gtivy54imix_149c95f02a8",
"start_offset" : 0,
"end_offset" : 80,
"type" : "word",
"position" : 1
} ]
}
Reading the analyzer output above, your example text is converted into a single, lowercase-but-otherwise-identical token given the analyzer shown. Are you sure there is no character filter at play? That's what would do the HTML encoding.
You should be able to run it as:
curl -XGET localhost:9200/sample/_analyze?field=message' -d 'text to analyze'
Since it was not reproducing with the analyzer directly, I tried to reproduce this on my end by creating an index to test it:
curl -XPUT localhost:9200/indexed-analysis -d '
{
"settings": {
"number_of_shards" : 1,
"number_of_replicas" : 0,
"index": {
"analysis": {
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["trim", "lowercase"]
}
}
}
}
},
"mappings": {
"indexed" : {
"properties": {
"text" : { "type" : "string" }
}
}
}
}'
curl -XPUT localhost:9200/indexed-analysis/indexed/1 -d '{
"text" :
"1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}'
curl -XGET localhost:9200/indexed-analysis/indexed/1?pretty
This produced the correct, identical result:
{
"_index" : "indexed-analysis",
"_type" : "indexed",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source":{
"text" : "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}
}
So, I tried _searching for it, and I found it appropriately.
curl -XGET localhost:9200/indexed-analysis/_search -d '{
"query": {
"match": {
"text": "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}
}
}'
Result:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "indexed-analysis",
"_type": "indexed",
"_id": "1",
"_score": 0.30685282,
"_source": {
"text": "1%2fPJJP3JV2C24iDfEu9XpHBaYxXh%2fdHTbmchB35SDznXO2g8Vz4D7GTIvY54iMiX_149c95f02a8"
}
}
]
}
}
All of this leads back to three possibilities:
Your search analyzer is different from your index analyzer. This is almost always going to produce unexpected results.
Using default should force it to be used for both reading and writing, but you can/should verify that is actually being used (as opposed to default_index or default_search):
curl -XGET /sample/_settings
curl -XGET /sample/_mapping
If you see analyzers being configured in the mapping for the message field, then that should probably be a red flag.
You have a character filter messing with the indexed string (and it's probably not doing the same thing for your search string, thus pointing back to #1).
There is a bug in the version of Elasticsearch that you are using (hopefully not, but you never know). All of the tests above were done against version 1.3.2.

Elasticsearch array must and must_not

I have a documents looking like this in my elasticsearch DB :
{
"tags" => [
"tag-1",
"tag-2",
"tag-3",
"tag-A"
]
"created_at" =>"2013-07-02 12:42:19 UTC",
"label" =>"Mon super label"
}
I would like to be able to filter my documents with this criteria :
Documents tags array must have tags-1, tags-3 and tags-2 but must not have tags-A.
I tried to use a bool filter but I can't manage to make it work !
Here is a method that seems to accomplish you want: http://sense.qbox.io/gist/4dd806936f12a9668d61ce63f39cb2c284512443
First I created an index with an explicit mapping. I did this so I could set the "tags" property to "index": "not_analyzed". This means that the text will not be modified in any way, which will simplify the querying process for this example.
curl -XPUT "http://localhost:9200/test_index" -d'
{
"mappings": {
"docs" : {
"properties": {
"tags" : {
"type": "string",
"index": "not_analyzed"
},
"label" : {
"type": "string"
}
}
}
}
}'
and then add some docs:
curl -XPUT "http://localhost:9200/test_index/docs/1" -d'
{
"tags" : [
"tag-1",
"tag-2",
"tag-3",
"tag-A"
],
"label" : "item 1"
}'
curl -XPUT "http://localhost:9200/test_index/docs/2" -d'
{
"tags" : [
"tag-1",
"tag-2",
"tag-3"
],
"label" : "item 2"
}'
curl -XPUT "http://localhost:9200/test_index/docs/3" -d'
{
"tags" : [
"tag-1",
"tag-2"
],
"label" : "item 3"
}'
Then we can query using must and must_not clauses in a bool filter as follows:
curl -XPOST "http://localhost:9200/test_index/_search" -d'
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"terms": {
"tags": [
"tag-1",
"tag-2",
"tag-3"
],
"execution" : "and"
}
}
],
"must_not": [
{
"term": {
"tags": "tag-A"
}
}
]
}
}
}
}
}'
which yields the correct result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "docs",
"_id": "2",
"_score": 1,
"_source": {
"tags": [
"tag-1",
"tag-2",
"tag-3"
],
"label": "item 2"
}
}
]
}
}
Notice the "execution" : "and" parameter in the terms filter in the must clause. This means only docs that have all the "tags" specified will be returned (rather than those that match one or more). That may have been what you were missing. You can read more about the options in the ES docs.
I made a runnable example here that you can play with, if you have ES installed and running at localhost:9200, or you can provide your own endpoint.

Resources