Elastic exact match w/o changing indexing - elasticsearch

I have following query to elastic:
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"entities.hashtags": "gf"
}
}
]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
},
entities.hashtags is array and as a result I receive entries with hashtags gf_anime, gf_whatever, gf_foobar etc.
But what I need is receive entries where exact "gf" hashtag exists.
I've looked in other questions on SO and saw that the solution in this case is to change analyzing of entities.hashtags so it'll match only exact values (I am pretty new with elastic hence can mistake with terms here).
My question is whether it's possible to define exact match search INSIDE THE QUERY? Id est w/o changing how elastic indexes its fields?

Are you sure that you need to do anything? Given your examples, you don't and you probably don't want to do not_analyzed:
curl -XPUT localhost:9200/test -d '{
"mappings": {
"test" : {
"properties": {
"body" : { "type" : "string" },
"entities" : {
"type" : "object",
"properties": {
"hashtags" : {
"type" : "string"
}
}
}
}
}
}
}'
curl -XPUT localhost:9200/test/test/1 -d '{
"body" : "anime", "entities" : { "hashtags" : "gf_anime" }
}'
curl -XPUT localhost:9200/test/test/2 -d '{
"body" : "anime", "entities" : { "hashtags" : ["GF", "gf_anime"] }
}'
curl -XPUT localhost:9200/test/test/3 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf_whatever", "gf_anime"] }
}'
With the above data indexed, your query only returns document 2 (note: this is simplified version of your query without the unnecessary/undesirable and filter; at least for the time being, you should always use the bool filter rather than and/or as it understands how to use the filter caches):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"entities.hashtags": "gf"
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
Where this breaks down is when you start putting in hashtag values that will be split into multiple tokens, thereby triggering false hits with the term filter. You can determine how the field's analyzer will treat any value by passing it to the _analyze endpoint and telling it the field to use the analyzer from:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf_anime'
{
"tokens" : [ {
"token" : "gf_anime",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
# Note the space instead of the underscore:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf anime'
{
"tokens" : [ {
"token" : "gf",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "anime",
"start_offset" : 3,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
If you were to add a fourth document with the "gf anime" variant, then you will get a false hit.
curl -XPUT localhost:9200/test/test/4 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf whatever", "gf anime"] }
}'
This is really not an indexing problem, but a bad data problem.
With all of the explanation out of the way, you can inefficiently solve this by using a script that always follows the term filter (to efficiently rule out the more common cases that don't hit it):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"bool" : {
"must" : [{
"term" : {
"entities.hashtags" : "gf"
}
},
{
"script" : {
"script" :
"_source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null",
"params" : {
"tag" : "gf"
}
}
}]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
This works by parsing the original the _source (and not using the indexed doc values). That is why this is not going to be very efficient, but it will work until you reindex. The _source.entities.hashtags == tag portion is only necessary if hashtags is not always an array (in my example, document 1 would not be an array). If it is always an array, then you can use _source.entities.hashtags.contains(tag) instead of _source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null.
Note: The script language is Groovy, which is the default starting in 1.4.0. It is not the default in earlier versions, and it must be explicitly enabled using script.default_lang : groovy.

Related

How do I configure elastic search to use the icu_tokenizer?

I'm trying to search a text indexed by elasticsearch and the icu_tokenizer but can't get it working.
My testcase is to tokenize the sentence “Hello. I am from Bangkok”, in thai สวัสดี ผมมาจากกรุงเทพฯ, which should be tokenized to the five words สวัสดี, ผม, มา, จาก, กรุงเทพฯ. (Sample from Elasticsearch - The Definitive Guide)
Searching using any of the last four words fails for me. Searching using any of the space separated words สวัสดี or ผมมาจากกรุงเทพฯ works fine.
If I specify the icu_tokenizer on the command line, like
curl -XGET 'http://localhost:9200/icu/_analyze?tokenizer=icu_tokenizer' -d "สวัสดี ผมมาจากกรุงเทพฯ"
it tokenizes to five words.
My settings are:
curl http://localhost:9200/icu/_settings?pretty
{
"icu" : {
"settings" : {
"index" : {
"creation_date" : "1474010824865",
"analysis" : {
"analyzer" : {
"nfkc_cf_normalized" : [ "icu_normalizer" ],
"tokenizer" : "icu_tokenizer"
}
}
},
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "tALRehqIRA6FGPu8iptzww",
"version" : {
"created" : "2040099"
}
}
}
}
The index is populated with
curl -XPOST 'http://localhost:9200/icu/employee/' -d '
{
"first_name" : "John",
"last_name" : "Doe",
"about" : "สวัสดี ผมมาจากกรุงเทพฯ"
}'
Searching with
curl -XGET 'http://localhost:9200/_search' -d'
{
"query" : {
"match" : {
"about" : "กรุงเทพฯ"
}
}
}'
Returns nothing ("hits" : [ ]).
Performing the same search with one of สวัสดี or ผมมาจากกรุงเทพฯ works fine.
I guess I've misconfigured the index, how should it be done?
The missing part is:
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
In the mapping, the document field have to be specified the analyzer to be using.
[Index] : icu
[type] : employee
[field] : about
PUT /icu
{
"settings": {
"analysis": {
"analyzer": {
"icu_analyzer" : {
"char_filter": [
"icu_normalizer"
],
"tokenizer" : "icu_tokenizer"
}
}
}
},
"mappings": {
"employee" : {
"properties": {
"about":{
"type": "text",
"analyzer": "icu_analyzer"
}
}
}
}
}
test the custom analyzer using followings DSLJson
POST /icu/_analyze
{
"text": "สวัสดี ผมมาจากกรุงเทพฯ",
"analyzer": "icu_analyzer"
}
The result should be [สวัสดี, ผม, มา, จาก, กรุงเทพฯ]
My suggestion would be :
Kibana : Dev Tool could help you for effective query crafting

How to aggregate by substring in elasticseach

I have to index many document like this:
POST /example/doc
{
id : "type-name",
foo: bar
}
and i would like to retrieve a list of all the type that are present. For example
POST /example/doc
{
id : "AAA-123",
foo: bar
}
POST /example/doc
{
id : "AAA-456",
foo: bar
}
POST /example/doc
{
id : "BBB-123",
foo: bar
}
and ask elasticseaarch to give me a list where i have AAA and BBB.
UPDATE
I've also solved using a custom analyzer
"settings": {
"analysis": {
"char_filter" : {
"remove_after_minus":{
"type":"pattern_replace",
"pattern":"-(.*)",
"replacement":""
}
},
"analyzer": {
"id_analyzer":{
"tokenizer" : "standard",
"char_filter" : ["remove_after_minus"]
}
}
}
}
If you keep the standard analyzer, the id will be split at the "-". So, if for your types lower and upper case are the same, you can just go with a simple facet query
curl -XPOST "http://localhost:9023/index/type/_search?size=0&pretty=true" -d
'{
"query" : {
{ "regexp":{ "id": "[A-Z]+" }
},
"facets" : {
"id" : {
"terms" : {
"field" : "id",
"size" : 50
}
}
}
}'
should give you somtehing that you can use.

Problems with elasticsearch filtering

I'm having trouble with filtering in elastic search. I want to filter an index of order lines. Like this sql query:
SELECT * FROM orderrow WHERE item_code = '7X-BogusItem'
Here's my elasticsearch query:
GET /myindex/orderrow/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"item_code": "7X-BogusItem"
}
}
}
}
}
I'm getting no results back. Yet when I run this query:
GET /myindex/orderrow/_search
{
"query": {
"query_string": {
"query": "7X-BogusItem"
}
}
}
I get the proper results. What am I doing wrong?
You could try with:
GET /myindex/orderrow/_search
{
"query": {
"constant_score": {
"filter": {
"query": {
"query_string": {
"query": "7X-BogusItem"
}
}
}
}
}
}
The thing is that query_string query is analyzed while term query is not. Probably your data 7X-BogusItem was transformed by default analyzer during indexing to terms like 7x and bogusitem. When you try to do a query with term 7X-BogusItem it will not work because you don't have term 7X-BogusItem - you have only terms 7x and bogusitem. However performing query_string will transform your query 7X-BogusItem to terms 7x and bogusitem under the hood and it will find what you want.
If you don't want your text 7X-BogusItem to be transformed by analyzer, you could change mapping option for field item_code to "index" : "not_analyzed".
You can check what your data will look like after analysis:
curl -XGET "localhost:9200/_analyze?analyzer=standard&pretty" -d '7X-BogusItem'
{
"tokens" : [ {
"token" : "7x",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "bogusitem",
"start_offset" : 3,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
So for text 7X-BogusItem we have in index terms 7x and bogusitem.

How to avoid cross object search behavior with nested types in elastic search

I am trying to determine the best way to index a document in elastic search. I have a document, Doc, which has some fields:
Doc
created_at
updated_at
field_a
field_b
But Doc will also have some fields specific to individual users. For example, field_x will have value 'A' for user 1, and field_x will have value 'B' for user 2. For each doc, there will be a very limited number of users (typically 2, up to ~10). When a user searches on field_x, they must search on the value that belongs to them. I have been exploring nested types in ES.
Doc
created_at
updated_at
field_x: [{
user: 1
field_x: A
},{
user: 2
field_x: B
}]
When user 1 searches on field_x for value 'A', this doc should result in a hit. However, it should not when user 1 searches by value 'B'.
However, according to the docs:
One of the problems when indexing inner objects that occur several
times in a doc is that “cross object” search match will occur
Is there a way to avoid this behavior with nested types or should I explore another type?
Additional information regarding performance of such queries would be very valuable. Just from reading the docs, its stated that nested queries are not too different in terms of performance as related to regular queries. If anyone has real experience this, I would love to hear it.
Nested type is what you are looking for, and don't worry too much about performance.
Before indexing your documents, you need to set the mapping for your documents:
curl -XDELETE localhost:9200/index
curl -XPUT localhost:9200/index
curl -XPUT localhost:9200/index/type/_mapping -d '{
"type": {
"properties": {
"field_x": {
"type": "nested",
"include_in_parent": false,
"include_in_root": false,
"properties": {
"user": {
"type": "string"
},
"field_x": {
"type": "string",
"index" : "not_analyzed" // NOTE*
}
}
}
}
}
}'
*note: If your field really contains only singular letters like "A" and "B", you don't want to analyze the field, otherwise elasticsearch will remove these singular letter "words".
If that was just your example, and in your real documents you are searching for proper words, remove this line and let elasticsearch analyze the field.
Then, index your documents:
curl -XPUT http://localhost:9200/index/type/1 -d '
{
"field_a": "foo",
"field_b": "bar",
"field_x" : [{
"user" : "1",
"field_x" : "A"
},
{
"user" : "2",
"field_x" : "B"
}]
}'
And run your query:
curl -XGET localhost:9200/index/type/_search -d '{
"query": {
"nested" : {
"path" : "field_x",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : [
{
"term": {
"field_x.user": "1"
}
},
{
"term": {
"field_x.field_x": "A"
}
}
]
}
}
}
}
}';
This will result in
{"took":13,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":1.987628,"hits":[{"_index":"index","_type":"type","_id":"1","_score":1.987628, "_source" :
{
"field_a": "foo",
"field_b": "bar",
"field_x" : [{
"user" : "1",
"field_x" : "A"
},
{
"user" : "2",
"field_x" : "B"
}]
}}]}}
However, querying
curl -XGET localhost:9200/index/type/_search -d '{
"query": {
"nested" : {
"path" : "field_x",
"score_mode" : "avg",
"query" : {
"bool" : {
"must" : [
{
"term": {
"field_x.user": "1"
}
},
{
"term": {
"field_x.field_x": "B"
}
}
]
}
}
}
}
}';
won't return any results
{"took":6,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

Elasticsearch search fo words having '#' character

For example, I am right now searching like this:
http://localhost:9200/posts/post/_search?q=content:%23sachin
But, I am getting all the results with 'sachin' and not '#sachin'. Also, I am writing a regular expression for getting the count of terms. The facet looks like this:
"facets": {
"content": {
"terms": {
"field": "content",
"size": 1000,
"all_terms": false,
"regex": "#sachin",
"regex_flags": [
"DOTALL",
"CASE_INSENSITIVE"
]
}
}
}
This is not returning any values. I think it has something to do with escaping the '#' inside the regular expression, but I am not sure how to do it. I have tried to escape it \ and \\, but it did not work. Can anyone help me in this regard?
This article gives information on how save # and # using custom analyzers:
https://web.archive.org/web/20160304014858/http://www.fullscale.co/blog/2013/03/04/preserving_specific_characters_during_tokenizing_in_elasticsearch.html
curl -XPUT 'http://localhost:9200/twitter' -d '{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"tweet_filter" : {
"type" : "word_delimiter",
"type_table": ["# => ALPHA", "# => ALPHA"]
}
},
"analyzer" : {
"tweet_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "tweet_filter"]
}
}
}
},
"mappings" : {
"tweet" : {
"properties" : {
"msg" : {
"type" : "string",
"analyzer" : "tweet_analyzer"
}
}
}
}
}'
This isn't dealing with facets, but the redefining of the type of those special characters in the analyzer could help.
Another approach that worth to consider is to index a special (e.g. "reserved") word instead of hash symbol. For example: HASHSYMBOLCHAR. Make sure that you will replace '#' chars in query as well.

Resources