Problems with elasticsearch filtering - elasticsearch

I'm having trouble with filtering in elastic search. I want to filter an index of order lines. Like this sql query:
SELECT * FROM orderrow WHERE item_code = '7X-BogusItem'
Here's my elasticsearch query:
GET /myindex/orderrow/_search
{
"query": {
"constant_score": {
"filter": {
"term": {
"item_code": "7X-BogusItem"
}
}
}
}
}
I'm getting no results back. Yet when I run this query:
GET /myindex/orderrow/_search
{
"query": {
"query_string": {
"query": "7X-BogusItem"
}
}
}
I get the proper results. What am I doing wrong?

You could try with:
GET /myindex/orderrow/_search
{
"query": {
"constant_score": {
"filter": {
"query": {
"query_string": {
"query": "7X-BogusItem"
}
}
}
}
}
}
The thing is that query_string query is analyzed while term query is not. Probably your data 7X-BogusItem was transformed by default analyzer during indexing to terms like 7x and bogusitem. When you try to do a query with term 7X-BogusItem it will not work because you don't have term 7X-BogusItem - you have only terms 7x and bogusitem. However performing query_string will transform your query 7X-BogusItem to terms 7x and bogusitem under the hood and it will find what you want.
If you don't want your text 7X-BogusItem to be transformed by analyzer, you could change mapping option for field item_code to "index" : "not_analyzed".
You can check what your data will look like after analysis:
curl -XGET "localhost:9200/_analyze?analyzer=standard&pretty" -d '7X-BogusItem'
{
"tokens" : [ {
"token" : "7x",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "bogusitem",
"start_offset" : 3,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
So for text 7X-BogusItem we have in index terms 7x and bogusitem.

Related

Multiple OR filter in Elasticsearch

Hello I'm having trouble deciding the correctness of the following query for multiple OR in Elasticsearch. I want to select all the unique data (not count, but select all rows)
My best try for this in elastic query is
GET mystash/_search
{
"aggs": {
"uniques":{
"filter":
{
"or":
[
{ "term": { "url.raw" : "/a.json" } },
{ "term": { "url.raw" : "/b.json" } },
{ "term": { "url.raw" : "/c.json"} },
{ "term": { "url.raw" : "/d.json"} }
]
},
"aggs": {
"unique" :{
"terms" :{
"field" : "id.raw",
"size" : 0
}
}
}
}
}
}
The equivalent SQL would be
SELECT DISTINCT id
FROM json_record
WHERE
json_record.url = 'a.json' OR
json_record.url = 'b.json' OR
json_record.url = 'c.json' OR
json_record.url = 'd.json'
I was wondering whether the query above is correct, since the data will be needed for report generations.
Some remarks:
You should use a query filter instead of an aggregation filter. Your query loads all documents.
You can replace your or+term filter by a single terms filter
You could use a size=0 at the root of the query to get only agg result and not search results
Example code:
{"size":0,
"query" :{"filtered":{"filter":{"terms":{"url":["a", "b", "c"]}}}},
"aggs" :{"unique":{"term":{"field":"id", "size" :0}}}
}

match or term query on a long property for exact match?

My document has the following mapping property:
"sid" : {"type" : "long", "store": "yes", "index": "not_analyzed"},
This property has only one long value for each record. I would like to query this property. I tried the following two queries:
{
"query" : {
"term" : {
"sid" : 10
}
}
}
{
"query" : {
"match" : {
"sid" : 10
}
}
}
Both queries work and return the target document. My question: which one is more efficient? And why?
You want to use a term query, and if you want to be even more effecient, use a filtered query so your results get cached.
GET index1/test/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"sid": 10
}
}
}
}
}
Both work like the same way as you mentioned. As distinguished from match query the term query matches documents that have fields that contain a term (not analyzed!). So my opinion is that term query is more efficient in your case, because no analyzing have to be done.See:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-term-query.html

Elastic exact match w/o changing indexing

I have following query to elastic:
"query": {
"filtered": {
"filter": {
"and": {
"filters": [
{
"term": {
"entities.hashtags": "gf"
}
}
]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
},
entities.hashtags is array and as a result I receive entries with hashtags gf_anime, gf_whatever, gf_foobar etc.
But what I need is receive entries where exact "gf" hashtag exists.
I've looked in other questions on SO and saw that the solution in this case is to change analyzing of entities.hashtags so it'll match only exact values (I am pretty new with elastic hence can mistake with terms here).
My question is whether it's possible to define exact match search INSIDE THE QUERY? Id est w/o changing how elastic indexes its fields?
Are you sure that you need to do anything? Given your examples, you don't and you probably don't want to do not_analyzed:
curl -XPUT localhost:9200/test -d '{
"mappings": {
"test" : {
"properties": {
"body" : { "type" : "string" },
"entities" : {
"type" : "object",
"properties": {
"hashtags" : {
"type" : "string"
}
}
}
}
}
}
}'
curl -XPUT localhost:9200/test/test/1 -d '{
"body" : "anime", "entities" : { "hashtags" : "gf_anime" }
}'
curl -XPUT localhost:9200/test/test/2 -d '{
"body" : "anime", "entities" : { "hashtags" : ["GF", "gf_anime"] }
}'
curl -XPUT localhost:9200/test/test/3 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf_whatever", "gf_anime"] }
}'
With the above data indexed, your query only returns document 2 (note: this is simplified version of your query without the unnecessary/undesirable and filter; at least for the time being, you should always use the bool filter rather than and/or as it understands how to use the filter caches):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"term": {
"entities.hashtags": "gf"
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
Where this breaks down is when you start putting in hashtag values that will be split into multiple tokens, thereby triggering false hits with the term filter. You can determine how the field's analyzer will treat any value by passing it to the _analyze endpoint and telling it the field to use the analyzer from:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf_anime'
{
"tokens" : [ {
"token" : "gf_anime",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
# Note the space instead of the underscore:
curl -XGET localhost:9200/test/_analyze?field=entities.hashtags\&pretty -d 'gf anime'
{
"tokens" : [ {
"token" : "gf",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 1
}, {
"token" : "anime",
"start_offset" : 3,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
If you were to add a fourth document with the "gf anime" variant, then you will get a false hit.
curl -XPUT localhost:9200/test/test/4 -d '{
"body" : "anime", "entities" : { "hashtags" : ["gf whatever", "gf anime"] }
}'
This is really not an indexing problem, but a bad data problem.
With all of the explanation out of the way, you can inefficiently solve this by using a script that always follows the term filter (to efficiently rule out the more common cases that don't hit it):
curl -XGET localhost:9200/test/_search
{
"query": {
"filtered": {
"filter": {
"bool" : {
"must" : [{
"term" : {
"entities.hashtags" : "gf"
}
},
{
"script" : {
"script" :
"_source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null",
"params" : {
"tag" : "gf"
}
}
}]
}
},
"query": {
"match_phrase": {
"body": "anime"
}
}
}
}
}
This works by parsing the original the _source (and not using the indexed doc values). That is why this is not going to be very efficient, but it will work until you reindex. The _source.entities.hashtags == tag portion is only necessary if hashtags is not always an array (in my example, document 1 would not be an array). If it is always an array, then you can use _source.entities.hashtags.contains(tag) instead of _source.entities.hashtags == tag || _source.entities.hashtags.find { it == tag } != null.
Note: The script language is Groovy, which is the default starting in 1.4.0. It is not the default in earlier versions, and it must be explicitly enabled using script.default_lang : groovy.

Favor exact matches over nGram in elasticsearch

I am trying to map a field as nGram and 'exact' match, and make the exact matches appear first in the search results. This is an answer to a similar question, but I am struggling to make it work.
No matter what boost value I specify for the 'exact' field I get the same results order each time. This is how my field mapping looks:
"name" : {
"type" : "multi_field",
"fields" : {
"name" : {
"type" : "string",
"boost" : 2.0,
"analyzer" : "ngram"
},
"exact" : {
"type" : "string",
"boost" : 4.0,
"analyzer" : "simple",
"include_in_all" : false
}
}
}
And this is how the query looks like:
{
"query": {
"filtered": {
"query": {
"query_string": {
"fields":["name","name.exact"],
"query":"Woods"
}
}
}
}
}
Understating how score is calculated
Elasticsearch has an option for producing an explanation with every search result. by setting the explain parameter to be true
POST <Index>/<Type>/_search?explain&format=yaml
{
"query" : " ....."
}
it will produce a lot of output for every hit and that can be overwhelming, but it worth taking some time to understand what it all means
the output of eplian might be harder to read in json, so adding format=yaml makes it easier to read
Understanding why a document is matched or not
you can pass the query to a specific document like below to see explanation how matching is being done.
GET <Index>/<type>/<id>/_explain
{
"query": "....."
}
The multi_field mapping is correct, but the search query needs to be changed like this:
{
"query": {
"filtered": {
"query": {
"multi_match": { # changed from "query_string"
"fields": ["name","name.exact"],
"query": "Woods",
# added this so the engine does a "sum of" instead of a "max of"
# this is deprecated in the latest versions but works with 0.x
"use_dis_max": false
}
}
}
}
}
Now the results take into account the 'exact' match and adds up to the score.

How to search for a term and match a boolean condition

I've had good success getting results for searches using the below syntax, but I'm having trouble adding a boolean condition.
http://localhost:9200/index_name/type_name/_search?q=test
My documents look like:
{
"isbn":"9780307414922",
"name":"Dark of the Night",
"adult":false
}
Here's my best guess as to how to achieve what I'm trying to do.
{
"query_string": {
"default_field": "_all",
"query": "test"
},
"from": 0,
"size": 20,
"terms": {
"adult": true
}
}
However this results in "Parse Failure [No parser for element [query_string]]]; }]"
I'm using elastic search 0.20.5.
How can I match documents containing a search term the way "?q=test" does and filter by the document's adult property?
Thanks in advance.
Your adult == true clause has to be part of the query - you can't pass in a term clause as a top level parameter to search.
So you could add it to the query as a query clause, in which case you need to join both query clauses using a bool query, as follows:
curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"query" : {
"bool" : {
"must" : [
{
"query_string" : {
"query" : "test"
}
},
{
"term" : {
"adult" : true
}
}
]
}
},
"from" : 0,
"size" : 20
}
'
Really, though, query clauses should be used for:
full text search
clauses which affect the relevance score
However, your adult == true clause is not being used to change the relevance, and it doesn't involve full text search. It's more of a yes/no response, in other words it is better applied as a filter clause.
This means that you need to wrap your full text query (_all contains test) in a query clause which accepts both a query and a filter: the filtered query:
curl -XGET 'http://127.0.0.1:9200/_all/_search?pretty=1' -d '
{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"adult" : true
}
},
"query" : {
"query_string" : {
"query" : "test"
}
}
}
},
"from" : 0,
"size" : 20
}
'
Filters are usually faster because:
they don't have to score documents, just include or exclude them
they can be cached and reused

Resources