ElasticSerach - Statistical facets on length of the list - elasticsearch

I have the following sample mappipng:
{
"book" : {
"properties" : {
"author" : { "type" : "string" },
"title" : { "type" : "string" },
"reviews" : {
"properties" : {
"url" : { "type" : "string" },
"score" : { "type" : "integer" }
}
},
"chapters" : {
"include_in_root" : 1,
"type" : "nested",
"properties" : {
"name" : { "type" : "string" }
}
}
}
}
}
I would like to get a facet on number of reviews - i.e. length of the "reviews" array.
For instance, verbally spoken results I need are: "100 documents with 10 reviews, 20 documents with 5 reviews, ..."
I'm trying the following statistical facet:
{
"query" : {
"match_all" : {}
},
"facets" : {
"stat1" : {
"statistical" : {"script" : "doc['reviews.score'].values.size()"}
}
}
}
but it keeps failing with:
{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], total failure; shardFailures {[mDsNfjLhRIyPObaOcxQo2w][facettest][0]: QueryPhaseExecutionException[[facettest][0]: query[ConstantScore(NotDeleted(cache(org.elasticsearch.index.search.nested.NonNestedDocsFilter#a2a5984b)))],from[0],size[10]: Query Failed [Failed to execute main query]]; nested: PropertyAccessException[[Error: could not access: reviews; in class: org.elasticsearch.search.lookup.DocLookup]
[Near : {... doc[reviews.score].values.size() ....}]
^
[Line: 1, Column: 5]]; }]",
"status" : 500
}
How can I achieve my goal?
ElasticSearch version is 0.19.9.
Here is my sample data:
{
"author" : "Mark Twain",
"title" : "The Adventures of Tom Sawyer",
"reviews" : [
{
"url" : "amazon.com",
"score" : 10
},
{
"url" : "www.barnesandnoble.com",
"score" : 9
}
],
"chapters" : [
{ "name" : "Chapter 1" }, { "name" : "Chapter 2" }
]
}
{
"author" : "Jack London",
"title" : "The Call of the Wild",
"reviews" : [
{
"url" : "amazon.com",
"score" : 8
},
{
"url" : "www.barnesandnoble.com",
"score" : 9
},
{
"url" : "www.books.com",
"score" : 5
}
],
"chapters" : [
{ "name" : "Chapter 1" }, { "name" : "Chapter 2" }
]
}

It looks like you are using curl to execute your query and this curl statement looks like this:
curl localhost:9200/my-index/book -d '{....}'
The problem here is that because you are using apostrophes to wrap the body of the request, you need to escape all apostrophes that it contains. So, you script should become:
{"script" : "doc['\''reviews.score'\''].values.size()"}
or
{"script" : "doc[\"reviews.score"].values.size()"}
The second issue is that from your description it looks like your are looking for a histogram facet or a range facet but not for a statistical facet. So, I would suggest trying something like this:
curl "localhost:9200/test-idx/book/_search?search_type=count&pretty" -d '{
"query" : {
"match_all" : {}
},
"facets" : {
"histo1" : {
"histogram" : {
"key_script" : "doc[\"reviews.score\"].values.size()",
"value_script" : "doc[\"reviews.score\"].values.size()",
"interval" : 1
}
}
}
}'
The third problem is that the script in the facet will be called for every single record in the result list and if you have a lot of results it might take really long time. So, I would suggest indexing an additional field called number_of_reviews that should be populated with the number of reviews by your client. Then your query would simply become:
curl "localhost:9200/test-idx/book/_search?search_type=count&pretty" -d '{
"query" : {
"match_all" : {}
},
"facets" : {
"histo1" : {
"histogram" : {
"field" : "number_of_reviews"
"interval" : 1
}
}
}
}'

Related

ElasticSearch, simple two fields comparison with painless

I'm trying to run a query such as SELECT * FROM indexPeople WHERE info.Age > info.AgeExpectancy
Note the two fields are NOT nested, they are just json object
POST /indexPeople/_search
{
"from" : 0,
"size" : 200,
"query" : {
"bool" : {
"filter" : [
{
"bool" : {
"must" : [
{
"script" : {
"script" : {
"source" : "doc['info.Age'].value > doc['info.AgeExpectancy'].value",
"lang" : "painless"
},
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
}
],
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
"_source" : {
"includes" : [
"info"
],
"excludes" : [ ]
}
}
However this query fails as
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"org.elasticsearch.index.fielddata.ScriptDocValues$Longs.get(ScriptDocValues.java:121)",
"org.elasticsearch.index.fielddata.ScriptDocValues$Longs.getValue(ScriptDocValues.java:115)",
"doc['info.Age'].value > doc['info.AgeExpectancy'].value",
" ^---- HERE"
],
"script" : "doc['info.Age'].value > doc['info.AgeExpectancy'].value",
"lang" : "painless",
"position" : {
"offset" : 22,
"start" : 0,
"end" : 70
}
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "indexPeople",
"node" : "c_Dv3IrlQmyvIVpLoR9qVA",
"reason" : {
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"org.elasticsearch.index.fielddata.ScriptDocValues$Longs.get(ScriptDocValues.java:121)",
"org.elasticsearch.index.fielddata.ScriptDocValues$Longs.getValue(ScriptDocValues.java:115)",
"doc['info.Age'].value > doc['info.AgeExpectancy'].value",
" ^---- HERE"
],
"script" : "doc['info.Age'].value > doc['info.AgeExpectancy'].value",
"lang" : "painless",
"position" : {
"offset" : 22,
"start" : 0,
"end" : 70
},
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "A document doesn't have a value for a field! Use doc[<field>].size()==0 to check if a document is missing a field!"
}
}
}
]
},
"status" : 400
}
Is there a way to achieve this?
What is the best way to debug it? I wanted to print the objects or look at the logs (which aren't there), but I couldn't find a way to do neither.
The mapping is:
{
"mappings": {
"_doc": {
"properties": {
"info": {
"properties": {
"Age": {
"type": "long"
},
"AgeExpectancy": {
"type": "long"
}
}
}
}
}
}
}
perhaps you already solved the issue. The reason why the query failed is clear:
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "A document doesn't have a value for a field! Use doc[<field>].size()==0 to check if a document is missing a field!"
}
Basically there is one or more document that do not have one of the queried fields. So you can achieve the result you need by using an if to check if the fields do indeed exists. If they do not exist, you can simply return false as follows:
{
"script": """
if (doc['info.Age'].size() > 0 && doc['info.AgeExpectancy'].size() > 0) {
return doc['info.Age'].value > doc['info.AgeExpectancy'].value
}
return false;
}
"""
I tested it with an Elasticsearch 7.10.2 and it works.
What is the best way to debug it
That is a though question, perhaps someone has a better answer for it. I try to list some options. Obviously, debugging requires to read carefully the error messages.
PAINLESS LAB
If you have a pretty recent version of Kibana, you can try to use the painless lab to simulate your documents and get the errors quicker and in a more focused environment.
KIBANA Scripted Field
You can try to create a bolean scripted field in the index pattern named condition. Before clicking create remember to click "preview result":
MINIMAL EXAMPLE Create a minimal example to reduce the complexity.
For this answer I used a sample index with four documents with all possible cases.
No info: { "message": "ok"}
Info.Age but not AgeExpectancy: {"message":"ok","info":{"Age":14}}
Info.AgeExpectancy but not Age: {"message":"ok","info":{"AgeExpectancy":12}}
Info.Age and AgeExpectancy: {"message":"ok","info":{"Age":14, "AgeExpectancy": 12}}

Partial update overwriting whole structure

I'm indexing a new document with the following content
{
"lastUpdate" : "20180114144020452",
"name" : "My Process",
"startDate" : "20180114162356585",
"endData" : "",
"tasks" : [
{
"1" : {
"lastUpdate" : "20180114144020452",
"taskId" : "123",
"subject" : "Terceira Atividade",
"status" : "Active",
"type" : "userTask",
"assign" : [
{
"date" : "20180114144020452",
"type" : "role",
"name" : "Time 3",
"id" : "Team3_345"
}
],
"receivedDate" : "",
"readDate" : "",
"finishDate" : ""
}
}
]
}
And then I'm trying to change task.1.status value with the following update content
{
"doc" : {
"tasks" : [
{
"1" : {
"status" : "Closed"
}
}
]
}
}
But it's overwriting the whole task.1 structure, deleting other values and letting only status value to closed instead of keep other values and change only status value.
How can I solve this? Thanks
You need to do it via a scripted partial updated like this
POST updates/update/1/_update
{
"script": {
"source": "ctx._source.tasks[0].1.status = 'Closed'"
}
}

Double wildcard in query causes weird highlighting for plain/fast vectors elasticsearch highlighters

I'm working on elasticsearch 1.5.2
After indexing following mapping:
PUT http://localhost:9200/index/_mapping/sometype
{
"properties" : {
"sometext" : {
"type" : "string",
"term_vector" : "with_positions_offsets"
}
}
}
and data:
POST http://localhost:9200/index/sometype
{
"sometext" : "A supervisor is responsible for the productivity and actions of a small group of employees. The supervisor has several manager-like roles, responsibilities, and powers. Two of the key differences between a supervisor and a manager are (1) the supervisor does not typically have hire and fire authority, and (2) the supervisor does not have budget authority."
}
user is trying to find all documents, but instead one wildcard he typed double:
POST http://localhost:9200/index/sometype/_search
{
"query" : {
"query_string" : {
"query" : "**",
"fields" : ["sometext"]
}
},
"highlight" : {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"order" : "score",
"require_field_match" : true,
"fields" : {
sometext : {
"fragment_size" : 150,
"number_of_fragments" : 1
}
}
}
}
and got following highlight:
"highlight" : {
"sometext" : ["responsibilities, <em>and</em> <em>powers</em>. <em>Two</em> <em>of</em> <em>the</em> <em>key</em> <em>differences</em> <em>between</em> <em>a</em> <em>supervisor</em> <em>and</em> <em>a</em> <em>manager</em> <em>are</em> (<em>1</em>) <em>the</em> <em>supervisor</em> <em>does</em> <em>not</em> <em>typically</em> <em>have</em> <em>hire</em> <em>and</em> <em>fire</em> <em>authority</em>, and"]
}
The same highlighting results are produced by query *?
But when query consist of just single asterisk - nothing returned by highlighter.
On plain highlighter (I just added "type" : "plain"to highlight) result looks a bit different (but still weird):
"highlight" : {
"sometext" : [", <em>responsibilities</em>, <em>and</em> <em>powers</em>. <em>Two</em> <em>of</em> <em>the</em> <em>key</em> <em>differences</em> <em>between</em> <em>a</em> <em>supervisor</em> <em>and</em> <em>a</em> <em>manager</em> <em>are</em> (<em>1</em>) <em>the</em> <em>supervisor</em> <em>does</em> <em>not</em> <em>typically</em> <em>have</em> <em>hire</em> <em>and</em> <em>fire</em> <em>authority</em>, <em>and</em> (<em>2</em>) <em>the</em> <em>supervisor</em> <em>does</em> <em>not</em> <em>have</em> <em>budget</em> <em>authority</em>."]
}
Does anybody know what is the reason of such behavior?
Maybe queries like ** and *? have some special meaning?
Thanks a lot.
Answered on elasticsearch forum
https://discuss.elastic.co/t/double-wildcard-in-string-query-causes-incorrect-highlighting-for-plain-and-fast-vectors-highlighters/45939
POST /index/sometype/_search
{
"query" : {
"query_string" : {`enter code here`
"query" : "**",
"fields" : ["sometext"]
}
},
"highlight" : {
"pre_tags" : ["<em>"],
"post_tags" : ["</em>"],
"order" : "score",
"require_field_match" : true,
"fields" : {
"sometext" : {
"fragment_size" : 180,
"number_of_fragments" : 1
}
}
}
}
:=>we can use this query

Specify Routing on Index Alias's Term Lookup Filter

I am using Logstash, ElasticSearch and Kibana to allow multiple users to log in and view the log data they have forwarded. I have created index aliases for each user. These restrict their results to contain only their own data.
I'd like to assign users to groups, and allow users to view data for the computers in their group. I created a parent-child relationship between the groups and the users, and I created a term lookup filter on the alias.
My problem is, I receive a RoutingMissingException when I try to apply the alias.
Is there a way to specify the routing for the term lookup filter? How can I lookup terms on a parent document?
I posted the mapping and alias below, but a full gist recreation is available at this link.
curl -XPUT 'http://localhost:9200/accesscontrol/' -d '{
"mappings" : {
"group" : {
"properties" : {
"name" : { "type" : "string" },
"hosts" : { "type" : "string" }
}
},
"user" : {
"_parent" : { "type" : "group" },
"_routing" : { "required" : true, "path" : "group_id" },
"properties" : {
"name" : { "type" : "string" },
"group_id" : { "type" : "string" }
}
}
}
}'
# Create the logstash alias for cvializ
curl -XPOST 'http://localhost:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "logstash-2014.04.25", "alias" : "cvializ-logstash-2014.04.25" } },
{
"add" : {
"index" : "logstash-2014.04.25",
"alias" : "cvializ-logstash-2014.04.25",
"routing" : "intern",
"filter": {
"terms" : {
"host" : {
"index" : "accesscontrol",
"type" : "user",
"id" : "cvializ",
"path" : "group.hosts"
},
"_cache_key" : "cvializ_hosts"
}
}
}
}
]
}'
In attempting to find a workaround for this error, I submitted a bug to the ElasticSearch team, and received an answer from them. It was a bug in ElasticSearch where the filter is applied before the dynamic mapping, causing some erroneous output. I've included their workaround below:
PUT /accesscontrol/group/admin
{
"name" : "admin",
"hosts" : ["computer1","computer2","computer3"]
}
PUT /_template/admin_group
{
"template" : "logstash-*",
"aliases" : {
"template-admin-{index}" : {
"filter" : {
"terms" : {
"host" : {
"index" : "accesscontrol",
"type" : "group",
"id" : "admin",
"path" : "hosts"
}
}
}
}
},
"mappings": {
"example" : {
"properties": {
"host" : {
"type" : "string"
}
}
}
}
}
POST /logstash-2014.05.09/example/1
{
"message":"my sample data",
"#version":"1",
"#timestamp":"2014-05-09T16:25:45.613Z",
"type":"example",
"host":"computer1"
}
GET /template-admin-logstash-2014.05.09/_search

ElasticSearch query/search/match

I have inserted 3 records in my ElasticSearch index as follows:
curl -XPOST 'http://127.0.0.1:9200/geoindex_test/STREET?pretty=1' -d '
{ "cityNames" : [ { "language" : "ENG",
"name" : "w bridgewater",
"raw_name" : "W BRIDGEWATER"
},
{ "language" : "ENG",
"name" : "west bridgewater",
"raw_name" : "West Bridgewater"
}
],
"id" : 1,
"streetNames" : [ { "language" : "ENG",
"name" : "cram rd",
"raw_name" : "Cram Rd"
} ]
}'
curl -XPOST 'http://127.0.0.1:9200/geoindex_test/STREET?pretty=1' -d '
{ "cityNames" : [ { "language" : "ENG",
"name" : "bridgewater corners",
"raw_name" : "BRIDGEWATER CORNERS"
},
{ "language" : "ENG",
"name" : "bridgewater center",
"raw_name" : "Bridgewater Center"
}
],
"id" : 2,
"streetNames" : [ { "language" : "ENG",
"name" : "valley view rd",
"raw_name" : "Valley View Rd"
} ]
}'
curl -XPOST 'http://127.0.0.1:9200/geoindex_test/STREET?pretty=1' -d '
{ "cityNames" : [ { "language" : "ENG",
"name" : "bridgewater",
"raw_name" : "Bridgewater"
},
{ "language" : "ENG",
"name" : "windsor",
"raw_name" : "Windsor"
}
],
"id" : 3,
"streetNames" : [ { "language" : "ENG",
"name" : "valley view rd",
"raw_name" : "Valley View Rd"
} ]
}'
And I perform a search as follows:
curl -XGET 'http://127.0.0.1:9200/geoindex_test/STREET/_search?pretty=1' -d '
{
"query" : {
"match" : { "cityNames.name" : "bridgewater" }
}
}'
I thought ElasticSearch would return the third record (id == 3) as the best match (record 3 is the only exact match to "bridgewater"), but instead it returns the record for id 1 (w bridgewater) as the best match. What am I doing wrong?
I imagine this is happening because you are using inner objects which basically collapse the objects under it, into one for search purposes. So when you're querying the search field for Object 1, for example, you're querying against ["w bridgewater", "west bridgewater"] and not discrete fields as you may imagine.
Since 'bridgewater' appears twice in object 1 and 2 (two name fields) vs once in object 3, those items rank higher in the search. Object 1 is ultimately picked, because the fields that 'bridgewater' appears in are shorter strings than in Object 2 ("w bridgewater" vs "bridgewater corners").
Instead of using inner objects like you're doing, use nested objects instead http://www.elasticsearch.org/guide/reference/mapping/nested-type/. setting the score mode to "max" will then make things match in a more intuitive manner for you.

Resources