Syntax for function_score in Elastic Search - elasticsearch

I am new to Elastic Search. I have indexed movie artists (actors and directors) in ElasticSearch and a simple text search works fine e.g if I search for 'steven' with the following syntax
{"query":
{"query_string":
{"query":"steven"}
}
}
... I get the following results, which are fine :
1. Steven Conrad - Popularity (from document) = 487 - elasticsearch _score = 3,2589545
2. Steven Knight - Popularity (from document) = 487 - elasticsearch _score = 3,076738
3. Steven Waddington - Popularity (from document) = 431 - elasticsearch _score = 2,4931839
4. Steven E. De Souza - Popularity (from document) = 534 - elasticsearch _score = 2,4613905
5. Steven R. Monroe - Popularity (from document) = 293 - elasticsearch _score = 2,4613905
6. Steven Mackintosh - Popularity (from document) = 363 - elasticsearch _score = 2,2812681
7. Steven Wright - Popularity (from document) = 356 - elasticsearch _score = 2,2812681
8. Steven Soderbergh - Popularity (from document) = 5947 - elasticsearch _score = 2,270944
9. Steven Seagal - Popularity (from document) = 1388 - elasticsearch _score = 2,270944
10. Steven Bauer - Popularity (from document) = 714 - elasticsearch _score = 2,270944
However, as you can see above, I have a popularity numeric field in my document, and, when searching for 'steven', I would like the most popular artists (Steven Soderbergh, Steven Seagal ...) to come first.
Ideally, I'd like to sort the results above by popularity * _score
I am pretty sure I have to have use the function_score feature of Elastic Search but I can't figure out the exact syntax.
I've tried to do my "improved" search with the following syntax
{
"query": {
"custom_score": {
"query": {
"query_string": {
"query": "steven"
}
},
"script": "_score * doc['popularity']"
}
}
}
But I get an exception (extract from error message below :)
org.elasticsearch.search.query.QueryPhaseExecutionException: [my_index][4]: query[filtered(function score (_all:steven,function=script[_score * doc['popularity']], params [null]))->cache(_type:artist)],from[0],size[10]: Query Failed [Failed to execute main query]
// ....
Caused by: java.lang.RuntimeException: uncomparable values <<1.9709579>> and <<org.elasticsearch.index.fielddata.ScriptDocValues$Longs#7c5b73bc>>
// ...
... 9 more
Caused by: java.lang.ClassCastException: org.elasticsearch.index.fielddata.ScriptDocValues$Longs cannot be cast to java.lang.Float
at java.lang.Float.compareTo(Float.java:33)
at org.elasticsearch.common.mvel2.math.MathProcessor.doOperationNonNumeric(MathProcessor.java:266)
I have the impression the syntax I use is incorrect
What should be the right syntax ? Or is there something else that I am missing ? Thanks a lot in advance
Edit
My table mapping is defined as follows :
"mappings" : {
"artist" : {
"_all" : {
"auto_boost" : true
},
"properties" : {
"first_name" : {
"type" : "string",
"index" : "not_analyzed",
"analyzer" : "standard"
},
"last_name" : {
"type" : "string",
"boost" : 2.0,
"index" : "not_analyzed",
"norms" : {
"enabled" : true
},
"analyzer" : "standard"
},
"popularity" : {
"type" : "integer"
}
}
}
}

have you missed the .value near doc['...']?
this works for me (i stored integers without mapping):
$ curl -XPUT localhost:9200/test/test/a -d '{"name":"steven", "popularity": 666}'
{"_index":"test","_type":"test","_id":"a","_version":1,"created":true}
$ curl -XPUT localhost:9200/test/test/b -d '{"name":"steven", "popularity": 42}'
{"_index":"test","_type":"test","_id":"b","_version":1,"created":true}
$ curl -XPOST localhost:9200/test/test/_search\?pretty -d '{ "query": { "custom_score": { "query": { "match_all": {}}, "script": "_score * doc[\"popularity\"].value" } } }'
{
"took" : 83,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 666.0,
"hits" : [ {
"_index" : "test",
"_type" : "test",
"_id" : "a",
"_score" : 666.0, "_source" : {"name":"steven", "popularity": 666}
}, {
"_index" : "test",
"_type" : "test",
"_id" : "b",
"_score" : 42.0, "_source" : {"name":"steven", "popularity": 42}
} ]
}
}

Related

dis_max query isn't looking for the best matching clause

I'm testing the dis_max query in the docs below:
PUT /blog/post/1
{
"title": "Quick brown rabbits",
"body": "Brown rabbits are commonly seen."
}
PUT /blog/post/2
{
"title": "Keeping pets healthy",
"body": "My quick brown fox eats rabbits on a regular basis."
}
This example is extracted from the book "Elasticsearch definitive guide" which explains that the answer from the query below would shows equals _score for both documents.
{
"query": {
"dis_max": {
"queries": [
{ "match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets" }}
]
}
}}
But, as you could see, the result from the query shows different _score.
{
"took" : 10,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.02250402,
"hits" : [ {
"_index" : "blog",
"_type" : "post",
"_id" : "2",
"_score" : 0.02250402,
"_source" : {
"title" : "Keeping pets healthy",
"body" : "My quick brown fox eats rabbits on a regular basis."
}
}, {
"_index" : "blog",
"_type" : "post",
"_id" : "1",
"_score" : 0.016645055,
"_source" : {
"title" : "Quick brown rabbits",
"body" : "Brown rabbits are commonly seen."
}
} ]
}
}
Elasticsearch is not returning the _score from best matching clause but is, somehow, blending the results. How may I fix it?
I've got the answer.
This confusing behavior happens because the index used in the example is using 5 shards (default number of shards). And the _score is not calculated in the index as a whole but in individual shards and then are summarized before the user got the answer.
This problem is not a issue when you have a huge number of documents, what it is not my case.
So, to test my thesis, I deleted my index:
DELETE /blog
And then, created a new index using only 1 shard:
PUT /BLOG
{ "settings" : { "number_of_shards" : 1 }}
So, I performed my query again and got both documents with the same _score: 0.12713557
Sweet =)

Elastich search : more_like_this operator returns no hit

I am trying to find similar documents to one document in elastic search (the document with id '4' in this case) in my sandbox based on a field (the 'town' field in this case).
So i wrote this query, which returns no hit :
GET _search
{
"query": {
"more_like_this" : {
"fields" : ["town"],
"docs" : [
{
"_index" : "app",
"_type" : "house",
"_id" : "4"
}
],
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
In my dataset, the document #4 is located in a town nammed 'Paris'. Thus when I run the following query, the document #4 is in the hits results with a lot of others results :
GET _search
{
"query": {
"match": { "town": "Paris" }
}
}
I don't understand why the 'more_like_this' query does not return results whereas there are other documents that have a field with the same value.
I precise that I check the _index, _type and _id parameters using the '"match_all": {}' query.
It looks like the second example of this official elastic search ressource : http://www.elastic.co/guide/en/elasticsearch/reference/1.5/query-dsl-mlt-query.html
What's wrong with my 'more_like_this' query ?
I am assuming you have only a less number of documents.
In that case , can you give min_doc_freq as 0 and try again.
Also use POST for search -
POST _search
{
"query": {
"more_like_this" : {
"fields" : ["town"],
"docs" : [
{
"_index" : "app",
"_type" : "house",
"_id" : "4"
}
],
"min_term_freq" : 1,
"max_query_terms" : 12,
"min_doc_freq" : 1
}
}
}

Elasticsearch index last update time

Is there a way to retrieve from ElasticSearch information on when a specific index was last updated?
My goal is to be able to tell when it was the last time that any documents were inserted/updated/deleted in the index. If this is not possible, is there something I can add in my index modification requests that will provide this information later on?
You can get the modification time from the _timestamp
To make it easier to return the timestamp you can set up Elasticsearch to store it:
curl -XPUT "http://localhost:9200/myindex/mytype/_mapping" -d'
{
"mytype": {
"_timestamp": {
"enabled": "true",
"store": "yes"
}
}
}'
If I insert a document and then query on it I get the timestamp:
curl -XGET 'http://localhost:9200/myindex/mytype/_search?pretty' -d '{
> fields : ["_timestamp"],
> "query": {
> "query_string": { "query":"*"}
> }
> }'
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "myindex",
"_type" : "mytype",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"_timestamp" : 1417599223918
}
} ]
}
}
updating the existing document:
curl -XPOST "http://localhost:9200/myindex/mytype/1/_update" -d'
{
"doc" : {
"field1": "data",
"field2": "more data"
},
"doc_as_upsert" : true
}'
Re-running the previous query shows me an updated timestamp:
"fields" : {
"_timestamp" : 1417599620167
}
I don't know if there are people who are looking for an equivalent, but here is a workaround using shards stats for > Elasticsearch 5 users:
curl XGET http://localhost:9200/_stats?level=shards
As you'll see, you have some informations per indices, commits and/or flushs that you might use to see if the indice changed (or not).
I hope it will help someone.
Just looked into a solution for this problem. Recent Elasticsearch versions have a <index>/_recovery API.
This returns a list of shards and a field called stop_time_in_millis which looks like it is a timestamp for the last write to that shard.

Elasticsearch river - no _meta document found after 5 attempts

I am using elasticsearch version 1.3.0. when I create a river using wikipedia plugin version 2.3.0 as thus
PUT _river/my_river/_meta -d
{
"type" : "wikipedia",
"wikipedia" : {
"url" : "http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
},
"index" : {
"index" : "wikipedia",
"type" : "wiki",
"bulk_size" : 1000,
"max_concurrent_bulk" : 3
}
}
the server responds with this message
{
"_index": "_river",
"_type": "my_river",
"_id": "_meta -d",
"_version": 1,
"created": true
}
however, I don't see the wikipedia documents when I run a search. also, when I restart my server I get river-routing no _meta document found after 5 attempts
Remove the -d at the end as it creates a document named _meta -d and not _meta.
PUT _river/my_river/_meta
{
"type" : "wikipedia",
"wikipedia" : {
"url" : "http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
},
"index" : {
"index" : "wikipedia",
"type" : "wiki",
"bulk_size" : 1000,
"max_concurrent_bulk" : 3
}
}

Join query in ElasticSearch

Is there any way (query) to join 2 JSONs below in ElasticSearch
{
product_id: "1111",
price: "23.56",
stock: "100"
}
{
product_id: "1111",
category: "iPhone case",
manufacturer: "Belkin"
}
Above 2 JSONs processed (input) under 2 different types in Logstash, so their indexes are available in different 'type' filed in Elasticsearch.
What I want is to join 2 JSONs on product_id field.
It depends what you intend when you say JOIN. Elasticsearch is not like regular database that supports JOIN between tables. It is a text search engine that manages documents within indexes.
On the other hand you can search within the same index over multiple types using a fields that are common to every type.
For example taking your data I can create an index with 2 types and their data like follows:
curl -XPOST localhost:9200/product -d '{
"settings" : {
"number_of_shards" : 5
}
}'
curl -XPOST localhost:9200/product/type1/_mapping -d '{
"type1" : {
"properties" : {
"product_id" : { "type" : "string" },
"price" : { "type" : "integer" },
"stock" : { "type" : "integer" }
}
}
}'
curl -XPOST localhost:9200/product/type2/_mapping -d '{
"type2" : {
"properties" : {
"product_id" : { "type" : "string" },
"category" : { "type" : "string" },
"manufacturer" : { "type" : "string" }
}
}
}'
curl -XPOST localhost:9200/product/type1/1 -d '{
product_id: "1111",
price: "23",
stock: "100"
}'
curl -XPOST localhost:9200/product/type2/1 -d '{
product_id: "1111",
category: "iPhone case",
manufacturer: "Belkin"
}'
I effectively created one index called product with 2 type type1 and type2.
Now I can do the following query and it will return both documents:
curl -XGET 'http://localhost:9200/product/_search?pretty=1' -d '{
"query": {
"query_string" : {
"query" : "product_id:1111"
}
}
}'
{
"took" : 95,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.5945348,
"hits" : [ {
"_index" : "product",
"_type" : "type1",
"_id" : "1",
"_score" : 0.5945348, "_source" : {
product_id: "1111",
price: "23",
stock: "100"
}
}, {
"_index" : "product",
"_type" : "type2",
"_id" : "1",
"_score" : 0.5945348, "_source" : {
product_id: "1111",
category: "iPhone case",
manufacturer: "Belkin"
}
} ]
}
}
The reason is because Elasticsearch will search over all documents within that index regardless of their type. This is still different than a JOIN in the sense Elasticsearch is not going to do a Cartesian product of the documents that belong to each type.
Hope that helps
isaac.hazan's answer works quite well, but I would like to add a few points that helped me with this kind of situation:
I landed on this page when I was trying to solve a similar problem, in that I had to exclude multiple records of one index based on documents of another index. The lack of relationships is one of the main downsides of unstructured databases.
The elasticsearch documentation page on Handling Relationships explains a lot.
Four common techniques are used to manage relational data in Elasticsearch:
Application-side joins
Data denormalization
Nested objects
Parent/child relationships
Often the final solution will require a mixture of a few of these techniques.
I've used nested objects and application-side joins, mostly. While using the same field name could momentarily solve the problem, I think it is better to rethink and create best-suited mapping for your application.
For instance, you might find that you want to list all products with price greater than x, or list all products that are not in stock anymore. To deal with such scenarios it helps if you are using one of the solutions mentioned above.
To perform joins on Elasticsearch take a look at the Siren "Federate" plugin. It adds join capabilities by extending the Elasticsearch native query syntax.
https://siren.io/federate/

Resources