Sum field and sort on Solr - sorting

I'm implementing a grouped search in Solr. I'm looking for a way of summing one field and sort the results by this sum. With the following data example I hope it will be clearer.
{
[
{
"id" : 1,
"parent_id" : 22,
"valueToBeSummed": 3
},
{
"id" : 2,
"parent_id" : 22,
"valueToBeSummed": 1
},
{
"id" : 3,
"parent_id" : 33,
"valueToBeSummed": 1
},
{
"id" : 4,
"parent_id" : 5,
"valueToBeSummed": 21
}
]
}
If the search is made over this data I'd like to obtain
{
[
{
"numFound": 1,
"summedValue" : 21,
"parent_id" : 5
},
{
"numFound": 2,
"summedValue" : 4,
"parent_id" : 22
},
{
"numFound": 1,
"summedValue" : 1,
"parent_id" : 33
}
]
}
Do you have any advice on this ?

Solr 5.1+ (and 5.3) introduces Solr Facet functions to solve this exact issue.
From Yonik's introduction of the feature:
$ curl http://localhost:8983/solr/query -d 'q=*:*&
json.facet={
categories:{
type : terms,
field : cat,
sort : "x desc", // can also use sort:{x:desc}
facet:{
x : "avg(price)",
y : "sum(price)"
}
}
}
'
So the suggestion would be to upgrade to the newest version of Solr (the most recent version is currently 5.2.1, be advised that some of the syntax that's on the above link will be landed in 5.3 - the current release target).

So you want to group your results on the field parent_id and inside each group you want to sum up the fields valueToBeSummed and then you want to sort the entire results (the groups) by this new summedvalue field. That is a very interesting use case...
Unfortunately, I don't think there is a built in way of doing what you have asked.
There are function queries which you can use to sort, there is a group.func parameter also, but they will not do what you have asked.
Have you already indexed this data? Or are you still in the process of charting out how to store this data? If its the latter then one possible way would be to have a summedvalue field for each documents and calculate this as and when a document gets indexed. For example, given the sample documents in your question, the first document will be indexed as
{
"id" : 1,
"parent_id" : 22,
"valueToBeSummed": 3
"summedvalue": 3
"timestamp": current-timestamp
},
Before indexing the second document id:2 with parent_id:22 you will run a solr query to get the last indexed document with parent_id:22
Solr Query q=parent_id:22&sort=timestamp desc&rows=1
and add the summedvalue of id:1 with valueToBeSummed of id:2
So the next document will be indexed as
{
"id" : 2,
"parent_id" : 22,
"valueToBeSummed": 1
"summedvalue": 4
"timestamp": current-timestamp
}
and so on.
Once you have documents indexed this way, you can run a regular solr query with &group=true&group.field=parent_id&sort=summedValue.
Please do let us know how you decide to implement it. Like I said its a very interesting use case! :)

You can add the below query
select?q=*:*&stats=true&stats.field={!tag=piv1 sum=true}valueToBeSummed&facet=true&facet.pivot={!stats=piv1 facet.sort=index}parent_id&wt=json&indent=true
You need to use Stats Component for the requirement. You can get more information here. The idea is first define on what you need to have stats on. Here it is valueToBeSummed, and then we need to group on parent_id. We use facet.pivot for this functionality.
Regarding sort, when we do grouping, the default sorting order is based on count in each group. We can define based on the value too. I have done this above using facet.sort=index. So it sorted on parent_id which is the one we used for grouping. But your requirement is to sort on valueToBeSummed which is different from the grouping attribute.
As of now not sure, if we can achieve that. But will look into it and let you know.
In short, you got the grouping, you got the sum above. Just sort is pending

Related

ElasticSearch Sorted Index not working as expected with multiple shards

I have an elastic index with default sort mapping of price:
shop_prices_sort_index
"sort" : {
"field" : "enrich.price",
"order" : "desc"
},
If I insert 10 documents:
100, 98, 10230, 34, 1, 23, 777, 2323, 3, 109
And Fetch results using /_search. By default it returns documents in order of price descending.
10230, 2323...
But if I distribute my documents into 3 shards, Then the same query returns some other sequence of products:
100, 98, 34...
I am really stuck here, I am not sure if I am missing out something basic or Do I need any extra settings to make a Sorted Index behave correctly.
PS: I also tried 'routing' & 'preference'. but no luck.
Any help much appreciated.
When configuring index sorting, you're only making sure that each segment inside each shard is properly sorted. The goal of index sorting is to provide some more optimization during searches
Due to the distributed nature of ES, when your index has many shards, each shard will be properly sorted, but your search query will still need to use sorting explicitly.
So if your index settings contains the following to apply sorting at indexing time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
your search queries will also need to contain the same sort specification at query time
"sort" : {
"field" : "enrich.price",
"order" : "desc"
}
By using index sorting you'll hit a little overhead at indexing time, but your search queries will be a bit faster in the end.

Getting child documents

I have an Elasticsearch index. Each document in that index has a number (i.e 1, 2, 3, etc.) and an array named ChildDocumentIds. There are additional properties too. Still, each item in this array is the _id of a document that is related to this document.
I have a saved search named "Child Documents". I would like to use the number (i.e. 1, 2, 3, etc.) and get the child documents associated with it.
Is there a way to do this in Elastisearch? I can't seem to find a way to do a relational-type query in Elasticsearch for this purpose. I know it will be slow, but I'm o.k. with that.
The terms query allows you to do this. If document #1000 had child documents 3, 12, and 15 then the following two queries would return identical results:
"terms" : { "_id" : [3, 12, 15] }
and:
"terms" : {
"_id" : {
"index" : <parent_index>,
"type" : <parent_type>,
"id" : 1000,
"path" : "ChildDocumentIds"
}
}
The reason that it requires you to specify the index and type a second time is that the terms query supports cross-index lookups.

elasticsearch - query between document types

I have a production_order document_type
i.e.
{
part_number: "abc123",
start_date: "2018-01-20"
},
{
part_number: "1234",
start_date: "2018-04-16"
}
I want to create a commodity document type
i.e.
{
part_number: "abc123",
commodity: "1 meter machining"
},
{
part_number: "1234",
commodity: "small flat & form"
}
Production orders are datawarehoused every week and are immutable.
Commodities on the other hand could change over time. i.e abc123 could change from 1 meter machining to 5 meter machining, so I don't want to store this data with the production_order records.
If a user searches for "small flat & form" in the commodity document type, I want to pull all matching records from the production_order document type, the match being between part number.
Obviously I can do this in a relational database with a join. Is it possible to do the same in elasticsearch?
If it helps, we have about 500k part numbers that will be commoditized and our production order data warehouse currently holds 20 million records.
I have found that you can indeed now query between indexs in elasticsearch, however you have to ensure your data stored correctly. Here is an example from the 6.3 elasticsearch docs
Terms lookup twitter example At first we index the information for
user with id 2, specifically, its followers, then index a tweet from
user with id 1. Finally we search on all the tweets that match the
followers of user 2.
PUT /users/user/2
{
"followers" : ["1", "3"]
}
PUT /tweets/tweet/1
{
"user" : "1"
}
GET /tweets/_search
{
"query" : {
"terms" : {
"user" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "followers"
}
}
}
}
Here is the link to the original page
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/query-dsl-terms-query.html
In my case above I need to setup my storage so that commodity is a field and it's values are an array of part numbers.
i.e.
{
"1 meter machining": ["abc1234", "1234"]
}
I can then look up the 1 meter machining part numbers against my production_order documents
I have tested and it works.
There is no joins supported in elasticsearch.
You can query twice first by getting all the partnumbers using "small flat & form" and then using all the partnumbers to query the other index.
Else try to find a way to merge these into a single index. That would be better. Updating the Commodities would not cause you any problem by combining the both.

How to normalize ElasticSearch scores?

For my project I need to find out which results of the searches are considered "good" matches. Currently, the scores vary wildly depending on the query, hence the need to normalize them somehow. Normalizing the scores would allow to select the results above a given threshold.
I found couple solutions for Lucene:
how do I normalise a solr/lucene score?
http://wiki.apache.org/lucene-java/ScoresAsPercentages
How would I go ahead and apply the same technique to ElasticSearch? Or perhaps there is already a solution that works with ES for score normalization?
As far as I searched, there is no way to get a normalized score out of elastic. You will have to hack it by making two queries. First will be a pilot query (preferably with size 1, but rest all attributes same) and it will fetch you the max_score. Then you can shoot your actual query and use functional_score to normalize the score. Pass the max_score you got as part of the pilot query in params to function_score and use it to normalize every score. Refer: This article snippet
It's a bit late.
We needed to normalise the ES score for one of our use cases. So, we wrote a plugin that overrides the ES Rescorer feature.
Supports min-max and z score.
Github: https://github.com/bkatwal/elasticsearch-score-normalizer
Usage:
Min-max
{
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "min_max",
"min_score" : 1,
"max_score" : 10
}
}
}
Usage z-score:
"query": {
... some query
},
"from" : 0,
"size" : 50,
"rescore" : {
"score_normalizer" : {
"normalizer_type" : "z_score",
"min_score" : 1,
"factor" : 0.6,
"factor_mode" : "increase_by_percent"
}
}
}
For complete documentation check the Github repository.

Uniformly distributing results in elastic search based on an attribute

I am using tire to perform searches on sets of objects that have a category attribute (there are 6 different categories).
I want the results to come in pages of 6 with one of each category on a page (while it is possible).
Eg1. So if the first,second and third category had 2 objects each and the fourth, fifth and sixth categories had 1 object each the pages would look like:
Data: [1,1,2,2,3,3,4,5,6]
1: 1,2,3,4,5,6
2: 1,2,3
Eg2. [1,1,1,1,1,2,2,3,4,5]
1: 1,2,3,4,5,1
2: 2,1,1,1
In something like ruby it wouldn't be too difficult to sort based on the number of times a number has appeared.
Something like
times_seen = {}
results.sort_by do |r|
times_seen[r.category] ||= 0
[times_seen[r.category] += 1, r.category]
end
E.g.
irb(main):032:0> times_seen = {};[1,1,1,1,1,2,2,3,4,5].sort_by{|i| times_seen[i] ||= 1; [times_seen[i] += 1, i];}
=> [1, 2, 3, 4, 5, 1, 2, 1, 1, 1]
To do this with a large number of results would be really slow because we would need to pull them all into ruby first and then sort.
Ideally we want to do this in elastic search and let it handle the pagination for us.
There is Script based sorting in elastic search:
http://www.elasticsearch.org/guide/reference/api/search/sort/
{
"query" : {
....
},
"sort" : {
"_script" : {
"script" : "doc['field_name'].value * factor",
"type" : "number",
"params" : {
"factor" : 1.1
},
"order" : "asc"
}
}
}
So if we could do something like this but have the times_seen logic from above in it, it would make life really easy, but it would require having a times_seen variable that persists between scripts.
Any ideas on how to achieve a uniform distribution based on an attribute or if it is possible to somehow use a variable in the script sort?

Resources