ElasticSearch custom score script does not preserve array ordering - elasticsearch

I am using ElasticSearch with a function_score property to retrieve documents sorted by createdOn. The createdOn field is stored as an Array representing date values, i.e.
"createdOn": [ 2014, 4, 24, 22, 11, 47 ]
Where createdOn[0] is year, createdOn[1] is month, createdOn[2] is day, etc. I am testing the following query, which should return documents scored by year. However, the doc['createdOn'] array does not preserve the value of the elements. In this query, doc['createdOn'].values[0] returns 4, not 2014.
POST /example/1
{
name:"apple",
createdOn: [2014, 8, 22, 5, 12, 32]
}
POST /example/2
{
name:"apple",
createdOn: [2011, 8, 22, 5, 12, 32]
}
POST /example/3
{
name:"apple",
createdOn: [2013, 8, 22, 5, 12, 32]
}
POST /example/_search
{
"query":
{
"function_score": {
"boost_mode": "replace",
"query": {
"match_all": {}
},
"script_score" : {
"script": "doc['createdOn'].values[0]"
}
}
}
}
It appears that this is due to the way ElasticSearch caches data: http://elasticsearch-users.115913.n3.nabble.com/Accessing-array-field-within-Native-Plugin-td4042848.html:
The only apparent solution other than using the source method (which is slow), is to use nested queries. Any ideas on how I could rewrite my query using nested queries? It seems like the only efficient way to sort this query by year.
The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The
field data cache is not loaded from source of the document (because
this will be slow) but from the lucene index, where the values are
sorted (for lookup speed). The get api does work based on the original
document source which is why you see those values in order (note- ES
doesn't the parse the source for the get api, it just gives you back
what you've put in it).
You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as
it needs to go to disk for every document.
I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{
"key": "k1" , "value" : "v1"},...].

Related

Syntax for referencing sub-buckets in Vega

I have an Elasticsearch (6.2) query that returns the following JSON:
"aggregations": {
"per_buyer": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Example-buyer",
"doc_count": 45,
"per_cart": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 6,
"buckets": [
{
"key": "397105FB",
"doc_count": 8,
"net": {
"value": 10
} } ] } } ] } }
What is the correct syntax for the vega "format" field to display data per "per_cart" bucket? Anything deeper than aggregations.per_buyer.buckets returns the error _.aggregations.per_buyer.buckets.per_cart is undefined . VEGA_DEBUG.view.data shows that some aggregations.per_buyer.buckets have a "per_cart" object, which in turn have buckets. ( Filtering so that all buckets have per_cart objects does not change anything. )
I previously asked this question without success on the Elastic forums.
Sorry missed the request in the discuss forum. Cross-posting my answer here:
#Steven_Ensslen your format must be aggregations.per_buyer.buckets, simply because that's the list of data that you need. Each data element in that bucket may contain a sub-list of buckets which you have to access from Vega itself. Now, if the sub-list will always have just a single bucket (e.g. if you are summing-up the total per main bucket), you can access it either directly from a mark, e.g. datum.per_cart.buckets[0].net.value, or you can create a formula transform that would copy that value into a top level field, e.g. {type:'formula', as:'net_value', expr: 'datum.per_cart.buckets[0].net.value'}, and use the net_value field in the mark. If on the other hand you have multiple items in the sub-list, you could use the flatten transform to flatten out the sub-buckets into a non-hierarchical list of items, and then use various transformations to get the data into the format you need.
P.S. "flatten" transform may not be available until 6.3 or 6.4.

Elastic - Search across object without key specification

I have an index with hundreds of millions docs and each of them has an object "histogram" with values for each day:
"_source": {
"proxy": {
"histogram": {
"2017-11-20": 411,
"2017-11-21": 34,
"2017-11-22": 0,
"2017-11-23": 2,
"2017-11-24": 1,
"2017-11-25": 2692,
"2017-11-26": 11673
}
}
}
And I need one of two solutions:
Find docs where any value inside histogram object is greater then XX
Find docs where avg of values in histogram object is greater then XX
In point 1 I can use range query, but I must specify exactly name of field (i.e. proxy.histogram.2017-11-20). And wildcard version (proxy.histogram.*) doesnot work.
In point 2 I found in ES only average aggregation, but I don't want aggregation of these fields after query (because large of data), I want to only search these docs.

String range query in Elasticsearch

I'm trying to query data in an Elasticsearch cluster (2.3) using the following range query. To clarify, I'm searching on a field that contains an array of values that were derived by concatenating two ids together with a count. For example:
Schema:
{
id1: 111,
id2: 222,
count: 5
}
The query I'm using looks like the following:
Query:
{
"query": {
"bool": {
"must": {
"range": {
"myfield": {
"from": "111_222_1",
"to": "111_222_2147483647",
"include_lower": true,
"include_upper": true
}
}
}
}
}
}
The to field uses Integer.MAX_VALUE
This works alright but doesn't exactly match the underlying data. Querying through other means produces more results than this method.
More strangely, trying 111_222_5 in the from field produces 0 results, while trying 111_222_10 does produce results.
How is ES (and/or Lucene) interpreting this range query and why is it producing such strange results? My initial guess is that it's not looking at the full value of the last portion of the String and possibly only looking at the first digit.
Is there a way to specify a format for the TermRange? I understand date ranging allows formatting.
A look here provides the answer.
The way it's doing range is lexicographic, 5 comes before 50 comes before 6, etc.
To get around this, I reindexed using a fixed length string for the count.
0000000001
0000000100
0001000101
...

Using MongoDB to store time series data of arbitrary intervals

I want to store time-series-like data. There are no set intervals for the data like normal time series data. Data points could be as often as every few seconds to as seldom as every few years, all in the same time series. I basically need to store the Date data type and a value, over and over.
I would like the ability to very quickly retrieve the most recent item in the series. I would also like the ability to quickly retrieve all the values within a range between two dates. Writing efficiency is nice but not as important.
My initial thought was to use documents with keys set to dates. Something like this:
{
"entry_last": 52,
"entry_history": {
datetime(2013, 1, 15): 94,
datetime(2014, 12, 23): 25,
datetime(2016, 10, 23, 5, 34, 00): 52
}
}
However, from my undertstanding, keys have to be strings.
So then I came up with this prototype:
{
"entry_last": 52,
"entry_history": [
[datetime(2013, 1, 15), 94],
[datetime(2014, 12, 23), 25],
[datetime(2016, 10, 23, 5, 34, 00), 52],
]
}
The idea here is to give myself very easy access to the last value with entry_last (the value of which is duplicated in the history), as well as to store each data entry in the most efficient way possible by only storing the date and value in entry_history.
What I'd like to know is whether or not my prototype is an efficient approach to storing my data. Specifically, I'd like to know if this will allow me to efficiently query the most recent value as well as values between two dates. If not, what is a better approach?
You don't have to manually specify the index, you can store only the datetime and use the index of the array.
The main issue I see with your solution is you have to manually maintain entry_last, if the update ever fails, this doesn't work anymore, unless you have few failsafes. If you build another app with a different technology using the same db, you'll have to recode the same logic. And I don't see how to query between two dates easily and efficiently here, unless you reorder the array every time you insert an element.
If I had to design this kind of data storing, I would create another collection to store the history (linked to your entries by _id) and index the date to fast query. But it might depend on the quantity of your data.
/* entry */
{
_id: 1234,
"entryName": 'name'
}
/* history */
{
_id: 9876,
"_linkedEntryId": 1234,
"date": new Date(2013, 1, 15)
}
{
_id: 9877,
"_linkedEntryId": 1234,
"date": new Date(2014, 12, 23)
}
{
_id: 9878,
"_linkedEntryId": 1234,
"date": new Date(2016, 10, 23, 5, 34, 00)
}
To give an idea of the performance, I have a mongodb running on my ultrabook (far from a dedicated server's performance) and I can get the most recent document linked to a specific identifier in 5-10ms. Same speed to get all documents between two dates. I'm querying a modest collection of one million documents. It's not random data, the average object's size is 2050B.

Sum field and sort on Solr

I'm implementing a grouped search in Solr. I'm looking for a way of summing one field and sort the results by this sum. With the following data example I hope it will be clearer.
{
[
{
"id" : 1,
"parent_id" : 22,
"valueToBeSummed": 3
},
{
"id" : 2,
"parent_id" : 22,
"valueToBeSummed": 1
},
{
"id" : 3,
"parent_id" : 33,
"valueToBeSummed": 1
},
{
"id" : 4,
"parent_id" : 5,
"valueToBeSummed": 21
}
]
}
If the search is made over this data I'd like to obtain
{
[
{
"numFound": 1,
"summedValue" : 21,
"parent_id" : 5
},
{
"numFound": 2,
"summedValue" : 4,
"parent_id" : 22
},
{
"numFound": 1,
"summedValue" : 1,
"parent_id" : 33
}
]
}
Do you have any advice on this ?
Solr 5.1+ (and 5.3) introduces Solr Facet functions to solve this exact issue.
From Yonik's introduction of the feature:
$ curl http://localhost:8983/solr/query -d 'q=*:*&
json.facet={
categories:{
type : terms,
field : cat,
sort : "x desc", // can also use sort:{x:desc}
facet:{
x : "avg(price)",
y : "sum(price)"
}
}
}
'
So the suggestion would be to upgrade to the newest version of Solr (the most recent version is currently 5.2.1, be advised that some of the syntax that's on the above link will be landed in 5.3 - the current release target).
So you want to group your results on the field parent_id and inside each group you want to sum up the fields valueToBeSummed and then you want to sort the entire results (the groups) by this new summedvalue field. That is a very interesting use case...
Unfortunately, I don't think there is a built in way of doing what you have asked.
There are function queries which you can use to sort, there is a group.func parameter also, but they will not do what you have asked.
Have you already indexed this data? Or are you still in the process of charting out how to store this data? If its the latter then one possible way would be to have a summedvalue field for each documents and calculate this as and when a document gets indexed. For example, given the sample documents in your question, the first document will be indexed as
{
"id" : 1,
"parent_id" : 22,
"valueToBeSummed": 3
"summedvalue": 3
"timestamp": current-timestamp
},
Before indexing the second document id:2 with parent_id:22 you will run a solr query to get the last indexed document with parent_id:22
Solr Query q=parent_id:22&sort=timestamp desc&rows=1
and add the summedvalue of id:1 with valueToBeSummed of id:2
So the next document will be indexed as
{
"id" : 2,
"parent_id" : 22,
"valueToBeSummed": 1
"summedvalue": 4
"timestamp": current-timestamp
}
and so on.
Once you have documents indexed this way, you can run a regular solr query with &group=true&group.field=parent_id&sort=summedValue.
Please do let us know how you decide to implement it. Like I said its a very interesting use case! :)
You can add the below query
select?q=*:*&stats=true&stats.field={!tag=piv1 sum=true}valueToBeSummed&facet=true&facet.pivot={!stats=piv1 facet.sort=index}parent_id&wt=json&indent=true
You need to use Stats Component for the requirement. You can get more information here. The idea is first define on what you need to have stats on. Here it is valueToBeSummed, and then we need to group on parent_id. We use facet.pivot for this functionality.
Regarding sort, when we do grouping, the default sorting order is based on count in each group. We can define based on the value too. I have done this above using facet.sort=index. So it sorted on parent_id which is the one we used for grouping. But your requirement is to sort on valueToBeSummed which is different from the grouping attribute.
As of now not sure, if we can achieve that. But will look into it and let you know.
In short, you got the grouping, you got the sum above. Just sort is pending

Resources